Building a Semantic Search Engine and Open-Status Classifier over the ResearchMath-14k Dataset

Learn to build a semantic search engine and open-status classifier using the ResearchMath-14k dataset, applying TF-IDF, embeddings, and clustering techniques.

Introduction

In this tutorial, we'll walk through building a semantic search engine and open-status classifier using the ResearchMath-14k dataset. This dataset contains mathematical research problems with their open/closed status. We'll extract keywords, create embeddings, cluster problems, and build a search system that can find similar problems. This is a great introduction to natural language processing (NLP) and machine learning concepts applied to research mathematics.

Prerequisites

Before starting, you'll need:

Basic Python knowledge
Installed Python packages: numpy, pandas, scikit-learn, umap-learn, matplotlib, seaborn, transformers, torch
A dataset file (we'll use ResearchMath-14k)

You can install the required packages with:

pip install numpy pandas scikit-learn umap-learn matplotlib seaborn transformers torch

Step-by-Step Instructions

Step 1: Load and Explore the Dataset

First, we'll load the dataset and examine its structure. This helps us understand what data we're working with.

1.1 Import Required Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
import umap
from transformers import AutoTokenizer, AutoModel
import torch

1.2 Load the Dataset

Assuming you've downloaded the ResearchMath-14k dataset as a CSV file:

# Load dataset
df = pd.read_csv('ResearchMath-14k.csv')

# Display basic info
print(df.head())
print(df.info())

Why this step? Understanding your data structure is crucial before any processing. This helps identify columns like problem descriptions and open status.

Step 2: Extract Keywords with TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) helps us identify important words in each problem description.

2.1 Create TF-IDF Vectorizer

# Initialize TF-IDF vectorizer
vectorizer = TfidfVectorizer(
    max_features=1000,
    stop_words='english',
    ngram_range=(1, 2)  # Include both single words and bigrams
)

# Fit and transform the problem descriptions
tfidf_matrix = vectorizer.fit_transform(df['problem_description'])

# Get feature names
feature_names = vectorizer.get_feature_names_out()
print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")

2.2 Extract Top Keywords

We'll identify the most important keywords for each problem:

# Function to get top keywords for a problem
def get_top_keywords(doc_idx, num_keywords=5):
    # Get TF-IDF scores for this document
    tfidf_scores = tfidf_matrix[doc_idx].toarray()[0]
    
    # Get indices of top scores
    top_indices = tfidf_scores.argsort()[::-1][:num_keywords]
    
    # Get keywords
    keywords = [feature_names[i] for i in top_indices if tfidf_scores[i] > 0]
    return keywords

# Example: Get top keywords for first problem
print("Top keywords for first problem:", get_top_keywords(0))

Why this step? TF-IDF helps us understand what makes each mathematical problem unique, which is essential for clustering and search.

Step 3: Generate Sentence Embeddings

Embeddings convert text into numerical vectors that capture semantic meaning. We'll use a pre-trained model for this.

3.1 Load Pre-trained Model

# Load pre-trained model and tokenizer
model_name = 'sentence-transformers/all-MiniLM-L6-v2'

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

3.2 Create Embeddings for All Problems

# Function to get embeddings
def get_embeddings(texts):
    embeddings = []
    for text in texts:
        # Tokenize input
        inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
        
        # Get model outputs
        with torch.no_grad():
            outputs = model(**inputs)
            # Use [CLS] token embedding (first token)
            embedding = outputs.last_hidden_state[:, 0, :].numpy()
            embeddings.append(embedding[0])
    
    return np.array(embeddings)

# Generate embeddings for all problems
embeddings = get_embeddings(df['problem_description'].tolist())
print(f"Embeddings shape: {embeddings.shape}")

Why this step? Embeddings capture the meaning of text, which is crucial for finding similar problems in our semantic search.

Step 4: Visualize Problem Landscape with UMAP

UMAP (Uniform Manifold Approximation and Projection) helps us visualize high-dimensional data in 2D or 3D.

4.1 Apply UMAP

# Reduce dimensionality with UMAP
reducer = umap.UMAP(n_components=2, random_state=42)
embedding_2d = reducer.fit_transform(embeddings)

# Plot the results
plt.figure(figsize=(10, 8))
plt.scatter(embedding_2d[:, 0], embedding_2d[:, 1], alpha=0.6)
plt.title('UMAP Visualization of Mathematical Problems')
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.show()

4.2 Color by Open Status

Let's color our visualization by whether problems are open or closed:

# Assuming we have an 'open_status' column
# Map open_status to colors
colors = ['red' if status == 'open' else 'blue' for status in df['open_status']]

# Plot with colors
plt.figure(figsize=(10, 8))
scatter = plt.scatter(embedding_2d[:, 0], embedding_2d[:, 1], c=colors, alpha=0.6)
plt.title('UMAP Visualization by Open Status')
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.legend(handles=scatter.legend_elements()[0], labels=['Closed', 'Open'])
plt.show()

Why this step? Visualization helps us understand how problems cluster and identify patterns in the mathematical landscape.

Step 5: Cluster Problems with K-Means

Clustering groups similar problems together, which is useful for organization and finding related research.

5.1 Perform K-Means Clustering

# Determine optimal number of clusters (elbow method)
# For simplicity, we'll use 5 clusters
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
clusters = kmeans.fit_predict(embeddings)

# Add cluster labels to dataframe
df['cluster'] = clusters
print(df['cluster'].value_counts())

5.2 Analyze Clusters

Let's examine what topics each cluster represents:

# Analyze cluster topics
for i in range(5):
    cluster_problems = df[df['cluster'] == i]
    print(f"\nCluster {i} problems:")
    print(cluster_problems['problem_description'].head(3))

Why this step? Clustering helps organize research problems and can reveal research areas that are underexplored or overexplored.

Step 6: Build Semantic Search Engine

Now we'll create a system that can find similar problems to a given query.

6.1 Create Search Function

# Function to find similar problems
def find_similar_problems(query, top_n=5):
    # Get embedding for query
    query_embedding = get_embeddings([query])
    
    # Calculate cosine similarity with all problems
    similarities = cosine_similarity(query_embedding, embeddings)[0]
    
    # Get top similar problems
    top_indices = similarities.argsort()[::-1][1:top_n+1]  # Skip first (itself)
    
    # Return similar problems
    similar_problems = df.iloc[top_indices]
    return similar_problems

# Example usage
query = "Prove that the Riemann hypothesis is true"
similar = find_similar_problems(query)
print("Similar problems:")
print(similar[['problem_description', 'open_status']])

6.2 Test the Search Engine

Try searching for different mathematical concepts:

# Test with another query
query2 = "Find a solution to the Navier-Stokes equations"
similar2 = find_similar_problems(query2)
print("\nSimilar problems to Navier-Stokes:")
print(similar2[['problem_description', 'open_status']])

Why this step? A semantic search engine allows researchers to quickly find related work, avoiding duplication and discovering connections between different areas of mathematics.

Summary

In this tutorial, we've built a complete NLP pipeline for mathematical research problems. We've learned how to:

Load and explore a research dataset
Extract important keywords using TF-IDF
Create semantic embeddings with transformer models
Visualize the mathematical problem landscape with UMAP
Cluster similar problems together
Build a semantic search engine

This pipeline demonstrates key concepts in natural language processing and machine learning applied to research mathematics. The tools and techniques we've used can be adapted for other domains, making this a valuable foundation for NLP projects.