Introduction
In this tutorial, we'll walk through building a semantic search engine and open-status classifier using the ResearchMath-14k dataset. This dataset contains mathematical research problems with their open/closed status. We'll extract keywords, create embeddings, cluster problems, and build a search system that can find similar problems. This is a great introduction to natural language processing (NLP) and machine learning concepts applied to research mathematics.
Prerequisites
Before starting, you'll need:
- Basic Python knowledge
- Installed Python packages:
numpy,pandas,scikit-learn,umap-learn,matplotlib,seaborn,transformers,torch - A dataset file (we'll use ResearchMath-14k)
You can install the required packages with:
pip install numpy pandas scikit-learn umap-learn matplotlib seaborn transformers torch
Step-by-Step Instructions
Step 1: Load and Explore the Dataset
First, we'll load the dataset and examine its structure. This helps us understand what data we're working with.
1.1 Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
import umap
from transformers import AutoTokenizer, AutoModel
import torch
1.2 Load the Dataset
Assuming you've downloaded the ResearchMath-14k dataset as a CSV file:
# Load dataset
df = pd.read_csv('ResearchMath-14k.csv')
# Display basic info
print(df.head())
print(df.info())
Why this step? Understanding your data structure is crucial before any processing. This helps identify columns like problem descriptions and open status.
Step 2: Extract Keywords with TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) helps us identify important words in each problem description.
2.1 Create TF-IDF Vectorizer
# Initialize TF-IDF vectorizer
vectorizer = TfidfVectorizer(
max_features=1000,
stop_words='english',
ngram_range=(1, 2) # Include both single words and bigrams
)
# Fit and transform the problem descriptions
tfidf_matrix = vectorizer.fit_transform(df['problem_description'])
# Get feature names
feature_names = vectorizer.get_feature_names_out()
print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
2.2 Extract Top Keywords
We'll identify the most important keywords for each problem:
# Function to get top keywords for a problem
def get_top_keywords(doc_idx, num_keywords=5):
# Get TF-IDF scores for this document
tfidf_scores = tfidf_matrix[doc_idx].toarray()[0]
# Get indices of top scores
top_indices = tfidf_scores.argsort()[::-1][:num_keywords]
# Get keywords
keywords = [feature_names[i] for i in top_indices if tfidf_scores[i] > 0]
return keywords
# Example: Get top keywords for first problem
print("Top keywords for first problem:", get_top_keywords(0))
Why this step? TF-IDF helps us understand what makes each mathematical problem unique, which is essential for clustering and search.
Step 3: Generate Sentence Embeddings
Embeddings convert text into numerical vectors that capture semantic meaning. We'll use a pre-trained model for this.
3.1 Load Pre-trained Model
# Load pre-trained model and tokenizer
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
3.2 Create Embeddings for All Problems
# Function to get embeddings
def get_embeddings(texts):
embeddings = []
for text in texts:
# Tokenize input
inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
# Get model outputs
with torch.no_grad():
outputs = model(**inputs)
# Use [CLS] token embedding (first token)
embedding = outputs.last_hidden_state[:, 0, :].numpy()
embeddings.append(embedding[0])
return np.array(embeddings)
# Generate embeddings for all problems
embeddings = get_embeddings(df['problem_description'].tolist())
print(f"Embeddings shape: {embeddings.shape}")
Why this step? Embeddings capture the meaning of text, which is crucial for finding similar problems in our semantic search.
Step 4: Visualize Problem Landscape with UMAP
UMAP (Uniform Manifold Approximation and Projection) helps us visualize high-dimensional data in 2D or 3D.
4.1 Apply UMAP
# Reduce dimensionality with UMAP
reducer = umap.UMAP(n_components=2, random_state=42)
embedding_2d = reducer.fit_transform(embeddings)
# Plot the results
plt.figure(figsize=(10, 8))
plt.scatter(embedding_2d[:, 0], embedding_2d[:, 1], alpha=0.6)
plt.title('UMAP Visualization of Mathematical Problems')
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.show()
4.2 Color by Open Status
Let's color our visualization by whether problems are open or closed:
# Assuming we have an 'open_status' column
# Map open_status to colors
colors = ['red' if status == 'open' else 'blue' for status in df['open_status']]
# Plot with colors
plt.figure(figsize=(10, 8))
scatter = plt.scatter(embedding_2d[:, 0], embedding_2d[:, 1], c=colors, alpha=0.6)
plt.title('UMAP Visualization by Open Status')
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.legend(handles=scatter.legend_elements()[0], labels=['Closed', 'Open'])
plt.show()
Why this step? Visualization helps us understand how problems cluster and identify patterns in the mathematical landscape.
Step 5: Cluster Problems with K-Means
Clustering groups similar problems together, which is useful for organization and finding related research.
5.1 Perform K-Means Clustering
# Determine optimal number of clusters (elbow method)
# For simplicity, we'll use 5 clusters
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
clusters = kmeans.fit_predict(embeddings)
# Add cluster labels to dataframe
df['cluster'] = clusters
print(df['cluster'].value_counts())
5.2 Analyze Clusters
Let's examine what topics each cluster represents:
# Analyze cluster topics
for i in range(5):
cluster_problems = df[df['cluster'] == i]
print(f"\nCluster {i} problems:")
print(cluster_problems['problem_description'].head(3))
Why this step? Clustering helps organize research problems and can reveal research areas that are underexplored or overexplored.
Step 6: Build Semantic Search Engine
Now we'll create a system that can find similar problems to a given query.
6.1 Create Search Function
# Function to find similar problems
def find_similar_problems(query, top_n=5):
# Get embedding for query
query_embedding = get_embeddings([query])
# Calculate cosine similarity with all problems
similarities = cosine_similarity(query_embedding, embeddings)[0]
# Get top similar problems
top_indices = similarities.argsort()[::-1][1:top_n+1] # Skip first (itself)
# Return similar problems
similar_problems = df.iloc[top_indices]
return similar_problems
# Example usage
query = "Prove that the Riemann hypothesis is true"
similar = find_similar_problems(query)
print("Similar problems:")
print(similar[['problem_description', 'open_status']])
6.2 Test the Search Engine
Try searching for different mathematical concepts:
# Test with another query
query2 = "Find a solution to the Navier-Stokes equations"
similar2 = find_similar_problems(query2)
print("\nSimilar problems to Navier-Stokes:")
print(similar2[['problem_description', 'open_status']])
Why this step? A semantic search engine allows researchers to quickly find related work, avoiding duplication and discovering connections between different areas of mathematics.
Summary
In this tutorial, we've built a complete NLP pipeline for mathematical research problems. We've learned how to:
- Load and explore a research dataset
- Extract important keywords using TF-IDF
- Create semantic embeddings with transformer models
- Visualize the mathematical problem landscape with UMAP
- Cluster similar problems together
- Build a semantic search engine
This pipeline demonstrates key concepts in natural language processing and machine learning applied to research mathematics. The tools and techniques we've used can be adapted for other domains, making this a valuable foundation for NLP projects.


