Introduction
In information retrieval systems, the quality of search results is often determined by how well the system can rank relevant documents. A two-stage retrieve-and-rerank pipeline is a widely adopted approach that combines the efficiency of fast retrieval with the precision of sophisticated reranking. In this tutorial, we'll build a high-precision retrieve-and-rerank pipeline using the zeroentropy/zerank-2-reranker, a 4B parameter cross-encoder model based on Qwen3. This model excels at scoring query-document pairs for relevance, making it ideal for reranking candidate documents retrieved by a faster bi-encoder.
Prerequisites
Before starting this tutorial, you should have:
- Basic understanding of Python programming
- Intermediate knowledge of machine learning concepts
- Installed Python packages:
transformers,torch,faiss, andnumpy - Access to a machine with a GPU (optional but recommended for performance)
Step-by-Step Instructions
1. Setting Up the Environment
1.1 Install Required Packages
First, ensure you have all necessary packages installed. Run the following command in your terminal:
pip install transformers torch faiss-cpu numpy
Why: These packages are essential for loading the model, handling tensors, and performing efficient vector similarity searches.
1.2 Import Libraries
Next, import the required libraries in your Python script:
import torch
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
Why: These libraries provide the tools to load the reranker model, encode text, and perform fast similarity searches.
2. Loading the Zerank-2 Reranker
2.1 Load the Model and Tokenizer
We'll load the zeroentropy/zerank-2-reranker model using the Hugging Face Transformers library:
model_name = "zeroentropy/zerank-2-reranker"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Why: This loads the pre-trained model and tokenizer, enabling us to process query-document pairs.
2.2 Set Model to Evaluation Mode
Ensure the model is in evaluation mode to disable dropout and other training-specific behaviors:
model.eval()
Why: This ensures consistent and reproducible results during inference.
3. Implementing Pairwise Scoring
3.1 Define a Scoring Function
Create a function that takes a query and a document and returns a relevance score:
def score_query_document(query, document):
# Prepare inputs for the model
inputs = tokenizer(query, document, return_tensors="pt", truncation=True, padding=True)
# Get model outputs
with torch.no_grad():
outputs = model(**inputs)
# Assuming the model outputs logits for binary classification
score = outputs.logits[0][1].item() # Probability of relevance
return score
Why: This function processes a query-document pair through the model and returns a numeric score indicating relevance.
3.2 Test the Scoring Function
Test the scoring function with a simple example:
query = "What is machine learning?"
document = "Machine learning is a subset of artificial intelligence that focuses on algorithms."
relevance_score = score_query_document(query, document)
print(f"Relevance Score: {relevance_score}")
Why: This validates that our model is working correctly and gives us a baseline understanding of the scoring mechanism.
4. Building the Retrieve-and-Rerank Pipeline
4.1 Implement a Fast Retriever
For the first stage, we'll use a simple bi-encoder (e.g., SentenceTransformer) to retrieve candidate documents:
# Load a bi-encoder for fast retrieval
retriever = SentenceTransformer('all-MiniLM-L6-v2')
candidates = [
"Machine learning is a subset of artificial intelligence that focuses on algorithms.",
"Deep learning uses neural networks with multiple layers.",
"Python is a popular programming language for data science."
]
# Encode the candidates
candidate_embeddings = retriever.encode(candidates)
Why: A bi-encoder like SentenceTransformer is efficient for initial retrieval, allowing us to quickly find potentially relevant documents.
4.2 Create a FAISS Index for Fast Search
Use FAISS to index the candidate embeddings for fast similarity search:
# Create FAISS index
index = faiss.IndexFlatIP(384) # 384 is the embedding dimension for all-MiniLM-L6-v2
index.add(candidate_embeddings)
# Query the index
query_embedding = retriever.encode(["What is the relationship between ML and AI?"])
D, I = index.search(query_embedding, k=2) # Retrieve top 2 candidates
Why: FAISS provides an efficient way to perform similarity search over large embedding spaces, making retrieval scalable.
4.3 Rerank Candidates with Zerank-2
Now, rerank the retrieved candidates using the Zerank-2 reranker:
reranked_candidates = []
for i in I[0]:
score = score_query_document("What is the relationship between ML and AI?", candidates[i])
reranked_candidates.append((candidates[i], score))
# Sort by score in descending order
reranked_candidates.sort(key=lambda x: x[1], reverse=True)
for doc, score in reranked_candidates:
print(f"Score: {score:.4f} - {doc}")
Why: This step improves the quality of results by applying the high-precision reranker to the top candidates retrieved by the fast bi-encoder.
5. Putting It All Together
5.1 Create a Complete Pipeline Function
Wrap everything into a single function that performs the full retrieve-and-rerank pipeline:
def retrieve_and_rerank(query, candidates, top_k=5):
# Step 1: Retrieve candidates
retriever = SentenceTransformer('all-MiniLM-L6-v2')
candidate_embeddings = retriever.encode(candidates)
index = faiss.IndexFlatIP(384)
index.add(candidate_embeddings)
query_embedding = retriever.encode([query])
D, I = index.search(query_embedding, k=top_k)
# Step 2: Rerank candidates
reranked = []
for i in I[0]:
score = score_query_document(query, candidates[i])
reranked.append((candidates[i], score))
reranked.sort(key=lambda x: x[1], reverse=True)
return reranked
Why: This modular function encapsulates the entire pipeline, making it reusable and easy to integrate into larger systems.
5.2 Use the Pipeline
Finally, test your pipeline with a sample dataset:
candidates = [
"Machine learning is a subset of artificial intelligence that focuses on algorithms.",
"Deep learning uses neural networks with multiple layers.",
"Python is a popular programming language for data science.",
"Natural language processing helps computers understand human language.",
"Computer vision enables machines to interpret visual information."
]
results = retrieve_and_rerank("What is machine learning?", candidates, top_k=3)
for doc, score in results:
print(f"Score: {score:.4f} - {doc}")
Why: This demonstrates how the pipeline works end-to-end, showing improved relevance through reranking.
Summary
In this tutorial, we built a high-precision retrieve-and-rerank pipeline using the zeroentropy/zerank-2-reranker. We started by setting up the environment and loading the model, then implemented pairwise scoring to understand how the reranker evaluates query-document pairs. We then created a two-stage pipeline where a fast bi-encoder retrieves candidates and the Zerank-2 reranker improves their relevance scores. This approach combines the efficiency of fast retrieval with the precision of advanced reranking, resulting in high-quality search results.



