Microsoft AI Releases Harrier-OSS-v1: A New Family of Multilingual Embedding Models Hitting SOTA on Multilingual MTEB v2

Learn how to use Microsoft's new Harrier-OSS-v1 multilingual embedding models to generate semantic embeddings and calculate similarity scores across multiple languages.

Introduction

In this tutorial, you'll learn how to use Microsoft's new Harrier-OSS-v1 multilingual embedding models to perform semantic similarity tasks across multiple languages. These models are designed to generate high-quality text embeddings that capture semantic meaning, making them ideal for applications like cross-lingual information retrieval, document clustering, and translation quality assessment.

Harrier-OSS-v1 models are part of a new family of text embedding models released by Microsoft, with three different scales (270M, 0.6B, and 27B parameters) to suit various computational requirements. The models have achieved state-of-the-art performance on the Multilingual MTEB benchmark, making them excellent choices for multilingual NLP tasks.

Prerequisites

Python 3.7 or higher installed on your system
Basic understanding of machine learning and NLP concepts
Intermediate knowledge of Python programming
Access to a Python environment with the required libraries installed

Step-by-Step Instructions

1. Install Required Libraries

First, you'll need to install the necessary Python libraries to work with the Harrier-OSS-v1 models. The primary library is Hugging Face's transformers and torch for PyTorch support.

pip install transformers torch

Why this step? The transformers library provides easy access to pre-trained models from Hugging Face, including Microsoft's Harrier-OSS models. torch is required for model inference and tensor operations.

2. Load the Harrier-OSS-v1 Model

Harrier-OSS-v1 models are available through Hugging Face's model hub. For this tutorial, we'll use the 0.6B parameter model, which offers a good balance between performance and computational efficiency.

from transformers import AutoTokenizer, AutoModel
import torch

# Load the tokenizer and model
model_name = "microsoft/harrier-oss-v1-0.6b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

Why this step? Loading the tokenizer and model allows us to process text inputs and generate embeddings. The tokenizer converts text into tokens that the model can understand, while the model generates the actual semantic embeddings.

3. Prepare Input Text

Before generating embeddings, we need to prepare our input text. Let's create a simple example with sentences in English and Spanish.

# Sample sentences in different languages
sentences = [
    "The weather is beautiful today.",
    "Hoy hace mucho sol.",
    "The sky is clear and blue.",
    "El cielo está claro y azul."
]

# Tokenize the sentences
inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)

Why this step? Tokenization prepares the text for model input. The padding ensures all sequences have the same length, and truncation prevents overly long sequences from causing memory issues.

4. Generate Embeddings

Now we'll generate embeddings for our sentences using the model. The model outputs hidden states that we'll aggregate to create sentence-level embeddings.

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)
    # Use the last hidden states
    last_hidden_states = outputs.last_hidden_state
    
# Create mean pooling to get sentence embeddings
attention_mask = inputs["attention_mask"]
# Mask out padding tokens
masked_embeddings = last_hidden_states * attention_mask.unsqueeze(-1)
# Sum and divide by number of non-padding tokens
sentence_embeddings = masked_embeddings.sum(dim=1) / attention_mask.sum(dim=1, keepdim=True)

Why this step? We use mean pooling to aggregate token-level embeddings into sentence-level embeddings. This approach effectively captures the overall semantic meaning of each sentence by averaging the embeddings of all tokens, weighted by the attention mask to ignore padding.

5. Calculate Similarity Scores

Once we have sentence embeddings, we can calculate similarity scores between pairs of sentences to demonstrate the model's cross-lingual capabilities.

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Calculate similarity between English and Spanish sentences
similarity_matrix = cosine_similarity(sentence_embeddings)

# Print similarity scores
for i in range(len(sentences)):
    for j in range(i+1, len(sentences)):
        if i % 2 == 0 and j % 2 == 1:  # Only compare English to Spanish pairs
            similarity = similarity_matrix[i][j]
            print(f"Similarity between: '{sentences[i]}' and '{sentences[j]}' = {similarity:.4f}")

Why this step? Cosine similarity measures the angle between two vectors in embedding space. Similar sentences will have embeddings that are close together, resulting in high similarity scores. This demonstrates how the model captures semantic meaning across languages.

6. Compare Different Model Sizes

Harrier-OSS-v1 comes in three different sizes. Let's compare how they perform with a simple example.

# Compare different model sizes
model_sizes = ["microsoft/harrier-oss-v1-0.6b", "microsoft/harrier-oss-v1-27b"]

for model_size in model_sizes:
    print(f"\nTesting {model_size}:")
    
    # Load model
    tokenizer = AutoTokenizer.from_pretrained(model_size)
    model = AutoModel.from_pretrained(model_size)
    
    # Generate embeddings
    inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)
    
    with torch.no_grad():
        outputs = model(**inputs)
        last_hidden_states = outputs.last_hidden_state
        attention_mask = inputs["attention_mask"]
        masked_embeddings = last_hidden_states * attention_mask.unsqueeze(-1)
        sentence_embeddings = masked_embeddings.sum(dim=1) / attention_mask.sum(dim=1, keepdim=True)
        
    # Print shape of embeddings
    print(f"Embedding shape: {sentence_embeddings.shape}")

Why this step? Comparing different model sizes helps you understand the trade-off between computational efficiency and performance. Larger models typically offer better accuracy but require more memory and processing time.

7. Save and Load Embeddings

It's often useful to save computed embeddings for later use without re-computing them.

# Save embeddings
import pickle

# Save embeddings to file
with open('sentence_embeddings.pkl', 'wb') as f:
    pickle.dump(sentence_embeddings, f)

# Load embeddings from file
with open('sentence_embeddings.pkl', 'rb') as f:
    loaded_embeddings = pickle.load(f)
    
print(f"Loaded embeddings shape: {loaded_embeddings.shape}")

Why this step? Saving embeddings allows you to reuse them in different applications without re-computing them. This is especially useful for large datasets or when running multiple experiments.

Summary

In this tutorial, you've learned how to work with Microsoft's Harrier-OSS-v1 multilingual embedding models. You've seen how to:

Install and load the Harrier-OSS-v1 models using Hugging Face's transformers library
Prepare text inputs for model processing
Generate sentence-level embeddings using mean pooling
Calculate semantic similarity between sentences in different languages
Compare different model sizes for performance vs. efficiency trade-offs
Save and load embeddings for reuse

These models are particularly powerful for cross-lingual tasks and can be easily integrated into larger NLP pipelines. The ability to generate high-quality embeddings across multiple languages makes them valuable for building multilingual applications, from search engines to translation systems.