Helical closes $10M seed to turn bio foundation models into systems
Back to Tutorials
aiTutorialintermediate

Helical closes $10M seed to turn bio foundation models into systems

April 14, 20261 views4 min read

Learn how to build a basic pharmaceutical AI pipeline using transformer models and biological data, similar to what Helical is doing in the bio foundation models space.

Introduction

In the rapidly evolving field of pharmaceutical AI, companies like Helical are pioneering the use of foundation models to transform drug discovery and development processes. This tutorial will guide you through building a basic bioinformatics pipeline using Python and machine learning libraries that mirrors the foundational concepts behind Helical's approach to pharmaceutical AI. You'll learn how to process biological data, create embeddings, and build simple predictive models using transformer architectures.

Prerequisites

To follow this tutorial, you should have:

  • Intermediate Python programming skills
  • Familiarity with machine learning concepts
  • Basic understanding of biological data types (DNA, RNA, proteins)
  • Installed Python packages: transformers, torch, biopython, pandas, numpy

Step-by-Step Instructions

1. Setting Up Your Environment

First, we'll create a clean Python environment and install the necessary packages. This ensures we have all dependencies needed for our bioinformatics pipeline.

pip install transformers torch biopython pandas numpy scikit-learn

Why this step? The transformers library provides pre-trained models for various biological sequences, while biopython handles biological data formats. These tools form the foundation of modern bioinformatics AI pipelines.

2. Loading and Preparing Biological Data

We'll start by creating a basic data loading function for protein sequences. This simulates how Helical might process raw biological data.

import pandas as pd
from Bio import SeqIO
import torch
from transformers import AutoTokenizer, AutoModel

# Sample protein sequences
protein_data = [
    {'id': 'P12345', 'sequence': 'MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG'},
    {'id': 'P67890', 'sequence': 'MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG'},
    {'id': 'P54321', 'sequence': 'MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG'}
]

df = pd.DataFrame(protein_data)
print(df.head())

Why this step? Biological data often comes in FASTA format or structured databases. This step prepares us to handle real-world data sources that Helical and similar companies work with daily.

3. Creating Protein Embeddings

Next, we'll use a pre-trained model to generate embeddings for our protein sequences. These embeddings capture semantic meaning and are crucial for downstream AI tasks.

# Load pre-trained model and tokenizer
model_name = "facebook/esm2_t6_8M_UR50D"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Function to generate embeddings
def generate_embeddings(sequences):
    embeddings = []
    for seq in sequences:
        inputs = tokenizer(seq, return_tensors="pt", truncation=True, padding=True)
        with torch.no_grad():
            outputs = model(**inputs)
            # Use the last hidden states
            embedding = outputs.last_hidden_state.mean(dim=1)
            embeddings.append(embedding)
    return torch.stack(embeddings)

# Generate embeddings for our data
embeddings = generate_embeddings(df['sequence'].tolist())
print(f"Embeddings shape: {embeddings.shape}")

Why this step? Protein embeddings are fundamental representations that capture structural and functional information. These are the building blocks for machine learning models that predict drug interactions or protein functions.

4. Building a Simple Predictive Model

Now we'll create a basic machine learning model to predict protein function based on embeddings. This mimics how Helical might build systems to predict drug targets.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create dummy labels for demonstration
# In real applications, these would come from experimental data
labels = [0, 1, 0]  # 0 = enzyme, 1 = receptor

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    embeddings, labels, test_size=0.2, random_state=42
)

# Train a simple model
model = LogisticRegression()
model.fit(X_train.numpy(), y_train)

# Make predictions
predictions = model.predict(X_test.numpy())
accuracy = accuracy_score(y_test, predictions)
print(f"Model accuracy: {accuracy:.2f}")

Why this step? This demonstrates how embeddings can be fed into traditional ML models to solve specific pharmaceutical problems. Real applications might predict drug toxicity or identify new therapeutic targets.

5. Integrating with Hugging Face Transformers

For scalability, we'll show how to save and load our model using Hugging Face's ecosystem, which is commonly used in production AI systems.

# Save the model and tokenizer
model.save_pretrained("./protein_model")
tokenizer.save_pretrained("./protein_model")

# Load the model back
loaded_model = AutoModel.from_pretrained("./protein_model")
loaded_tokenizer = AutoTokenizer.from_pretrained("./protein_model")

print("Model and tokenizer saved and loaded successfully!")

Why this step? Production AI systems need to be easily deployable and shareable. Hugging Face's ecosystem allows for seamless model sharing and deployment, similar to how Helical might distribute their systems to pharma partners.

6. Creating a Prediction Pipeline

Finally, we'll build a complete pipeline that can process new protein sequences and make predictions.

def predict_protein_function(sequence):
    # Generate embedding
    inputs = loaded_tokenizer(sequence, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = loaded_model(**inputs)
        embedding = outputs.last_hidden_state.mean(dim=1)
    
    # Make prediction
    prediction = model.predict(embedding.numpy())
    probability = model.predict_proba(embedding.numpy())
    
    return {
        'prediction': int(prediction[0]),
        'confidence': float(max(probability[0]))
    }

# Test with a new sequence
new_sequence = "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"
result = predict_protein_function(new_sequence)
print(f"Prediction: {result}")

Why this step? A complete pipeline is essential for practical AI applications. This demonstrates how the theoretical concepts translate into a working system that could be used by pharmaceutical companies.

Summary

This tutorial walked you through creating a basic pharmaceutical AI pipeline using transformer models and biological data. You learned how to:

  • Load and process protein sequence data
  • Generate biological embeddings using pre-trained models
  • Build and train simple predictive models
  • Deploy models using Hugging Face ecosystem
  • Create a complete prediction pipeline

These concepts form the foundation of systems like those developed by Helical. While this example is simplified, it demonstrates the core principles behind how AI is transforming pharmaceutical research and development.

Source: TNW Neural

Related Articles