mbiomics raises €30M Series A to take its microbiome cancer co-therapy into clinical trials

Learn to analyze microbiome data using Python to identify potential therapeutic targets for cancer immunotherapy, similar to what mbiomics is developing for advanced melanoma treatment.

Introduction

In this tutorial, you'll learn how to analyze microbiome data using Python to identify potential therapeutic targets for cancer immunotherapy, similar to what companies like mbiomics are doing. We'll focus on processing and analyzing microbiome datasets to understand bacterial composition and its correlation with immune response markers. This approach is crucial for developing personalized microbiome-based treatments for advanced melanoma and other cancers.

Prerequisites

Basic Python programming knowledge
Installed Python 3.8+ with pip
Required packages: numpy, pandas, scikit-learn, matplotlib, seaborn, biom-format
Sample microbiome dataset (we'll use a synthetic dataset for demonstration)

Step-by-Step Instructions

1. Install Required Python Packages

First, we need to install the necessary libraries for microbiome data analysis. These packages will help us process, visualize, and analyze the microbiome datasets.

pip install numpy pandas scikit-learn matplotlib seaborn biom-format

Why: These libraries provide essential functionality for handling microbiome data, performing statistical analysis, and creating visualizations to understand bacterial composition patterns.

2. Import Libraries and Load Sample Data

Let's start by importing our required libraries and creating a sample microbiome dataset that mimics real-world data from cancer patients.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Create sample microbiome data
np.random.seed(42)

# Simulate 100 patients with 50 bacterial species
n_samples = 100
n_species = 50

# Generate random bacterial abundance data
abundance_data = np.random.lognormal(mean=2, sigma=1.5, size=(n_samples, n_species))

# Create sample patient metadata
metadata = pd.DataFrame({
    'patient_id': [f'P{i:03d}' for i in range(n_samples)],
    'treatment_response': np.random.choice(['Good', 'Moderate', 'Poor'], n_samples),
    'immune_checkpoint_inhibitor_response': np.random.choice(['High', 'Medium', 'Low'], n_samples),
    'melanoma_stage': np.random.choice(['Stage III', 'Stage IV'], n_samples)
})

# Convert abundance data to DataFrame
species_names = [f'Species_{i:02d}' for i in range(n_species)]
abundance_df = pd.DataFrame(abundance_data, columns=species_names)

# Combine with metadata
microbiome_data = pd.concat([metadata, abundance_df], axis=1)

print("Dataset shape:", microbiome_data.shape)
print("\nFirst few rows of metadata:")
print(metadata.head())

Why: We're creating a realistic sample dataset that mimics actual microbiome data from cancer patients. This allows us to demonstrate the analysis workflow without requiring access to real patient data.

3. Data Preprocessing and Quality Control

Before analysis, we need to clean and preprocess our microbiome data to ensure reliable results.

# Check for missing values
print("Missing values per column:")
print(microbiome_data.isnull().sum().sum())

# Remove low-abundance species (those with mean abundance < 0.1)
abundance_cols = [col for col in microbiome_data.columns if col.startswith('Species_')]
mean_abundances = microbiome_data[abundance_cols].mean()
high_abundance_species = mean_abundances[mean_abundances > 0.1].index

# Filter data to keep only high-abundance species
filtered_data = microbiome_data.copy()
filtered_data = filtered_data[high_abundance_species]

# Normalize data (log transformation for better distribution)
filtered_data[high_abundance_species] = np.log1p(filtered_data[high_abundance_species])

print(f"Original species count: {n_species}")
print(f"Filtered species count: {len(high_abundance_species)}")

Why: Low-abundance species can introduce noise into our analysis. Log transformation helps normalize the data distribution, making it more suitable for statistical analysis and visualization.

4. Exploratory Data Analysis

Let's examine the distribution of bacterial species and understand how different patient groups vary.

# Calculate species diversity metrics
species_diversity = filtered_data[high_abundance_species].apply(lambda x: np.sum(x > 0) / len(x), axis=0)

# Plot species diversity
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.histplot(species_diversity, kde=True)
plt.title('Distribution of Species Diversity')
plt.xlabel('Diversity Index')

# Analyze treatment response groups
plt.subplot(1, 2, 2)
response_groups = filtered_data.groupby('treatment_response')[high_abundance_species].mean()
response_groups.T.plot(kind='bar', figsize=(12, 6))
plt.title('Average Abundance by Treatment Response')
plt.xlabel('Treatment Response')
plt.ylabel('Average Abundance')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Why: This analysis helps identify which bacterial species are most abundant and how they differ between treatment response groups, which is crucial for understanding potential therapeutic targets.

5. Dimensionality Reduction with PCA

PCA helps us visualize high-dimensional microbiome data in 2D or 3D space, revealing patterns and clusters in patient microbiomes.

# Prepare data for PCA
pca_data = filtered_data[high_abundance_species]

# Standardize the data
scaler = StandardScaler()
pca_data_scaled = scaler.fit_transform(pca_data)

# Perform PCA
pca = PCA(n_components=3)
pca_result = pca.fit_transform(pca_data_scaled)

# Create a DataFrame with PCA results
pca_df = pd.DataFrame(pca_result, columns=['PC1', 'PC2', 'PC3'])
pca_df['treatment_response'] = filtered_data['treatment_response']
pca_df['immune_checkpoint_response'] = filtered_data['immune_checkpoint_inhibitor_response']

# Visualize PCA results
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='treatment_response')
plt.title('PCA: PC1 vs PC2')

plt.subplot(1, 3, 2)
sns.scatterplot(data=pca_df, x='PC1', y='PC3', hue='treatment_response')
plt.title('PCA: PC1 vs PC3')

plt.subplot(1, 3, 3)
sns.scatterplot(data=pca_df, x='PC2', y='PC3', hue='treatment_response')
plt.title('PCA: PC2 vs PC3')

plt.tight_layout()
plt.show()

print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Cumulative explained variance: {np.sum(pca.explained_variance_ratio_):.3f}")

Why: PCA reduces the complexity of high-dimensional microbiome data while preserving most of the variation. This visualization helps identify whether different treatment responses cluster together based on microbiome composition.

6. Correlation Analysis with Clinical Outcomes

Finally, we'll analyze correlations between specific bacterial species and treatment outcomes to identify potential therapeutic targets.

# Correlation analysis between species and treatment response
# Convert categorical variables to numerical
filtered_data['response_numeric'] = filtered_data['treatment_response'].map({'Poor': 0, 'Moderate': 1, 'Good': 2})
filtered_data['checkpoint_numeric'] = filtered_data['immune_checkpoint_inhibitor_response'].map({'Low': 0, 'Medium': 1, 'High': 2})

# Calculate correlations
species_correlations = []
for species in high_abundance_species:
    corr = np.corrcoef(filtered_data[species], filtered_data['response_numeric'])[0, 1]
    species_correlations.append({'species': species, 'correlation': corr})

# Create correlation DataFrame
correlation_df = pd.DataFrame(species_correlations)
correlation_df = correlation_df.sort_values('correlation', key=abs, ascending=False)

# Display top 10 most correlated species
print("Top 10 most correlated species with treatment response:")
print(correlation_df.head(10))

# Visualize correlations
plt.figure(figsize=(10, 6))
sns.barplot(data=correlation_df.head(10), x='correlation', y='species')
plt.title('Top 10 Species Correlated with Treatment Response')
plt.xlabel('Correlation Coefficient')
plt.tight_layout()
plt.show()

Why: Identifying bacterial species strongly correlated with treatment outcomes is crucial for developing targeted microbiome therapies. These correlations can guide the development of specific bacterial products that enhance immune checkpoint inhibitor responses, like the approach mbiomics is pursuing.

Summary

In this tutorial, you've learned how to process and analyze microbiome data to identify potential therapeutic targets for cancer immunotherapy. We've covered data preprocessing, exploratory analysis, dimensionality reduction with PCA, and correlation analysis between bacterial species and treatment outcomes. These techniques are fundamental for understanding how microbiome composition relates to immune response in cancer patients, which is the core approach used by companies like mbiomics to develop their microbiome-based cancer therapies.

The skills you've learned here can be applied to real-world datasets to identify bacterial species that may enhance treatment response, ultimately contributing to the development of personalized microbiome-based cancer therapeutics.