This startup is betting India’s gig economy can train the world’s robots

Learn to process and analyze wearable sensor data for AI training, simulating the data collection pipeline used by Human Archive to train robots through real-world physical data.

Introduction

In this tutorial, you'll learn how to work with sensor data collected from real-world environments using Python and machine learning libraries. This builds upon the concept of Human Archive's approach where gig workers collect physical training data through wearable devices. You'll create a system that processes and analyzes sensor data from wearable devices, simulating the data collection pipeline used in AI training for robotics.

Prerequisites

Basic Python programming knowledge
Installed Python 3.8+
Required packages: numpy, pandas, scikit-learn, matplotlib
Basic understanding of sensor data and time-series analysis

Step-by-step instructions

1. Setting Up Your Environment

1.1 Install Required Packages

First, create a virtual environment and install the necessary libraries:

python -m venv sensor_env
source sensor_env/bin/activate  # On Windows: sensor_env\Scripts\activate
pip install numpy pandas scikit-learn matplotlib

This creates an isolated environment to avoid conflicts with existing Python packages, ensuring consistent behavior across different systems.

1.2 Create Project Structure

Set up your project directory:

mkdir wearable_data_analysis
mkdir wearable_data_analysis/data
mkdir wearable_data_analysis/src
mkdir wearable_data_analysis/models

This organization helps maintain clean code structure and separates data, source code, and model files.

2. Generating Synthetic Sensor Data

2.1 Create Data Generation Script

Create a script to simulate sensor data from wearable devices:

import numpy as np
import pandas as pd
from datetime import datetime, timedelta

def generate_wearable_data(n_samples=1000):
    """Generate synthetic sensor data for wearable devices"""
    timestamps = [datetime.now() - timedelta(minutes=i) for i in range(n_samples)]
    
    data = {
        'timestamp': timestamps,
        'acceleration_x': np.random.normal(0, 0.5, n_samples),
        'acceleration_y': np.random.normal(0, 0.5, n_samples),
        'acceleration_z': np.random.normal(0, 0.5, n_samples),
        'gyro_x': np.random.normal(0, 0.1, n_samples),
        'gyro_y': np.random.normal(0, 0.1, n_samples),
        'gyro_z': np.random.normal(0, 0.1, n_samples),
        'temperature': np.random.normal(37, 0.5, n_samples),
        'heart_rate': np.random.normal(72, 8, n_samples),
        'location_lat': np.random.uniform(18.9, 19.0, n_samples),
        'location_lon': np.random.uniform(72.7, 72.8, n_samples)
    }
    
    return pd.DataFrame(data)

# Generate and save data
sensor_data = generate_wearable_data(1000)
sensor_data.to_csv('wearable_data_analysis/data/sensor_data.csv', index=False)
print(f"Generated {len(sensor_data)} sensor readings")

This simulates the kind of data that gig workers would collect through wearable devices, including acceleration, gyro data, physiological measurements, and location information.

3. Data Processing and Analysis

3.1 Load and Inspect Data

Create a data processing script to analyze the collected information:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler

# Load the data
df = pd.read_csv('wearable_data_analysis/data/sensor_data.csv')

df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)

print("Dataset Info:")
print(df.info())
print("\nFirst 5 rows:")
print(df.head())

# Basic statistics
print("\nBasic Statistics:")
print(df.describe())

This step is crucial for understanding your data before applying machine learning algorithms. It helps identify patterns, anomalies, and data quality issues.

3.2 Visualize Sensor Data

Create visualizations to understand sensor behavior patterns:

# Plot acceleration data
fig, axes = plt.subplots(3, 1, figsize=(12, 8))

axes[0].plot(df.index, df['acceleration_x'], label='X-axis')
axes[0].plot(df.index, df['acceleration_y'], label='Y-axis')
axes[0].plot(df.index, df['acceleration_z'], label='Z-axis')
axes[0].set_title('Acceleration Data')
axes[0].legend()

axes[1].plot(df.index, df['gyro_x'], label='X-axis')
axes[1].plot(df.index, df['gyro_y'], label='Y-axis')
axes[1].plot(df.index, df['gyro_z'], label='Z-axis')
axes[1].set_title('Gyro Data')
axes[1].legend()

axes[2].plot(df.index, df['heart_rate'], label='Heart Rate')
axes[2].set_title('Heart Rate')
axes[2].legend()

plt.tight_layout()
plt.savefig('wearable_data_analysis/data/sensor_analysis.png')
plt.show()

Visualizations help identify trends and patterns that might not be apparent from raw numbers, which is essential for AI training data quality assessment.

4. Feature Engineering for AI Training

4.1 Create Time-based Features

Enhance your dataset with engineered features that AI models can use:

# Create time-based features
df['hour'] = df.index.hour
df['day_of_week'] = df.index.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

# Create movement features
df['acceleration_magnitude'] = np.sqrt(df['acceleration_x']**2 + df['acceleration_y']**2 + df['acceleration_z']**2)
df['gyro_magnitude'] = np.sqrt(df['gyro_x']**2 + df['gyro_y']**2 + df['gyro_z']**2)

# Create rolling statistics
for col in ['acceleration_x', 'acceleration_y', 'acceleration_z', 'heart_rate']:
    df[f'{col}_rolling_mean_5'] = df[col].rolling(window=5).mean()
    df[f'{col}_rolling_std_5'] = df[col].rolling(window=5).std()

print("New features created:")
print(df.columns.tolist())

Feature engineering transforms raw sensor data into meaningful inputs for machine learning models, which is critical for training robots to understand real-world scenarios.

5. Preparing Data for Machine Learning

5.1 Data Preprocessing Pipeline

Prepare your data for machine learning training:

# Handle missing values
print("Missing values before processing:")
print(df.isnull().sum())

# Fill missing values with forward fill
df.fillna(method='ffill', inplace=True)

# Remove any remaining NaN values
df.dropna(inplace=True)

print("\nMissing values after processing:")
print(df.isnull().sum())

# Select features for training
feature_columns = [col for col in df.columns if col not in ['timestamp', 'location_lat', 'location_lon']]
X = df[feature_columns]

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"\nData shape: {X_scaled.shape}")
print(f"Features: {len(feature_columns)}")

Proper preprocessing ensures that machine learning models receive consistent, quality data, which is essential for training robust AI systems for robotics applications.

6. Training a Simple Classification Model

6.1 Create Activity Classification Model

Build a basic model to classify different activities based on sensor data:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Create target variable (simple activity classification)
# For demonstration, we'll classify based on acceleration magnitude
threshold = df['acceleration_magnitude'].median()
df['activity'] = (df['acceleration_magnitude'] > threshold).astype(int)

# Prepare training data
y = df['activity']
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

print("Model Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

This demonstrates how sensor data can be used to train AI models for recognizing activities, which is fundamental for robotics applications that need to understand human behavior.

7. Save and Export Results

7.1 Export Processed Data and Model

Save your processed data and trained model for future use:

# Save processed data
processed_data = df.copy()
processed_data.to_csv('wearable_data_analysis/data/processed_sensor_data.csv', index=True)

# Save the model
import joblib
joblib.dump(model, 'wearable_data_analysis/models/sensor_activity_model.pkl')
joblib.dump(scaler, 'wearable_data_analysis/models/sensor_scaler.pkl')

print("Processed data and model saved successfully!")

Saving processed data and models ensures you can reuse your work and maintain reproducible results, which is essential for AI development pipelines.

Summary

In this tutorial, you've learned how to work with wearable sensor data similar to what Human Archive collects from gig workers in India. You've created a complete pipeline that includes data generation, processing, feature engineering, and machine learning model training. This approach mirrors the real-world data collection methods used to train AI systems for robotics applications. The skills you've learned are directly applicable to working with IoT sensor data, human activity recognition systems, and AI training data preparation for autonomous systems.