Introduction
In this tutorial, you'll learn how to work with sensor data collected from real-world environments using Python and machine learning libraries. This builds upon the concept of Human Archive's approach where gig workers collect physical training data through wearable devices. You'll create a system that processes and analyzes sensor data from wearable devices, simulating the data collection pipeline used in AI training for robotics.
Prerequisites
- Basic Python programming knowledge
- Installed Python 3.8+
- Required packages: numpy, pandas, scikit-learn, matplotlib
- Basic understanding of sensor data and time-series analysis
Step-by-step instructions
1. Setting Up Your Environment
1.1 Install Required Packages
First, create a virtual environment and install the necessary libraries:
python -m venv sensor_env
source sensor_env/bin/activate # On Windows: sensor_env\Scripts\activate
pip install numpy pandas scikit-learn matplotlib
This creates an isolated environment to avoid conflicts with existing Python packages, ensuring consistent behavior across different systems.
1.2 Create Project Structure
Set up your project directory:
mkdir wearable_data_analysis
mkdir wearable_data_analysis/data
mkdir wearable_data_analysis/src
mkdir wearable_data_analysis/models
This organization helps maintain clean code structure and separates data, source code, and model files.
2. Generating Synthetic Sensor Data
2.1 Create Data Generation Script
Create a script to simulate sensor data from wearable devices:
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
def generate_wearable_data(n_samples=1000):
"""Generate synthetic sensor data for wearable devices"""
timestamps = [datetime.now() - timedelta(minutes=i) for i in range(n_samples)]
data = {
'timestamp': timestamps,
'acceleration_x': np.random.normal(0, 0.5, n_samples),
'acceleration_y': np.random.normal(0, 0.5, n_samples),
'acceleration_z': np.random.normal(0, 0.5, n_samples),
'gyro_x': np.random.normal(0, 0.1, n_samples),
'gyro_y': np.random.normal(0, 0.1, n_samples),
'gyro_z': np.random.normal(0, 0.1, n_samples),
'temperature': np.random.normal(37, 0.5, n_samples),
'heart_rate': np.random.normal(72, 8, n_samples),
'location_lat': np.random.uniform(18.9, 19.0, n_samples),
'location_lon': np.random.uniform(72.7, 72.8, n_samples)
}
return pd.DataFrame(data)
# Generate and save data
sensor_data = generate_wearable_data(1000)
sensor_data.to_csv('wearable_data_analysis/data/sensor_data.csv', index=False)
print(f"Generated {len(sensor_data)} sensor readings")
This simulates the kind of data that gig workers would collect through wearable devices, including acceleration, gyro data, physiological measurements, and location information.
3. Data Processing and Analysis
3.1 Load and Inspect Data
Create a data processing script to analyze the collected information:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
# Load the data
df = pd.read_csv('wearable_data_analysis/data/sensor_data.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)
print("Dataset Info:")
print(df.info())
print("\nFirst 5 rows:")
print(df.head())
# Basic statistics
print("\nBasic Statistics:")
print(df.describe())
This step is crucial for understanding your data before applying machine learning algorithms. It helps identify patterns, anomalies, and data quality issues.
3.2 Visualize Sensor Data
Create visualizations to understand sensor behavior patterns:
# Plot acceleration data
fig, axes = plt.subplots(3, 1, figsize=(12, 8))
axes[0].plot(df.index, df['acceleration_x'], label='X-axis')
axes[0].plot(df.index, df['acceleration_y'], label='Y-axis')
axes[0].plot(df.index, df['acceleration_z'], label='Z-axis')
axes[0].set_title('Acceleration Data')
axes[0].legend()
axes[1].plot(df.index, df['gyro_x'], label='X-axis')
axes[1].plot(df.index, df['gyro_y'], label='Y-axis')
axes[1].plot(df.index, df['gyro_z'], label='Z-axis')
axes[1].set_title('Gyro Data')
axes[1].legend()
axes[2].plot(df.index, df['heart_rate'], label='Heart Rate')
axes[2].set_title('Heart Rate')
axes[2].legend()
plt.tight_layout()
plt.savefig('wearable_data_analysis/data/sensor_analysis.png')
plt.show()
Visualizations help identify trends and patterns that might not be apparent from raw numbers, which is essential for AI training data quality assessment.
4. Feature Engineering for AI Training
4.1 Create Time-based Features
Enhance your dataset with engineered features that AI models can use:
# Create time-based features
df['hour'] = df.index.hour
df['day_of_week'] = df.index.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
# Create movement features
df['acceleration_magnitude'] = np.sqrt(df['acceleration_x']**2 + df['acceleration_y']**2 + df['acceleration_z']**2)
df['gyro_magnitude'] = np.sqrt(df['gyro_x']**2 + df['gyro_y']**2 + df['gyro_z']**2)
# Create rolling statistics
for col in ['acceleration_x', 'acceleration_y', 'acceleration_z', 'heart_rate']:
df[f'{col}_rolling_mean_5'] = df[col].rolling(window=5).mean()
df[f'{col}_rolling_std_5'] = df[col].rolling(window=5).std()
print("New features created:")
print(df.columns.tolist())
Feature engineering transforms raw sensor data into meaningful inputs for machine learning models, which is critical for training robots to understand real-world scenarios.
5. Preparing Data for Machine Learning
5.1 Data Preprocessing Pipeline
Prepare your data for machine learning training:
# Handle missing values
print("Missing values before processing:")
print(df.isnull().sum())
# Fill missing values with forward fill
df.fillna(method='ffill', inplace=True)
# Remove any remaining NaN values
df.dropna(inplace=True)
print("\nMissing values after processing:")
print(df.isnull().sum())
# Select features for training
feature_columns = [col for col in df.columns if col not in ['timestamp', 'location_lat', 'location_lon']]
X = df[feature_columns]
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(f"\nData shape: {X_scaled.shape}")
print(f"Features: {len(feature_columns)}")
Proper preprocessing ensures that machine learning models receive consistent, quality data, which is essential for training robust AI systems for robotics applications.
6. Training a Simple Classification Model
6.1 Create Activity Classification Model
Build a basic model to classify different activities based on sensor data:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
# Create target variable (simple activity classification)
# For demonstration, we'll classify based on acceleration magnitude
threshold = df['acceleration_magnitude'].median()
df['activity'] = (df['acceleration_magnitude'] > threshold).astype(int)
# Prepare training data
y = df['activity']
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
print("Model Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
This demonstrates how sensor data can be used to train AI models for recognizing activities, which is fundamental for robotics applications that need to understand human behavior.
7. Save and Export Results
7.1 Export Processed Data and Model
Save your processed data and trained model for future use:
# Save processed data
processed_data = df.copy()
processed_data.to_csv('wearable_data_analysis/data/processed_sensor_data.csv', index=True)
# Save the model
import joblib
joblib.dump(model, 'wearable_data_analysis/models/sensor_activity_model.pkl')
joblib.dump(scaler, 'wearable_data_analysis/models/sensor_scaler.pkl')
print("Processed data and model saved successfully!")
Saving processed data and models ensures you can reuse your work and maintain reproducible results, which is essential for AI development pipelines.
Summary
In this tutorial, you've learned how to work with wearable sensor data similar to what Human Archive collects from gig workers in India. You've created a complete pipeline that includes data generation, processing, feature engineering, and machine learning model training. This approach mirrors the real-world data collection methods used to train AI systems for robotics applications. The skills you've learned are directly applicable to working with IoT sensor data, human activity recognition systems, and AI training data preparation for autonomous systems.



