Introduction
In this tutorial, we'll explore how to analyze and visualize health query data similar to what OpenAI revealed about ChatGPT's usage patterns in underserved areas. You'll learn to process real-world health data using Python, extract meaningful insights, and create visualizations that highlight geographic and temporal trends in healthcare access. This intermediate-level tutorial assumes familiarity with Python programming, data analysis concepts, and basic machine learning principles.
Prerequisites
- Python 3.8 or higher installed
- Basic understanding of pandas, numpy, and matplotlib
- Experience with API data fetching (requests library)
- Knowledge of geographic data handling (geopandas recommended)
- Basic understanding of data visualization concepts
Step-by-Step Instructions
1. Setting Up Your Environment
1.1 Create a Virtual Environment
First, create a dedicated Python environment to manage dependencies:
python -m venv health_query_env
source health_query_env/bin/activate # On Windows: health_query_env\Scripts\activate
Why: Isolating your project dependencies prevents conflicts with other Python packages on your system.
1.2 Install Required Libraries
Install the necessary packages for data processing and visualization:
pip install pandas numpy matplotlib seaborn geopandas folium requests scikit-learn
Why: These libraries provide essential functionality for data manipulation, geographic analysis, and interactive mapping.
2. Data Collection and Preparation
2.1 Simulate Health Query Data
Since we don't have access to actual OpenAI data, we'll create a realistic dataset that mimics the described patterns:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random
# Create a sample dataset
np.random.seed(42)
# Generate hospital desert locations (latitude, longitude)
desert_locations = [
{'city': 'Ruralville', 'lat': 37.5, 'lng': -120.2},
{'city': 'Desert Town', 'lat': 35.1, 'lng': -118.5},
{'city': 'Mountain View', 'lat': 39.8, 'lng': -121.9}
]
# Generate sample queries
n_queries = 600000 # Approximate weekly queries
queries = []
for _ in range(n_queries):
location = random.choice(desert_locations)
hour = random.randint(0, 23)
is_after_hours = hour >= 18 or hour <= 6
queries.append({
'timestamp': datetime.now() - timedelta(hours=random.randint(0, 168)),
'lat': location['lat'] + random.uniform(-0.2, 0.2),
'lng': location['lng'] + random.uniform(-0.2, 0.2),
'city': location['city'],
'after_hours': is_after_hours,
'query_type': random.choice(['symptom_check', 'medication_info', 'emergency_info'])
})
# Convert to DataFrame
df = pd.DataFrame(queries)
df['date'] = pd.to_datetime(df['timestamp']).dt.date
Why: This simulates the actual data structure and patterns observed in the OpenAI study, including after-hours usage and geographic distribution.
2.2 Data Exploration
Examine the structure and quality of your dataset:
print(df.head())
print(df.info())
print(df.groupby('after_hours').size())
Why: Understanding your data structure is crucial before analysis. This step reveals the distribution of after-hours queries.
3. Analyzing Geographic Patterns
3.1 Create Geographic Distribution Map
Visualize where health queries are coming from using folium:
import folium
# Create a base map centered on the US
m = folium.Map(location=[39.8283, -98.5795], zoom_start=4)
# Add markers for each query location
for idx, row in df.sample(n=1000).iterrows(): # Sample for performance
folium.CircleMarker(
location=[row['lat'], row['lng']],
radius=2,
color='red' if row['after_hours'] else 'blue',
fill=True,
fill_color='red' if row['after_hours'] else 'blue',
popup=f"{row['city']} - After Hours: {row['after_hours']}"
).add_to(m)
# Save the map
m.save('health_queries_map.html')
Why: Interactive maps help visualize geographic hotspots and reveal patterns in healthcare access that might not be apparent in tabular data.
3.2 Analyze Query Distribution by Location
Examine how queries are distributed across different desert areas:
# Group by city and after-hours status
location_analysis = df.groupby(['city', 'after_hours']).size().unstack(fill_value=0)
print(location_analysis)
# Calculate percentages
location_analysis['total'] = location_analysis.sum(axis=1)
location_analysis['after_hours_pct'] = (location_analysis[True] / location_analysis['total']) * 100
print(location_analysis[['after_hours_pct']])
Why: This analysis helps identify which areas have the highest proportion of after-hours healthcare queries, highlighting potential access gaps.
4. Temporal Analysis of Health Queries
4.1 Hourly Query Patterns
Understand when users are most active with health queries:
# Extract hour from timestamp
df['hour'] = pd.to_datetime(df['timestamp']).dt.hour
# Group by hour and after-hours status
hourly_analysis = df.groupby(['hour', 'after_hours']).size().unstack(fill_value=0)
print(hourly_analysis)
# Visualize hourly patterns
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.plot(hourly_analysis.index, hourly_analysis[False], label='Daytime Queries', marker='o')
plt.plot(hourly_analysis.index, hourly_analysis[True], label='After Hours Queries', marker='s')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Queries')
plt.title('Health Query Patterns by Hour')
plt.legend()
plt.grid(True)
plt.show()
Why: Understanding temporal patterns helps healthcare providers plan staffing and resource allocation more effectively.
4.2 Weekly Trends Analysis
Identify patterns across different days of the week:
# Extract day of week
df['day_of_week'] = pd.to_datetime(df['timestamp']).dt.day_name()
# Group by day and after-hours status
weekly_analysis = df.groupby(['day_of_week', 'after_hours']).size().unstack(fill_value=0)
print(weekly_analysis)
# Reorder days for better visualization
days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekly_analysis = weekly_analysis.reindex(days_order)
# Plot weekly trends
plt.figure(figsize=(12, 6))
weekly_analysis.plot(kind='bar', stacked=True)
plt.xlabel('Day of Week')
plt.ylabel('Number of Queries')
plt.title('Health Query Patterns by Day')
plt.xticks(rotation=45)
plt.legend(title='After Hours')
plt.tight_layout()
plt.show()
Why: Weekly trends can reveal patterns in when people seek healthcare help, potentially correlating with work schedules or availability of healthcare services.
5. Advanced Analysis with Machine Learning
5.1 Predictive Modeling for Query Patterns
Use simple machine learning to predict after-hours queries based on temporal features:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# Prepare features
features = ['hour', 'day_of_week_encoded']
# Create dummy variables for day of week
df_encoded = pd.get_dummies(df, columns=['day_of_week'], prefix='day')
# Define target variable
X = df_encoded[features + [col for col in df_encoded.columns if col.startswith('day_')]]
Y = df_encoded['after_hours']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate
print(classification_report(y_test, y_pred))
Why: Predictive models can help healthcare systems anticipate demand and optimize resource allocation.
5.2 Feature Importance Analysis
Identify which factors most influence after-hours healthcare queries:
# Get feature importance
importances = model.feature_importances_
feature_names = X.columns
# Create importance dataframe
importance_df = pd.DataFrame({'feature': feature_names, 'importance': importances})
importance_df = importance_df.sort_values('importance', ascending=False)
print(importance_df.head(10))
# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(importance_df.head(10)['feature'], importance_df.head(10)['importance'])
plt.xlabel('Importance')
plt.title('Top 10 Features Influencing After Hours Queries')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
Why: Understanding feature importance helps prioritize interventions and resources where they're most needed.
6. Creating a Comprehensive Dashboard
6.1 Combine All Analysis into One View
Create a comprehensive dashboard that displays all insights:
# Create a summary report
summary = {
'total_queries': len(df),
'after_hours_queries': df['after_hours'].sum(),
'after_hours_percentage': (df['after_hours'].sum() / len(df)) * 100,
'average_hourly_queries': df.groupby('hour').size().mean(),
'top_city': df['city'].value_counts().index[0],
'top_query_type': df['query_type'].value_counts().index[0]
}
print("Health Query Analysis Summary:")
for key, value in summary.items():
print(f"{key}: {value}")
Why: A comprehensive summary provides a quick overview of key insights that can inform decision-making.
Summary
In this tutorial, you've learned to work with health query data similar to what OpenAI analyzed in their study. You've processed and visualized geographic patterns, analyzed temporal trends, and applied machine learning techniques to predict after-hours healthcare demand. These skills are directly applicable to understanding healthcare access challenges in underserved areas, which is critical for improving public health outcomes. The techniques demonstrated here can be extended to real-world datasets, helping healthcare providers and policymakers make data-driven decisions about resource allocation and service delivery in areas with limited healthcare access.



