Introduction
In the wake of Coinbase's recent data center outage, it's crucial for engineers to understand how to build resilient systems that can handle hardware failures gracefully. This tutorial will teach you how to implement a basic fault-tolerant system using AWS services, specifically focusing on auto-scaling groups and load balancing to prevent single points of failure. We'll create a simple web application that can automatically recover from instance failures.
Prerequisites
- Basic understanding of AWS services (EC2, ELB, Auto Scaling)
- Active AWS account with appropriate permissions
- Basic knowledge of Python and Flask web framework
- Command line access to AWS CLI
Step-by-Step Instructions
1. Create a Simple Flask Application
First, we'll create a basic web application that will run on our EC2 instances. This application will help us demonstrate how the system handles failures.
# app.py
from flask import Flask
import os
import time
app = Flask(__name__)
@app.route('/')
def home():
hostname = os.environ.get('HOSTNAME', 'unknown')
return f'Coinbase Fault-Tolerant System
Running on: {hostname}
Timestamp: {time.time()}
'
if __name__ == '__main__':
app.run(host='0.0.0.0', port=80)
Why this step? This creates a simple web application that we can deploy to EC2 instances. The application will display the hostname to help us identify which instance is serving requests during load balancing.
2. Create a Deployment Script
Next, we'll create a script to automate the deployment of our application to EC2 instances.
# deploy.sh
#!/bin/bash
# Update system
sudo apt update
# Install Python and pip
sudo apt install -y python3 python3-pip
# Install Flask
pip3 install flask
# Create application directory
sudo mkdir -p /var/www/coinbase-app
# Copy application files
sudo cp app.py /var/www/coinbase-app/
# Set permissions
sudo chown -R ubuntu:ubuntu /var/www/coinbase-app
# Start the application
nohup python3 /var/www/coinbase-app/app.py > /var/log/app.log 2&>1 &
Why this step? This script automates the setup process on each EC2 instance, ensuring consistency across all servers in our auto-scaling group.
3. Launch EC2 Instances with Auto Scaling
We'll use AWS CLI to create an Auto Scaling group that maintains a minimum of 3 instances.
# Create launch configuration
aws autoscaling create-launch-configuration \
--launch-configuration-name coinbase-lc \
--image-id ami-0c047a86012500412 \
--instance-type t3.medium \
--key-name your-key-pair \
--security-groups your-security-group \
--user-data file://deploy.sh
# Create auto scaling group
aws autoscaling create-auto-scaling-group \
--auto-scaling-group-name coinbase-asg \
--launch-configuration-name coinbase-lc \
--min-size 3 \
--max-size 10 \
--desired-capacity 3 \
--vpc-zone-identifier 'subnet-12345,subnet-67890'
Why this step? This creates a baseline of 3 instances that will automatically scale up or down based on demand, preventing single points of failure like the one that affected Coinbase.
4. Set Up Load Balancer
Now we'll create an Application Load Balancer to distribute traffic across our instances.
# Create target group
aws elbv2 create-target-group \
--name coinbase-target-group \
--protocol HTTP \
--port 80 \
--vpc-id vpc-12345 \
--target-type instance
# Create load balancer
aws elbv2 create-load-balancer \
--name coinbase-alb \
--subnets subnet-12345 subnet-67890 \
--scheme internet-facing \
--type application
# Register targets with target group
aws elbv2 register-targets \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/coinbase-target-group/1234567890123456 \
--targets Id=i-0123456789abcdef0,Id=i-0123456789abcdef1
Why this step? Load balancing distributes traffic across multiple instances, ensuring that if one instance fails (like Coinbase's data center), the others continue serving requests without interruption.
5. Configure Auto Scaling Policies
We'll set up scaling policies to automatically add instances when CPU utilization exceeds 70%.
# Create scaling policy
aws autoscaling put-scaling-policy \
--auto-scaling-group-name coinbase-asg \
--policy-name cpu-scaling-policy \
--policy-type TargetTrackingScaling \
--target-tracking-configuration '{"PredefinedMetricSpecification":{"PredefinedMetricType":"ASGAverageCPUUtilization"},"TargetValue":70.0}'
# Create CloudWatch alarm for health checks
aws cloudwatch put-metric-alarm \
--alarm-name instance-health-check \
--alarm-description "Health check for instances" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:coinbase-alerts
Why this step? These policies ensure that if traffic increases, the system automatically adds more instances. Additionally, health checks will detect failed instances and trigger replacement, mimicking the resilience that Coinbase should have had.
6. Test the Fault Tolerance
Finally, we'll simulate a failure to verify our system's resilience.
# Simulate instance failure
aws ec2 terminate-instances \
--instance-ids i-0123456789abcdef0
# Monitor the auto scaling group
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names coinbase-asg
# Check load balancer health
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/coinbase-target-group/1234567890123456
Why this step? This simulates the exact scenario that affected Coinbase - an instance failure. The system should automatically detect the failure and spin up a new instance to maintain service availability.
Summary
This tutorial demonstrated how to build a fault-tolerant system that can handle hardware failures like the one that affected Coinbase. By implementing auto-scaling groups, load balancing, and proper health checks, we've created a system that automatically recovers from instance failures without service interruption.
The key lessons learned include:
- Never rely on a single point of failure
- Use auto-scaling to handle both load and failure scenarios
- Implement proper monitoring and health checks
- Design systems with redundancy from the ground up
This approach directly addresses the issues that caused Coinbase's outage and provides a framework for building more resilient infrastructure.



