Coinbase cut 700 jobs on Monday, lost $394 million on Thursday, and went dark on Friday because a data centre overheated

Learn how to build a fault-tolerant system using AWS auto-scaling groups and load balancing to prevent single points of failure like the one that affected Coinbase.

Introduction

In the wake of Coinbase's recent data center outage, it's crucial for engineers to understand how to build resilient systems that can handle hardware failures gracefully. This tutorial will teach you how to implement a basic fault-tolerant system using AWS services, specifically focusing on auto-scaling groups and load balancing to prevent single points of failure. We'll create a simple web application that can automatically recover from instance failures.

Prerequisites

Basic understanding of AWS services (EC2, ELB, Auto Scaling)
Active AWS account with appropriate permissions
Basic knowledge of Python and Flask web framework
Command line access to AWS CLI

Step-by-Step Instructions

1. Create a Simple Flask Application

First, we'll create a basic web application that will run on our EC2 instances. This application will help us demonstrate how the system handles failures.

# app.py
from flask import Flask
import os
import time

app = Flask(__name__)

@app.route('/')
def home():
    hostname = os.environ.get('HOSTNAME', 'unknown')
    return f'Coinbase Fault-Tolerant System
Running on: {hostname}
Timestamp: {time.time()}'

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=80)

Why this step? This creates a simple web application that we can deploy to EC2 instances. The application will display the hostname to help us identify which instance is serving requests during load balancing.

2. Create a Deployment Script

Next, we'll create a script to automate the deployment of our application to EC2 instances.

# deploy.sh
#!/bin/bash

# Update system
sudo apt update

# Install Python and pip
sudo apt install -y python3 python3-pip

# Install Flask
pip3 install flask

# Create application directory
sudo mkdir -p /var/www/coinbase-app

# Copy application files
sudo cp app.py /var/www/coinbase-app/

# Set permissions
sudo chown -R ubuntu:ubuntu /var/www/coinbase-app

# Start the application
nohup python3 /var/www/coinbase-app/app.py > /var/log/app.log 2&>1 &

Why this step? This script automates the setup process on each EC2 instance, ensuring consistency across all servers in our auto-scaling group.

3. Launch EC2 Instances with Auto Scaling

We'll use AWS CLI to create an Auto Scaling group that maintains a minimum of 3 instances.

# Create launch configuration
aws autoscaling create-launch-configuration \
    --launch-configuration-name coinbase-lc \
    --image-id ami-0c047a86012500412 \
    --instance-type t3.medium \
    --key-name your-key-pair \
    --security-groups your-security-group \
    --user-data file://deploy.sh

# Create auto scaling group
aws autoscaling create-auto-scaling-group \
    --auto-scaling-group-name coinbase-asg \
    --launch-configuration-name coinbase-lc \
    --min-size 3 \
    --max-size 10 \
    --desired-capacity 3 \
    --vpc-zone-identifier 'subnet-12345,subnet-67890'

Why this step? This creates a baseline of 3 instances that will automatically scale up or down based on demand, preventing single points of failure like the one that affected Coinbase.

4. Set Up Load Balancer

Now we'll create an Application Load Balancer to distribute traffic across our instances.

# Create target group
aws elbv2 create-target-group \
    --name coinbase-target-group \
    --protocol HTTP \
    --port 80 \
    --vpc-id vpc-12345 \
    --target-type instance

# Create load balancer
aws elbv2 create-load-balancer \
    --name coinbase-alb \
    --subnets subnet-12345 subnet-67890 \
    --scheme internet-facing \
    --type application

# Register targets with target group
aws elbv2 register-targets \
    --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/coinbase-target-group/1234567890123456 \
    --targets Id=i-0123456789abcdef0,Id=i-0123456789abcdef1

Why this step? Load balancing distributes traffic across multiple instances, ensuring that if one instance fails (like Coinbase's data center), the others continue serving requests without interruption.

5. Configure Auto Scaling Policies

We'll set up scaling policies to automatically add instances when CPU utilization exceeds 70%.

# Create scaling policy
aws autoscaling put-scaling-policy \
    --auto-scaling-group-name coinbase-asg \
    --policy-name cpu-scaling-policy \
    --policy-type TargetTrackingScaling \
    --target-tracking-configuration '{"PredefinedMetricSpecification":{"PredefinedMetricType":"ASGAverageCPUUtilization"},"TargetValue":70.0}'

# Create CloudWatch alarm for health checks
aws cloudwatch put-metric-alarm \
    --alarm-name instance-health-check \
    --alarm-description "Health check for instances" \
    --metric-name CPUUtilization \
    --namespace AWS/EC2 \
    --statistic Average \
    --period 300 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold \
    --evaluation-periods 2 \
    --alarm-actions arn:aws:sns:us-east-1:123456789012:coinbase-alerts

Why this step? These policies ensure that if traffic increases, the system automatically adds more instances. Additionally, health checks will detect failed instances and trigger replacement, mimicking the resilience that Coinbase should have had.

6. Test the Fault Tolerance

Finally, we'll simulate a failure to verify our system's resilience.

# Simulate instance failure
aws ec2 terminate-instances \
    --instance-ids i-0123456789abcdef0

# Monitor the auto scaling group
aws autoscaling describe-auto-scaling-groups \
    --auto-scaling-group-names coinbase-asg

# Check load balancer health
aws elbv2 describe-target-health \
    --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/coinbase-target-group/1234567890123456

Why this step? This simulates the exact scenario that affected Coinbase - an instance failure. The system should automatically detect the failure and spin up a new instance to maintain service availability.

Summary

This tutorial demonstrated how to build a fault-tolerant system that can handle hardware failures like the one that affected Coinbase. By implementing auto-scaling groups, load balancing, and proper health checks, we've created a system that automatically recovers from instance failures without service interruption.

The key lessons learned include:

Never rely on a single point of failure
Use auto-scaling to handle both load and failure scenarios
Implement proper monitoring and health checks
Design systems with redundancy from the ground up

This approach directly addresses the issues that caused Coinbase's outage and provides a framework for building more resilient infrastructure.

Coinbase cut 700 jobs on Monday, lost $394 million on Thursday, and went dark on Friday because a data centre overheated

Step-by-Step Instructions

1. Create a Simple Flask Application

Coinbase Fault-Tolerant System

2. Create a Deployment Script

3. Launch EC2 Instances with Auto Scaling

4. Set Up Load Balancer

5. Configure Auto Scaling Policies

6. Test the Fault Tolerance

Summary

Related Articles

Microsoft’s patch Tuesdays are about to get bigger

Microsoft goes all in on new AI-powered Windows security strategy - what it means for you

This iPhone bug won't let me save cropped screenshots - but I found a fix