Ubuntu infrastructure has been down for more than a day

Learn to set up a comprehensive monitoring system for Ubuntu infrastructure using Prometheus and Grafana to prevent extended outages and maintain system availability.

Introduction

In this tutorial, you'll learn how to set up and configure a robust monitoring system for Ubuntu infrastructure using Prometheus and Grafana. This is particularly important given recent outages like the one described in the Ars Technica article, where infrastructure downtime affected critical vulnerability communications. By implementing proper monitoring, you'll be able to detect issues before they escalate and maintain better visibility into your Ubuntu systems.

Prerequisites

Ubuntu 20.04 or 22.04 server with sudo privileges
Basic understanding of Linux command line and networking
Docker installed on your system
At least 4GB RAM available for containerized services

Step 1: Install Docker and Docker Compose

First, we need to set up the containerization environment that will run our monitoring stack. Docker allows us to easily deploy and manage our monitoring services without worrying about system dependencies.

Install Docker

sudo apt update
sudo apt install -y docker.io
sudo systemctl start docker
docker --version

Install Docker Compose

sudo apt install -y docker-compose

Why: Docker provides isolated environments for our monitoring services, ensuring they don't conflict with existing system software and making deployment consistent across different environments.

Step 2: Create Monitoring Directory Structure

We'll create a dedicated directory for our monitoring configuration files and data storage.

Create Project Directory

mkdir -p ubuntu-monitoring/{prometheus,grafana,data}
cd ubuntu-monitoring

Create Prometheus Configuration

cat > prometheus/prometheus.yml << EOF
# Global config
global:
  scrape_interval:     15s
  evaluation_interval: 15s

# Alerting configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape
scrape_configs:
  # The job name is added as a label `job` to any timeseries scraped from this config
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'ubuntu-server'
    static_configs:
      - targets: ['localhost:9100']
EOF

Why: This configuration tells Prometheus where to scrape metrics from, including itself and the node exporter which will monitor system metrics on our Ubuntu server.

Step 3: Set Up Docker Compose File

Now we'll define our monitoring stack using Docker Compose, which will orchestrate all our services together.

Create Docker Compose Configuration

cat > docker-compose.yml << EOF
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./data/prometheus:/prometheus
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=24h'
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - ./data/grafana:/var/lib/grafana
    restart: unless-stopped
EOF

Why: Docker Compose allows us to define and run multi-container Docker applications with a single command, making it easy to deploy our entire monitoring stack.

Step 4: Start the Monitoring Stack

With our configuration files in place, we can now start all our monitoring services.

Launch Services

docker-compose up -d

Verify Services Are Running

docker ps
# Expected output should show prometheus, node-exporter, and grafana containers running

Why: This command starts all services in detached mode, allowing them to run in the background while we continue working. The verification step ensures all containers are properly initialized.

Step 5: Configure Grafana Dashboard

Now we'll set up Grafana to visualize our Ubuntu system metrics.

Access Grafana Web Interface

Open your browser and navigate to http://localhost:3000. The default login is admin/admin.

Add Prometheus Data Source

Click on the gear icon (Configuration) in the left sidebar
Select "Data Sources"
Click "Add data source"
Select "Prometheus"
Set URL to http://prometheus:9090
Click "Save & Test"

Why: Grafana needs to know where to fetch metrics from, and Prometheus serves as our time-series database for system monitoring data.

Step 6: Create Ubuntu System Dashboard

We'll create a custom dashboard to monitor key Ubuntu infrastructure metrics.

Create New Dashboard

Click the "+" icon in the left sidebar
Select "Dashboard"
Click "Add new panel"
Set Query to: node_cpu_seconds_total{mode!="idle"}
Change the panel type to "Graph"
Set Title to "CPU Usage by Mode"

Add Memory Usage Panel

Click "Add panel"
Set Query to: node_memory_MemAvailable_bytes
Change the panel type to "Gauge"
Set Title to "Available Memory"

Why: Creating custom dashboards allows you to focus on the most critical metrics for your Ubuntu infrastructure, making it easier to spot potential issues before they cause outages like the one mentioned in the news article.

Step 7: Set Up Alerting Rules

To proactively detect issues, we'll configure alerting rules in Prometheus.

Update Prometheus Configuration

cat >> prometheus/prometheus.yml << EOF

# Alert rules
rule_files:
  - "alert.rules.yml"
EOF

Create Alert Rules File

cat > prometheus/alert.rules.yml << EOF
groups:
- name: ubuntu-alerts
  rules:
  - alert: HighCPUUsage
    expr: rate(node_cpu_seconds_total{mode!='idle'}[5m]) > 0.8
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "High CPU usage detected"
      description: "CPU usage has been above 80% for more than 2 minutes"

  - alert: LowMemory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Low memory warning"
      description: "Available memory is below 10% for more than 5 minutes"
EOF

Restart Prometheus to Apply Rules

docker-compose restart prometheus

Why: Alerting rules provide automated notifications when system metrics exceed predefined thresholds, helping you identify and respond to issues before they cause infrastructure outages.

Summary

In this tutorial, you've set up a comprehensive monitoring solution for Ubuntu infrastructure using Prometheus and Grafana. You've learned how to:

Install and configure Docker and Docker Compose
Create a multi-container monitoring stack
Configure Prometheus to scrape system metrics
Set up Grafana for visualization
Create custom alerting rules for critical infrastructure metrics

This monitoring setup will help prevent the kind of extended outages that affected Ubuntu infrastructure recently. By proactively monitoring CPU usage, memory consumption, and other key metrics, you'll be able to detect and address issues before they escalate into major problems that impact critical vulnerability communications and system availability.