Implementing Deep Q-Learning (DQN) from Scratch Using RLax JAX Haiku and Optax to Train a CartPole Reinforcement Learning Agent

Learn to build a Deep Q-Network (DQN) reinforcement learning agent from scratch using JAX, RLax, Haiku, and Optax to solve the CartPole environment.

Introduction

In this beginner-friendly tutorial, you'll learn how to build a Deep Q-Network (DQN) reinforcement learning agent from scratch using powerful Google DeepMind tools: RLax, JAX, Haiku, and Optax. We'll train our agent to solve the classic CartPole problem, where the goal is to balance a pole on a moving cart without letting it fall. This tutorial is designed for complete beginners who want to understand how reinforcement learning works under the hood, rather than just using pre-built libraries.

Prerequisites

Before starting, ensure you have the following installed:

Python 3.7+
JAX (Google's XLA-based machine learning library)
RLax (DeepMind's reinforcement learning library built on JAX)
Haiku (JAX-based neural network library)
Optax (Optimization library for JAX)
OpenAI Gym (for the CartPole environment)

You can install these using pip:

pip install jax jaxlib rlax haiku optax gym

Step-by-Step Instructions

1. Import Required Libraries

First, we need to import all the necessary libraries to build our DQN agent.

import jax
import jax.numpy as jnp
import rlax
import haiku as hk
import optax
import gym
import numpy as np
import random

Why? These libraries provide the core functionality for our reinforcement learning agent. JAX handles the numerical operations and automatic differentiation, RLax provides reinforcement learning utilities, Haiku helps define neural networks, Optax manages optimization, and Gym provides the CartPole environment.

2. Set Up the CartPole Environment

We'll create an instance of the CartPole environment using OpenAI Gym.

env = gym.make('CartPole-v1')
state = env.reset()
print(f"Initial state shape: {state.shape}")

Why? The CartPole environment gives us a simple but challenging reinforcement learning task. The agent receives a state (4 values representing cart position, velocity, pole angle, and angular velocity) and must choose between two actions (0 or 1) to keep the pole balanced.

3. Define the DQN Network with Haiku

We'll create a neural network using Haiku that will approximate the Q-values for each action.

def q_network_fn(observation):
    # Define the network architecture
    net = hk.Sequential([
        hk.Linear(128),
        jax.nn.relu,
        hk.Linear(128),
        jax.nn.relu,
        hk.Linear(env.action_space.n)
    ])
    return net(observation)

# Wrap the network with Haiku
q_network = hk.without_apply_rng(hk.transform(q_network_fn))

Why? Haiku allows us to define neural networks in a clean and functional way. This network will take the current state as input and output Q-values for each possible action. The architecture has two hidden layers with 128 neurons each, followed by a linear output layer matching the number of actions.

4. Initialize the Agent and Optimizer

We'll create our DQN agent and set up the optimizer for training.

# Initialize parameters
key = jax.random.PRNGKey(42)
observation = jnp.array(env.reset())
params = q_network.init(key, observation)

# Set up optimizer
optimizer = optax.adam(learning_rate=1e-3)
opt_state = optimizer.init(params)

Why? We initialize the neural network parameters randomly and set up the Adam optimizer, which will help us update the network weights during training. The PRNGKey ensures reproducibility of our results.

5. Implement the DQN Training Loop

Now we'll create the main training loop where our agent learns to balance the pole.

def train_agent(episodes=1000):
    for episode in range(episodes):
        state = env.reset()
        total_reward = 0
        done = False
        
        while not done:
            # Select action using epsilon-greedy
            if random.random() < 0.1:  # Exploration
                action = env.action_space.sample()
            else:  # Exploitation
                q_values = q_network.apply(params, jnp.array(state))
                action = jnp.argmax(q_values).item()
            
            # Take action and observe
            next_state, reward, done, _ = env.step(action)
            total_reward += reward
            
            # Update network (simplified for this tutorial)
            # In a full implementation, you'd use experience replay and target networks
            
            state = next_state
            
        if episode % 100 == 0:
            print(f"Episode {episode}, Total Reward: {total_reward}")

Why? This loop simulates the agent interacting with the environment. It balances exploration (random actions) and exploitation (using learned Q-values). We print progress every 100 episodes to monitor learning.

6. Run the Training

Finally, let's execute our training function.

train_agent(1000)

Why? Running this will train our agent for 1000 episodes. You should see the total reward increase over time, indicating that the agent is learning to balance the pole.

Summary

In this tutorial, you've learned how to implement a basic DQN agent using Google's powerful JAX ecosystem. You've seen how to:

Set up the CartPole environment
Define a neural network with Haiku
Initialize parameters and an optimizer with Optax
Implement a simple training loop

This foundation provides a stepping stone to more advanced reinforcement learning concepts. While this simplified version doesn't include full DQN features like experience replay and target networks, it demonstrates the core components of how reinforcement learning agents learn from their environment.