Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks

Learn how to set up and use FlashKDA, an open-source high-performance implementation of Kimi Delta Attention from Moonshot AI, for accelerating attention computation in large language models.

Introduction

In this tutorial, you'll learn how to use FlashKDA, an open-source high-performance implementation of Kimi Delta Attention developed by Moonshot AI. This tool leverages CUTLASS kernels to accelerate attention mechanisms in large language models, particularly excelling with variable-length batching and showing significant performance improvements on H20 benchmarks. You'll set up the environment, explore the key components, and run benchmark tests to understand how FlashKDA can enhance your attention computation workloads.

Prerequisites

Basic understanding of attention mechanisms in transformers
Python 3.8 or higher
NVIDIA GPU with CUDA support (recommended: H20 or A100)
PyTorch installed with CUDA support
Git installed for cloning repositories

Step-by-Step Instructions

1. Clone the FlashKDA Repository

The first step is to get the source code. FlashKDA is open-sourced and available on GitHub. Cloning the repository gives you access to all the necessary files and benchmarks.

git clone https://github.com/MoonshotAI/FlashKDA.git

Why: This step ensures you have the latest implementation of the CUTLASS kernels for Kimi Delta Attention and all the necessary benchmarking scripts.

2. Set Up the Environment

FlashKDA requires a specific Python environment with dependencies. We recommend using a virtual environment to avoid conflicts.

cd FlashKDA
python -m venv flashkda_env
source flashkda_env/bin/activate  # On Windows: flashkda_env\Scripts\activate
pip install -r requirements.txt

Why: Setting up a dedicated environment isolates the dependencies required for FlashKDA, ensuring that your system's Python setup remains unaffected.

3. Install CUTLASS and PyTorch Extensions

FlashKDA is built on top of CUTLASS kernels, so you need to install these components. The installation process involves building the CUTLASS kernels and installing PyTorch extensions.

pip install flash-attn --no-build-isolation

Why: CUTLASS provides optimized kernels for attention operations, and installing the PyTorch extensions ensures compatibility with your PyTorch version for seamless integration.

4. Explore the Benchmark Scripts

FlashKDA includes benchmarking scripts to evaluate performance improvements. These scripts help you understand how FlashKDA compares to standard attention mechanisms.

ls benchmarks/

Key files include: benchmark_kimi_delta.py, benchmark_standard_attention.py, and plot_results.py.

Why: Benchmarking scripts are crucial for validating performance gains. They allow you to measure and visualize how FlashKDA improves attention computation speed and memory usage.

5. Run a Simple Benchmark

Now, let's run a basic benchmark to see how FlashKDA performs. We'll use a simple script that compares attention mechanisms.

python benchmarks/benchmark_kimi_delta.py --batch_size 8 --seq_len 1024 --num_heads 32 --head_dim 128

Why: Running the benchmark with specific parameters allows you to see how FlashKDA scales with different batch sizes and sequence lengths, providing insights into its performance characteristics.

6. Analyze Results

After running the benchmark, examine the output to compare performance metrics. Look for speed improvements and memory usage differences.

python benchmarks/plot_results.py --results_dir ./results

Why: Visualization helps you understand performance trends and confirm that FlashKDA delivers the expected speedups over standard attention mechanisms.

7. Integrate FlashKDA into Your Model

To use FlashKDA in your own model, you need to import and replace standard attention with the FlashKDA implementation. Here's a code snippet:

import torch
import torch.nn as nn
from flashkda.attention import KimiDeltaAttention

# Example usage in a transformer block
class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, head_dim):
        super().__init__()
        self.attn = KimiDeltaAttention(embed_dim, num_heads, head_dim)
        self.ffn = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        attn_out = self.attn(x, x, x)
        return self.ffn(attn_out)

Why: Integrating FlashKDA into your model allows you to leverage its performance benefits in real-world applications, such as language modeling or NLP tasks.

8. Test with Variable-Length Batching

One of FlashKDA's strengths is variable-length batching. You can test this by creating inputs of different sequence lengths:

import torch
from flashkda.attention import KimiDeltaAttention

# Create variable-length inputs
batch_sizes = [4, 6, 8]
seq_lengths = [512, 768, 1024]

for bs, sl in zip(batch_sizes, seq_lengths):
    x = torch.randn(bs, sl, 128)
    attn = KimiDeltaAttention(128, 32, 128)
    output = attn(x, x, x)
    print(f"Batch size: {bs}, Sequence length: {sl}, Output shape: {output.shape}")

Why: Testing variable-length batching ensures that FlashKDA handles dynamic inputs efficiently, which is common in real-world NLP tasks.

Summary

In this tutorial, you've learned how to set up and use FlashKDA, an open-source implementation of Kimi Delta Attention from Moonshot AI. You've cloned the repository, installed dependencies, run benchmarks, and integrated FlashKDA into a model. FlashKDA offers significant performance improvements over standard attention mechanisms, especially on H20 hardware, and supports variable-length batching, making it a valuable tool for accelerating attention computation in large language models.