A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing

Learn how to implement kvcached for dynamic KV-cache management in LLM serving, including setting up Qwen2.5 models with an OpenAI-compatible API and simulating bursty inference workloads.

Introduction

In this tutorial, we'll explore how to implement and use kvcached, a dynamic KV-cache solution built on top of vLLM, to optimize GPU memory usage when serving large language models (LLMs). This implementation is especially useful for handling bursty LLM serving workloads and enabling multi-model GPU sharing, which are common in production environments. By the end of this tutorial, you'll understand how to set up a lightweight Qwen2.5 model with an OpenAI-compatible API and experiment with dynamic KV-cache allocation.

Prerequisites

Basic understanding of Python and machine learning concepts
Access to a machine with at least one NVIDIA GPU and CUDA support
Python 3.8 or higher installed
Installed packages: vllm, openai, torch, fastapi, uvicorn

Step-by-Step Instructions

1. Setting Up the Environment

1.1 Install Required Dependencies

We start by installing all necessary packages. The vllm library provides the core inference engine, while openai and fastapi help us create an OpenAI-compatible API server.

pip install vllm openai fastapi uvicorn torch

1.2 Verify GPU Availability

Ensure that your system has CUDA-compatible GPUs available for running LLMs.

python -c "import torch; print(torch.cuda.is_available())"

2. Deploying Qwen2.5 with OpenAI-Compatible API

2.1 Download and Load the Qwen2.5 Model

We'll use the Qwen2.5 model, which is a lightweight yet powerful LLM. We'll load it using the vllm engine and set up a basic inference server.

from vllm import LLM, SamplingParams
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import asyncio

app = FastAPI()
llm = LLM(model="Qwen/Qwen2.5-0.5B", tensor_parallel_size=1)

# Define request schema

class ChatRequest(BaseModel):
    messages: list
    temperature: float = 0.7

@app.post("/v1/chat/completions")
def chat_completion(request: ChatRequest):
    prompts = [msg["content"] for msg in request.messages]
    sampling_params = SamplingParams(temperature=request.temperature)
    outputs = llm.generate(prompts, sampling_params)
    return {"response": outputs[0].outputs[0].text}

2.2 Run the API Server

Start the FastAPI server to expose the model via an OpenAI-compatible API.

uvicorn main:app --host 0.0.0.0 --port 8000

3. Implementing Dynamic KV-Cache with kvcached

3.1 Configure kvcached Parameters

To enable dynamic KV-cache allocation, we modify the LLM initialization to include kvcached-specific parameters. This allows for more efficient memory usage, especially during bursty inference patterns.

from vllm import LLM, SamplingParams

# Configure kvcached parameters
llm = LLM(
    model="Qwen/Qwen2.5-0.5B",
    tensor_parallel_size=1,
    kv_cache_dtype="auto",
    enable_prefix_caching=True,
    max_num_seqs=100,
    max_model_len=2048,
    use_v1=True  # Enable kvcached v1 support
)

3.2 Monitor Memory Usage

With kvcached enabled, we can monitor GPU memory usage to observe how dynamic KV-cache allocation affects memory consumption.

import torch

# Check memory usage
print(f"GPU memory allocated: {torch.cuda.memory_allocated() / (1024**2):.2f} MB")
print(f"GPU memory reserved: {torch.cuda.memory_reserved() / (1024**2):.2f} MB")

4. Simulating Bursty LLM Serving

4.1 Create a Simulated Bursty Load

We simulate bursty inference patterns by generating a series of requests with varying batch sizes.

import asyncio
import aiohttp
import random

async def send_request(session, prompt, temperature=0.7):
    url = "http://localhost:8000/v1/chat/completions"
    data = {
        "messages": [{"role": "user", "content": prompt}],
        "temperature": temperature
    }
    async with session.post(url, json=data) as response:
        return await response.json()

async def simulate_bursty_load():
    async with aiohttp.ClientSession() as session:
        tasks = []
        for i in range(50):
            prompt = f"Explain quantum computing in {random.randint(1, 5)} sentences."
            tasks.append(send_request(session, prompt))
        responses = await asyncio.gather(*tasks)
        return responses

# Run the bursty load simulation
await simulate_bursty_load()

4.2 Analyze KV-Cache Behavior

By observing the behavior of the KV-cache under bursty load, we can understand how kvcached adapts to varying memory demands. This is crucial for multi-model GPU sharing scenarios.

5. Multi-Model GPU Sharing

5.1 Configure Multiple Models

To enable multi-model GPU sharing, we can run multiple LLM instances with different models on the same GPU, using kvcached to manage memory efficiently.

from vllm import LLM

# Load two models with shared KV-cache management
model1 = LLM(model="Qwen/Qwen2.5-0.5B", tensor_parallel_size=1)
model2 = LLM(model="Qwen/Qwen2.5-1.5B", tensor_parallel_size=1)

# Both models will share GPU memory through kvcached

5.2 Switch Between Models Dynamically

With kvcached, switching between models dynamically becomes more efficient, as the KV-cache is managed centrally and can be reused across models.

Summary

In this tutorial, we explored how to implement and use kvcached for dynamic KV-cache management in LLM serving. We set up a lightweight Qwen2.5 model with an OpenAI-compatible API, configured kvcached parameters for efficient memory usage, simulated bursty inference workloads, and demonstrated multi-model GPU sharing. By leveraging kvcached, you can significantly improve GPU memory utilization and enable more flexible and scalable LLM serving architectures.