Introduction
In this tutorial, we'll explore how to implement and use kvcached, a dynamic KV-cache solution built on top of vLLM, to optimize GPU memory usage when serving large language models (LLMs). This implementation is especially useful for handling bursty LLM serving workloads and enabling multi-model GPU sharing, which are common in production environments. By the end of this tutorial, you'll understand how to set up a lightweight Qwen2.5 model with an OpenAI-compatible API and experiment with dynamic KV-cache allocation.
Prerequisites
- Basic understanding of Python and machine learning concepts
- Access to a machine with at least one NVIDIA GPU and CUDA support
- Python 3.8 or higher installed
- Installed packages:
vllm,openai,torch,fastapi,uvicorn
Step-by-Step Instructions
1. Setting Up the Environment
1.1 Install Required Dependencies
We start by installing all necessary packages. The vllm library provides the core inference engine, while openai and fastapi help us create an OpenAI-compatible API server.
pip install vllm openai fastapi uvicorn torch
1.2 Verify GPU Availability
Ensure that your system has CUDA-compatible GPUs available for running LLMs.
python -c "import torch; print(torch.cuda.is_available())"
2. Deploying Qwen2.5 with OpenAI-Compatible API
2.1 Download and Load the Qwen2.5 Model
We'll use the Qwen2.5 model, which is a lightweight yet powerful LLM. We'll load it using the vllm engine and set up a basic inference server.
from vllm import LLM, SamplingParams
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import asyncio
app = FastAPI()
llm = LLM(model="Qwen/Qwen2.5-0.5B", tensor_parallel_size=1)
# Define request schema
class ChatRequest(BaseModel):
messages: list
temperature: float = 0.7
@app.post("/v1/chat/completions")
def chat_completion(request: ChatRequest):
prompts = [msg["content"] for msg in request.messages]
sampling_params = SamplingParams(temperature=request.temperature)
outputs = llm.generate(prompts, sampling_params)
return {"response": outputs[0].outputs[0].text}
2.2 Run the API Server
Start the FastAPI server to expose the model via an OpenAI-compatible API.
uvicorn main:app --host 0.0.0.0 --port 8000
3. Implementing Dynamic KV-Cache with kvcached
3.1 Configure kvcached Parameters
To enable dynamic KV-cache allocation, we modify the LLM initialization to include kvcached-specific parameters. This allows for more efficient memory usage, especially during bursty inference patterns.
from vllm import LLM, SamplingParams
# Configure kvcached parameters
llm = LLM(
model="Qwen/Qwen2.5-0.5B",
tensor_parallel_size=1,
kv_cache_dtype="auto",
enable_prefix_caching=True,
max_num_seqs=100,
max_model_len=2048,
use_v1=True # Enable kvcached v1 support
)
3.2 Monitor Memory Usage
With kvcached enabled, we can monitor GPU memory usage to observe how dynamic KV-cache allocation affects memory consumption.
import torch
# Check memory usage
print(f"GPU memory allocated: {torch.cuda.memory_allocated() / (1024**2):.2f} MB")
print(f"GPU memory reserved: {torch.cuda.memory_reserved() / (1024**2):.2f} MB")
4. Simulating Bursty LLM Serving
4.1 Create a Simulated Bursty Load
We simulate bursty inference patterns by generating a series of requests with varying batch sizes.
import asyncio
import aiohttp
import random
async def send_request(session, prompt, temperature=0.7):
url = "http://localhost:8000/v1/chat/completions"
data = {
"messages": [{"role": "user", "content": prompt}],
"temperature": temperature
}
async with session.post(url, json=data) as response:
return await response.json()
async def simulate_bursty_load():
async with aiohttp.ClientSession() as session:
tasks = []
for i in range(50):
prompt = f"Explain quantum computing in {random.randint(1, 5)} sentences."
tasks.append(send_request(session, prompt))
responses = await asyncio.gather(*tasks)
return responses
# Run the bursty load simulation
await simulate_bursty_load()
4.2 Analyze KV-Cache Behavior
By observing the behavior of the KV-cache under bursty load, we can understand how kvcached adapts to varying memory demands. This is crucial for multi-model GPU sharing scenarios.
5. Multi-Model GPU Sharing
5.1 Configure Multiple Models
To enable multi-model GPU sharing, we can run multiple LLM instances with different models on the same GPU, using kvcached to manage memory efficiently.
from vllm import LLM
# Load two models with shared KV-cache management
model1 = LLM(model="Qwen/Qwen2.5-0.5B", tensor_parallel_size=1)
model2 = LLM(model="Qwen/Qwen2.5-1.5B", tensor_parallel_size=1)
# Both models will share GPU memory through kvcached
5.2 Switch Between Models Dynamically
With kvcached, switching between models dynamically becomes more efficient, as the KV-cache is managed centrally and can be reused across models.
Summary
In this tutorial, we explored how to implement and use kvcached for dynamic KV-cache management in LLM serving. We set up a lightweight Qwen2.5 model with an OpenAI-compatible API, configured kvcached parameters for efficient memory usage, simulated bursty inference workloads, and demonstrated multi-model GPU sharing. By leveraging kvcached, you can significantly improve GPU memory utilization and enable more flexible and scalable LLM serving architectures.



