vLLM
A high-throughput and memory-efficient inference and serving engine for Large Language Models
Alternative To
- • HuggingFace TGI
- • TensorRT-LLM
- • LMDeploy
- • SGLang
Difficulty Level
Requires some technical experience. Moderate setup complexity.
Overview
vLLM is a fast and efficient library for Large Language Model (LLM) inference and serving, originally developed at UC Berkeley’s Sky Computing Lab. It addresses the critical challenges of LLM deployment: high memory usage, slow inference speed, and inefficient resource utilization. At its core, vLLM introduces PagedAttention, an innovative attention algorithm that significantly optimizes memory management for attention keys and values, inspired by virtual memory and paging concepts from operating systems.
The library enables high-throughput LLM serving with dramatically improved performance compared to traditional solutions, making it possible to serve complex language models efficiently even with limited computational resources. vLLM has evolved into a community-driven project with contributions from both academia and industry, becoming a cornerstone technology for organizations looking to deploy LLMs in production environments.
Key Features
Feature | Description |
---|---|
PagedAttention | Novel attention algorithm that partitions KV cache into blocks, reducing memory waste by up to 96% |
Continuous Batching | Efficiently processes incoming requests without waiting for a full batch to form |
High Throughput | Delivers up to 24x higher throughput than HuggingFace Transformers |
Tensor Parallelism | Supports distributed inference across multiple GPUs |
Pipeline Parallelism | Enables model distribution across multiple devices for larger models |
Quantization Support | Includes GPTQ, AWQ, INT4, INT8, and FP8 for reduced memory footprint |
Optimized CUDA Kernels | Integration with FlashAttention and FlashInfer for maximum performance |
Streaming Outputs | Real-time token generation with minimal latency |
OpenAI-Compatible API | Drop-in replacement for OpenAI’s API, simplifying integration |
Multi-Hardware Support | Works with NVIDIA GPUs, AMD CPUs/GPUs, Intel CPUs/GPUs, TPUs, and AWS Neuron |
Prefix Caching | Automatically caches and reuses computation for common prefixes |
Multi-LoRA Support | Efficiently serves multiple fine-tuned models with minimal overhead |
Speculative Decoding | Accelerates generation by predicting multiple tokens at once |
Technical Details
vLLM’s architecture is designed to maximize throughput while minimizing memory overhead, particularly addressing the inefficiencies in traditional attention mechanisms.
PagedAttention Technology
PagedAttention is the core innovation behind vLLM’s performance advantages. It solves the key-value (KV) cache fragmentation problem by:
- Partitioning the KV cache into fixed-size blocks (similar to memory pages in operating systems)
- Allowing non-contiguous storage of keys and values in memory
- Using a block table to map logical sequence positions to physical memory blocks
- Enabling efficient memory sharing between sequences with common prefixes
This approach dramatically reduces memory waste and enables more efficient batching of requests, leading to significantly higher throughput.
Supported Model Architectures
vLLM supports a wide range of model architectures:
- Transformer-based LLMs (Llama, Mistral, Falcon, etc.)
- Mixture-of-Expert models (Mixtral, Deepseek-V2/V3)
- Embedding models (E5-Mistral)
- Multi-modal LLMs (LLaVA)
Version Information
Version | Release Date | Key Features Added |
---|---|---|
v0.6.0 | September 2024 | 2.7x throughput improvement, 5x latency reduction |
v0.5.0 | July 2024 | Enhanced multi-modal support, improved quantization |
v0.4.0 | May 2024 | Expanded hardware support, speculative decoding |
v0.3.0 | March 2024 | Prefix caching, multi-LoRA support |
v0.2.0 | December 2023 | Tensor parallelism improvements, streaming outputs |
v0.1.0 | June 2023 | Initial release with PagedAttention |
Why Use vLLM
vLLM offers several compelling advantages over alternative LLM serving solutions:
Superior Performance: vLLM delivers up to 24x higher throughput than HuggingFace Transformers and 3.5x faster throughput than HuggingFace’s Text Generation Inference (TGI) in benchmark tests.
Memory Efficiency: PagedAttention reduces memory waste by up to 96%, allowing more efficient use of GPU resources and enabling larger batch sizes.
Cost Effectiveness: The improved throughput means the same infrastructure can handle significantly more traffic (up to 5x more in some cases) without requiring additional GPUs, translating to direct cost savings.
Seamless Integration: vLLM provides an OpenAI-compatible API server, making it easy to integrate with existing applications that use OpenAI’s API.
Flexibility: Support for various decoding algorithms (parallel sampling, beam search), tensor parallelism, and streaming outputs provides flexibility for different use cases.
Multi-Hardware Support: Unlike some alternatives that only support NVIDIA GPUs, vLLM works across a variety of hardware platforms.
Active Development: As an active open-source project with contributions from both academia and industry, vLLM continues to improve with regular updates and new features.
System Requirements
Minimum Requirements
- Operating System: Linux (vLLM can only fully run on Linux)
- Python: 3.8 - 3.11
- CPU: 4+ cores
- RAM: 16GB+
- GPU: NVIDIA GPU with compute capability 7.0 or higher (V100, T4, RTX20xx, A100, L4, H100, etc.)
- Storage: 5GB+
- CUDA: CUDA 11.8 or 12.1 (binaries are compiled with CUDA 12.1 by default)
Recommended Requirements
- CPU: 8+ cores
- RAM: 32GB+
- GPU: NVIDIA A100 (80GB) or H100 for large models
- Storage: 20GB+ SSD
Hardware Compatibility
vLLM supports multiple hardware platforms:
- NVIDIA GPUs (primary support)
- AMD CPUs and GPUs (via ROCm)
- Intel CPUs and GPUs
- PowerPC CPUs
- TPUs
- AWS Neuron (Trainium and Inferentia)
Installation Guide
Prerequisites
- Linux operating system
- Python 3.8 - 3.11
- CUDA-compatible GPU
- Git (for source installation)
Installation with pip
The simplest way to install vLLM is using pip:
# Create a new conda environment (recommended)
conda create -n vllm python=3.9 -y
conda activate vllm
# Install vLLM with CUDA 12.1
pip install vllm
For CUDA 11.8 compatibility:
# Install vLLM with CUDA 11.8
export VLLM_VERSION=0.6.0
export PYTHON_VERSION=39
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
Building from Source
For maximum compatibility or to use specific features:
# Clone the repository
git clone https://github.com/vllm-project/vllm.git
cd vllm
# Optionally build with multi-LoRA capability
# export VLLM_INSTALL_PUNICA_KERNELS=1
# Install in development mode
pip install -e . # This may take 5-10 minutes
Using Docker
vLLM also provides Docker images for easy deployment:
# Use the NVIDIA PyTorch Docker image
# Use `--ipc=host` to ensure sufficient shared memory
docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3
# Inside the container, install vLLM
pip install vllm
Practical Exercise: Serving a Model with vLLM
This exercise demonstrates how to set up and use vLLM to serve a language model with an OpenAI-compatible API.
Step 1: Install vLLM
First, ensure vLLM is installed as described in the installation guide above.
Step 2: Start an OpenAI-compatible API Server
vLLM provides a simple command to start an API server compatible with the OpenAI API:
# Start a server with Llama-2-7b-chat
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--tensor-parallel-size 1 \
--host 0.0.0.0 \
--port 8000
This command:
- Loads the Llama-2-7b-chat model from Hugging Face
- Uses a tensor parallelism size of 1 (using a single GPU)
- Binds the server to all network interfaces on port 8000
For larger models or multiple GPUs, adjust the --tensor-parallel-size
parameter.
Step 3: Query the API Server
Once the server is running, you can query it using standard HTTP requests:
import requests
import json
# Define the API endpoint
api_url = "http://localhost:8000/v1/chat/completions"
# Prepare the request payload
payload = {
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is PagedAttention in vLLM?"}
],
"temperature": 0.7,
"max_tokens": 500
}
# Send the request
headers = {"Content-Type": "application/json"}
response = requests.post(api_url, headers=headers, data=json.dumps(payload))
# Print the response
print(json.dumps(response.json(), indent=4))
Step 4: Streaming Responses
vLLM also supports streaming responses, which is useful for real-time applications:
import requests
import json
api_url = "http://localhost:8000/v1/chat/completions"
payload = {
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short poem about artificial intelligence."}
],
"temperature": 0.7,
"max_tokens": 500,
"stream": True # Enable streaming
}
headers = {"Content-Type": "application/json"}
response = requests.post(api_url, headers=headers, data=json.dumps(payload), stream=True)
# Process the streaming response
for line in response.iter_lines():
if line:
line_text = line.decode('utf-8')
if line_text.startswith('data: '):
data_str = line_text[6:] # Remove 'data: ' prefix
if data_str != '[DONE]':
try:
data = json.loads(data_str)
if 'choices' in data and len(data['choices']) > 0:
delta = data['choices'][0].get('delta', {})
if 'content' in delta:
print(delta['content'], end='', flush=True)
except json.JSONDecodeError:
pass
Step 5: Advanced Configuration
For production deployments, you might want to adjust various parameters:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--max-num-batched-tokens 16384 \
--max-num-seqs 256 \
--host 0.0.0.0 \
--port 8000
This configuration:
- Distributes the model across 2 GPUs
- Sets the maximum sequence length to 8192 tokens
- Limits the maximum number of tokens processed in a batch to 16384
- Allows up to 256 concurrent sequences
Resources
Official Documentation and Repositories
Technical Papers and Articles
- vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention - Original announcement blog post
- vLLM Paper (SOSP 2023) - Academic paper detailing the technology
Community Resources
- vLLM Meetups - Regular community events
- vLLM Office Hours - Biweekly online sessions