LLM Inference AI Infrastructure

⚡

vLLM

A high-throughput and memory-efficient inference and serving engine for Large Language Models

Intermediate inference serving optimization GPU open-source

GitHub Repository Official Website

Alternative To

• HuggingFace TGI
• TensorRT-LLM
• LMDeploy
• SGLang

Difficulty Level

Intermediate

Requires some technical experience. Moderate setup complexity.

Overview

vLLM is a fast and efficient library for Large Language Model (LLM) inference and serving, originally developed at UC Berkeley’s Sky Computing Lab. It addresses the critical challenges of LLM deployment: high memory usage, slow inference speed, and inefficient resource utilization. At its core, vLLM introduces PagedAttention, an innovative attention algorithm that significantly optimizes memory management for attention keys and values, inspired by virtual memory and paging concepts from operating systems.

The library enables high-throughput LLM serving with dramatically improved performance compared to traditional solutions, making it possible to serve complex language models efficiently even with limited computational resources. vLLM has evolved into a community-driven project with contributions from both academia and industry, becoming a cornerstone technology for organizations looking to deploy LLMs in production environments.

Key Features

Feature	Description
PagedAttention	Novel attention algorithm that partitions KV cache into blocks, reducing memory waste by up to 96%
Continuous Batching	Efficiently processes incoming requests without waiting for a full batch to form
High Throughput	Delivers up to 24x higher throughput than HuggingFace Transformers
Tensor Parallelism	Supports distributed inference across multiple GPUs
Pipeline Parallelism	Enables model distribution across multiple devices for larger models
Quantization Support	Includes GPTQ, AWQ, INT4, INT8, and FP8 for reduced memory footprint
Optimized CUDA Kernels	Integration with FlashAttention and FlashInfer for maximum performance
Streaming Outputs	Real-time token generation with minimal latency
OpenAI-Compatible API	Drop-in replacement for OpenAI’s API, simplifying integration
Multi-Hardware Support	Works with NVIDIA GPUs, AMD CPUs/GPUs, Intel CPUs/GPUs, TPUs, and AWS Neuron
Prefix Caching	Automatically caches and reuses computation for common prefixes
Multi-LoRA Support	Efficiently serves multiple fine-tuned models with minimal overhead
Speculative Decoding	Accelerates generation by predicting multiple tokens at once

Technical Details

vLLM’s architecture is designed to maximize throughput while minimizing memory overhead, particularly addressing the inefficiencies in traditional attention mechanisms.

PagedAttention Technology

PagedAttention is the core innovation behind vLLM’s performance advantages. It solves the key-value (KV) cache fragmentation problem by:

Partitioning the KV cache into fixed-size blocks (similar to memory pages in operating systems)
Allowing non-contiguous storage of keys and values in memory
Using a block table to map logical sequence positions to physical memory blocks
Enabling efficient memory sharing between sequences with common prefixes

This approach dramatically reduces memory waste and enables more efficient batching of requests, leading to significantly higher throughput.

Supported Model Architectures

vLLM supports a wide range of model architectures:

Transformer-based LLMs (Llama, Mistral, Falcon, etc.)
Mixture-of-Expert models (Mixtral, Deepseek-V2/V3)
Embedding models (E5-Mistral)
Multi-modal LLMs (LLaVA)

Version Information

Version	Release Date	Key Features Added
v0.6.0	September 2024	2.7x throughput improvement, 5x latency reduction
v0.5.0	July 2024	Enhanced multi-modal support, improved quantization
v0.4.0	May 2024	Expanded hardware support, speculative decoding
v0.3.0	March 2024	Prefix caching, multi-LoRA support
v0.2.0	December 2023	Tensor parallelism improvements, streaming outputs
v0.1.0	June 2023	Initial release with PagedAttention

Why Use vLLM

vLLM offers several compelling advantages over alternative LLM serving solutions:

Superior Performance: vLLM delivers up to 24x higher throughput than HuggingFace Transformers and 3.5x faster throughput than HuggingFace’s Text Generation Inference (TGI) in benchmark tests.
Memory Efficiency: PagedAttention reduces memory waste by up to 96%, allowing more efficient use of GPU resources and enabling larger batch sizes.
Cost Effectiveness: The improved throughput means the same infrastructure can handle significantly more traffic (up to 5x more in some cases) without requiring additional GPUs, translating to direct cost savings.
Seamless Integration: vLLM provides an OpenAI-compatible API server, making it easy to integrate with existing applications that use OpenAI’s API.
Flexibility: Support for various decoding algorithms (parallel sampling, beam search), tensor parallelism, and streaming outputs provides flexibility for different use cases.
Multi-Hardware Support: Unlike some alternatives that only support NVIDIA GPUs, vLLM works across a variety of hardware platforms.
Active Development: As an active open-source project with contributions from both academia and industry, vLLM continues to improve with regular updates and new features.

System Requirements

Minimum Requirements

Operating System: Linux (vLLM can only fully run on Linux)
Python: 3.8 - 3.11
CPU: 4+ cores
RAM: 16GB+
GPU: NVIDIA GPU with compute capability 7.0 or higher (V100, T4, RTX20xx, A100, L4, H100, etc.)
Storage: 5GB+
CUDA: CUDA 11.8 or 12.1 (binaries are compiled with CUDA 12.1 by default)

Recommended Requirements

CPU: 8+ cores
RAM: 32GB+
GPU: NVIDIA A100 (80GB) or H100 for large models
Storage: 20GB+ SSD

Hardware Compatibility

vLLM supports multiple hardware platforms:

NVIDIA GPUs (primary support)
AMD CPUs and GPUs (via ROCm)
Intel CPUs and GPUs
PowerPC CPUs
TPUs
AWS Neuron (Trainium and Inferentia)

Installation Guide

Prerequisites

Linux operating system
Python 3.8 - 3.11
CUDA-compatible GPU
Git (for source installation)

Installation with pip

The simplest way to install vLLM is using pip:

# Create a new conda environment (recommended)
conda create -n vllm python=3.9 -y
conda activate vllm

# Install vLLM with CUDA 12.1
pip install vllm

For CUDA 11.8 compatibility:

# Install vLLM with CUDA 11.8
export VLLM_VERSION=0.6.0
export PYTHON_VERSION=39
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118

Building from Source

For maximum compatibility or to use specific features:

# Clone the repository
git clone https://github.com/vllm-project/vllm.git
cd vllm

# Optionally build with multi-LoRA capability
# export VLLM_INSTALL_PUNICA_KERNELS=1

# Install in development mode
pip install -e .  # This may take 5-10 minutes

Using Docker

vLLM also provides Docker images for easy deployment:

# Use the NVIDIA PyTorch Docker image
# Use `--ipc=host` to ensure sufficient shared memory
docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3

# Inside the container, install vLLM
pip install vllm

Practical Exercise: Serving a Model with vLLM

This exercise demonstrates how to set up and use vLLM to serve a language model with an OpenAI-compatible API.

Step 1: Install vLLM

First, ensure vLLM is installed as described in the installation guide above.

Step 2: Start an OpenAI-compatible API Server

vLLM provides a simple command to start an API server compatible with the OpenAI API:

# Start a server with Llama-2-7b-chat
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --tensor-parallel-size 1 \
    --host 0.0.0.0 \
    --port 8000

This command:

Loads the Llama-2-7b-chat model from Hugging Face
Uses a tensor parallelism size of 1 (using a single GPU)
Binds the server to all network interfaces on port 8000

For larger models or multiple GPUs, adjust the --tensor-parallel-size parameter.

Step 3: Query the API Server

Once the server is running, you can query it using standard HTTP requests:

import requests
import json

# Define the API endpoint
api_url = "http://localhost:8000/v1/chat/completions"

# Prepare the request payload
payload = {
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is PagedAttention in vLLM?"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
}

# Send the request
headers = {"Content-Type": "application/json"}
response = requests.post(api_url, headers=headers, data=json.dumps(payload))

# Print the response
print(json.dumps(response.json(), indent=4))

Step 4: Streaming Responses

vLLM also supports streaming responses, which is useful for real-time applications:

import requests
import json

api_url = "http://localhost:8000/v1/chat/completions"
payload = {
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a short poem about artificial intelligence."}
    ],
    "temperature": 0.7,
    "max_tokens": 500,
    "stream": True  # Enable streaming
}

headers = {"Content-Type": "application/json"}
response = requests.post(api_url, headers=headers, data=json.dumps(payload), stream=True)

# Process the streaming response
for line in response.iter_lines():
    if line:
        line_text = line.decode('utf-8')
        if line_text.startswith('data: '):
            data_str = line_text[6:]  # Remove 'data: ' prefix
            if data_str != '[DONE]':
                try:
                    data = json.loads(data_str)
                    if 'choices' in data and len(data['choices']) > 0:
                        delta = data['choices'][0].get('delta', {})
                        if 'content' in delta:
                            print(delta['content'], end='', flush=True)
                except json.JSONDecodeError:
                    pass

Step 5: Advanced Configuration

For production deployments, you might want to adjust various parameters:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --tensor-parallel-size 2 \
    --max-model-len 8192 \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 256 \
    --host 0.0.0.0 \
    --port 8000

This configuration:

Distributes the model across 2 GPUs
Sets the maximum sequence length to 8192 tokens
Limits the maximum number of tokens processed in a batch to 16384
Allows up to 256 concurrent sequences

Resources

Official Documentation and Repositories

Technical Papers and Articles

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention - Original announcement blog post
vLLM Paper (SOSP 2023) - Academic paper detailing the technology

Community Resources

vLLM Meetups - Regular community events
vLLM Office Hours - Biweekly online sessions