LLM Inference AI Infrastructure

vLLM

A high-throughput and memory-efficient inference and serving engine for Large Language Models

Intermediate inference serving optimization GPU open-source

Alternative To

  • • HuggingFace TGI
  • • TensorRT-LLM
  • • LMDeploy
  • • SGLang

Difficulty Level

Intermediate

Requires some technical experience. Moderate setup complexity.

Overview

vLLM is a fast and efficient library for Large Language Model (LLM) inference and serving, originally developed at UC Berkeley’s Sky Computing Lab. It addresses the critical challenges of LLM deployment: high memory usage, slow inference speed, and inefficient resource utilization. At its core, vLLM introduces PagedAttention, an innovative attention algorithm that significantly optimizes memory management for attention keys and values, inspired by virtual memory and paging concepts from operating systems.

The library enables high-throughput LLM serving with dramatically improved performance compared to traditional solutions, making it possible to serve complex language models efficiently even with limited computational resources. vLLM has evolved into a community-driven project with contributions from both academia and industry, becoming a cornerstone technology for organizations looking to deploy LLMs in production environments.

Key Features

FeatureDescription
PagedAttentionNovel attention algorithm that partitions KV cache into blocks, reducing memory waste by up to 96%
Continuous BatchingEfficiently processes incoming requests without waiting for a full batch to form
High ThroughputDelivers up to 24x higher throughput than HuggingFace Transformers
Tensor ParallelismSupports distributed inference across multiple GPUs
Pipeline ParallelismEnables model distribution across multiple devices for larger models
Quantization SupportIncludes GPTQ, AWQ, INT4, INT8, and FP8 for reduced memory footprint
Optimized CUDA KernelsIntegration with FlashAttention and FlashInfer for maximum performance
Streaming OutputsReal-time token generation with minimal latency
OpenAI-Compatible APIDrop-in replacement for OpenAI’s API, simplifying integration
Multi-Hardware SupportWorks with NVIDIA GPUs, AMD CPUs/GPUs, Intel CPUs/GPUs, TPUs, and AWS Neuron
Prefix CachingAutomatically caches and reuses computation for common prefixes
Multi-LoRA SupportEfficiently serves multiple fine-tuned models with minimal overhead
Speculative DecodingAccelerates generation by predicting multiple tokens at once

Technical Details

vLLM’s architecture is designed to maximize throughput while minimizing memory overhead, particularly addressing the inefficiencies in traditional attention mechanisms.

PagedAttention Technology

PagedAttention is the core innovation behind vLLM’s performance advantages. It solves the key-value (KV) cache fragmentation problem by:

  1. Partitioning the KV cache into fixed-size blocks (similar to memory pages in operating systems)
  2. Allowing non-contiguous storage of keys and values in memory
  3. Using a block table to map logical sequence positions to physical memory blocks
  4. Enabling efficient memory sharing between sequences with common prefixes

This approach dramatically reduces memory waste and enables more efficient batching of requests, leading to significantly higher throughput.

Supported Model Architectures

vLLM supports a wide range of model architectures:

  • Transformer-based LLMs (Llama, Mistral, Falcon, etc.)
  • Mixture-of-Expert models (Mixtral, Deepseek-V2/V3)
  • Embedding models (E5-Mistral)
  • Multi-modal LLMs (LLaVA)

Version Information

VersionRelease DateKey Features Added
v0.6.0September 20242.7x throughput improvement, 5x latency reduction
v0.5.0July 2024Enhanced multi-modal support, improved quantization
v0.4.0May 2024Expanded hardware support, speculative decoding
v0.3.0March 2024Prefix caching, multi-LoRA support
v0.2.0December 2023Tensor parallelism improvements, streaming outputs
v0.1.0June 2023Initial release with PagedAttention

Why Use vLLM

vLLM offers several compelling advantages over alternative LLM serving solutions:

  1. Superior Performance: vLLM delivers up to 24x higher throughput than HuggingFace Transformers and 3.5x faster throughput than HuggingFace’s Text Generation Inference (TGI) in benchmark tests.

  2. Memory Efficiency: PagedAttention reduces memory waste by up to 96%, allowing more efficient use of GPU resources and enabling larger batch sizes.

  3. Cost Effectiveness: The improved throughput means the same infrastructure can handle significantly more traffic (up to 5x more in some cases) without requiring additional GPUs, translating to direct cost savings.

  4. Seamless Integration: vLLM provides an OpenAI-compatible API server, making it easy to integrate with existing applications that use OpenAI’s API.

  5. Flexibility: Support for various decoding algorithms (parallel sampling, beam search), tensor parallelism, and streaming outputs provides flexibility for different use cases.

  6. Multi-Hardware Support: Unlike some alternatives that only support NVIDIA GPUs, vLLM works across a variety of hardware platforms.

  7. Active Development: As an active open-source project with contributions from both academia and industry, vLLM continues to improve with regular updates and new features.

System Requirements

Minimum Requirements

  • Operating System: Linux (vLLM can only fully run on Linux)
  • Python: 3.8 - 3.11
  • CPU: 4+ cores
  • RAM: 16GB+
  • GPU: NVIDIA GPU with compute capability 7.0 or higher (V100, T4, RTX20xx, A100, L4, H100, etc.)
  • Storage: 5GB+
  • CUDA: CUDA 11.8 or 12.1 (binaries are compiled with CUDA 12.1 by default)
  • CPU: 8+ cores
  • RAM: 32GB+
  • GPU: NVIDIA A100 (80GB) or H100 for large models
  • Storage: 20GB+ SSD

Hardware Compatibility

vLLM supports multiple hardware platforms:

  • NVIDIA GPUs (primary support)
  • AMD CPUs and GPUs (via ROCm)
  • Intel CPUs and GPUs
  • PowerPC CPUs
  • TPUs
  • AWS Neuron (Trainium and Inferentia)

Installation Guide

Prerequisites

  • Linux operating system
  • Python 3.8 - 3.11
  • CUDA-compatible GPU
  • Git (for source installation)

Installation with pip

The simplest way to install vLLM is using pip:

# Create a new conda environment (recommended)
conda create -n vllm python=3.9 -y
conda activate vllm

# Install vLLM with CUDA 12.1
pip install vllm

For CUDA 11.8 compatibility:

# Install vLLM with CUDA 11.8
export VLLM_VERSION=0.6.0
export PYTHON_VERSION=39
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118

Building from Source

For maximum compatibility or to use specific features:

# Clone the repository
git clone https://github.com/vllm-project/vllm.git
cd vllm

# Optionally build with multi-LoRA capability
# export VLLM_INSTALL_PUNICA_KERNELS=1

# Install in development mode
pip install -e .  # This may take 5-10 minutes

Using Docker

vLLM also provides Docker images for easy deployment:

# Use the NVIDIA PyTorch Docker image
# Use `--ipc=host` to ensure sufficient shared memory
docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3

# Inside the container, install vLLM
pip install vllm

Practical Exercise: Serving a Model with vLLM

This exercise demonstrates how to set up and use vLLM to serve a language model with an OpenAI-compatible API.

Step 1: Install vLLM

First, ensure vLLM is installed as described in the installation guide above.

Step 2: Start an OpenAI-compatible API Server

vLLM provides a simple command to start an API server compatible with the OpenAI API:

# Start a server with Llama-2-7b-chat
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --tensor-parallel-size 1 \
    --host 0.0.0.0 \
    --port 8000

This command:

  • Loads the Llama-2-7b-chat model from Hugging Face
  • Uses a tensor parallelism size of 1 (using a single GPU)
  • Binds the server to all network interfaces on port 8000

For larger models or multiple GPUs, adjust the --tensor-parallel-size parameter.

Step 3: Query the API Server

Once the server is running, you can query it using standard HTTP requests:

import requests
import json

# Define the API endpoint
api_url = "http://localhost:8000/v1/chat/completions"

# Prepare the request payload
payload = {
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is PagedAttention in vLLM?"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
}

# Send the request
headers = {"Content-Type": "application/json"}
response = requests.post(api_url, headers=headers, data=json.dumps(payload))

# Print the response
print(json.dumps(response.json(), indent=4))

Step 4: Streaming Responses

vLLM also supports streaming responses, which is useful for real-time applications:

import requests
import json

api_url = "http://localhost:8000/v1/chat/completions"
payload = {
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a short poem about artificial intelligence."}
    ],
    "temperature": 0.7,
    "max_tokens": 500,
    "stream": True  # Enable streaming
}

headers = {"Content-Type": "application/json"}
response = requests.post(api_url, headers=headers, data=json.dumps(payload), stream=True)

# Process the streaming response
for line in response.iter_lines():
    if line:
        line_text = line.decode('utf-8')
        if line_text.startswith('data: '):
            data_str = line_text[6:]  # Remove 'data: ' prefix
            if data_str != '[DONE]':
                try:
                    data = json.loads(data_str)
                    if 'choices' in data and len(data['choices']) > 0:
                        delta = data['choices'][0].get('delta', {})
                        if 'content' in delta:
                            print(delta['content'], end='', flush=True)
                except json.JSONDecodeError:
                    pass

Step 5: Advanced Configuration

For production deployments, you might want to adjust various parameters:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --tensor-parallel-size 2 \
    --max-model-len 8192 \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 256 \
    --host 0.0.0.0 \
    --port 8000

This configuration:

  • Distributes the model across 2 GPUs
  • Sets the maximum sequence length to 8192 tokens
  • Limits the maximum number of tokens processed in a batch to 16384
  • Allows up to 256 concurrent sequences

Resources

Official Documentation and Repositories

Technical Papers and Articles

Community Resources

Integration Guides