vLLM
AI & LLMsA high-throughput and memory-efficient inference and serving engine for LLMs
Release History
v0.22.023 fixes95 featuresThis release focuses heavily on DeepSeek V4 maturity with new kernel support and packaging, significant advancements in Model Runner V2, and the introduction of an experimental Rust frontend. Performance saw notable gains from batch-invariant inference with Cutlass FP8 and the rollout of multi-tier KV cache offloading.
v0.21.0Breaking15 fixes22 featuresThis release introduces significant performance and stability improvements, notably integrating KV offloading with the Hybrid Memory Allocator and enabling speculative decoding with thinking budgets. It also formally deprecates support for older versions of the Transformers library.
v0.20.24 fixesvLLM v0.20.2 is a small patch release focused on bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL models.
v0.20.111 fixes6 featuresvLLM v0.20.1 is a patch release focused on stabilizing and improving performance for DeepSeek V4, including various kernel optimizations and critical bug fixes across CUDA and ROCm platforms.
v0.20.0Breaking6 fixes10 featuresv0.20.0 introduces major infrastructure upgrades, including a default switch to CUDA 13.0 and PyTorch 2.11, alongside significant performance enhancements like TurboQuant 2-bit KV cache and the re-enabling of FlashAttention 4 as default prefill backend.
v0.19.17 fixes2 featuresThis patch release upgrades to Transformers v5.5.4 and delivers numerous bug fixes specifically targeting Gemma4 streaming, tool calls, and model loading, alongside adding support for Gemma4 Eagle3 and quantized MoE.
v0.19.0v0.18.1This is a patch release on top of v0.18.0 intended to address a few minor issues.
v0.18.0Breaking26 fixes29 featuresv0.18.0 introduces major features like gRPC serving, GPU-less render serving, and significant improvements to KV cache offloading and Elastic Expert Parallelism. Ray is now an optional dependency, and numerous model-specific fixes and kernel optimizations have been integrated.
v0.17.16 fixesThis patch release addresses several issues primarily related to TRTLLM MoE backends, Mamba/Qwen SSM caching, and MTP handling.
v0.17.0Breaking11 fixes18 featuresvLLM v0.17.0 introduces a major upgrade to PyTorch 2.10, integrates FlashAttention 4, and significantly matures Model Runner V2 with features like Pipeline Parallelism. This release also adds full support for the Qwen3.5 model family and introduces new performance tuning flags.
v0.16.0Breaking31 fixes51 featuresvLLM v0.16.0 introduces full support for Async scheduling with Pipeline Parallelism, a new Realtime WebSocket API, and a major overhaul of XPU platform support by deprecating IPEX in favor of vllm-xpu-kernels. This release also includes extensive model support additions and performance optimizations across various hardware platforms.
v0.15.1v0.15.0Breaking7 fixes50 featuresThis release introduces extensive model support, significant performance enhancements across NVIDIA and AMD hardware (especially for MoE and FP4), and new API features like session-based streaming input. Several deprecated metrics and quantization methods have been removed.
v0.14.12 fixesThis is a patch release addressing security vulnerabilities and memory leaks found in the previous version.
v0.14.0v0.13.0Breaking6 fixes10 featuresvLLM v0.13.0 introduces support for NVIDIA Blackwell Ultra and DeepSeek-V3.2, alongside a major performance overhaul for Whisper models. This release transitions attention configuration from environment variables to CLI arguments and includes significant core engine optimizations like Model Runner V2.
v0.12.0Breaking5 fixes8 featuresvLLM v0.12.0 introduces a major architectural shift with GPU Model Runner V2 and PyTorch 2.9.0 integration. It delivers significant performance gains (up to 18% throughput) and expands support for DeepSeek-V3, multimodal models, and AMD hardware.
v0.11.24 fixesThis release provides four critical bug fixes addressing Ray multi-node clusters, speculative decoding assertions, and FlashAttention MLA scheduling.
v0.11.1Breaking9 fixes10 featuresThis release updates vLLM to PyTorch 2.9.0 and CUDA 12.9.1, introduces Anthropic API compatibility, and significantly improves the stability of async scheduling and torch.compile integration.
v0.11.0Breaking6 fixes8 featuresThis release marks the complete transition to the V1 engine, removing all V0 components while introducing support for DeepSeek-V3.2 and Qwen3 architectures. It features significant performance optimizations including KV cache CPU offloading, DeepGEMM by default, and Dual-Batch Overlap.
v0.10.2Breaking5 fixes8 featuresvLLM 0.10.2 introduces native aarch64 support, PyTorch 2.8.0 integration, and extensive optimizations for NVIDIA Blackwell GPUs. It expands model support to include Whisper and various vision-language models while maturing the V1 engine core.
v0.10.1.13 fixesA critical bugfix and security release addressing vulnerabilities in HTTP header handling and unsafe type conversion, alongside a fix for CUTLASS MLA CUDAGraphs.
v0.10.1Breaking5 fixes10 featuresv0.10.1 introduces support for Blackwell and RTX 5090 GPUs, expands vision-language model compatibility, and adds a plugin system for model loaders. It also deprecates V0 FA3 support and removes AQLM quantization.
v0.10.1rc1Breaking10 fixes11 featuresThis release introduces model loader plugins, official Emu3 support, and significant performance optimizations for MoE kernels and FlashInfer. It also includes critical bug fixes for TPU, ROCm, and various quantization backends.
v0.10.0Breaking6 fixes9 featuresv0.10.0 introduces the V1 engine as the primary focus, removing several legacy V0 backends and features while adding support for Llama 4 and NVIDIA Blackwell optimizations. It features significant performance improvements via async scheduling and microbatch tokenization.
v0.10.0rc2Breaking9 fixes11 featuresThis release introduces VLM support via the transformers backend, enables shared-memory pipeline parallelism for CPUs, and adds support for NVIDIA SM100 (Blackwell) architectures. It also includes significant performance optimizations for MLA kernels and KV cache management alongside various bug fixes for distributed logging and ray integration.
v0.10.0rc1Breaking10 fixes12 featuresThis release introduces fp8 support for Triton experts, adds Llama 4 support, and migrates CPU/XPU/TPU backends exclusively to the V1 engine. It also includes significant performance optimizations for quantization kernels and initial support for the OpenAI Responses API.
v0.9.2Breaking6 fixes10 featuresThis release marks the final transition phase to the V1 engine, introducing Blackwell (SM100/120) support, Expert-Parallel Load Balancing, and expanded multi-modal/audio API capabilities. It includes significant performance optimizations for CUDA-Graphs and broadens hardware support for Intel GPUs and TPUs.
v0.9.2rc2Breaking10 fixes10 featuresThis release focuses on the transition to the V1 engine by removing legacy V0 backends for CPU/TPU/XPU, while adding support for Blackwell (SM100) and Llama 4. Key improvements include FP8 kernel optimizations, FlexAttention enhancements, and expanded multimodal support in the frontend.
v0.9.2rc1Breaking9 fixes9 featuresThis release introduces support for Qwen3 Embedding/Reranker models, enables ROCm V1 by default, and adds several performance optimizations including deep_gemm support and vectorized INT8 kernels. It also includes critical bug fixes for structured outputs and CUDAGraph stability.
v0.9.1Breaking7 fixes11 featuresThis release introduces significant performance optimizations for large-scale serving, including DP/EP CUDA graph support and Blackwell hardware integration. It also enforces stricter API usage by removing several long-standing deprecations and positional argument support in the LLM class.
v0.9.1rc1Breaking8 fixes7 featuresThis release introduces quantization and multi-LoRA support for Neuron/TPU, migrates configurations to Pydantic dataclasses, and enforces stricter keyword-only arguments for LLM initialization. It also includes significant bug fixes for MLA attention accuracy and V1 backend stability.
v0.9.0.11 fixThis patch release provides a critical bug fix for DeepSeek models running on NVIDIA Ampere and older GPU architectures.
v0.9.0Breaking5 fixes8 featuresvLLM v0.7.0 upgrades to PyTorch 2.7 and CUDA 12.8, introducing initial NVIDIA Blackwell support and advanced scaling features like Expert and Data Parallelism. It also includes significant model expansions, a migration to MkDocs, and a shift to deterministic defaults for the V1 engine.
v0.8.5.post12 fixesThis patch release addresses a memory leak in request data caching and corrects accuracy issues related to sliding window attention in the V1 engine.
v0.8.5Breaking8 fixes10 featuresThis release introduces Day 0 support for Qwen3, structural tag tool calling via xgrammar, and disaggregated serving via the KV Connector API. It includes significant performance optimizations for MoE kernels and breaking changes to CLI argument formatting for chunked prefill and multi-step outputs.
v0.8.4Breaking7 fixes10 featuresThis release introduces support for Llama4 and Qwen3 models, alongside significant performance optimizations for DeepSeek MLA and MoE kernels. It also stabilizes the V1 engine by enabling multi-input and structured outputs by default.
v0.8.3Breaking7 fixes11 featuresThis release introduces Day 0 support for Llama 4 (V1 engine only) and native sliding window attention. It also features significant performance optimizations for MoE kernels, expanded hardware support for AMD and TPU, and architectural improvements to the V1 engine.
v0.8.3rc110 fixes11 featuresThis release focuses on expanding V1 engine capabilities, including CPU MLA support, improved TPU stability, and new model support for Molmo and Granite. It also introduces several kernel optimizations for MoE and FP8 quantization.
v0.8.2Breaking7 fixes11 featuresThis release focuses on V1 engine stability, including a critical memory fix and crash prevention for duplicate request IDs. It introduces FP8 KV Cache support, a new fastsafetensors loader, and expands hardware support for TPU and ROCm.
v0.8.1Breaking10 fixes6 featuresv0.8.1 is a maintenance release focusing on V1 engine stability, adding Zamba2 support, and enabling LoRA for embedding models. It includes critical fixes for sampling dtypes, quantization, and TPU performance.
v0.8.0Breaking4 fixes8 featuresv0.8.0 enables the V1 engine by default, introduces support for NVIDIA Blackwell and Gemma 3, and significantly optimizes DeepSeek model performance through FlashMLA and Expert Parallelism.
v0.8.0rc2Breaking7 fixes4 featuresThis release focuses on V1 engine refinements, including making MLA the default and removing the input cache client. It also includes critical bug fixes for Ultravox, Mixtral, and ROCm testing environments.
v0.8.0rc1Breaking10 fixes12 featuresThis release introduces Expert Parallelism for DeepSeek models, a new /score endpoint for embeddings, and significant V1 engine enhancements including parallel sampling. It also removes global seed setting, requiring users to manually define seeds for reproducibility.
v0.7.3Breaking6 fixes9 featuresThis release introduces significant DeepSeek optimizations including Multi-Token Prediction and MLA FlashAttention3 support, alongside major V1 Engine updates like LoRA and Pipeline Parallelism. It expands hardware support for TPU, ROCm, and Gaudi while adding several new model architectures and quantization methods.
v0.7.2Breaking8 fixes10 featuresThis release introduces support for Qwen2.5-VL and a new transformers backend for arbitrary model support. It significantly improves DeepSeek model performance through KV cache memory alignment and torch.compile optimizations.
v0.7.19 fixes10 featuresThis release introduces significant MLA and FP8 kernel optimizations for DeepSeek models, resulting in 3x throughput and 10x memory capacity improvements. It also expands hardware support for Neuron and AMD, adds MiniCPM-o, and enhances the V1 engine with new metrics and prefix caching.
v0.7.0Breaking6 fixes8 featuresThis release introduces the V1 engine alpha for improved performance and architectural simplicity, alongside full torch.compile integration. It adds support for several new models including Deepseek-VL2 and Whisper, while expanding hardware compatibility for Apple Silicon, AMD, and TPU.
Common Errors
EngineDeadError4 reportsEngineDeadError in vllm usually indicates a fatal error within the engine, such as CUDA OOM, kernel panics, or assertion failures related to hardware or software incompatibility. To fix it, thoroughly examine the logs for specific error messages (CUDA errors, assertion failures) and address the root cause, which might involve reducing batch size, using a different kv_cache_dtype, or addressing hardware/driver incompatibility issues. Restart the vllm engine after resolving the identified issue to ensure a clean state.
InternalServerError3 reportsInternalServerError in vllm often arises from unexpected exceptions during tensor operations, model loading, or inference requests, especially related to resource limitations or compatibility issues. To fix it, carefully examine the vllm logs for specific error messages and stack traces, and address the underlying cause; this may involve increasing available resources (CPU/GPU memory), updating vllm or its dependencies to compatible versions, or adjusting model configurations to reduce resource demands. Ensure model weights are correctly loaded and valid, along with proper image format handling for multimodal models.
NotImplementedError3 reportsThe "NotImplementedError" in vllm usually arises when a requested feature, often a specific CUDA or optimized operation for a particular data type (like Float8), hasn't been coded or compiled for your hardware (especially ROCm or older GPUs) or a specific model architecture. To fix this, either use a supported data type (like float16 or bfloat16), ensure you're using a vllm version with ROCm support if needed, or if applicable, wait for a future update with the specific operation implemented for your hardware/model or contribute the missing implementation yourself according to the vllm documentation.
RuntimeError2 reportsRuntimeError in vllm often arises from unexpected tensor shape mismatches, especially during operations like attention or sampling. To resolve this, carefully inspect the tensor shapes involved using debug prints before the failing operation and ensure they align with expected dimensions based on batch size, sequence length, and vocabulary size. Adjust tensor reshaping, padding, or slicing logic to enforce consistent shapes across all inputs and intermediate computations.
UnboundLocalError2 reportsUnboundLocalError arises when a variable is referenced before assignment within its scope. To fix this, ensure the variable is assigned a value before being used, often by initializing it (e.g., `name_mapped = None`) before the conditional block where it might be conditionally assigned. This guarantees the variable exists in the local scope regardless of the execution path.
ValueError2 reportsValueError in vllm often arises from incorrect or missing configuration data, such as specifying an unsupported model architecture, a missing score.weight file for specific runners like pooling, or incompatible runner arguments for embedding models. To fix this, ensure that the model architecture specified in the command-line arguments matches the actual model being loaded, verify that all required weight files are present and correctly located, and carefully review runner-specific requirements, like using compatible runners (e.g. 'llm' runner) for embedding models.
Related AI & LLMs Packages
AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.
Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models.
🦜🔗 The platform for reliable agents.
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
LLM inference in C/C++
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
Subscribe to Updates
Get notified when new versions are released