vLLM
AI & LLMsA high-throughput and memory-efficient inference and serving engine for LLMs
Release History
v0.16.0Breaking31 fixes51 featuresvLLM v0.16.0 introduces full support for Async scheduling with Pipeline Parallelism, a new Realtime WebSocket API, and a major overhaul of XPU platform support by deprecating IPEX in favor of vllm-xpu-kernels. This release also includes extensive model support additions and performance optimizations across various hardware platforms.
v0.15.1v0.15.0Breaking7 fixes50 featuresThis release introduces extensive model support, significant performance enhancements across NVIDIA and AMD hardware (especially for MoE and FP4), and new API features like session-based streaming input. Several deprecated metrics and quantization methods have been removed.
v0.14.12 fixesThis is a patch release addressing security vulnerabilities and memory leaks found in the previous version.
v0.14.0v0.13.0Breaking6 fixes10 featuresvLLM v0.13.0 introduces support for NVIDIA Blackwell Ultra and DeepSeek-V3.2, alongside a major performance overhaul for Whisper models. This release transitions attention configuration from environment variables to CLI arguments and includes significant core engine optimizations like Model Runner V2.
v0.12.0Breaking5 fixes8 featuresvLLM v0.12.0 introduces a major architectural shift with GPU Model Runner V2 and PyTorch 2.9.0 integration. It delivers significant performance gains (up to 18% throughput) and expands support for DeepSeek-V3, multimodal models, and AMD hardware.
v0.11.24 fixesThis release provides four critical bug fixes addressing Ray multi-node clusters, speculative decoding assertions, and FlashAttention MLA scheduling.
v0.11.1Breaking9 fixes10 featuresThis release updates vLLM to PyTorch 2.9.0 and CUDA 12.9.1, introduces Anthropic API compatibility, and significantly improves the stability of async scheduling and torch.compile integration.
v0.11.0Breaking6 fixes8 featuresThis release marks the complete transition to the V1 engine, removing all V0 components while introducing support for DeepSeek-V3.2 and Qwen3 architectures. It features significant performance optimizations including KV cache CPU offloading, DeepGEMM by default, and Dual-Batch Overlap.
v0.10.2Breaking5 fixes8 featuresvLLM 0.10.2 introduces native aarch64 support, PyTorch 2.8.0 integration, and extensive optimizations for NVIDIA Blackwell GPUs. It expands model support to include Whisper and various vision-language models while maturing the V1 engine core.
v0.10.1.13 fixesA critical bugfix and security release addressing vulnerabilities in HTTP header handling and unsafe type conversion, alongside a fix for CUTLASS MLA CUDAGraphs.
v0.10.1Breaking5 fixes10 featuresv0.10.1 introduces support for Blackwell and RTX 5090 GPUs, expands vision-language model compatibility, and adds a plugin system for model loaders. It also deprecates V0 FA3 support and removes AQLM quantization.
v0.10.1rc1Breaking10 fixes11 featuresThis release introduces model loader plugins, official Emu3 support, and significant performance optimizations for MoE kernels and FlashInfer. It also includes critical bug fixes for TPU, ROCm, and various quantization backends.
v0.10.0Breaking6 fixes9 featuresv0.10.0 introduces the V1 engine as the primary focus, removing several legacy V0 backends and features while adding support for Llama 4 and NVIDIA Blackwell optimizations. It features significant performance improvements via async scheduling and microbatch tokenization.
v0.10.0rc2Breaking9 fixes11 featuresThis release introduces VLM support via the transformers backend, enables shared-memory pipeline parallelism for CPUs, and adds support for NVIDIA SM100 (Blackwell) architectures. It also includes significant performance optimizations for MLA kernels and KV cache management alongside various bug fixes for distributed logging and ray integration.
v0.10.0rc1Breaking10 fixes12 featuresThis release introduces fp8 support for Triton experts, adds Llama 4 support, and migrates CPU/XPU/TPU backends exclusively to the V1 engine. It also includes significant performance optimizations for quantization kernels and initial support for the OpenAI Responses API.
v0.9.2Breaking6 fixes10 featuresThis release marks the final transition phase to the V1 engine, introducing Blackwell (SM100/120) support, Expert-Parallel Load Balancing, and expanded multi-modal/audio API capabilities. It includes significant performance optimizations for CUDA-Graphs and broadens hardware support for Intel GPUs and TPUs.
v0.9.2rc2Breaking10 fixes10 featuresThis release focuses on the transition to the V1 engine by removing legacy V0 backends for CPU/TPU/XPU, while adding support for Blackwell (SM100) and Llama 4. Key improvements include FP8 kernel optimizations, FlexAttention enhancements, and expanded multimodal support in the frontend.
v0.9.2rc1Breaking9 fixes9 featuresThis release introduces support for Qwen3 Embedding/Reranker models, enables ROCm V1 by default, and adds several performance optimizations including deep_gemm support and vectorized INT8 kernels. It also includes critical bug fixes for structured outputs and CUDAGraph stability.
v0.9.1Breaking7 fixes11 featuresThis release introduces significant performance optimizations for large-scale serving, including DP/EP CUDA graph support and Blackwell hardware integration. It also enforces stricter API usage by removing several long-standing deprecations and positional argument support in the LLM class.
v0.9.1rc1Breaking8 fixes7 featuresThis release introduces quantization and multi-LoRA support for Neuron/TPU, migrates configurations to Pydantic dataclasses, and enforces stricter keyword-only arguments for LLM initialization. It also includes significant bug fixes for MLA attention accuracy and V1 backend stability.
v0.9.0.11 fixThis patch release provides a critical bug fix for DeepSeek models running on NVIDIA Ampere and older GPU architectures.
v0.9.0Breaking5 fixes8 featuresvLLM v0.7.0 upgrades to PyTorch 2.7 and CUDA 12.8, introducing initial NVIDIA Blackwell support and advanced scaling features like Expert and Data Parallelism. It also includes significant model expansions, a migration to MkDocs, and a shift to deterministic defaults for the V1 engine.
v0.8.5.post12 fixesThis patch release addresses a memory leak in request data caching and corrects accuracy issues related to sliding window attention in the V1 engine.
v0.8.5Breaking8 fixes10 featuresThis release introduces Day 0 support for Qwen3, structural tag tool calling via xgrammar, and disaggregated serving via the KV Connector API. It includes significant performance optimizations for MoE kernels and breaking changes to CLI argument formatting for chunked prefill and multi-step outputs.
v0.8.4Breaking7 fixes10 featuresThis release introduces support for Llama4 and Qwen3 models, alongside significant performance optimizations for DeepSeek MLA and MoE kernels. It also stabilizes the V1 engine by enabling multi-input and structured outputs by default.
v0.8.3Breaking7 fixes11 featuresThis release introduces Day 0 support for Llama 4 (V1 engine only) and native sliding window attention. It also features significant performance optimizations for MoE kernels, expanded hardware support for AMD and TPU, and architectural improvements to the V1 engine.
v0.8.3rc110 fixes11 featuresThis release focuses on expanding V1 engine capabilities, including CPU MLA support, improved TPU stability, and new model support for Molmo and Granite. It also introduces several kernel optimizations for MoE and FP8 quantization.
v0.8.2Breaking7 fixes11 featuresThis release focuses on V1 engine stability, including a critical memory fix and crash prevention for duplicate request IDs. It introduces FP8 KV Cache support, a new fastsafetensors loader, and expands hardware support for TPU and ROCm.
v0.8.1Breaking10 fixes6 featuresv0.8.1 is a maintenance release focusing on V1 engine stability, adding Zamba2 support, and enabling LoRA for embedding models. It includes critical fixes for sampling dtypes, quantization, and TPU performance.
v0.8.0Breaking4 fixes8 featuresv0.8.0 enables the V1 engine by default, introduces support for NVIDIA Blackwell and Gemma 3, and significantly optimizes DeepSeek model performance through FlashMLA and Expert Parallelism.
v0.8.0rc2Breaking7 fixes4 featuresThis release focuses on V1 engine refinements, including making MLA the default and removing the input cache client. It also includes critical bug fixes for Ultravox, Mixtral, and ROCm testing environments.
v0.8.0rc1Breaking10 fixes12 featuresThis release introduces Expert Parallelism for DeepSeek models, a new /score endpoint for embeddings, and significant V1 engine enhancements including parallel sampling. It also removes global seed setting, requiring users to manually define seeds for reproducibility.
v0.7.3Breaking6 fixes9 featuresThis release introduces significant DeepSeek optimizations including Multi-Token Prediction and MLA FlashAttention3 support, alongside major V1 Engine updates like LoRA and Pipeline Parallelism. It expands hardware support for TPU, ROCm, and Gaudi while adding several new model architectures and quantization methods.
v0.7.2Breaking8 fixes10 featuresThis release introduces support for Qwen2.5-VL and a new transformers backend for arbitrary model support. It significantly improves DeepSeek model performance through KV cache memory alignment and torch.compile optimizations.
v0.7.19 fixes10 featuresThis release introduces significant MLA and FP8 kernel optimizations for DeepSeek models, resulting in 3x throughput and 10x memory capacity improvements. It also expands hardware support for Neuron and AMD, adds MiniCPM-o, and enhances the V1 engine with new metrics and prefix caching.
v0.7.0Breaking6 fixes8 featuresThis release introduces the V1 engine alpha for improved performance and architectural simplicity, alongside full torch.compile integration. It adds support for several new models including Deepseek-VL2 and Whisper, while expanding hardware compatibility for Apple Silicon, AMD, and TPU.
Common Errors
InternalServerError2 reportsInternalServerError in vllm often stems from unexpected issues during inference, like CUDA errors during sampling or invalid configurations. To fix it, carefully examine the vllm server logs for specific error messages, then address the root cause; this might involve adjusting CUDA configurations, fixing invalid parameters in your request (response_format), or increasing RPC timeouts if the error indicates these. Always validate your input data and configurations to ensure compatibility with the model and vllm version.
EngineDeadError2 reportsEngineDeadError usually arises when the vLLM engine encounters an unrecoverable error during request processing (e.g., CUDA errors, assertions failing due to unexpected input sizes). To fix it, carefully examine the error logs for root causes like shape mismatches or invalid tensor values, and implement rigorous input validation to prevent problematic requests from reaching the engine. Consider adding try-except blocks with appropriate logging around potentially error-prone operations within the vLLM engine to gracefully handle failures and gather debugging information.
RayChannelTimeoutError2 reportsRayChannelTimeoutError in vllm often arises due to insufficient Ray worker resources or network bottlenecks causing delays in inter-process communication, especially visible with pipeline parallelism (PP>1) or complex requests. To fix it, increase `object_store_memory` and `num_cpus` allocated to Ray workers during initialization to provide ample resources or, inspect the network bandwidth/latency between Ray nodes when running in multinode setting and apply network optimization techniques. Also, consider simplifying complex requests or reducing the batch size to minimize worker load.
RayTaskError2 reportsRayTaskError in vllm often arises from GPU memory allocation issues during distributed execution, particularly with pipeline parallelism or Triton kernels. The fix involves reducing the `gpu_memory_utilization` parameter in `vllm.EngineArgs`, enabling CUDA memory manager (CMM) via environment variable `export CUDA_VISIBLE_DEVICES=[GPU_IDS]` where [GPU_IDS] uses device ID's (not rank) or carefully adjusting the model configuration (e.g., reducing `max_model_len` or n_gpu_shard values) to fit within available GPU memory, and ensuring all nodes in the cluster have compatible CUDA versions.
NotImplementedError2 reportsThe "NotImplementedError" in vllm usually arises when a requested feature or optimization (like a specific attention implementation or quantization method) hasn't been fully implemented for the current hardware or architecture. To fix it, either choose a supported configuration (different quantization, attention mechanism, or hardware), or implement the missing functionality within the vllm codebase, potentially requiring modifications to attention kernels or quantization routines for the targeted architecture.
ModuleNotFoundError2 reportsThe "ModuleNotFoundError" in vllm usually indicates that a required vllm module is missing or the vllm installation is incomplete/corrupted. Reinstall vllm using `pip install --upgrade vllm` or `pip install --no-cache-dir --force-reinstall vllm` to ensure all necessary modules are correctly installed. If using a specific branch or commit, ensure you've properly built and installed vllm from source as per the documentation.
Related AI & LLMs Packages
AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.
Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models.
🦜🔗 The platform for reliable agents.
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
LLM inference in C/C++
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
Subscribe to Updates
Get notified when new versions are released