vLLM

AI & LLMs

A high-throughput and memory-efficient inference and serving engine for LLMs

Latest: v0.16.038 releases29 breaking changes11 common errorsUpdated Feb 13, 2026View on GitHub

Release History

v0.16.0Breaking31 fixes51 features

Feb 13, 2026

vLLM v0.16.0 introduces full support for Async scheduling with Pipeline Parallelism, a new Realtime WebSocket API, and a major overhaul of XPU platform support by deprecating IPEX in favor of vllm-xpu-kernels. This release also includes extensive model support additions and performance optimizations across various hardware platforms.

v0.15.1

Feb 4, 2026

v0.15.0Breaking7 fixes50 features

Jan 29, 2026

This release introduces extensive model support, significant performance enhancements across NVIDIA and AMD hardware (especially for MoE and FP4), and new API features like session-based streaming input. Several deprecated metrics and quantization methods have been removed.

v0.14.12 fixes

Jan 24, 2026

This is a patch release addressing security vulnerabilities and memory leaks found in the previous version.

v0.14.0

Jan 20, 2026

v0.13.0Breaking6 fixes10 features

Dec 19, 2025

vLLM v0.13.0 introduces support for NVIDIA Blackwell Ultra and DeepSeek-V3.2, alongside a major performance overhaul for Whisper models. This release transitions attention configuration from environment variables to CLI arguments and includes significant core engine optimizations like Model Runner V2.

v0.12.0Breaking5 fixes8 features

Dec 3, 2025

vLLM v0.12.0 introduces a major architectural shift with GPU Model Runner V2 and PyTorch 2.9.0 integration. It delivers significant performance gains (up to 18% throughput) and expands support for DeepSeek-V3, multimodal models, and AMD hardware.

v0.11.24 fixes

Nov 20, 2025

This release provides four critical bug fixes addressing Ray multi-node clusters, speculative decoding assertions, and FlashAttention MLA scheduling.

v0.11.1Breaking9 fixes10 features

Nov 18, 2025

This release updates vLLM to PyTorch 2.9.0 and CUDA 12.9.1, introduces Anthropic API compatibility, and significantly improves the stability of async scheduling and torch.compile integration.

v0.11.0Breaking6 fixes8 features

Oct 2, 2025

This release marks the complete transition to the V1 engine, removing all V0 components while introducing support for DeepSeek-V3.2 and Qwen3 architectures. It features significant performance optimizations including KV cache CPU offloading, DeepGEMM by default, and Dual-Batch Overlap.

v0.10.2Breaking5 fixes8 features

Sep 13, 2025

vLLM 0.10.2 introduces native aarch64 support, PyTorch 2.8.0 integration, and extensive optimizations for NVIDIA Blackwell GPUs. It expands model support to include Whisper and various vision-language models while maturing the V1 engine core.

v0.10.1.13 fixes

Aug 20, 2025

A critical bugfix and security release addressing vulnerabilities in HTTP header handling and unsafe type conversion, alongside a fix for CUTLASS MLA CUDAGraphs.

v0.10.1Breaking5 fixes10 features

Aug 18, 2025

v0.10.1 introduces support for Blackwell and RTX 5090 GPUs, expands vision-language model compatibility, and adds a plugin system for model loaders. It also deprecates V0 FA3 support and removes AQLM quantization.

v0.10.1rc1Breaking10 fixes11 features

Aug 17, 2025

This release introduces model loader plugins, official Emu3 support, and significant performance optimizations for MoE kernels and FlashInfer. It also includes critical bug fixes for TPU, ROCm, and various quantization backends.

v0.10.0Breaking6 fixes9 features

Jul 24, 2025

v0.10.0 introduces the V1 engine as the primary focus, removing several legacy V0 backends and features while adding support for Llama 4 and NVIDIA Blackwell optimizations. It features significant performance improvements via async scheduling and microbatch tokenization.

v0.10.0rc2Breaking9 fixes11 features

Jul 24, 2025

This release introduces VLM support via the transformers backend, enables shared-memory pipeline parallelism for CPUs, and adds support for NVIDIA SM100 (Blackwell) architectures. It also includes significant performance optimizations for MLA kernels and KV cache management alongside various bug fixes for distributed logging and ray integration.

v0.10.0rc1Breaking10 fixes12 features

Jul 20, 2025

This release introduces fp8 support for Triton experts, adds Llama 4 support, and migrates CPU/XPU/TPU backends exclusively to the V1 engine. It also includes significant performance optimizations for quantization kernels and initial support for the OpenAI Responses API.

v0.9.2Breaking6 fixes10 features

Jul 7, 2025

This release marks the final transition phase to the V1 engine, introducing Blackwell (SM100/120) support, Expert-Parallel Load Balancing, and expanded multi-modal/audio API capabilities. It includes significant performance optimizations for CUDA-Graphs and broadens hardware support for Intel GPUs and TPUs.

v0.9.2rc2Breaking10 fixes10 features

Jul 6, 2025

This release focuses on the transition to the V1 engine by removing legacy V0 backends for CPU/TPU/XPU, while adding support for Blackwell (SM100) and Llama 4. Key improvements include FP8 kernel optimizations, FlexAttention enhancements, and expanded multimodal support in the frontend.

v0.9.2rc1Breaking9 fixes9 features

Jul 3, 2025

This release introduces support for Qwen3 Embedding/Reranker models, enables ROCm V1 by default, and adds several performance optimizations including deep_gemm support and vectorized INT8 kernels. It also includes critical bug fixes for structured outputs and CUDAGraph stability.

v0.9.1Breaking7 fixes11 features

Jun 10, 2025

This release introduces significant performance optimizations for large-scale serving, including DP/EP CUDA graph support and Blackwell hardware integration. It also enforces stricter API usage by removing several long-standing deprecations and positional argument support in the LLM class.

v0.9.1rc1Breaking8 fixes7 features

Jun 9, 2025

This release introduces quantization and multi-LoRA support for Neuron/TPU, migrates configurations to Pydantic dataclasses, and enforces stricter keyword-only arguments for LLM initialization. It also includes significant bug fixes for MLA attention accuracy and V1 backend stability.

v0.9.0.11 fix

May 30, 2025

This patch release provides a critical bug fix for DeepSeek models running on NVIDIA Ampere and older GPU architectures.

v0.9.0Breaking5 fixes8 features

May 15, 2025

vLLM v0.7.0 upgrades to PyTorch 2.7 and CUDA 12.8, introducing initial NVIDIA Blackwell support and advanced scaling features like Expert and Data Parallelism. It also includes significant model expansions, a migration to MkDocs, and a shift to deterministic defaults for the V1 engine.

v0.8.5.post12 fixes

May 2, 2025

This patch release addresses a memory leak in request data caching and corrects accuracy issues related to sliding window attention in the V1 engine.

v0.8.5Breaking8 fixes10 features

Apr 28, 2025

This release introduces Day 0 support for Qwen3, structural tag tool calling via xgrammar, and disaggregated serving via the KV Connector API. It includes significant performance optimizations for MoE kernels and breaking changes to CLI argument formatting for chunked prefill and multi-step outputs.

v0.8.4Breaking7 fixes10 features

Apr 14, 2025

This release introduces support for Llama4 and Qwen3 models, alongside significant performance optimizations for DeepSeek MLA and MoE kernels. It also stabilizes the V1 engine by enabling multi-input and structured outputs by default.

v0.8.3Breaking7 fixes11 features

Apr 6, 2025

This release introduces Day 0 support for Llama 4 (V1 engine only) and native sliding window attention. It also features significant performance optimizations for MoE kernels, expanded hardware support for AMD and TPU, and architectural improvements to the V1 engine.

v0.8.3rc110 fixes11 features

Apr 5, 2025

This release focuses on expanding V1 engine capabilities, including CPU MLA support, improved TPU stability, and new model support for Molmo and Granite. It also introduces several kernel optimizations for MoE and FP8 quantization.

v0.8.2Breaking7 fixes11 features

Mar 23, 2025

This release focuses on V1 engine stability, including a critical memory fix and crash prevention for duplicate request IDs. It introduces FP8 KV Cache support, a new fastsafetensors loader, and expands hardware support for TPU and ROCm.

v0.8.1Breaking10 fixes6 features

Mar 19, 2025

v0.8.1 is a maintenance release focusing on V1 engine stability, adding Zamba2 support, and enabling LoRA for embedding models. It includes critical fixes for sampling dtypes, quantization, and TPU performance.

v0.8.0Breaking4 fixes8 features

Mar 18, 2025

v0.8.0 enables the V1 engine by default, introduces support for NVIDIA Blackwell and Gemma 3, and significantly optimizes DeepSeek model performance through FlashMLA and Expert Parallelism.

v0.8.0rc2Breaking7 fixes4 features

Mar 17, 2025

This release focuses on V1 engine refinements, including making MLA the default and removing the input cache client. It also includes critical bug fixes for Ultravox, Mixtral, and ROCm testing environments.

v0.8.0rc1Breaking10 fixes12 features

Mar 17, 2025

This release introduces Expert Parallelism for DeepSeek models, a new /score endpoint for embeddings, and significant V1 engine enhancements including parallel sampling. It also removes global seed setting, requiring users to manually define seeds for reproducibility.

v0.7.3Breaking6 fixes9 features

Feb 20, 2025

This release introduces significant DeepSeek optimizations including Multi-Token Prediction and MLA FlashAttention3 support, alongside major V1 Engine updates like LoRA and Pipeline Parallelism. It expands hardware support for TPU, ROCm, and Gaudi while adding several new model architectures and quantization methods.

v0.7.2Breaking8 fixes10 features

Feb 6, 2025

This release introduces support for Qwen2.5-VL and a new transformers backend for arbitrary model support. It significantly improves DeepSeek model performance through KV cache memory alignment and torch.compile optimizations.

v0.7.19 fixes10 features

Feb 1, 2025

This release introduces significant MLA and FP8 kernel optimizations for DeepSeek models, resulting in 3x throughput and 10x memory capacity improvements. It also expands hardware support for Neuron and AMD, adds MiniCPM-o, and enhances the V1 engine with new metrics and prefix caching.

v0.7.0Breaking6 fixes8 features

Jan 27, 2025

This release introduces the V1 engine alpha for improved performance and architectural simplicity, alongside full torch.compile integration. It adds support for several new models including Deepseek-VL2 and Whisper, while expanding hardware compatibility for Apple Silicon, AMD, and TPU.

Common Errors

InternalServerError2 reports

InternalServerError in vllm often stems from unexpected issues during inference, like CUDA errors during sampling or invalid configurations. To fix it, carefully examine the vllm server logs for specific error messages, then address the root cause; this might involve adjusting CUDA configurations, fixing invalid parameters in your request (response_format), or increasing RPC timeouts if the error indicates these. Always validate your input data and configurations to ensure compatibility with the model and vllm version.

EngineDeadError2 reports

EngineDeadError usually arises when the vLLM engine encounters an unrecoverable error during request processing (e.g., CUDA errors, assertions failing due to unexpected input sizes). To fix it, carefully examine the error logs for root causes like shape mismatches or invalid tensor values, and implement rigorous input validation to prevent problematic requests from reaching the engine. Consider adding try-except blocks with appropriate logging around potentially error-prone operations within the vLLM engine to gracefully handle failures and gather debugging information.

RayChannelTimeoutError2 reports

RayChannelTimeoutError in vllm often arises due to insufficient Ray worker resources or network bottlenecks causing delays in inter-process communication, especially visible with pipeline parallelism (PP>1) or complex requests. To fix it, increase `object_store_memory` and `num_cpus` allocated to Ray workers during initialization to provide ample resources or, inspect the network bandwidth/latency between Ray nodes when running in multinode setting and apply network optimization techniques. Also, consider simplifying complex requests or reducing the batch size to minimize worker load.

RayTaskError2 reports

RayTaskError in vllm often arises from GPU memory allocation issues during distributed execution, particularly with pipeline parallelism or Triton kernels. The fix involves reducing the `gpu_memory_utilization` parameter in `vllm.EngineArgs`, enabling CUDA memory manager (CMM) via environment variable `export CUDA_VISIBLE_DEVICES=[GPU_IDS]` where [GPU_IDS] uses device ID's (not rank) or carefully adjusting the model configuration (e.g., reducing `max_model_len` or n_gpu_shard values) to fit within available GPU memory, and ensuring all nodes in the cluster have compatible CUDA versions.

NotImplementedError2 reports

The "NotImplementedError" in vllm usually arises when a requested feature or optimization (like a specific attention implementation or quantization method) hasn't been fully implemented for the current hardware or architecture. To fix it, either choose a supported configuration (different quantization, attention mechanism, or hardware), or implement the missing functionality within the vllm codebase, potentially requiring modifications to attention kernels or quantization routines for the targeted architecture.

ModuleNotFoundError2 reports

The "ModuleNotFoundError" in vllm usually indicates that a required vllm module is missing or the vllm installation is incomplete/corrupted. Reinstall vllm using `pip install --upgrade vllm` or `pip install --no-cache-dir --force-reinstall vllm` to ensure all necessary modules are correctly installed. If using a specific branch or commit, ensure you've properly built and installed vllm from source as per the documentation.

Related AI & LLMs Packages

AutoGPT

AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.

Ollama

Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models.

LangChain

🦜🔗 The platform for reliable agents.

ComfyUI

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.

llama.cpp

LLM inference in C/C++

GPT4All

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.