Change8

vLLM

AI & LLMs

A high-throughput and memory-efficient inference and serving engine for LLMs

Latest: v0.13.033 releases27 breaking changesView on GitHub →

Release History

v0.13.0Breaking6 fixes10 features
Dec 19, 2025

vLLM v0.13.0 introduces support for NVIDIA Blackwell Ultra and DeepSeek-V3.2, alongside a major performance overhaul for Whisper models. This release transitions attention configuration from environment variables to CLI arguments and includes significant core engine optimizations like Model Runner V2.

v0.12.0Breaking5 fixes8 features
Dec 3, 2025

vLLM v0.12.0 introduces a major architectural shift with GPU Model Runner V2 and PyTorch 2.9.0 integration. It delivers significant performance gains (up to 18% throughput) and expands support for DeepSeek-V3, multimodal models, and AMD hardware.

v0.11.24 fixes
Nov 20, 2025

This release provides four critical bug fixes addressing Ray multi-node clusters, speculative decoding assertions, and FlashAttention MLA scheduling.

v0.11.1Breaking9 fixes10 features
Nov 18, 2025

This release updates vLLM to PyTorch 2.9.0 and CUDA 12.9.1, introduces Anthropic API compatibility, and significantly improves the stability of async scheduling and torch.compile integration.

v0.11.0Breaking6 fixes8 features
Oct 2, 2025

This release marks the complete transition to the V1 engine, removing all V0 components while introducing support for DeepSeek-V3.2 and Qwen3 architectures. It features significant performance optimizations including KV cache CPU offloading, DeepGEMM by default, and Dual-Batch Overlap.

v0.10.2Breaking5 fixes8 features
Sep 13, 2025

vLLM 0.10.2 introduces native aarch64 support, PyTorch 2.8.0 integration, and extensive optimizations for NVIDIA Blackwell GPUs. It expands model support to include Whisper and various vision-language models while maturing the V1 engine core.

v0.10.1.13 fixes
Aug 20, 2025

A critical bugfix and security release addressing vulnerabilities in HTTP header handling and unsafe type conversion, alongside a fix for CUTLASS MLA CUDAGraphs.

v0.10.1Breaking5 fixes10 features
Aug 18, 2025

v0.10.1 introduces support for Blackwell and RTX 5090 GPUs, expands vision-language model compatibility, and adds a plugin system for model loaders. It also deprecates V0 FA3 support and removes AQLM quantization.

v0.10.1rc1Breaking10 fixes11 features
Aug 17, 2025

This release introduces model loader plugins, official Emu3 support, and significant performance optimizations for MoE kernels and FlashInfer. It also includes critical bug fixes for TPU, ROCm, and various quantization backends.

v0.10.0Breaking6 fixes9 features
Jul 24, 2025

v0.10.0 introduces the V1 engine as the primary focus, removing several legacy V0 backends and features while adding support for Llama 4 and NVIDIA Blackwell optimizations. It features significant performance improvements via async scheduling and microbatch tokenization.

v0.10.0rc2Breaking9 fixes11 features
Jul 24, 2025

This release introduces VLM support via the transformers backend, enables shared-memory pipeline parallelism for CPUs, and adds support for NVIDIA SM100 (Blackwell) architectures. It also includes significant performance optimizations for MLA kernels and KV cache management alongside various bug fixes for distributed logging and ray integration.

v0.10.0rc1Breaking10 fixes12 features
Jul 20, 2025

This release introduces fp8 support for Triton experts, adds Llama 4 support, and migrates CPU/XPU/TPU backends exclusively to the V1 engine. It also includes significant performance optimizations for quantization kernels and initial support for the OpenAI Responses API.

v0.9.2Breaking6 fixes10 features
Jul 7, 2025

This release marks the final transition phase to the V1 engine, introducing Blackwell (SM100/120) support, Expert-Parallel Load Balancing, and expanded multi-modal/audio API capabilities. It includes significant performance optimizations for CUDA-Graphs and broadens hardware support for Intel GPUs and TPUs.

v0.9.2rc2Breaking10 fixes10 features
Jul 6, 2025

This release focuses on the transition to the V1 engine by removing legacy V0 backends for CPU/TPU/XPU, while adding support for Blackwell (SM100) and Llama 4. Key improvements include FP8 kernel optimizations, FlexAttention enhancements, and expanded multimodal support in the frontend.

v0.9.2rc1Breaking9 fixes9 features
Jul 3, 2025

This release introduces support for Qwen3 Embedding/Reranker models, enables ROCm V1 by default, and adds several performance optimizations including deep_gemm support and vectorized INT8 kernels. It also includes critical bug fixes for structured outputs and CUDAGraph stability.

v0.9.1Breaking7 fixes11 features
Jun 10, 2025

This release introduces significant performance optimizations for large-scale serving, including DP/EP CUDA graph support and Blackwell hardware integration. It also enforces stricter API usage by removing several long-standing deprecations and positional argument support in the LLM class.

v0.9.1rc1Breaking8 fixes7 features
Jun 9, 2025

This release introduces quantization and multi-LoRA support for Neuron/TPU, migrates configurations to Pydantic dataclasses, and enforces stricter keyword-only arguments for LLM initialization. It also includes significant bug fixes for MLA attention accuracy and V1 backend stability.

v0.9.0.11 fix
May 30, 2025

This patch release provides a critical bug fix for DeepSeek models running on NVIDIA Ampere and older GPU architectures.

v0.9.0Breaking5 fixes8 features
May 15, 2025

vLLM v0.7.0 upgrades to PyTorch 2.7 and CUDA 12.8, introducing initial NVIDIA Blackwell support and advanced scaling features like Expert and Data Parallelism. It also includes significant model expansions, a migration to MkDocs, and a shift to deterministic defaults for the V1 engine.

v0.8.5.post12 fixes
May 2, 2025

This patch release addresses a memory leak in request data caching and corrects accuracy issues related to sliding window attention in the V1 engine.

v0.8.5Breaking8 fixes10 features
Apr 28, 2025

This release introduces Day 0 support for Qwen3, structural tag tool calling via xgrammar, and disaggregated serving via the KV Connector API. It includes significant performance optimizations for MoE kernels and breaking changes to CLI argument formatting for chunked prefill and multi-step outputs.

v0.8.4Breaking7 fixes10 features
Apr 14, 2025

This release introduces support for Llama4 and Qwen3 models, alongside significant performance optimizations for DeepSeek MLA and MoE kernels. It also stabilizes the V1 engine by enabling multi-input and structured outputs by default.

v0.8.3Breaking7 fixes11 features
Apr 6, 2025

This release introduces Day 0 support for Llama 4 (V1 engine only) and native sliding window attention. It also features significant performance optimizations for MoE kernels, expanded hardware support for AMD and TPU, and architectural improvements to the V1 engine.

v0.8.3rc110 fixes11 features
Apr 5, 2025

This release focuses on expanding V1 engine capabilities, including CPU MLA support, improved TPU stability, and new model support for Molmo and Granite. It also introduces several kernel optimizations for MoE and FP8 quantization.

v0.8.2Breaking7 fixes11 features
Mar 23, 2025

This release focuses on V1 engine stability, including a critical memory fix and crash prevention for duplicate request IDs. It introduces FP8 KV Cache support, a new fastsafetensors loader, and expands hardware support for TPU and ROCm.

v0.8.1Breaking10 fixes6 features
Mar 19, 2025

v0.8.1 is a maintenance release focusing on V1 engine stability, adding Zamba2 support, and enabling LoRA for embedding models. It includes critical fixes for sampling dtypes, quantization, and TPU performance.

v0.8.0Breaking4 fixes8 features
Mar 18, 2025

v0.8.0 enables the V1 engine by default, introduces support for NVIDIA Blackwell and Gemma 3, and significantly optimizes DeepSeek model performance through FlashMLA and Expert Parallelism.

v0.8.0rc2Breaking7 fixes4 features
Mar 17, 2025

This release focuses on V1 engine refinements, including making MLA the default and removing the input cache client. It also includes critical bug fixes for Ultravox, Mixtral, and ROCm testing environments.

v0.8.0rc1Breaking10 fixes12 features
Mar 17, 2025

This release introduces Expert Parallelism for DeepSeek models, a new /score endpoint for embeddings, and significant V1 engine enhancements including parallel sampling. It also removes global seed setting, requiring users to manually define seeds for reproducibility.

v0.7.3Breaking6 fixes9 features
Feb 20, 2025

This release introduces significant DeepSeek optimizations including Multi-Token Prediction and MLA FlashAttention3 support, alongside major V1 Engine updates like LoRA and Pipeline Parallelism. It expands hardware support for TPU, ROCm, and Gaudi while adding several new model architectures and quantization methods.

v0.7.2Breaking8 fixes10 features
Feb 6, 2025

This release introduces support for Qwen2.5-VL and a new transformers backend for arbitrary model support. It significantly improves DeepSeek model performance through KV cache memory alignment and torch.compile optimizations.

v0.7.19 fixes10 features
Feb 1, 2025

This release introduces significant MLA and FP8 kernel optimizations for DeepSeek models, resulting in 3x throughput and 10x memory capacity improvements. It also expands hardware support for Neuron and AMD, adds MiniCPM-o, and enhances the V1 engine with new metrics and prefix caching.

v0.7.0Breaking6 fixes8 features
Jan 27, 2025

This release introduces the V1 engine alpha for improved performance and architectural simplicity, alongside full torch.compile integration. It adds support for several new models including Deepseek-VL2 and Whisper, while expanding hardware compatibility for Apple Silicon, AMD, and TPU.