v0.10.2

Breaking Changes

📅 Sep 13, 2025📦 vllmView on GitHub →

⚠ 4 breaking✨ 8 features🐛 5 fixes⚡ 3 deprecations🔧 10 symbols

Summary

vLLM 0.10.2 introduces native aarch64 support, PyTorch 2.8.0 integration, and extensive optimizations for NVIDIA Blackwell GPUs. It expands model support to include Whisper and various vision-language models while maturing the V1 engine core.

⚠️ Breaking Changes

PyTorch 2.8.0 upgrade requires environment updates and may affect existing installations.
FlashMLA is now disabled on NVIDIA Blackwell GPUs due to compatibility issues.
Original Marlin quantization format has been removed.
V0 Neuron backend and V0 pooling model support have been deprecated/removed.

Migration Steps

Upgrade environment to support PyTorch 2.8.0.
For aarch64, install via 'uv pip install vllm==0.10.2 --extra-index-url https://wheels.vllm.ai/0.10.2/ --torch-backend=auto'.
Transition away from the original Marlin quantization format to supported alternatives.
Update scripts using V0 Neuron or V0 pooling backends to use V1 or alternative implementations.

✨ New Features

Native aarch64 support for GB200 platform with multiplatform docker images.
Support for new model families: Apertus, LFM2, MiDashengLM, Motif-1-Tiny, Seed-Oss, EmbeddingGemma-300m, GTE, Donut OCR, KeyeVL-1.5-8B, R-4B, Ernie4.5 VL, MiniCPM-V 4.5, Ovis2.5, Qwen3-Next, InternVL3.5, Qwen2Audio, NemotronH Nano VLM, BLOOM V1, and Whisper.
Terratorch backend integration for non-language tasks like semantic segmentation.
NVIDIA Blackwell/SM100 support including FP8 MLA, DeepGEMM, and MXFP4 MoE.
Apple Silicon bfloat16 support for M2+ chips.
Decode Context Parallel (DCP) support for MLA.
Per-layer quantization routing and GGUF quantization with layer skipping.
OpenAI API enhancements for audio transcription/translation and usage statistics.

🐛 Bug Fixes

Fixed critical CUDA graph capture throughput issue.
Removed unnecessary CUDA sync from GLM-4.1V and Qwen2VL preprocessing.
Eliminated redundant all-reduce in Qwen3 MoE.
Fixed TPU core dump via tpu_info 0.4.0 update.
Fixed InternVL CPU threading and GLM4.5-V video frame decoding performance.

🔧 Affected Symbols

torchflashinfermarlinNeuronBackendFlashMLAterratorchQwen2VLGLM-4.1VMamba1Whisper

⚡ Deprecations

V0 Neuron backend
V0 pooling model support
Prefix caching is now disabled for hybrid and Mamba models