v0.10.2
Breaking Changes📦 vllmView on GitHub →
⚠ 4 breaking✨ 8 features🐛 5 fixes⚡ 3 deprecations🔧 10 symbols
Summary
vLLM 0.10.2 introduces native aarch64 support, PyTorch 2.8.0 integration, and extensive optimizations for NVIDIA Blackwell GPUs. It expands model support to include Whisper and various vision-language models while maturing the V1 engine core.
⚠️ Breaking Changes
- PyTorch 2.8.0 upgrade requires environment updates and may affect existing installations.
- FlashMLA is now disabled on NVIDIA Blackwell GPUs due to compatibility issues.
- Original Marlin quantization format has been removed.
- V0 Neuron backend and V0 pooling model support have been deprecated/removed.
Migration Steps
- Upgrade environment to support PyTorch 2.8.0.
- For aarch64, install via 'uv pip install vllm==0.10.2 --extra-index-url https://wheels.vllm.ai/0.10.2/ --torch-backend=auto'.
- Transition away from the original Marlin quantization format to supported alternatives.
- Update scripts using V0 Neuron or V0 pooling backends to use V1 or alternative implementations.
✨ New Features
- Native aarch64 support for GB200 platform with multiplatform docker images.
- Support for new model families: Apertus, LFM2, MiDashengLM, Motif-1-Tiny, Seed-Oss, EmbeddingGemma-300m, GTE, Donut OCR, KeyeVL-1.5-8B, R-4B, Ernie4.5 VL, MiniCPM-V 4.5, Ovis2.5, Qwen3-Next, InternVL3.5, Qwen2Audio, NemotronH Nano VLM, BLOOM V1, and Whisper.
- Terratorch backend integration for non-language tasks like semantic segmentation.
- NVIDIA Blackwell/SM100 support including FP8 MLA, DeepGEMM, and MXFP4 MoE.
- Apple Silicon bfloat16 support for M2+ chips.
- Decode Context Parallel (DCP) support for MLA.
- Per-layer quantization routing and GGUF quantization with layer skipping.
- OpenAI API enhancements for audio transcription/translation and usage statistics.
🐛 Bug Fixes
- Fixed critical CUDA graph capture throughput issue.
- Removed unnecessary CUDA sync from GLM-4.1V and Qwen2VL preprocessing.
- Eliminated redundant all-reduce in Qwen3 MoE.
- Fixed TPU core dump via tpu_info 0.4.0 update.
- Fixed InternVL CPU threading and GLM4.5-V video frame decoding performance.
🔧 Affected Symbols
torchflashinfermarlinNeuronBackendFlashMLAterratorchQwen2VLGLM-4.1VMamba1Whisper⚡ Deprecations
- V0 Neuron backend
- V0 pooling model support
- Prefix caching is now disabled for hybrid and Mamba models