Change8

v0.10.2

Breaking Changes
📦 vllmView on GitHub →
4 breaking8 features🐛 5 fixes3 deprecations🔧 10 symbols

Summary

vLLM 0.10.2 introduces native aarch64 support, PyTorch 2.8.0 integration, and extensive optimizations for NVIDIA Blackwell GPUs. It expands model support to include Whisper and various vision-language models while maturing the V1 engine core.

⚠️ Breaking Changes

  • PyTorch 2.8.0 upgrade requires environment updates and may affect existing installations.
  • FlashMLA is now disabled on NVIDIA Blackwell GPUs due to compatibility issues.
  • Original Marlin quantization format has been removed.
  • V0 Neuron backend and V0 pooling model support have been deprecated/removed.

Migration Steps

  1. Upgrade environment to support PyTorch 2.8.0.
  2. For aarch64, install via 'uv pip install vllm==0.10.2 --extra-index-url https://wheels.vllm.ai/0.10.2/ --torch-backend=auto'.
  3. Transition away from the original Marlin quantization format to supported alternatives.
  4. Update scripts using V0 Neuron or V0 pooling backends to use V1 or alternative implementations.

✨ New Features

  • Native aarch64 support for GB200 platform with multiplatform docker images.
  • Support for new model families: Apertus, LFM2, MiDashengLM, Motif-1-Tiny, Seed-Oss, EmbeddingGemma-300m, GTE, Donut OCR, KeyeVL-1.5-8B, R-4B, Ernie4.5 VL, MiniCPM-V 4.5, Ovis2.5, Qwen3-Next, InternVL3.5, Qwen2Audio, NemotronH Nano VLM, BLOOM V1, and Whisper.
  • Terratorch backend integration for non-language tasks like semantic segmentation.
  • NVIDIA Blackwell/SM100 support including FP8 MLA, DeepGEMM, and MXFP4 MoE.
  • Apple Silicon bfloat16 support for M2+ chips.
  • Decode Context Parallel (DCP) support for MLA.
  • Per-layer quantization routing and GGUF quantization with layer skipping.
  • OpenAI API enhancements for audio transcription/translation and usage statistics.

🐛 Bug Fixes

  • Fixed critical CUDA graph capture throughput issue.
  • Removed unnecessary CUDA sync from GLM-4.1V and Qwen2VL preprocessing.
  • Eliminated redundant all-reduce in Qwen3 MoE.
  • Fixed TPU core dump via tpu_info 0.4.0 update.
  • Fixed InternVL CPU threading and GLM4.5-V video frame decoding performance.

🔧 Affected Symbols

torchflashinfermarlinNeuronBackendFlashMLAterratorchQwen2VLGLM-4.1VMamba1Whisper

⚡ Deprecations

  • V0 Neuron backend
  • V0 pooling model support
  • Prefix caching is now disabled for hybrid and Mamba models