Change8

v0.7.0

Breaking Changes
📦 vllmView on GitHub →
2 breaking8 features🐛 6 fixes🔧 9 symbols

Summary

This release introduces the V1 engine alpha for improved performance and architectural simplicity, alongside full torch.compile integration. It adds support for several new models including Deepseek-VL2 and Whisper, while expanding hardware compatibility for Apple Silicon, AMD, and TPU.

⚠️ Breaking Changes

  • Deepseekv3 integration initially broke quantization for other methods; ensure you are on the latest patch if using mixed quantization.
  • V1 engine is a complete rewrite; while optional, it may have different performance characteristics or edge-case behaviors compared to V0.

Migration Steps

  1. To test the new V1 engine, set the environment variable VLLM_USE_V1=1.
  2. To enable torch.compile optimizations, use the -O3 engine parameter.
  3. For VLM developers, implement the merged multi-modal processor and get_*_embeddings methods to support the V1 engine.

✨ New Features

  • V1 Engine: A rewritten high-performance engine (alpha) enabled via VLLM_USE_V1=1.
  • torch.compile integration: Enabled by default in V1 and via -O3 engine parameter.
  • New Models: CogAgent, Deepseek-VL2, fairseq2 Llama, InternLM3, Whisper, Qwen2 PRM, and InternLM2 reward models.
  • Hardware Support: Native Apple Silicon support, AMD MI300 FP8 support, and TPU W8A8 support.
  • API Server: Added Jina- and Cohere-compatible Rerank API.
  • Distributed: Support for torchrun, SPMD-style offline inference, and new collective_rpc abstraction.
  • Kernels: Flash Attention 3 support and Punica prefill kernels fusion.
  • VLM: New merged multi-modal processor for easier model development.

🐛 Bug Fixes

  • Fixed Deepseekv3 quantization breakage for other methods (#11547).
  • Fixed TeleChat2ForCausalLM weights mapper (#11546).
  • Fixed ROCm compressed tensor support (#11561).
  • Fixed interleaving sliding window for Cohere2 model (#11583).
  • Fixed OpenAI parallel sampling when using xgrammar (#11637).
  • Fixed last token measurement (#11376).

🔧 Affected Symbols

LLM.sleepLLM.wake_upLLM.collective_rpcLLM.reset_prefix_cachetorch.compileDeepseekScalingRotaryEmbeddingMiniCPMVBaseModelMolmoForCausalLMVLLM_USE_V1