v0.7.0
Breaking Changes📦 vllmView on GitHub →
⚠ 2 breaking✨ 8 features🐛 6 fixes🔧 9 symbols
Summary
This release introduces the V1 engine alpha for improved performance and architectural simplicity, alongside full torch.compile integration. It adds support for several new models including Deepseek-VL2 and Whisper, while expanding hardware compatibility for Apple Silicon, AMD, and TPU.
⚠️ Breaking Changes
- Deepseekv3 integration initially broke quantization for other methods; ensure you are on the latest patch if using mixed quantization.
- V1 engine is a complete rewrite; while optional, it may have different performance characteristics or edge-case behaviors compared to V0.
Migration Steps
- To test the new V1 engine, set the environment variable VLLM_USE_V1=1.
- To enable torch.compile optimizations, use the -O3 engine parameter.
- For VLM developers, implement the merged multi-modal processor and get_*_embeddings methods to support the V1 engine.
✨ New Features
- V1 Engine: A rewritten high-performance engine (alpha) enabled via VLLM_USE_V1=1.
- torch.compile integration: Enabled by default in V1 and via -O3 engine parameter.
- New Models: CogAgent, Deepseek-VL2, fairseq2 Llama, InternLM3, Whisper, Qwen2 PRM, and InternLM2 reward models.
- Hardware Support: Native Apple Silicon support, AMD MI300 FP8 support, and TPU W8A8 support.
- API Server: Added Jina- and Cohere-compatible Rerank API.
- Distributed: Support for torchrun, SPMD-style offline inference, and new collective_rpc abstraction.
- Kernels: Flash Attention 3 support and Punica prefill kernels fusion.
- VLM: New merged multi-modal processor for easier model development.
🐛 Bug Fixes
- Fixed Deepseekv3 quantization breakage for other methods (#11547).
- Fixed TeleChat2ForCausalLM weights mapper (#11546).
- Fixed ROCm compressed tensor support (#11561).
- Fixed interleaving sliding window for Cohere2 model (#11583).
- Fixed OpenAI parallel sampling when using xgrammar (#11637).
- Fixed last token measurement (#11376).
🔧 Affected Symbols
LLM.sleepLLM.wake_upLLM.collective_rpcLLM.reset_prefix_cachetorch.compileDeepseekScalingRotaryEmbeddingMiniCPMVBaseModelMolmoForCausalLMVLLM_USE_V1