v0.7.0

Breaking Changes

📅 Jan 27, 2025📦 vllmView on GitHub →

⚠ 2 breaking✨ 8 features🐛 6 fixes🔧 9 symbols

Summary

This release introduces the V1 engine alpha for improved performance and architectural simplicity, alongside full torch.compile integration. It adds support for several new models including Deepseek-VL2 and Whisper, while expanding hardware compatibility for Apple Silicon, AMD, and TPU.

⚠️ Breaking Changes

Deepseekv3 integration initially broke quantization for other methods; ensure you are on the latest patch if using mixed quantization.
V1 engine is a complete rewrite; while optional, it may have different performance characteristics or edge-case behaviors compared to V0.

Migration Steps

To test the new V1 engine, set the environment variable VLLM_USE_V1=1.
To enable torch.compile optimizations, use the -O3 engine parameter.
For VLM developers, implement the merged multi-modal processor and get_*_embeddings methods to support the V1 engine.

✨ New Features

V1 Engine: A rewritten high-performance engine (alpha) enabled via VLLM_USE_V1=1.
torch.compile integration: Enabled by default in V1 and via -O3 engine parameter.
New Models: CogAgent, Deepseek-VL2, fairseq2 Llama, InternLM3, Whisper, Qwen2 PRM, and InternLM2 reward models.
Hardware Support: Native Apple Silicon support, AMD MI300 FP8 support, and TPU W8A8 support.
API Server: Added Jina- and Cohere-compatible Rerank API.
Distributed: Support for torchrun, SPMD-style offline inference, and new collective_rpc abstraction.
Kernels: Flash Attention 3 support and Punica prefill kernels fusion.
VLM: New merged multi-modal processor for easier model development.

🐛 Bug Fixes

Fixed Deepseekv3 quantization breakage for other methods (#11547).
Fixed TeleChat2ForCausalLM weights mapper (#11546).
Fixed ROCm compressed tensor support (#11561).
Fixed interleaving sliding window for Cohere2 model (#11583).
Fixed OpenAI parallel sampling when using xgrammar (#11637).
Fixed last token measurement (#11376).

🔧 Affected Symbols

LLM.sleepLLM.wake_upLLM.collective_rpcLLM.reset_prefix_cachetorch.compileDeepseekScalingRotaryEmbeddingMiniCPMVBaseModelMolmoForCausalLMVLLM_USE_V1