v0.20.1
📦 vllmView on GitHub →
✨ 6 features🐛 11 fixes🔧 4 symbols
Summary
vLLM v0.20.1 is a patch release focused on stabilizing and improving performance for DeepSeek V4, including various kernel optimizations and critical bug fixes across CUDA and ROCm platforms.
Migration Steps
- If encountering issues related to persistent topk, note that it has been temporarily disabled as a workaround for deadlocks/races.
✨ New Features
- Added base model support for DeepSeek V4.
- Implemented multi-stream pre-attention GEMM for DeepSeek V4, configurable via a knob and default threshold.
- Added BF16 and MXFP8 all-to-all support for FlashInfer one-sided communication.
- Introduced PTX cvt instruction for faster FP32->FP4 conversion.
- Integrated tile kernels (head_compute_mix_kernel) for optimized head computation.
- Guarded megamoe flag with Pure TP.
🐛 Bug Fixes
- Fixed persistent topk cooperative deadlock at TopK=1024 and inter-CTA init race on RadixRowState (persistent topk temporarily disabled as a workaround).
- Fixed import error due to AOT compile cache loading.
- Fixed torch inductor error.
- Fixed repeated RoPE cache initialization.
- Fixed missing type conversion for non-streaming tool calls in DSV3.2/V4.
- Fixed max_num_batched_token not being captured in CUDA graph.
- Fixed num_gpu_blocks_override not being accounted for in max_model_len checks.
- Auto-disabled expandable_segments around cumem memory pool.
- Fixed BailingMoE linear layer and MLA RoPE rotation for BailingMoE V2.5.
- Fixed reasoning parser kwargs not being passed to structured output.
- [ROCm] Fixed input_ids and expert_map args for Quark W4A8 GPT-OSS.