Change8

v0.20.1

📦 vllmView on GitHub →
6 features🐛 11 fixes🔧 4 symbols

Summary

vLLM v0.20.1 is a patch release focused on stabilizing and improving performance for DeepSeek V4, including various kernel optimizations and critical bug fixes across CUDA and ROCm platforms.

Migration Steps

  1. If encountering issues related to persistent topk, note that it has been temporarily disabled as a workaround for deadlocks/races.

✨ New Features

  • Added base model support for DeepSeek V4.
  • Implemented multi-stream pre-attention GEMM for DeepSeek V4, configurable via a knob and default threshold.
  • Added BF16 and MXFP8 all-to-all support for FlashInfer one-sided communication.
  • Introduced PTX cvt instruction for faster FP32->FP4 conversion.
  • Integrated tile kernels (head_compute_mix_kernel) for optimized head computation.
  • Guarded megamoe flag with Pure TP.

🐛 Bug Fixes

  • Fixed persistent topk cooperative deadlock at TopK=1024 and inter-CTA init race on RadixRowState (persistent topk temporarily disabled as a workaround).
  • Fixed import error due to AOT compile cache loading.
  • Fixed torch inductor error.
  • Fixed repeated RoPE cache initialization.
  • Fixed missing type conversion for non-streaming tool calls in DSV3.2/V4.
  • Fixed max_num_batched_token not being captured in CUDA graph.
  • Fixed num_gpu_blocks_override not being accounted for in max_model_len checks.
  • Auto-disabled expandable_segments around cumem memory pool.
  • Fixed BailingMoE linear layer and MLA RoPE rotation for BailingMoE V2.5.
  • Fixed reasoning parser kwargs not being passed to structured output.
  • [ROCm] Fixed input_ids and expert_map args for Quark W4A8 GPT-OSS.

Affected Symbols