v0.20.2
📦 vllmView on GitHub →
🐛 4 fixes🔧 4 symbols
Summary
vLLM v0.20.2 is a small patch release focused on bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL models.
🐛 Bug Fixes
- Re-enabled the persistent topk path on Hopper and ensured the memset kernel runs at CUDA graph capture time regardless of `max_seq_len` for DeepSeek V4 sparse attention, fixing MTP=1 hang.
- Fixed a "failure to allocate KV blocks" error in the V1 engine KV cache manager for DeepSeek V4.
- Plumbed `hidden_dim_unpadded` through the `moe_forward` fake op so MXFP4 works under `torch.compile` on v0.20.x for gpt-oss.
- Removed an invalid deepstack boundary check that could fail under heavy load for Qwen3-VL.