v0.20.2

📅 May 10, 2026📦 vllmView on GitHub →

🐛 4 fixes🔧 4 symbols

Summary

vLLM v0.20.2 is a small patch release focused on bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL models.

Re-enabled the persistent topk path on Hopper and ensured the memset kernel runs at CUDA graph capture time regardless of `max_seq_len` for DeepSeek V4 sparse attention, fixing MTP=1 hang.
Fixed a "failure to allocate KV blocks" error in the V1 engine KV cache manager for DeepSeek V4.
Plumbed `hidden_dim_unpadded` through the `moe_forward` fake op so MXFP4 works under `torch.compile` on v0.20.x for gpt-oss.
Removed an invalid deepstack boundary check that could fail under heavy load for Qwen3-VL.