Change8

v0.20.2

📦 vllmView on GitHub →
🐛 4 fixes🔧 4 symbols

Summary

vLLM v0.20.2 is a small patch release focused on bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL models.

🐛 Bug Fixes

  • Re-enabled the persistent topk path on Hopper and ensured the memset kernel runs at CUDA graph capture time regardless of `max_seq_len` for DeepSeek V4 sparse attention, fixing MTP=1 hang.
  • Fixed a "failure to allocate KV blocks" error in the V1 engine KV cache manager for DeepSeek V4.
  • Plumbed `hidden_dim_unpadded` through the `moe_forward` fake op so MXFP4 works under `torch.compile` on v0.20.x for gpt-oss.
  • Removed an invalid deepstack boundary check that could fail under heavy load for Qwen3-VL.

Affected Symbols