v0.8.3rc1

📅 Apr 5, 2025📦 vllmView on GitHub →

✨ 11 features🐛 10 fixes🔧 9 symbols

Summary

This release focuses on expanding V1 engine capabilities, including CPU MLA support, improved TPU stability, and new model support for Molmo and Granite. It also introduces several kernel optimizations for MoE and FP8 quantization.

Migration Steps

Review user-defined chat templates as warning information has been enhanced.
Note that Ray version dependencies are now more restricted.

✨ New Features

Added CPU MLA (Multi-Head Latent Attention) kernel support.
Support for multi-image inputs in Molmo models.
Added Reasoning Parser for Granite Models.
Integrated Fused MoE Kernels from AITER for ROCm.
Added CUTLASS grouped gemm fp8 MoE kernel.
Support for SHA256 as a hash function in prefix caching.
Support for FIPS enabled machines with MD5 hashing workaround.
Added middleware to log API Server responses.
Support for long_prefill_token_threshold in V1 scheduler.
Added Fp8 Channelwise Dynamic Per Token GroupedGEMM.
MiniCPM-V/O now supports the V1 engine.

🐛 Bug Fixes

Fixed CUDA kernel index data type in gptq_marlin/awq_marlin_repack.
Fixed TPU v1 mp profiler.
Fixed conflicting macro names for gguf kernels.
Fixed inductor cache on max_position_embeddings.
Fixed nightly MLA failure involving FA2 and chunked prefill in V1.
Fixed raw_request extraction in load_aware_call decorator.
Fixed weight loading for specific models in Transformers backend.
Fixed use_cascade_attention handling for Alibi-based models on V1.
Fixed TPU Sampler recompilation and bucket padding issues.
Support triton==3.3.0+git95326d9f for RTX 5090 compatibility.

🔧 Affected Symbols

SchedulerInterfaceTransformersModelSiglipMLPLRUCachetpu_model_runnervllm.csrc.quantization.gptq_marlinvllm_compile_cacheMiniCPM-VMolmo