v0.22.1
📦 vllmView on GitHub →
✨ 2 features🐛 6 fixes🔧 7 symbols
Summary
v0.22.1 is a patch release introducing support for Mellum v2 and enabling quantized inference acceleration on AMD Zen CPUs. It also includes several critical fixes for model initialization, Ray serving stability, and build issues.
Migration Steps
- If using HyperCLOVAX, ensure transformers >= 5.9.0 is installed, as the model loading logic now relies on native support or vLLM's vendored config.
✨ New Features
- Added support for JetBrains' Mellum v2 model.
- Enabled zentorch-accelerated quantized linear inference (W8A8 and W4A16) on AMD Zen CPUs, with fallback for other hardware.
🐛 Bug Fixes
- Resolved a CUTLASS fmin compatibility issue that prevented DeepSeek-V4 initialization.
- Fixed OlmoHybridForCausalLM failing to initialize after checkpoint change regarding rope_parameters.
- Fixed HyperCLOVAX loading by registering the model_type to use vLLM's vendored config instead of stale upstream remote code.
- Fixed a deterministic hang in multi-node Ray data-parallel serving when num_api_servers > 1 by adjusting port allocation.
- Stopped installing flashinfer-jit-cache via --extra-index-url to fix Docker image builds due to PyPI quarantine.
- Normalized NIXL KV-connector wheel installs to only keep the wheel matching the image's CUDA major version, resolving ImportError: libcudart.so.12 on CUDA 13 images.