v0.8.1
Breaking Changes📦 vllm
⚠ 2 breaking✨ 6 features🐛 10 fixes⚡ 1 deprecations🔧 11 symbols
Summary
v0.8.1 is a maintenance release focusing on V1 engine stability, adding Zamba2 support, and enabling LoRA for embedding models. It includes critical fixes for sampling dtypes, quantization, and TPU performance.
⚠️ Breaking Changes
- Retired SGMV and BGMV Kernels for LoRA; users relying on these specific kernels for custom implementations may need to transition to newer kernel paths.
- Removed custom_cache_manager from the frontend, which may affect custom engine implementations.
Migration Steps
- Upgrade to v0.8.1 to receive critical bug fixes for the v0.8.0 release.
- If using custom Dockerfiles, ensure the latest version of the transformers library is installed as per updated documentation.
- Update any code referencing SGMV/BGMV kernels to use the standard LoRA kernel path.
- Remove references to custom_cache_manager if previously used in frontend integrations.
✨ New Features
- Added support for Zamba2 models.
- Embedding models now support LoRA adapters.
- Refactored Structured Output to support multiple backends in V1.
- Optimized Rejection Sampler with Triton Kernels for V1 speculative decoding.
- Added support for different tokenizer_mode in benchmarks.
- Re-enabled Gemma3 support for V1 engine.
🐛 Bug Fixes
- Fixed interface for Olmo2 on V1 engine.
- Fixed BitsAndBytes (bnb) quantization for models with mixed HF and Mistral format weights.
- Fixed chunked prefill with padding on TPU V1.
- Fixed broken CPU quantization caused by triton import issues.
- Fixed LoRA extra vocab size calculation.
- Fixed validation of logprobs in ChatCompletionRequest.
- Fixed size calculation of processing cache.
- Ensured int64 usage for sampled token IDs and fixed long dtype in topk sampling.
- Fixed torchrun compatibility issues.
- Fixed fused_moe_kernel for MI325 configurations.
🔧 Affected Symbols
ChatCompletionRequestAutoModelForImageTextToTextSGMVBGMVcustom_cache_managerfused_moe_kernelGemma3Olmo2Zamba2PixtralCALL_RESHAPE_AND_CACHE_XX⚡ Deprecations
- SGMV and BGMV Kernels have been retired in favor of updated LoRA implementations.