v0.8.1

Breaking Changes

📅 Mar 19, 2025📦 vllm

⚠ 2 breaking✨ 6 features🐛 10 fixes⚡ 1 deprecations🔧 11 symbols

Summary

v0.8.1 is a maintenance release focusing on V1 engine stability, adding Zamba2 support, and enabling LoRA for embedding models. It includes critical fixes for sampling dtypes, quantization, and TPU performance.

⚠️ Breaking Changes

Retired SGMV and BGMV Kernels for LoRA; users relying on these specific kernels for custom implementations may need to transition to newer kernel paths.
Removed custom_cache_manager from the frontend, which may affect custom engine implementations.

Migration Steps

Upgrade to v0.8.1 to receive critical bug fixes for the v0.8.0 release.
If using custom Dockerfiles, ensure the latest version of the transformers library is installed as per updated documentation.
Update any code referencing SGMV/BGMV kernels to use the standard LoRA kernel path.
Remove references to custom_cache_manager if previously used in frontend integrations.

✨ New Features

Added support for Zamba2 models.
Embedding models now support LoRA adapters.
Refactored Structured Output to support multiple backends in V1.
Optimized Rejection Sampler with Triton Kernels for V1 speculative decoding.
Added support for different tokenizer_mode in benchmarks.
Re-enabled Gemma3 support for V1 engine.

🐛 Bug Fixes

Fixed interface for Olmo2 on V1 engine.
Fixed BitsAndBytes (bnb) quantization for models with mixed HF and Mistral format weights.
Fixed chunked prefill with padding on TPU V1.
Fixed broken CPU quantization caused by triton import issues.
Fixed LoRA extra vocab size calculation.
Fixed validation of logprobs in ChatCompletionRequest.
Fixed size calculation of processing cache.
Ensured int64 usage for sampled token IDs and fixed long dtype in topk sampling.
Fixed torchrun compatibility issues.
Fixed fused_moe_kernel for MI325 configurations.

🔧 Affected Symbols

ChatCompletionRequestAutoModelForImageTextToTextSGMVBGMVcustom_cache_managerfused_moe_kernelGemma3Olmo2Zamba2PixtralCALL_RESHAPE_AND_CACHE_XX

⚡ Deprecations

SGMV and BGMV Kernels have been retired in favor of updated LoRA implementations.