v0.10.0rc2

Breaking Changes

📅 Jul 24, 2025📦 vllmView on GitHub →

⚠ 1 breaking✨ 11 features🐛 9 fixes⚡ 2 deprecations🔧 9 symbols

Summary

This release introduces VLM support via the transformers backend, enables shared-memory pipeline parallelism for CPUs, and adds support for NVIDIA SM100 (Blackwell) architectures. It also includes significant performance optimizations for MLA kernels and KV cache management alongside various bug fixes for distributed logging and ray integration.

⚠️ Breaking Changes

Removed deprecated arguments in v0.10. Users must update their configuration to use current argument names.

Migration Steps

Review and remove any v0.10 deprecated arguments from startup scripts or API calls.
Update out-of-tree HPU plugins to align with new kv_cache_dtype handling.
If using ROCm, ensure the latest build fixes are applied to avoid regressions.
Update fp4 quantization calls to match the new API signature.

✨ New Features

Support for Vision Language Models (VLMs) using the transformers backend.
Support for multiple poolers at the model level.
Enabled shared-memory based pipeline parallel for CPU backend.
Added Nvidia ModelOpt config adaptation.
Support for SM100 (Blackwell) for cutlass FP8 groupGEMM.
Added support for Arcee models.
Added Qwen3CoderToolParser for tool-use capabilities.
Enabled multi-modal (mm) caching for the transformers backend.
Added parallel model weight loading for runai_streamer.
Added tokenization_kwargs to encode for embedding model truncation.
Added fused MLA QKV + strided layernorm for performance optimization.

🐛 Bug Fixes

Fixed thread-safety issues in utils.current_stream.
Fixed Prometheus logging for Data Parallelism (DP).
Fixed DeepGemm CUDA initialization error.
Fixed eviction cached block logic in the core engine.
Fixed CUDA FP8 KV cache dtype support.
Fixed tool_choice handling when null is passed in JSON payloads.
Fixed deepseek-v2-lite failure caused by fused_qkv_a_proj name update.
Fixed ray import error and memory cleanup bug.
Fixed missing placeholder in logger debug strings.

🔧 Affected Symbols

AutoWeightsLoaderFreeKVCacheBlockQueueLogitsProcessorutils.current_streamQwen3CoderToolParserKVCacheTensorllm.chathf_processorDeepGemm

⚡ Deprecations

Deprecated models now trigger clear warning messages to notify users of upcoming removal.
Arguments previously marked for deprecation in v0.10 have been removed.