v0.7.1

📅 Feb 1, 2025📦 vllmView on GitHub →

✨ 10 features🐛 9 fixes🔧 9 symbols

Summary

This release introduces significant MLA and FP8 kernel optimizations for DeepSeek models, resulting in 3x throughput and 10x memory capacity improvements. It also expands hardware support for Neuron and AMD, adds MiniCPM-o, and enhances the V1 engine with new metrics and prefix caching.

Migration Steps

Update pre-commit hooks if contributing to the repository.
For DeepSeek-V3, ensure use of the new block-quantized CUTLASS kernels for optimal performance.

✨ New Features

MLA optimization for DeepSeek models providing ~3x generation throughput and ~10x memory capacity.
New Model Support: MiniCPM-o (text outputs only) and MiniCPM-o-2.6.
Support for reasoning content in API for DeepSeek R1.
Support for overriding generation config in engine arguments.
Enables offline /score endpoint for embedding models.
NKI-based flash-attention kernel with paged KV cache for Neuron hardware.
Llama 3.2 support upstreaming for AMD/ROCm.
Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill.
Triton fused MoE kernel for GPTQ/AWQ.
V1 Engine: Initial Prometheus logger, metrics (TTFT, TPOT, GPU cache usage), and zero overhead prefix caching design.

🐛 Bug Fixes

Fix missing seq_start_loc in xformers prefill metadata.
Fix GPT2 GGUF inference.
Fix whisper quantization due to fake k_proj bias.
Fix torch.compile for DeepSeek models.
Fix Modelopt model loading for k-v-scales for Llama models.
Fix alignment of arguments in convert_sparse_cross_attention_mask_to_dense.
Fix pydantic logging validator.
V1: Free encoder cache for aborted requests.
V1: Add extra_keys to block_hash for prefix caching.

🔧 Affected Symbols

MLA KernelFP8 KernelsServingCompletionMiniCPM-oQwen-VLcutlass_scaled_mmtorch.compileIterationStatsPrometheus