v3.3.0
📦 tgiView on GitHub →
✨ 4 features🐛 15 fixes🔧 14 symbols
Summary
This release introduces prefill chunking for VLMs and includes numerous stability fixes across various hardware backends like Gaudi and L4. Key updates involve dependency bumps and specific model support enhancements.
Migration Steps
- If you rely on Prometheus metrics, you may need to update your configuration if you were using the default port and wish to change it.
✨ New Features
- Prefill chunking support for Vision-Language Models (VLMs).
- Support flashinfer for Gemma3 prefill.
- Option added to configure the Prometheus port.
- IPEX support for FP8 kvcache, softcap, and sliding window.
🐛 Bug Fixes
- Fixed Qwen 2.5 VL (32B) issues.
- Fixed tokenization issues related to text-embeddin...
- Cleaned up cuda/rocm code in hpu backend and enabled flat_hpu for Gaudi.
- Fixed L4 related issues.
- Addressed vulnerability CVE-2024-6345 related to setuptools <= 70.0.
- Enabled transformers flash llm/vlm in ipex.
- Hotfixed Gaudi dependencies and Gaudi2 with newer transformers.
- Fixed issue where opentelemetry trace id was created instead of read from request headers.
- Fixed CI build issues.
- Fixed router and template handling for Qwen3.
- Skipped template handling for `{% generation %}` and `{% endgeneration %}`.
- Fixed mllama snaps.
- Fixed HF_HUB_OFFLINE=1 behavior for Gaudi backend.
- Ensured forward and tokenize chooser use the same shape.
- Implemented Chunked Prefill for VLMs.
🔧 Affected Symbols
Qwen 2.5 VL (32B)text-embeddin...Gaudi hpu backendL4setuptoolstransformers ipexGemma3flashinferopentelemetrysccacheQwen3 router/templateHF_HUB_OFFLINE=1IPEX FP8 kvcache/softcap/slidingwindowchooser shape logic