v3.1.1

📅 Mar 4, 2025📦 tgi

✨ 9 features🐛 14 fixes🔧 2 symbols

Summary

This release focuses on backend expansion, adding support for Llamacpp, Neuron, and Gaudi backends, alongside significant improvements to Qwen VL handling and template features. It also includes various stability fixes and dependency updates.

✨ New Features

Added `strftime_now` callable function for `minijinja` chat templates.
Added `loop_controls` feature to `minijinja` to handle `{% break %}`.
Added Llamacpp backend support.
Added Neuron backend support.
Added Gaudi Backend support.
Support sigmoid scoring function in GPTQ-MoE.
Added initial qwen2.5-vl model and test.
Added parsing of HF_HUB_USER_AGENT_ORIGIN environment variable for telemetry.
Added support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests.

🐛 Bug Fixes

Fixed TRTLLM CI build on release.
Fixed gcc version in impureWithCuda setup.
Improved Qwen VL implementation.
Fixed Triton issues.
Fixed Qwen VL break in intel platform.
Fixed flaky mllama test.
Prevented single user from overloading the server by limiting requests.
Restored NCCL forced upgrade.
Fixed Qwen2 VL crash in continuous batching.
Simplified logs2 output.
Improved tool call message processing.
Fixed two edge cases in `RadixTrie::find`.
Fixed minor formatting and linter issues.
Avoided running neuron integration tests twice.

🔧 Affected Symbols

RadixTrie::findRadixAllocator