TGI
AI & LLMsLarge Language Model Text Generation Inference
Release History
v3.3.71 fix1 featureThis release introduces support for limiting image fetching size and fixes an issue related to automatic device count computation. The system is also entering Maintenance mode.
v3.3.62 fixesThis patch release (v3.3.6) focuses primarily on bug fixes, including correcting an issue with flashinfer masking and removing Azure references, alongside minor documentation and code cleanup.
v3.3.5Breaking5 fixes8 featuresThis release introduces significant hardware acceleration updates, including V2 Pydantic migration, XPU LoRA support, and various Gaudi optimizations for models like Gemma3 and Deepseek v2. It also bumps core dependencies like transformers and huggingface_hub.
v3.3.41 fix2 featuresThis release introduces initial support for Gemma 3 models on Gaudi and fixes a bug related to Neuron models exported with batch_size 1.
v3.3.34 fixes1 featureThis release focuses on updating the Neuron backend, including bumping the SDK version and adding support for the Qwen3_moe model on Gaudi. Several Gaudi-specific fixes and performance optimizations were also implemented.
v3.3.23 fixes2 featuresThis release focuses on Gaudi improvements, including OOM fixes and new hardware support, alongside an upgrade to vllm extension operations and the addition of the Qwen3 model.
v3.3.12 fixes2 featuresThis release updates TGI to Torch 2.7 and CUDA 12.8, incorporating HPU warmup logic refinements, kernel updates, and bug fixes.
v3.3.015 fixes4 featuresThis release introduces prefill chunking for VLMs and includes numerous stability fixes across various hardware backends like Gaudi and L4. Key updates involve dependency bumps and specific model support enhancements.
v3.2.31 fix1 featureThis release introduces patching for Llama 4 and updates underlying dependencies like ROCM and transformers. It also includes a fix for a compute type typo.
v3.2.21 fix3 featuresThis release introduces support for the llama4 model, adds a configurable termination timeout, and includes several fixes, notably for Gaudi hardware.
v3.2.12 fixes2 featuresThis release introduces support for the Gemma 3 text model type and the official release of the Gaudi Backend. It also includes necessary updates for Triton kernel compilation and various bug fixes.
v3.2.0Breaking6 fixes3 featuresThis release introduces support for the Gemma 3 model and brings significant updates to tool calling behavior, aligning it more closely with OpenAI's specification, alongside various backend and model-specific bug fixes.
v3.1.114 fixes9 featuresThis release focuses on backend expansion, adding support for Llamacpp, Neuron, and Gaudi backends, alongside significant improvements to Qwen VL handling and template features. It also includes various stability fixes and dependency updates.
v3.1.04 fixes3 featuresThis release introduces full hardware support for Deepseek R1 on AMD and Nvidia, adds fp8 support for MoE models, and includes several stability fixes and dependency updates.
v3.0.214 fixes11 featuresThis release introduces a major new transformers backend supporting flashattention for unsupported models and adds support for several new models including Cohere2 and OLMo variants. Numerous bug fixes target specific model issues, VLM handling, and hardware acceleration improvements across CUDA, ROCm, and XPU platforms.
Common Errors
ModuleNotFoundError4 reportsThe "ModuleNotFoundError" in TGI usually indicates that a required Python package is missing from your environment. To fix this, identify the missing module from the error message (e.g., 'punica_sgmv') and install it using pip: `pip install <missing_module_name>`. Alternatively, ensure you've installed TGI with `pip install --upgrade "hf-text-generation-inference[all]"` to include all optional dependencies.
NotImplementedError4 reportsNotImplementedError in TGI usually means a specific feature or model architecture hasn't been fully implemented in the code yet. To resolve this, check the TGI documentation or issue tracker for updates implementing the required functionality or workarounds. If support is absent, you may need to contribute the missing implementation yourself or wait for the TGI team to add support.
LocalEntryNotFoundError2 reportsThis error usually arises in tgi due to missing or incorrectly installed dependencies, especially custom or third-party modules. To fix it, ensure all required packages, including those specified in `requirements.txt` or necessary for specific model functionalities (like `punica_sgmv`), are installed using `pip install -r requirements.txt` or `pip install <missing_package_name>`. If the issue pertains to a custom module, verify the module's path is correctly included in the Python environment using `sys.path.append`.
ZeroDivisionError1 reportThe ZeroDivisionError occurs when dividing by zero. To fix it, check the denominator in your code where the error occurs and ensure it's never zero. Implement a conditional statement to handle cases where the denominator might be zero, either by assigning a default non-zero value or skipping the division operation altogether.
BadRequestError1 reportThe "BadRequestError" in TGI often arises from inconsistencies between the client's request parameters and the API's expected input format, such as incorrect data types or missing required fields. Fix this by carefully reviewing the API documentation and ensuring that your client requests accurately match the expected schema, including validating the input data types, names, and presence of required parameters. Consider using a tool like Postman or a Python library with request validation capabilities during development to diagnose and rectify discrepancies prior to production deployment.
ConnectionResetError1 reportConnectionResetError in tgi often arises when the server unexpectedly closes a connection while the client is still attempting to send or receive data, frequently due to timeouts or exceeding server limits. Implement keep-alive mechanisms on both client and server sides to maintain persistent connections, and adjust timeout settings within the tgi configuration (e.g., --max-concurrent-requests, --max-input-length) to accommodate the expected workload. Also, check for resource exhaustion on the server.
Related AI & LLMs Packages
AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.
Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models.
🦜🔗 The platform for reliable agents.
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
LLM inference in C/C++
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
Subscribe to Updates
Get notified when new versions are released