llama.cpp
AI & LLMsLLM inference in C/C++
Release History
b87971 fix2 featuresThis release focuses heavily on optimizing Hexagon (HMX) performance by introducing asynchronous workers and queues to overlap computation stages, alongside fixing a race condition in the worker drain mechanism.
b87961 fixThis release removes the deprecated ggml-ext.h file and corrects its placement, alongside providing numerous pre-compiled binaries for various operating systems and hardware configurations.
b87951 fixThis release includes a fix for the FA support logic within the metal backend and provides updated binaries for various operating systems and hardware configurations.
b87942 fixes1 featureThis release introduces the new mtmd_image_tokens_get_decoder_pos() API and includes fixes for build issues and naming consistency.
b87931 fix1 featureThis release focuses on Vulkan shader improvements by conditionally enabling RoundingModeRTE and refactors SPIRV-Headers fetching logic, resolving a build issue on Ubuntu.
b87921 fixThis release re-enables macOS CI workflows and includes fixes for Vulkan compilation warnings, alongside providing numerous pre-built binaries for various platforms and hardware configurations.
b87911 featureThis release introduces the XIELU unary operation support for the metal backend and provides updated pre-compiled binaries for macOS, Linux, Windows, and openEuler across various hardware and acceleration configurations.
b8790This release updates the internal BoringSSL dependency to version 0.20260413.0 and provides a comprehensive set of pre-built binaries for macOS, Linux, Windows, and openEuler across various architectures and acceleration backends.
b87891 fixThis release addresses a specific bug in the ggml library concerning ARM NEON nvfp4 dot product calculations on non-dotprod targets.
b87881 fixThis release addresses a CMake warning on Windows/MSVC by adjusting policy settings in the build system. It also includes various pre-compiled binaries for multiple operating systems and hardware configurations.
b87874 fixes1 featureThis release focuses on updates to the ggml-webgpu backend, specifically improving matmul accumulation precision and fixing several related bugs across different platforms.
b87861 fix2 featuresThis release optimizes performance by conditionally creating the reasoning budget sampler, ensuring backend sampling remains enabled when no token budget is set. It also preserves sampler creation when grammar is lazy to maintain tool usage compatibility.
b87851 featureThis release introduces NVFP4 support within the Vulkan backend for several core tensor operations. It also provides a comprehensive set of pre-compiled binaries for diverse operating systems and hardware accelerators.
b87841 fix1 featureThe server component has been updated to support the OpenAI /v1/audio/transcriptions API, alongside various platform-specific binary releases and a fix for the default response_format value.
b87831 fixThis release primarily addresses parsing edge cases for the common/gemma4 component and provides updated binary distributions for numerous operating systems and hardware configurations.
b87812 featuresThis release introduces dedicated support and an official template for the DeepSeek v3.2 model parser. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.
b87793 fixes3 featuresThis release introduces Vulkan Flash Attention DP4A support for quantized KV caches using integer dot products and includes several fixes related to indexing and quantization checks in the Vulkan backend.
b87782 featuresThis release introduces download cancellation and temporary file cleanup features. It also provides updated pre-compiled binaries for various operating systems and hardware configurations.
b87771 featureThis release introduces the exposure of build_info when operating in router mode. It also provides numerous pre-compiled binaries for macOS, iOS, Linux, Windows, and openEuler targeting various CPU/GPU backends.
b87761 fix1 featureThis release limits DeviceSegmentedSort to immediate mode due to CUDA graph capture limitations, ensuring stability when using CUDA graphs, and includes performance comparisons between the two sorting methods.
b87751 featureThis release updates the model processing for Gemma 4 audio by implementing causal attention. It also provides a comprehensive set of pre-compiled binaries for numerous platforms and hardware configurations.
b87721 fixThis release removes an unnecessary conditional check related to debug mode and provides updated binary distributions for macOS, Linux, Windows, and openEuler targeting various CPU/GPU architectures.
b8771This release disables Q1_0 in the SYCL backend and performs cleanup of unused variables within the backend code. It also provides updated binary distributions for macOS, Linux, Windows, and openEuler.
b87701 fixThis release primarily addresses a stability issue in the mtmd module related to small image handling. Various pre-compiled binaries for different operating systems and hardware configurations are provided.
b8769Breaking2 fixes3 featuresThis release introduces comprehensive support for Qwen3 audio models (omni and ASR) and includes several internal fixes, notably removing the deepstack dependency for audio.
b87664 fixes6 featuresThis release introduces support for the Gemma 4 audio conformer encoder, detailing its specific architecture and preprocessing steps. Several internal fixes were implemented related to tensor loading and mask matching.
b8763This release provides updated pre-compiled binaries for macOS, Linux (including specialized builds for ROCm 7.2 and OpenVINO), and Windows across multiple hardware architectures and acceleration backends.
b87622 fixes4 featuresThis release introduces comprehensive support for the MERaLiON-2 multimodal audio model, including its specific architecture components and supported tasks. It also includes minor cleanups in the MERaLiON adaptor comments.
b87611 fix3 featuresThis release introduces basic support for the q5_k quantization format on OpenCL, including necessary matrix operation implementations and associated unit test fixes. It also provides a wide array of pre-compiled binaries for diverse operating systems and hardware accelerators.
b87601 fixThis release primarily addresses a bug related to data splitting for the Qwen 3 Next model. It also provides updated binary distributions for macOS, Linux, Windows, and openEuler platforms.
b87591 fixThis release addresses missing cases for GGML_TYPE_Q1_0 within the ggml library and provides a comprehensive set of updated pre-compiled binaries for diverse operating systems and hardware configurations.
b8757This release primarily focuses on distributing pre-built binaries for numerous platforms including macOS, Linux (with Vulkan, ROCm, OpenVINO support), and Windows (with CUDA, Vulkan, SYCL, HIP support). A minor change was made in CUDA to store node->src ne/nb for graph equality.
b87561 fixThis release addresses a bug related to structured output generation when JSON schema $refs are utilized. It also provides numerous pre-compiled binaries for different platforms.
b87551 fix3 featuresThis release focuses heavily on expanding hardware and OS support, particularly adding Linux on Snapdragon support via the hexagon backend, and updating build configurations and documentation.
b87548 fixes10 featuresThis release significantly improves hexagon performance through op request batching, buffer management rewrite, and explicit L2 cache control. It also removes the deprecated GGML_HEXAGON_EXPERIMENTAL environment variable.
b87531 featureThis release updates common components to align with the official gemma4 template and provides numerous pre-compiled binaries for macOS, Linux, Windows, and openEuler across various CPU/GPU architectures.
b87511 featureThis release introduces an update to Gemma 4 model loading, making shared-KV tail attention tensors optional. It also provides a comprehensive set of pre-compiled binaries for diverse platforms and hardware configurations.
b87521 featureThis release introduces a new callback interface for monitoring download progress and provides updated binary distributions for macOS, Linux, Windows, and openEuler across various hardware architectures and acceleration backends.
b87501 featureThis release introduces support for non-square subgroup matrix configurations in ggml-webgpu for Intel GPUs and provides updated pre-compiled binaries across macOS, Linux, Windows, and openEuler platforms.
b87499 fixes1 featureThis release addresses numerous quantization precision issues within ggml WebGPU, especially concerning f16 stability and NaN handling. It also improves backend lifecycle management for WebGPU and cleans up deprecated code.
b87481 fixThis release fixes an issue in llama-server where the --alias flag conflicted with model presets, and provides updated pre-compiled binaries for broad platform compatibility.
b87471 fixThis release includes a fix for loading cached Hugging Face models when the API is unavailable and provides updated binary distributions for macOS, Linux, Windows, and openEuler.
b87461 featureThis release introduces the experimental status for the --split-mode tensor flag and provides a comprehensive set of pre-built binaries for numerous platforms and hardware accelerators.
b87442 fixes1 featureThis release enables the reasoning budget sampler for gemma4 by updating parameter initialization and fixing a related parsing issue in the thought block handling.
b87422 featuresThis release introduces support for Q1_0 quantization format within the Vulkan backend and updates internal dependency usage by incorporating 'get_dm'.
b87411 featureThis release introduces fluidity enhancements to the progress bar and provides updated pre-compiled binaries for macOS, Linux, Windows, and openEuler targeting various CPU/GPU backends.
b87401 featureThis release introduces performance improvements on CUDA devices via kernel fusion for multiplications and provides updated binary distributions for numerous platforms and hardware configurations.
b87393 featuresThis release introduces support for the AMD CDNA4 architecture (gfx950) for MI350X/MI355X accelerators, adjusting matrix multiplication paths accordingly. Various pre-compiled binaries for different operating systems and hardware configurations are also provided.
b873816 fixes9 featuresThis release introduces experimental backend-agnostic tensor parallelism in ggml, supporting models like GPT-OSS and Qwen 3 MoE across multiple GPUs. Numerous bug fixes address stability, quantization handling, and backend-specific issues across Vulkan, Metal, ROCm, and various model implementations.
b87342 fixesThis release addresses an ambiguous grammar rule in gemma4 and resolves a missing comma issue. It also provides extensive pre-compiled binaries for macOS, Linux, Windows, and openEuler platforms.
b87371 fixThis release includes a stability fix in ggml by checking the return values of CUB calls used in argsort and top-k operations. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.
b87333 fixes1 featureThis release simplifies autoparser tagged parser rules and fixes several issues related to parameter initialization and optional argument ordering within the continuation logic. It also provides numerous pre-compiled binaries for diverse hardware and operating systems.
b87321 fixThis release primarily addresses a bug by fixing the multimodal padding token for gemma3n/gemma4 models and includes minor nits.
b87301 fix1 featureThis release introduces new tokenizer tests for Gemma 4, fixes an associated edge case, and includes minor internal code cleanup.
b87311 fix3 featuresThis release introduces support for dots in model names within the mtmd feature and adds GGUF conversion capabilities, alongside various platform-specific binary updates.
b87291 fix4 featuresThis release enhances the Jinja engine with Python-style string repetition, improved ASCII handling in tojson, and identity operations for int/float types. A bug related to invalid UTF-8 byte escaping during JSON serialization was also fixed.
b87281 featureThis release introduces missing mm-id specializations for q1_0 within the metal backend and provides updated pre-compiled binaries for macOS, Linux, Windows, and openEuler platforms.
b87261 fixThis release includes a fix for grammar commandline arguments in the server component and provides updated binary distributions for macOS, Linux, Windows, and openEuler targeting various hardware architectures and accelerators.
b87243 fixes2 featuresThis release enhances SYCL performance by adding Flash Attention support for head size 512 and cleans up backend initialization logic. It also removes defunct mxfp4 reordering logic.
b87221 featureThis release unifies Vulkan type macros to use Vx instead of the older _VECx convention. It also provides extensive pre-built binaries for diverse hardware and operating system configurations.
b87211 fixThis release addresses an issue where GGUF split files were incorrectly ordered during model selection. It ensures robustness by skipping non-primary split files regardless of their listing order.
b87201 fixThis release includes an internal fix for CUDA equality checks by storing source data pointers and provides updated pre-compiled binaries across multiple platforms and accelerators.
b87191 fixThis release addresses a significant memory leak in the optimization context freeing routine (ggml_opt_free) by ensuring the per-batch context copy (ctx_copy) is properly released.
b87181 featureThis release introduces a server-side feature to respect the ignore eos flag and provides numerous pre-compiled binaries for diverse operating systems and hardware configurations.
b87171 featureThis release updates binary builds for numerous platforms (macOS, Linux, Windows, openEuler) and implements a vocabulary change by removing the </s> eog token for Gemma 4 models.
b87151 fixThis release primarily addresses minor documentation issues by fixing a couple of typos. It also provides updated binary distributions across various operating systems and hardware configurations.
b87141 fixThis release enhances KV cache quantization checks to correctly account for enabled flash attention. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.
b87131 featureThis release introduces adapter support querying for the WebGPU backend and provides updated binary distributions for numerous operating systems and hardware configurations.
b87124 featuresThis release introduces the initial Q1_0 Metal backend, including kernel tuning and associated testing infrastructure. It also provides a wide array of pre-compiled binaries for macOS, Linux, Windows, and openEuler platforms.
b87113 featuresThis release introduces architectural optimizations for the Gemma model by restructuring projection operations within the first layer and before the main layer loop. Pre-compiled binaries are provided for various operating systems and hardware configurations.
b87101 fixThis release updates the debug example to disable the cb_eval callback when using the --save-logits flag, reducing noise in the output. It also provides numerous pre-compiled binaries for macOS, Linux, Windows, and openEuler.
b87091 fixThis release includes a fix for MiniMax handling within the parser and provides updated pre-compiled binaries for macOS, Linux, Windows, and openEuler across various CPU/GPU configurations.
b87081 fix1 featureThis release focuses on updating pre-compiled binaries across multiple operating systems and hardware targets, including new support for ROCm 7.2 and CUDA 13.1, alongside minor test cleanup.
b87051 fix6 featuresThis release introduces support for the step3-vl-10b model and includes several internal optimizations and refactoring, such as using fused QKV and updating parameter handling for MmprojModel.
b87033 featuresThis release introduces KleidiAI-enabled ARM artifacts for macOS and standardizes the macOS release build process. It also provides an extensive set of pre-compiled binaries for various Linux, Windows, and openEuler configurations.
b87023 featuresThis release significantly speeds up CUDA graph property checks using hashing and introduces optimizations like 'seen node' and 'memcp'. It also provides extensive pre-built binaries for diverse operating systems and hardware configurations.
b87013 fixes2 featuresThis release introduces performance improvements for q4_0 and q4_1 mmq kernels on AMD GPUs via ds_read_b128 optimization and includes various bug fixes and code cleanup in the CUDA implementation.
b86991 fix1 featureThis release introduces support for attention rotation within the kv-cache for heterogeneous iSWA configurations and removes an unnecessary assertion.
b86971 fixThis release introduces a safety check in CUDA operations to prevent buffer overlap during fusion and provides numerous updated pre-built binaries across different operating systems and hardware configurations.
b86982 fixes6 featuresThis release focuses heavily on optimizing and stabilizing the ggml-webgpu backend by parameterizing submission sizes, adding iOS limits, and removing internal deadlocks. Internal refactoring includes moving types and simplifying profiling futures.
b86961 fixThis release addresses a bug in llama-server where model parameters were not being propagated correctly. It also provides numerous pre-compiled binaries for different operating systems and hardware setups.
b8694BreakingThis release removes per-architecture tensor name lists from the llama component and provides updated binary distributions for numerous operating systems and hardware configurations.
b86931 fixThis release addresses a specific bug in the server component related to checkpoint restoration when pos_min is 0. It also provides a comprehensive set of pre-compiled binaries for different operating systems and hardware architectures.
b86921 fixThis release deprecates the GGML_OP_ADD1 operation and includes minor internal cleanup, alongside providing updated pre-built binaries for numerous operating systems and hardware configurations.
b86912 featuresThis release introduces Vulkan build support for ggml on Linux and improves error reporting for fork failures. It also provides a comprehensive set of pre-compiled binaries across multiple platforms and hardware configurations.
b86903 featuresThis release introduces support for FA dequantization of Q4_1, Q5_0, Q5_1, and IQ4_NL formats within the Vulkan backend. Various pre-compiled binaries for different operating systems and hardware configurations are provided.
b86881 fixThis release primarily fixes an incorrect compute capability constant for CDNA2 (gfx90a/MI210) within the ggml-cuda backend. It also provides updated pre-compiled binaries across multiple platforms.
b86851 fix2 featuresThis release introduces a significant Q8_0 reorder optimization for the SYCL backend, improving performance on Intel Arc hardware, and fixes a bug preventing this optimization from activating for Q8_0 tensors.
b86831 featureThis release introduces support for the MUL_MAT_ID operation within the ggml-webgpu backend and provides updated pre-compiled binaries for macOS, Linux, Windows, and openEuler platforms.
b8682Breaking2 fixes2 featuresThis release introduces Q1_0 1-bit quantization support for the CPU, involving renaming and removing specific quantization variants and fixing related enum issues.
b8681Breaking1 fixThis release fixes an issue where newline characters were incorrectly stripped in multiline input for llama-cli and includes an internal change replacing '&' with 'string_view'.
b86804 featuresThis release introduces an optimized flash_attn_stream_k_fixup kernel for CUDA to improve performance under specific conditions. It also provides updated pre-compiled binaries across multiple operating systems and hardware targets.
b86791 featureThis release updates the llama-bench tool by adding new command-line arguments (`-fitc` and `-fitt`) and provides a comprehensive set of pre-compiled binaries for numerous operating systems and hardware configurations.
b86781 featureThis release introduces byte token handling support for the Gemma4 BPE detokenizer and provides updated pre-compiled binaries for numerous operating systems and hardware configurations.
b86761 fixThis release fixes an issue in the server's chunked stream provider to correctly handle unsuccessful writes to the sink, ensuring stream abortion on connection failure. It also provides updated pre-compiled binaries for numerous platforms.
b86721 featureThis release includes a minor optimization for argosrt output initialization within the hexagon component and provides updated binary distributions for macOS, Linux, Windows, and openEuler targeting various CPU/GPU architectures.
b86713 fixesThis release focuses on fixing issues related to GGUF boolean metadata loading, ensuring platform-independent behavior for BOOL metadata within the llama component.
b86702 fixes6 featuresThis release introduces comprehensive support for HunyuanOCR models, including vision capabilities and a new chat template. It also includes various fixes related to token IDs and tensor mappings during conversion.
b86682 fixesThis release updates the llama-server startup logging to remove redundancy and include build/commit information. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.
b86656 fixes5 featuresThis release introduces specialized parsing and improved tool call handling for Gemma 4, alongside various internal cleanups and bug fixes.
Common Errors
NotImplementedError2 reportsNotImplementedError in llama-cpp often arises when attempting to use a feature or model architecture that hasn't yet been fully implemented in the conversion or evaluation code. To resolve this, either update to the latest version of llama-cpp which may include the necessary implementation or contribute the missing functionality by implementing the required logic for the specific operator/model architecture and submitting a pull request. If an update is not available, using a model known to work can also provide a workaround.
DeviceLostError2 reportsDeviceLostError in llama-cpp usually indicates the GPU lost connection or encountered a critical error, often due to out-of-memory issues or driver instability, especially with Vulkan. Try reducing the model size, batch size, or number of threads to decrease GPU memory usage or updating to the latest GPU drivers to resolve potential driver bugs. Consider using a different backend like CUDA or Metal if available to circumvent Vulkan-specific problems.
InternalServerError2 reportsInternalServerError in llama-cpp often arises from unsupported model architectures or operations, such as attempting multimodal input with a model not designed for it or faulty tool calling within a specific model. To resolve this, verify model compatibility with the requested operation in your code, and update llama-cpp to the latest version or use a compatible model known to work with multimodal inputs or tool calling. If issues persist, inspect the model's configuration, particularly its handling of vision or function calling, and revise your prompts accordingly.
FileNotFoundError1 reportThe "FileNotFoundError" in llama-cpp usually means a required file path, often a model or tokenizer component, isn't valid or the file doesn't exist at that location. Double-check the path specified in your command-line arguments or configuration files for typos and ensure the necessary files are actually present in the indicated directory. If converting from Hugging Face, ensure all necessary files, like "tokenizer.model", were downloaded correctly.
Related AI & LLMs Packages
AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.
Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models.
🦜🔗 The platform for reliable agents.
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
A high-throughput and memory-efficient inference and serving engine for LLMs
Subscribe to Updates
Get notified when new versions are released