Change8

llama.cpp

AI & LLMs

LLM inference in C/C++

Latest: b8797100 releases4 breaking changes4 common errorsView on GitHub

Release History

b87971 fix2 features
3h ago

This release focuses heavily on optimizing Hexagon (HMX) performance by introducing asynchronous workers and queues to overlap computation stages, alongside fixing a race condition in the worker drain mechanism.

b87961 fix
4h ago

This release removes the deprecated ggml-ext.h file and corrects its placement, alongside providing numerous pre-compiled binaries for various operating systems and hardware configurations.

b87951 fix
7h ago

This release includes a fix for the FA support logic within the metal backend and provides updated binaries for various operating systems and hardware configurations.

b87942 fixes1 feature
8h ago

This release introduces the new mtmd_image_tokens_get_decoder_pos() API and includes fixes for build issues and naming consistency.

b87931 fix1 feature
10h ago

This release focuses on Vulkan shader improvements by conditionally enabling RoundingModeRTE and refactors SPIRV-Headers fetching logic, resolving a build issue on Ubuntu.

b87921 fix
11h ago

This release re-enables macOS CI workflows and includes fixes for Vulkan compilation warnings, alongside providing numerous pre-built binaries for various platforms and hardware configurations.

b87911 feature
12h ago

This release introduces the XIELU unary operation support for the metal backend and provides updated pre-compiled binaries for macOS, Linux, Windows, and openEuler across various hardware and acceleration configurations.

b8790
13h ago

This release updates the internal BoringSSL dependency to version 0.20260413.0 and provides a comprehensive set of pre-built binaries for macOS, Linux, Windows, and openEuler across various architectures and acceleration backends.

b87891 fix
14h ago

This release addresses a specific bug in the ggml library concerning ARM NEON nvfp4 dot product calculations on non-dotprod targets.

b87881 fix
14h ago

This release addresses a CMake warning on Windows/MSVC by adjusting policy settings in the build system. It also includes various pre-compiled binaries for multiple operating systems and hardware configurations.

b87874 fixes1 feature
15h ago

This release focuses on updates to the ggml-webgpu backend, specifically improving matmul accumulation precision and fixing several related bugs across different platforms.

b87861 fix2 features
15h ago

This release optimizes performance by conditionally creating the reasoning budget sampler, ensuring backend sampling remains enabled when no token budget is set. It also preserves sampler creation when grammar is lazy to maintain tool usage compatibility.

b87851 feature
16h ago

This release introduces NVFP4 support within the Vulkan backend for several core tensor operations. It also provides a comprehensive set of pre-compiled binaries for diverse operating systems and hardware accelerators.

b87841 fix1 feature
17h ago

The server component has been updated to support the OpenAI /v1/audio/transcriptions API, alongside various platform-specific binary releases and a fix for the default response_format value.

b87831 fix
Apr 14, 2026

This release primarily addresses parsing edge cases for the common/gemma4 component and provides updated binary distributions for numerous operating systems and hardware configurations.

b87812 features
Apr 13, 2026

This release introduces dedicated support and an official template for the DeepSeek v3.2 model parser. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b87793 fixes3 features
Apr 13, 2026

This release introduces Vulkan Flash Attention DP4A support for quantized KV caches using integer dot products and includes several fixes related to indexing and quantization checks in the Vulkan backend.

b87782 features
Apr 13, 2026

This release introduces download cancellation and temporary file cleanup features. It also provides updated pre-compiled binaries for various operating systems and hardware configurations.

b87771 feature
Apr 13, 2026

This release introduces the exposure of build_info when operating in router mode. It also provides numerous pre-compiled binaries for macOS, iOS, Linux, Windows, and openEuler targeting various CPU/GPU backends.

b87761 fix1 feature
Apr 13, 2026

This release limits DeviceSegmentedSort to immediate mode due to CUDA graph capture limitations, ensuring stability when using CUDA graphs, and includes performance comparisons between the two sorting methods.

b87751 feature
Apr 13, 2026

This release updates the model processing for Gemma 4 audio by implementing causal attention. It also provides a comprehensive set of pre-compiled binaries for numerous platforms and hardware configurations.

b87721 fix
Apr 13, 2026

This release removes an unnecessary conditional check related to debug mode and provides updated binary distributions for macOS, Linux, Windows, and openEuler targeting various CPU/GPU architectures.

b8771
Apr 13, 2026

This release disables Q1_0 in the SYCL backend and performs cleanup of unused variables within the backend code. It also provides updated binary distributions for macOS, Linux, Windows, and openEuler.

b87701 fix
Apr 12, 2026

This release primarily addresses a stability issue in the mtmd module related to small image handling. Various pre-compiled binaries for different operating systems and hardware configurations are provided.

b8769Breaking2 fixes3 features
Apr 12, 2026

This release introduces comprehensive support for Qwen3 audio models (omni and ASR) and includes several internal fixes, notably removing the deepstack dependency for audio.

b87664 fixes6 features
Apr 12, 2026

This release introduces support for the Gemma 4 audio conformer encoder, detailing its specific architecture and preprocessing steps. Several internal fixes were implemented related to tensor loading and mask matching.

b8763
Apr 11, 2026

This release provides updated pre-compiled binaries for macOS, Linux (including specialized builds for ROCm 7.2 and OpenVINO), and Windows across multiple hardware architectures and acceleration backends.

b87622 fixes4 features
Apr 11, 2026

This release introduces comprehensive support for the MERaLiON-2 multimodal audio model, including its specific architecture components and supported tasks. It also includes minor cleanups in the MERaLiON adaptor comments.

b87611 fix3 features
Apr 11, 2026

This release introduces basic support for the q5_k quantization format on OpenCL, including necessary matrix operation implementations and associated unit test fixes. It also provides a wide array of pre-compiled binaries for diverse operating systems and hardware accelerators.

b87601 fix
Apr 11, 2026

This release primarily addresses a bug related to data splitting for the Qwen 3 Next model. It also provides updated binary distributions for macOS, Linux, Windows, and openEuler platforms.

b87591 fix
Apr 11, 2026

This release addresses missing cases for GGML_TYPE_Q1_0 within the ggml library and provides a comprehensive set of updated pre-compiled binaries for diverse operating systems and hardware configurations.

b8757
Apr 11, 2026

This release primarily focuses on distributing pre-built binaries for numerous platforms including macOS, Linux (with Vulkan, ROCm, OpenVINO support), and Windows (with CUDA, Vulkan, SYCL, HIP support). A minor change was made in CUDA to store node->src ne/nb for graph equality.

b87561 fix
Apr 11, 2026

This release addresses a bug related to structured output generation when JSON schema $refs are utilized. It also provides numerous pre-compiled binaries for different platforms.

b87551 fix3 features
Apr 11, 2026

This release focuses heavily on expanding hardware and OS support, particularly adding Linux on Snapdragon support via the hexagon backend, and updating build configurations and documentation.

b87548 fixes10 features
Apr 11, 2026

This release significantly improves hexagon performance through op request batching, buffer management rewrite, and explicit L2 cache control. It also removes the deprecated GGML_HEXAGON_EXPERIMENTAL environment variable.

b87531 feature
Apr 11, 2026

This release updates common components to align with the official gemma4 template and provides numerous pre-compiled binaries for macOS, Linux, Windows, and openEuler across various CPU/GPU architectures.

b87511 feature
Apr 11, 2026

This release introduces an update to Gemma 4 model loading, making shared-KV tail attention tensors optional. It also provides a comprehensive set of pre-compiled binaries for diverse platforms and hardware configurations.

b87521 feature
Apr 11, 2026

This release introduces a new callback interface for monitoring download progress and provides updated binary distributions for macOS, Linux, Windows, and openEuler across various hardware architectures and acceleration backends.

b87501 feature
Apr 10, 2026

This release introduces support for non-square subgroup matrix configurations in ggml-webgpu for Intel GPUs and provides updated pre-compiled binaries across macOS, Linux, Windows, and openEuler platforms.

b87499 fixes1 feature
Apr 10, 2026

This release addresses numerous quantization precision issues within ggml WebGPU, especially concerning f16 stability and NaN handling. It also improves backend lifecycle management for WebGPU and cleans up deprecated code.

b87481 fix
Apr 10, 2026

This release fixes an issue in llama-server where the --alias flag conflicted with model presets, and provides updated pre-compiled binaries for broad platform compatibility.

b87471 fix
Apr 10, 2026

This release includes a fix for loading cached Hugging Face models when the API is unavailable and provides updated binary distributions for macOS, Linux, Windows, and openEuler.

b87461 feature
Apr 10, 2026

This release introduces the experimental status for the --split-mode tensor flag and provides a comprehensive set of pre-built binaries for numerous platforms and hardware accelerators.

b87442 fixes1 feature
Apr 10, 2026

This release enables the reasoning budget sampler for gemma4 by updating parameter initialization and fixing a related parsing issue in the thought block handling.

b87422 features
Apr 10, 2026

This release introduces support for Q1_0 quantization format within the Vulkan backend and updates internal dependency usage by incorporating 'get_dm'.

b87411 feature
Apr 10, 2026

This release introduces fluidity enhancements to the progress bar and provides updated pre-compiled binaries for macOS, Linux, Windows, and openEuler targeting various CPU/GPU backends.

b87401 feature
Apr 10, 2026

This release introduces performance improvements on CUDA devices via kernel fusion for multiplications and provides updated binary distributions for numerous platforms and hardware configurations.

b87393 features
Apr 9, 2026

This release introduces support for the AMD CDNA4 architecture (gfx950) for MI350X/MI355X accelerators, adjusting matrix multiplication paths accordingly. Various pre-compiled binaries for different operating systems and hardware configurations are also provided.

b873816 fixes9 features
Apr 9, 2026

This release introduces experimental backend-agnostic tensor parallelism in ggml, supporting models like GPT-OSS and Qwen 3 MoE across multiple GPUs. Numerous bug fixes address stability, quantization handling, and backend-specific issues across Vulkan, Metal, ROCm, and various model implementations.

b87342 fixes
Apr 9, 2026

This release addresses an ambiguous grammar rule in gemma4 and resolves a missing comma issue. It also provides extensive pre-compiled binaries for macOS, Linux, Windows, and openEuler platforms.

b87371 fix
Apr 9, 2026

This release includes a stability fix in ggml by checking the return values of CUB calls used in argsort and top-k operations. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b87333 fixes1 feature
Apr 9, 2026

This release simplifies autoparser tagged parser rules and fixes several issues related to parameter initialization and optional argument ordering within the continuation logic. It also provides numerous pre-compiled binaries for diverse hardware and operating systems.

b87321 fix
Apr 9, 2026

This release primarily addresses a bug by fixing the multimodal padding token for gemma3n/gemma4 models and includes minor nits.

b87301 fix1 feature
Apr 9, 2026

This release introduces new tokenizer tests for Gemma 4, fixes an associated edge case, and includes minor internal code cleanup.

b87311 fix3 features
Apr 9, 2026

This release introduces support for dots in model names within the mtmd feature and adds GGUF conversion capabilities, alongside various platform-specific binary updates.

b87291 fix4 features
Apr 9, 2026

This release enhances the Jinja engine with Python-style string repetition, improved ASCII handling in tojson, and identity operations for int/float types. A bug related to invalid UTF-8 byte escaping during JSON serialization was also fixed.

b87281 feature
Apr 9, 2026

This release introduces missing mm-id specializations for q1_0 within the metal backend and provides updated pre-compiled binaries for macOS, Linux, Windows, and openEuler platforms.

b87261 fix
Apr 9, 2026

This release includes a fix for grammar commandline arguments in the server component and provides updated binary distributions for macOS, Linux, Windows, and openEuler targeting various hardware architectures and accelerators.

b87243 fixes2 features
Apr 9, 2026

This release enhances SYCL performance by adding Flash Attention support for head size 512 and cleans up backend initialization logic. It also removes defunct mxfp4 reordering logic.

b87221 feature
Apr 9, 2026

This release unifies Vulkan type macros to use Vx instead of the older _VECx convention. It also provides extensive pre-built binaries for diverse hardware and operating system configurations.

b87211 fix
Apr 9, 2026

This release addresses an issue where GGUF split files were incorrectly ordered during model selection. It ensures robustness by skipping non-primary split files regardless of their listing order.

b87201 fix
Apr 9, 2026

This release includes an internal fix for CUDA equality checks by storing source data pointers and provides updated pre-compiled binaries across multiple platforms and accelerators.

b87191 fix
Apr 9, 2026

This release addresses a significant memory leak in the optimization context freeing routine (ggml_opt_free) by ensuring the per-batch context copy (ctx_copy) is properly released.

b87181 feature
Apr 9, 2026

This release introduces a server-side feature to respect the ignore eos flag and provides numerous pre-compiled binaries for diverse operating systems and hardware configurations.

b87171 feature
Apr 9, 2026

This release updates binary builds for numerous platforms (macOS, Linux, Windows, openEuler) and implements a vocabulary change by removing the </s> eog token for Gemma 4 models.

b87151 fix
Apr 9, 2026

This release primarily addresses minor documentation issues by fixing a couple of typos. It also provides updated binary distributions across various operating systems and hardware configurations.

b87141 fix
Apr 9, 2026

This release enhances KV cache quantization checks to correctly account for enabled flash attention. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b87131 feature
Apr 9, 2026

This release introduces adapter support querying for the WebGPU backend and provides updated binary distributions for numerous operating systems and hardware configurations.

b87124 features
Apr 9, 2026

This release introduces the initial Q1_0 Metal backend, including kernel tuning and associated testing infrastructure. It also provides a wide array of pre-compiled binaries for macOS, Linux, Windows, and openEuler platforms.

b87113 features
Apr 9, 2026

This release introduces architectural optimizations for the Gemma model by restructuring projection operations within the first layer and before the main layer loop. Pre-compiled binaries are provided for various operating systems and hardware configurations.

b87101 fix
Apr 8, 2026

This release updates the debug example to disable the cb_eval callback when using the --save-logits flag, reducing noise in the output. It also provides numerous pre-compiled binaries for macOS, Linux, Windows, and openEuler.

b87091 fix
Apr 8, 2026

This release includes a fix for MiniMax handling within the parser and provides updated pre-compiled binaries for macOS, Linux, Windows, and openEuler across various CPU/GPU configurations.

b87081 fix1 feature
Apr 8, 2026

This release focuses on updating pre-compiled binaries across multiple operating systems and hardware targets, including new support for ROCm 7.2 and CUDA 13.1, alongside minor test cleanup.

b87051 fix6 features
Apr 8, 2026

This release introduces support for the step3-vl-10b model and includes several internal optimizations and refactoring, such as using fused QKV and updating parameter handling for MmprojModel.

b87033 features
Apr 8, 2026

This release introduces KleidiAI-enabled ARM artifacts for macOS and standardizes the macOS release build process. It also provides an extensive set of pre-compiled binaries for various Linux, Windows, and openEuler configurations.

b87023 features
Apr 8, 2026

This release significantly speeds up CUDA graph property checks using hashing and introduces optimizations like 'seen node' and 'memcp'. It also provides extensive pre-built binaries for diverse operating systems and hardware configurations.

b87013 fixes2 features
Apr 8, 2026

This release introduces performance improvements for q4_0 and q4_1 mmq kernels on AMD GPUs via ds_read_b128 optimization and includes various bug fixes and code cleanup in the CUDA implementation.

b86991 fix1 feature
Apr 8, 2026

This release introduces support for attention rotation within the kv-cache for heterogeneous iSWA configurations and removes an unnecessary assertion.

b86971 fix
Apr 8, 2026

This release introduces a safety check in CUDA operations to prevent buffer overlap during fusion and provides numerous updated pre-built binaries across different operating systems and hardware configurations.

b86982 fixes6 features
Apr 8, 2026

This release focuses heavily on optimizing and stabilizing the ggml-webgpu backend by parameterizing submission sizes, adding iOS limits, and removing internal deadlocks. Internal refactoring includes moving types and simplifying profiling futures.

b86961 fix
Apr 8, 2026

This release addresses a bug in llama-server where model parameters were not being propagated correctly. It also provides numerous pre-compiled binaries for different operating systems and hardware setups.

b8694Breaking
Apr 7, 2026

This release removes per-architecture tensor name lists from the llama component and provides updated binary distributions for numerous operating systems and hardware configurations.

b86931 fix
Apr 7, 2026

This release addresses a specific bug in the server component related to checkpoint restoration when pos_min is 0. It also provides a comprehensive set of pre-compiled binaries for different operating systems and hardware architectures.

b86921 fix
Apr 7, 2026

This release deprecates the GGML_OP_ADD1 operation and includes minor internal cleanup, alongside providing updated pre-built binaries for numerous operating systems and hardware configurations.

b86912 features
Apr 7, 2026

This release introduces Vulkan build support for ggml on Linux and improves error reporting for fork failures. It also provides a comprehensive set of pre-compiled binaries across multiple platforms and hardware configurations.

b86903 features
Apr 7, 2026

This release introduces support for FA dequantization of Q4_1, Q5_0, Q5_1, and IQ4_NL formats within the Vulkan backend. Various pre-compiled binaries for different operating systems and hardware configurations are provided.

b86881 fix
Apr 7, 2026

This release primarily fixes an incorrect compute capability constant for CDNA2 (gfx90a/MI210) within the ggml-cuda backend. It also provides updated pre-compiled binaries across multiple platforms.

b86851 fix2 features
Apr 7, 2026

This release introduces a significant Q8_0 reorder optimization for the SYCL backend, improving performance on Intel Arc hardware, and fixes a bug preventing this optimization from activating for Q8_0 tensors.

b86831 feature
Apr 6, 2026

This release introduces support for the MUL_MAT_ID operation within the ggml-webgpu backend and provides updated pre-compiled binaries for macOS, Linux, Windows, and openEuler platforms.

b8682Breaking2 fixes2 features
Apr 6, 2026

This release introduces Q1_0 1-bit quantization support for the CPU, involving renaming and removing specific quantization variants and fixing related enum issues.

b8681Breaking1 fix
Apr 6, 2026

This release fixes an issue where newline characters were incorrectly stripped in multiline input for llama-cli and includes an internal change replacing '&' with 'string_view'.

b86804 features
Apr 6, 2026

This release introduces an optimized flash_attn_stream_k_fixup kernel for CUDA to improve performance under specific conditions. It also provides updated pre-compiled binaries across multiple operating systems and hardware targets.

b86791 feature
Apr 6, 2026

This release updates the llama-bench tool by adding new command-line arguments (`-fitc` and `-fitt`) and provides a comprehensive set of pre-compiled binaries for numerous operating systems and hardware configurations.

b86781 feature
Apr 6, 2026

This release introduces byte token handling support for the Gemma4 BPE detokenizer and provides updated pre-compiled binaries for numerous operating systems and hardware configurations.

b86761 fix
Apr 6, 2026

This release fixes an issue in the server's chunked stream provider to correctly handle unsuccessful writes to the sink, ensuring stream abortion on connection failure. It also provides updated pre-compiled binaries for numerous platforms.

b86721 feature
Apr 6, 2026

This release includes a minor optimization for argosrt output initialization within the hexagon component and provides updated binary distributions for macOS, Linux, Windows, and openEuler targeting various CPU/GPU architectures.

b86713 fixes
Apr 6, 2026

This release focuses on fixing issues related to GGUF boolean metadata loading, ensuring platform-independent behavior for BOOL metadata within the llama component.

b86702 fixes6 features
Apr 6, 2026

This release introduces comprehensive support for HunyuanOCR models, including vision capabilities and a new chat template. It also includes various fixes related to token IDs and tensor mappings during conversion.

b86682 fixes
Apr 5, 2026

This release updates the llama-server startup logging to remove redundancy and include build/commit information. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b86656 fixes5 features
Apr 4, 2026

This release introduces specialized parsing and improved tool call handling for Gemma 4, alongside various internal cleanups and bug fixes.

Common Errors

NotImplementedError2 reports

NotImplementedError in llama-cpp often arises when attempting to use a feature or model architecture that hasn't yet been fully implemented in the conversion or evaluation code. To resolve this, either update to the latest version of llama-cpp which may include the necessary implementation or contribute the missing functionality by implementing the required logic for the specific operator/model architecture and submitting a pull request. If an update is not available, using a model known to work can also provide a workaround.

DeviceLostError2 reports

DeviceLostError in llama-cpp usually indicates the GPU lost connection or encountered a critical error, often due to out-of-memory issues or driver instability, especially with Vulkan. Try reducing the model size, batch size, or number of threads to decrease GPU memory usage or updating to the latest GPU drivers to resolve potential driver bugs. Consider using a different backend like CUDA or Metal if available to circumvent Vulkan-specific problems.

InternalServerError2 reports

InternalServerError in llama-cpp often arises from unsupported model architectures or operations, such as attempting multimodal input with a model not designed for it or faulty tool calling within a specific model. To resolve this, verify model compatibility with the requested operation in your code, and update llama-cpp to the latest version or use a compatible model known to work with multimodal inputs or tool calling. If issues persist, inspect the model's configuration, particularly its handling of vision or function calling, and revise your prompts accordingly.

FileNotFoundError1 report

The "FileNotFoundError" in llama-cpp usually means a required file path, often a model or tokenizer component, isn't valid or the file doesn't exist at that location. Double-check the path specified in your command-line arguments or configuration files for typos and ensure the necessary files are actually present in the indicated directory. If converting from Hugging Face, ensure all necessary files, like "tokenizer.model", were downloaded correctly.

Related AI & LLMs Packages

Subscribe to Updates

Get notified when new versions are released

RSS Feed