Change8

llama.cpp

AI & LLMs

LLM inference in C/C++

Latest: b8182100 releases3 breaking changes4 common errorsView on GitHub

Release History

b81821 feature
11h ago

This release updates the bundled miniaudio library to version 0.11.24 and provides a comprehensive set of pre-built binaries for numerous operating systems and hardware configurations.

b8181
13h ago

This release updates the bundled cpp-httplib dependency to version 0.35.0 and provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b81803 fixes3 features
17h ago

This release introduces model metadata loading from Hugging Face for testing purposes, along with optimizations for incremental downloading and fixes related to compilation conditions.

b81793 fixes6 features
Feb 27, 2026

This release introduces significant performance enhancements for AMD CDNA3 (MI300X) hardware by adding MFMA support to the flash attention MMA kernel. It also refines the dispatch logic for flash attention kernels based on batch size and head dimensions.

b8178
Feb 27, 2026

This release adds a "pragma once" directive to server-context.h and provides numerous pre-compiled binaries for macOS, Linux, Windows, and openEuler targeting different CPU/GPU architectures.

b81771 feature
Feb 27, 2026

The server API has been updated to mirror the /v1/responses endpoint to /responses for consistency. This release also provides numerous pre-compiled binaries for diverse hardware and operating systems.

b81751 feature
Feb 27, 2026

This release introduces repack support for mxfp4 quantization within ggml-cpu and provides updated pre-compiled binaries across various operating systems and hardware configurations.

b81731 fix3 features
Feb 27, 2026

The server component received significant updates to support multiple model aliases via a comma-separated --alias flag and introduced informational tags. The /v1/models endpoint was updated to reflect these new fields.

b81721 fix
Feb 27, 2026

This release enables out-of-tree builds for the test-chat binary by correcting the working directory heuristic for finding model files. It also provides extensive pre-compiled binaries for multiple operating systems and hardware configurations.

b81711 feature
Feb 27, 2026

This release improves iGPU support by replacing a hardcoded value with the dynamic maximum work group size. It also provides numerous pre-compiled binaries for diverse hardware and operating systems.

b81701 feature
Feb 27, 2026

This release updates the ggml-zendnn backend to align with the latest ZenDNN API changes, including adapting to the new lowoha::matmul interface and updating the CMake configuration.

b81691 fix2 features
Feb 27, 2026

This release enhances ggml performance by fixing AMX issues and introducing batched support, leading to faster perplexity calculation times.

b81681 fix
Feb 27, 2026

This release primarily addresses a bug concerning fp16 Flash Attention performance on specific Windows AMD hardware configurations using Vulkan. It also provides a comprehensive set of updated pre-built binaries across multiple operating systems and hardware targets.

b81671 fix
Feb 27, 2026

This release primarily addresses a bug fix concerning the padding calculation for n_tokens. It also provides updated pre-compiled binaries for numerous operating systems and hardware configurations.

b81661 fix
Feb 27, 2026

This release addresses a bug fix in the server component related to context checkpoint restoration and provides updated binary distributions for numerous operating systems and hardware configurations.

b81651 fix
Feb 27, 2026

This release contains a critical bug fix related to the kv-cache's can_shift() check when using M-RoPE, alongside updated pre-compiled binaries for various operating systems and hardware configurations.

b81641 fix2 features
Feb 27, 2026

This release introduces support for merging gate and exp weights in llama models and adds necessary components for all MoE models, alongside various pre-compiled binaries.

b81634 fixes6 features
Feb 26, 2026

This release focuses heavily on improving the reliability and safety of the ggml-virtgpu backend through extensive consistency checks, error handling, and fallbacks for optional interface methods. It also includes various minor fixes and documentation updates.

b81621 fix
Feb 26, 2026

This release primarily addresses a bug where the server's load-on-startup configuration from the INI file was ignored. It also provides updated pre-compiled binaries for numerous operating systems and hardware configurations.

b81611 fix
Feb 26, 2026

This release corrects a bug related to the default size for string slices in jinja and provides updated binary distributions for numerous platforms including macOS, Linux, Windows, and openEuler.

b81591 fix
Feb 26, 2026

This release includes an optimization in the GGUF implementation to reduce unnecessary file size calls and provides updated pre-built binaries for various operating systems and hardware configurations.

b81571 fix1 feature
Feb 26, 2026

This release introduces support for permuted quantization formats and removes an outdated check related to s0/s10 parameters, accompanied by numerous pre-compiled binaries for diverse platforms.

b81561 fix
Feb 26, 2026

This release introduces a safety check in the Vulkan backend to prevent memory overlap during fusion operations. It also provides updated binaries for various operating systems and hardware configurations.

b81551 feature
Feb 25, 2026

This release introduces additional aliases for sampler CLI parameters and provides updated binary distributions for macOS, Linux, Windows, and openEuler targeting various hardware and compute backends.

b81531 feature
Feb 25, 2026

The server component now supports multi-modal prompt caching. This release also provides extensive pre-compiled binaries for a wide range of operating systems and hardware configurations.

b81521 fix1 feature
Feb 25, 2026

This release introduces support for multi-modal context checkpoints on the server side and includes several internal code modifications and a bug fix related to sequence management.

b81491 fix
Feb 25, 2026

This release addresses a bug related to ftell/fseek functionality on Windows within the gguf component and provides updated pre-compiled binaries for macOS, Linux, Windows, and openEuler.

b81481 fix
Feb 25, 2026

This release includes a fix for graph splitting issues within models and provides updated pre-compiled binaries for macOS, Linux, Windows, and openEuler platforms.

b81471 fix
Feb 24, 2026

This release primarily addresses a bug in the server component where query parameters were incorrectly handled during request proxying in multi-model router mode.

b814610 fixes2 features
Feb 24, 2026

This release introduces significant stability improvements by preventing various integer overflows across ggml and gguf operations, alongside minor fixes and the removal of a deprecated function.

b81451 fix1 feature
Feb 24, 2026

This release updates the model label for LFM2-24B-A2B in benchmark output and cleans up the output formatting by removing an extra line.

b81441 feature
Feb 24, 2026

The server API has been updated to support the new "max_completion_tokens" request property, deprecating the older "max_tokens" parameter.

b814311 fixes18 features
Feb 24, 2026

This release focuses heavily on refactoring and optimizing the Vulkan Scalar Flash Attention implementation, introducing fp16 support, improving synchronization, and applying numerous hardware-specific tuning fixes across AMD, Intel, and Nvidia platforms.

b81421 fix
Feb 24, 2026

This release primarily addresses a bug fix related to cooperative matrix multiplication support within the Vulkan backend when bf16 is not available, alongside providing updated pre-built binaries for numerous platforms.

b81411 fix
Feb 24, 2026

This release addresses a data race issue within the Vulkan mul_mat_id shader and provides updated binary distributions across various platforms including macOS, Linux, Windows, and openEuler.

b81404 fixes3 features
Feb 24, 2026

This release focuses heavily on internal refactoring within the hexagon backend, optimizing various Ops by using local context structs and rewriting ROPE for better DMA/VTCM utilization, leading to minor performance gains. Snapdragon builds also received updates to support larger ubatches.

b8138
Feb 23, 2026

This release updates the bundled cpp-httplib dependency to version 0.34.0 and provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b8133Breaking1 fix3 features
Feb 23, 2026

This release removes the storage of output ids, logits, and embeddings from the llama context state, necessitating updates to session handling and state loading mechanisms. Bug fixes include addressing sequence allocation errors in examples for recurrent models.

b81321 feature
Feb 23, 2026

This release introduces an enhancement to the CLI allowing model specification via text filename and provides updated binary distributions for numerous operating systems and hardware configurations.

b81311 fix
Feb 22, 2026

This release includes a bug fix for incorrect statistics calculation in Jinja filters and provides updated binary distributions across macOS, Linux, Windows, and openEuler platforms.

b81301 fix
Feb 22, 2026

This release addresses a bug in the XML parser related to improper trimming upon message completion and provides updated binary distributions for numerous operating systems and hardware configurations.

b81281 feature
Feb 22, 2026

This release introduces support for the new Kanana-2 model and provides updated pre-compiled binaries across multiple platforms, including specific CUDA and ROCm versions.

b81261 fix1 feature
Feb 22, 2026

This release introduces an improvement on the server side to merge contiguous response input items into a single assistant message, alongside various binary updates for different platforms.

b81231 feature
Feb 21, 2026

This release introduces support for building ROCm artifacts targeting ROCm 7.2, expanding hardware compatibility. New pre-built binaries are available across various operating systems and hardware configurations.

b8122
Feb 21, 2026

This release updates the internal cpp-httplib vendor dependency to version 0.33.1 and provides a comprehensive set of pre-built binaries for macOS, Linux, Windows, and openEuler targets.

b81211 fix1 feature
Feb 21, 2026

This release significantly improves CUDA graph capture by delaying activation until stability is confirmed, preventing wasted overhead during prompt processing and allowing graphs to re-enable after stabilization. It also includes minor cleanup by removing EM dashes.

b81191 fix
Feb 21, 2026

This release primarily addresses a build issue related to hexagon and provides updated pre-compiled binaries across multiple platforms and hardware configurations.

b81181 fix2 features
Feb 20, 2026

This release merges the qwen3-coder and nemotron nano 3 parsers, migrating qwen3-coder to PEG parsing, and includes a new JSON parameter test for CI.

b81172 features
Feb 20, 2026

This release focuses on enhancing CPU performance by adding support for various RVV vec dot kernels within ggml-cpu, alongside providing updated pre-compiled binaries for multiple operating systems and hardware configurations.

b81164 fixes3 features
Feb 20, 2026

This release introduces a --dry-run option for llama-quantize and refines internal tensor dimension handling and quantization logic, including new checks related to imatrix usage.

b8115
Feb 20, 2026

This release primarily focuses on providing pre-built binaries for various operating systems and hardware configurations, including specific CUDA versions (12.4 and 13.1) for Windows, and includes tests for matrix multiplication with huge batch sizes.

b81132 fixes4 features
Feb 20, 2026

This release introduces robust support for the Step-3.5-Flash model, including correct XML tool call parsing and thinking support by routing it to the Nemotron v3 PEG parser. Dead thinking code paths in the Qwen3-Coder XML handler were also removed.

b81121 fix
Feb 20, 2026

This release addresses a Jinja rendering error related to assistant messages containing both content and tool call thinking in gpt-oss and provides updated binaries for numerous operating systems and hardware configurations.

b81111 fix1 feature
Feb 20, 2026

This release introduces support for several unary operations within the ggml-webgpu backend and includes a necessary fix for type casting during trigonometric computations. It also provides updated pre-compiled binaries for numerous operating systems and architectures.

b81104 features
Feb 20, 2026

This release introduces support for the PaddleOCR-VL model and includes several internal updates related to model loading parameters, preprocessing, and format adjustments.

b81091 fix
Feb 20, 2026

This release primarily addresses a bug fix related to MMQ shader push constants and multi-dispatch functionality within the Vulkan backend.

b81082 fixes
Feb 20, 2026

This release addresses a critical bug in Qwen3.5 model shapes and optimizes continuous operations by removing unnecessary reshapes. It also provides updated binaries for numerous platforms.

b81072 features
Feb 20, 2026

This release updates the build_attn logic and introduces control over flash_attn usage through context parameters. Pre-built binaries for numerous platforms are provided.

b81065 fixes2 features
Feb 20, 2026

This release introduces full support for the JAIS-2 model architecture, including specific fixes for tokenizer hashing, RoPE type, and control vector support. It also notes that JAIS-2 requires F32 precision accumulators on CUDA.

b81051 fix
Feb 19, 2026

This release addresses a bug in CUDA kernel selection logic for tile FA and provides updated binary distributions for numerous platforms including macOS, Linux, and Windows.

b81041 fix2 features
Feb 19, 2026

This release fixes an issue where an extra newline was inserted between text and media markers in MTMD chat output by introducing a specific `media_marker` type. This resolves token count discrepancies when comparing llama-server output with HF implementations for vision models.

b81023 fixes2 features
Feb 19, 2026

This release introduces support for the LFM2.5-Audio-1.5B tokenizer and includes several internal code improvements and fixes related to attention layers and model conversion.

b8101
Feb 19, 2026

Internal refactoring was performed in the llama module to unify batch index resolution by utilizing output_resolve_row() in get_logits_ith() and get_embeddings_ith().

b81001 fix4 features
Feb 19, 2026

This release introduces full support for modern BERT models, including specific architectural adjustments like GELU in rank pooling and dense first layers. Several internal file updates and bug fixes related to mean pooling were also implemented.

b80991 fix2 features
Feb 19, 2026

This release introduces a significant performance enhancement for llamafile on powerpc by adding an FP16 MMA path for Q4/Q8 matrix multiplications, resulting in 1.5x to 2x speedup for relevant workloads.

b80981 fix1 feature
Feb 19, 2026

This release focuses on model graph optimization by deduping Qwen35 graphs and includes a minor fix by adding a missing sigmoid function.

b80951 fix
Feb 19, 2026

This release primarily addresses a bug in the ggml webgpu backend related to large matrix-vector multiplication dispatching. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b80941 fix1 feature
Feb 18, 2026

This release introduces server-side debugging by saving generated text for the /slots endpoint based on the LLAMA_SERVER_SLOTS_DEBUG environment variable.

b80914 fixes3 features
Feb 18, 2026

This release focuses heavily on refactoring and implementing preliminary JIT compilation for key matrix operations within the ggml WebGPU backend, alongside organizing the shader library.

b80931 feature
Feb 18, 2026

This release introduces model support for GLM-OCR and updates the conversion script. Pre-built binaries for numerous platforms are provided.

b80891 feature
Feb 18, 2026

The Vulkan backend was updated to split matrix multiplication operations into multiple dispatches to prevent overflow issues when handling large batch dimensions. This release also includes a comprehensive set of pre-compiled binaries for diverse operating systems and hardware configurations.

b80881 fix1 feature
Feb 18, 2026

This release optimizes internal string handling by inlining small helper functions and utilizing string_view where appropriate, alongside fixing related corner cases. New binaries are provided for numerous platforms.

b8087
Feb 18, 2026

This release refactors the OpenCL implementations for expm1 and softplus kernels and introduces the use of 'h' for half literals in OpenCL operations.

b80861 feature
Feb 17, 2026

This release includes performance optimizations for OpenCL mean and sum_row kernels and provides updated pre-compiled binaries for a wide range of operating systems and hardware configurations.

b80831 fix
Feb 17, 2026

This release disables LTO for CPU feature detection in ggml to resolve Illegal instruction errors occurring on older hardware due to aggressive cross-module optimization.

b80821 feature
Feb 17, 2026

This release enables CUDA graphs for MMID operations when the batch size is small (1 to 4) and includes various pre-compiled binaries for different operating systems and hardware configurations.

b80791 feature
Feb 17, 2026

This release updates the build process by linking ws2_32 as PUBLIC on Windows and provides a comprehensive set of pre-built binaries across multiple operating systems and hardware architectures.

b8078
Feb 17, 2026

This release includes a cleanup of the library linking logic in the build system. It also provides extensive pre-built binaries for macOS, Linux, Windows, and openEuler across various architectures and acceleration backends.

b80772 features
Feb 17, 2026

This release introduces support for JoyAI-LLM-Flash conversion by updating tokenizer hash mappings and adding a new pre-tokenizer name for joyai-llm.

b80761 feature
Feb 17, 2026

This release introduces proper batching support for the Perplexity integration and provides updated binary distributions for macOS, Linux, Windows, and openEuler targeting various CPU/GPU architectures.

b80751 feature
Feb 17, 2026

This release introduces inline functions for common operations and provides updated binary distributions for numerous platforms including macOS, Linux, and Windows with various hardware acceleration options.

b80741 feature
Feb 17, 2026

This release promotes `ggml_is_view` to a public API and renames an internal helper function from `ggml_aux_is_view` to `ggml_impl_is_view`.

b80731 fix1 feature
Feb 17, 2026

This release introduces support for Tiny Aya Models and includes fixes for tokenizer regex edge cases. It also provides numerous pre-compiled binaries for different operating systems and hardware configurations.

b8072
Feb 17, 2026

The build system was reworked to correctly handle the deprecation of llama_option_depr, specifically addressing LLAMA_CURL. Numerous pre-compiled binaries for diverse platforms and hardware configurations are now available.

b80711 fix
Feb 17, 2026

This release refines the ROCm compilation workaround for ROCWMMA_FATTN/GFX9 to be conditional on newer ROCm versions, resolving an issue observed with ROCm 6.4.4.

b80701 fix2 features
Feb 16, 2026

This release introduces model graph deduplication and updates for Qwen family models, including the addition of `llm_build_delta_net_base`, alongside providing numerous pre-compiled binaries for diverse platforms.

b80694 fixes
Feb 16, 2026

This release focuses on internal fixes within the graph and continuous modules, specifically addressing issues related to KQ mask reuse and adapter checks.

b80681 fix2 features
Feb 16, 2026

This release introduces SVE optimization for aarch64 in the ggml kernel, improving performance on supported hardware, and includes extensive pre-compiled binaries for multiple platforms.

b80671 feature
Feb 15, 2026

This release primarily updates the binary distributions for ggml synchronization, providing new builds for macOS, Linux, Windows, and openEuler targeting various CPU/GPU architectures.

b8064Breaking1 fix3 features
Feb 15, 2026

This release focuses heavily on CUDA performance optimizations for iq2xxs/iq2xs/iq3xxs dequantization, including register savings and algorithmic simplification, alongside fixing a type definition issue.

b8061Breaking1 fix
Feb 15, 2026

This release primarily addresses a build issue related to the KleidiAI backend when compiling multiple CPU backends by correcting the use of CMake's FetchContent functions. It also provides updated pre-compiled binaries for numerous platforms.

b8062
Feb 15, 2026

The LLAMA_HTTPLIB build option was removed due to cpp-httplib now compiling correctly on visionOS, simplifying the build process. This release also provides updated binary distributions for macOS, Linux, Windows, and openEuler.

b80601 fix
Feb 15, 2026

This release primarily addresses a bug concerning output reordering when backend sampling is utilized. It also provides updated pre-compiled binaries for numerous operating systems and hardware configurations.

b80591 fix
Feb 15, 2026

This release focuses on stability by avoiding undefined behavior in the ggml gemm ukernel. It also provides updated pre-built binaries for numerous platforms and hardware configurations.

b80581 feature
Feb 15, 2026

This release introduces performance optimizations for the ggml CPU backend, specifically targeting the ggml_vec_dot_bf16 function on s390x architecture, alongside updated pre-compiled binaries.

b80577 fixes2 features
Feb 15, 2026

This release introduces significant performance enhancements to ggml-cpu via a new GEMM microkernel and addresses several low-level implementation details and warnings. It also provides extensive new pre-compiled binaries for various platforms and accelerators.

b80561 fix
Feb 15, 2026

This release primarily addresses a CMake build issue related to the KleidiAI install target failure when using EXCLUDE_FROM_ALL, ensuring proper exclusion while maintaining functionality.

b80542 features
Feb 14, 2026

This release adds support for Nemotron Nano 12B v2 VL models and simplifies related code. It also implements a change to pre-downsample position embeddings during GGUF conversion for fixed input size handling.

b80534 features
Feb 14, 2026

This release focuses on internal optimizations, primarily targeting the Qwen3Next model graph execution and refining chunking logic by removing redundancy and avoiding mask passing.

b80521 fix
Feb 14, 2026

This release primarily addresses a bug in GGML related to the interaction between GGML_DEBUG and OpenMP compilation flags. It also provides updated pre-built binaries for numerous operating systems and hardware configurations.

Common Errors

DeviceLostError2 reports

vk::DeviceLostError usually signifies that the GPU has encountered an unrecoverable error, often due to exceeding memory limits or hitting a timeout. Reduce the model size, lower batch sizes, or decrease the max context length to alleviate memory pressure. Alternatively, increase the GPU timeout duration in your system settings or driver configurations if that is permissible.

InternalServerError2 reports

InternalServerError in llama-cpp often arises from unsupported model architectures or operations, such as attempting multimodal input with a model not designed for it or faulty tool calling within a specific model. To resolve this, verify model compatibility with the requested operation in your code, and update llama-cpp to the latest version or use a compatible model known to work with multimodal inputs or tool calling. If issues persist, inspect the model's configuration, particularly its handling of vision or function calling, and revise your prompts accordingly.

FileNotFoundError1 report

The "FileNotFoundError" in llama-cpp usually means a required file path, often a model or tokenizer component, isn't valid or the file doesn't exist at that location. Double-check the path specified in your command-line arguments or configuration files for typos and ensure the necessary files are actually present in the indicated directory. If converting from Hugging Face, ensure all necessary files, like "tokenizer.model", were downloaded correctly.

NotImplementedError1 report

NotImplementedError usually arises when a function or method is called but lacks a concrete implementation in the current class or codebase. To fix it, locate the function causing the error, either implement the missing functionality there, or use a different function call. If you are converting models, ensure the conversion script supports the specific model architecture being used.

Related AI & LLMs Packages

Subscribe to Updates

Get notified when new versions are released

RSS Feed