llama.cpp
AI & LLMsLLM inference in C/C++
Release History
b81821 featureThis release updates the bundled miniaudio library to version 0.11.24 and provides a comprehensive set of pre-built binaries for numerous operating systems and hardware configurations.
b8181This release updates the bundled cpp-httplib dependency to version 0.35.0 and provides numerous pre-compiled binaries for various operating systems and hardware configurations.
b81803 fixes3 featuresThis release introduces model metadata loading from Hugging Face for testing purposes, along with optimizations for incremental downloading and fixes related to compilation conditions.
b81793 fixes6 featuresThis release introduces significant performance enhancements for AMD CDNA3 (MI300X) hardware by adding MFMA support to the flash attention MMA kernel. It also refines the dispatch logic for flash attention kernels based on batch size and head dimensions.
b8178This release adds a "pragma once" directive to server-context.h and provides numerous pre-compiled binaries for macOS, Linux, Windows, and openEuler targeting different CPU/GPU architectures.
b81771 featureThe server API has been updated to mirror the /v1/responses endpoint to /responses for consistency. This release also provides numerous pre-compiled binaries for diverse hardware and operating systems.
b81751 featureThis release introduces repack support for mxfp4 quantization within ggml-cpu and provides updated pre-compiled binaries across various operating systems and hardware configurations.
b81731 fix3 featuresThe server component received significant updates to support multiple model aliases via a comma-separated --alias flag and introduced informational tags. The /v1/models endpoint was updated to reflect these new fields.
b81721 fixThis release enables out-of-tree builds for the test-chat binary by correcting the working directory heuristic for finding model files. It also provides extensive pre-compiled binaries for multiple operating systems and hardware configurations.
b81711 featureThis release improves iGPU support by replacing a hardcoded value with the dynamic maximum work group size. It also provides numerous pre-compiled binaries for diverse hardware and operating systems.
b81701 featureThis release updates the ggml-zendnn backend to align with the latest ZenDNN API changes, including adapting to the new lowoha::matmul interface and updating the CMake configuration.
b81691 fix2 featuresThis release enhances ggml performance by fixing AMX issues and introducing batched support, leading to faster perplexity calculation times.
b81681 fixThis release primarily addresses a bug concerning fp16 Flash Attention performance on specific Windows AMD hardware configurations using Vulkan. It also provides a comprehensive set of updated pre-built binaries across multiple operating systems and hardware targets.
b81671 fixThis release primarily addresses a bug fix concerning the padding calculation for n_tokens. It also provides updated pre-compiled binaries for numerous operating systems and hardware configurations.
b81661 fixThis release addresses a bug fix in the server component related to context checkpoint restoration and provides updated binary distributions for numerous operating systems and hardware configurations.
b81651 fixThis release contains a critical bug fix related to the kv-cache's can_shift() check when using M-RoPE, alongside updated pre-compiled binaries for various operating systems and hardware configurations.
b81641 fix2 featuresThis release introduces support for merging gate and exp weights in llama models and adds necessary components for all MoE models, alongside various pre-compiled binaries.
b81634 fixes6 featuresThis release focuses heavily on improving the reliability and safety of the ggml-virtgpu backend through extensive consistency checks, error handling, and fallbacks for optional interface methods. It also includes various minor fixes and documentation updates.
b81621 fixThis release primarily addresses a bug where the server's load-on-startup configuration from the INI file was ignored. It also provides updated pre-compiled binaries for numerous operating systems and hardware configurations.
b81611 fixThis release corrects a bug related to the default size for string slices in jinja and provides updated binary distributions for numerous platforms including macOS, Linux, Windows, and openEuler.
b81591 fixThis release includes an optimization in the GGUF implementation to reduce unnecessary file size calls and provides updated pre-built binaries for various operating systems and hardware configurations.
b81571 fix1 featureThis release introduces support for permuted quantization formats and removes an outdated check related to s0/s10 parameters, accompanied by numerous pre-compiled binaries for diverse platforms.
b81561 fixThis release introduces a safety check in the Vulkan backend to prevent memory overlap during fusion operations. It also provides updated binaries for various operating systems and hardware configurations.
b81551 featureThis release introduces additional aliases for sampler CLI parameters and provides updated binary distributions for macOS, Linux, Windows, and openEuler targeting various hardware and compute backends.
b81531 featureThe server component now supports multi-modal prompt caching. This release also provides extensive pre-compiled binaries for a wide range of operating systems and hardware configurations.
b81521 fix1 featureThis release introduces support for multi-modal context checkpoints on the server side and includes several internal code modifications and a bug fix related to sequence management.
b81491 fixThis release addresses a bug related to ftell/fseek functionality on Windows within the gguf component and provides updated pre-compiled binaries for macOS, Linux, Windows, and openEuler.
b81481 fixThis release includes a fix for graph splitting issues within models and provides updated pre-compiled binaries for macOS, Linux, Windows, and openEuler platforms.
b81471 fixThis release primarily addresses a bug in the server component where query parameters were incorrectly handled during request proxying in multi-model router mode.
b814610 fixes2 featuresThis release introduces significant stability improvements by preventing various integer overflows across ggml and gguf operations, alongside minor fixes and the removal of a deprecated function.
b81451 fix1 featureThis release updates the model label for LFM2-24B-A2B in benchmark output and cleans up the output formatting by removing an extra line.
b81441 featureThe server API has been updated to support the new "max_completion_tokens" request property, deprecating the older "max_tokens" parameter.
b814311 fixes18 featuresThis release focuses heavily on refactoring and optimizing the Vulkan Scalar Flash Attention implementation, introducing fp16 support, improving synchronization, and applying numerous hardware-specific tuning fixes across AMD, Intel, and Nvidia platforms.
b81421 fixThis release primarily addresses a bug fix related to cooperative matrix multiplication support within the Vulkan backend when bf16 is not available, alongside providing updated pre-built binaries for numerous platforms.
b81411 fixThis release addresses a data race issue within the Vulkan mul_mat_id shader and provides updated binary distributions across various platforms including macOS, Linux, Windows, and openEuler.
b81404 fixes3 featuresThis release focuses heavily on internal refactoring within the hexagon backend, optimizing various Ops by using local context structs and rewriting ROPE for better DMA/VTCM utilization, leading to minor performance gains. Snapdragon builds also received updates to support larger ubatches.
b8138This release updates the bundled cpp-httplib dependency to version 0.34.0 and provides numerous pre-compiled binaries for various operating systems and hardware configurations.
b8133Breaking1 fix3 featuresThis release removes the storage of output ids, logits, and embeddings from the llama context state, necessitating updates to session handling and state loading mechanisms. Bug fixes include addressing sequence allocation errors in examples for recurrent models.
b81321 featureThis release introduces an enhancement to the CLI allowing model specification via text filename and provides updated binary distributions for numerous operating systems and hardware configurations.
b81311 fixThis release includes a bug fix for incorrect statistics calculation in Jinja filters and provides updated binary distributions across macOS, Linux, Windows, and openEuler platforms.
b81301 fixThis release addresses a bug in the XML parser related to improper trimming upon message completion and provides updated binary distributions for numerous operating systems and hardware configurations.
b81281 featureThis release introduces support for the new Kanana-2 model and provides updated pre-compiled binaries across multiple platforms, including specific CUDA and ROCm versions.
b81261 fix1 featureThis release introduces an improvement on the server side to merge contiguous response input items into a single assistant message, alongside various binary updates for different platforms.
b81231 featureThis release introduces support for building ROCm artifacts targeting ROCm 7.2, expanding hardware compatibility. New pre-built binaries are available across various operating systems and hardware configurations.
b8122This release updates the internal cpp-httplib vendor dependency to version 0.33.1 and provides a comprehensive set of pre-built binaries for macOS, Linux, Windows, and openEuler targets.
b81211 fix1 featureThis release significantly improves CUDA graph capture by delaying activation until stability is confirmed, preventing wasted overhead during prompt processing and allowing graphs to re-enable after stabilization. It also includes minor cleanup by removing EM dashes.
b81191 fixThis release primarily addresses a build issue related to hexagon and provides updated pre-compiled binaries across multiple platforms and hardware configurations.
b81181 fix2 featuresThis release merges the qwen3-coder and nemotron nano 3 parsers, migrating qwen3-coder to PEG parsing, and includes a new JSON parameter test for CI.
b81172 featuresThis release focuses on enhancing CPU performance by adding support for various RVV vec dot kernels within ggml-cpu, alongside providing updated pre-compiled binaries for multiple operating systems and hardware configurations.
b81164 fixes3 featuresThis release introduces a --dry-run option for llama-quantize and refines internal tensor dimension handling and quantization logic, including new checks related to imatrix usage.
b8115This release primarily focuses on providing pre-built binaries for various operating systems and hardware configurations, including specific CUDA versions (12.4 and 13.1) for Windows, and includes tests for matrix multiplication with huge batch sizes.
b81132 fixes4 featuresThis release introduces robust support for the Step-3.5-Flash model, including correct XML tool call parsing and thinking support by routing it to the Nemotron v3 PEG parser. Dead thinking code paths in the Qwen3-Coder XML handler were also removed.
b81121 fixThis release addresses a Jinja rendering error related to assistant messages containing both content and tool call thinking in gpt-oss and provides updated binaries for numerous operating systems and hardware configurations.
b81111 fix1 featureThis release introduces support for several unary operations within the ggml-webgpu backend and includes a necessary fix for type casting during trigonometric computations. It also provides updated pre-compiled binaries for numerous operating systems and architectures.
b81104 featuresThis release introduces support for the PaddleOCR-VL model and includes several internal updates related to model loading parameters, preprocessing, and format adjustments.
b81091 fixThis release primarily addresses a bug fix related to MMQ shader push constants and multi-dispatch functionality within the Vulkan backend.
b81082 fixesThis release addresses a critical bug in Qwen3.5 model shapes and optimizes continuous operations by removing unnecessary reshapes. It also provides updated binaries for numerous platforms.
b81072 featuresThis release updates the build_attn logic and introduces control over flash_attn usage through context parameters. Pre-built binaries for numerous platforms are provided.
b81065 fixes2 featuresThis release introduces full support for the JAIS-2 model architecture, including specific fixes for tokenizer hashing, RoPE type, and control vector support. It also notes that JAIS-2 requires F32 precision accumulators on CUDA.
b81051 fixThis release addresses a bug in CUDA kernel selection logic for tile FA and provides updated binary distributions for numerous platforms including macOS, Linux, and Windows.
b81041 fix2 featuresThis release fixes an issue where an extra newline was inserted between text and media markers in MTMD chat output by introducing a specific `media_marker` type. This resolves token count discrepancies when comparing llama-server output with HF implementations for vision models.
b81023 fixes2 featuresThis release introduces support for the LFM2.5-Audio-1.5B tokenizer and includes several internal code improvements and fixes related to attention layers and model conversion.
b8101Internal refactoring was performed in the llama module to unify batch index resolution by utilizing output_resolve_row() in get_logits_ith() and get_embeddings_ith().
b81001 fix4 featuresThis release introduces full support for modern BERT models, including specific architectural adjustments like GELU in rank pooling and dense first layers. Several internal file updates and bug fixes related to mean pooling were also implemented.
b80991 fix2 featuresThis release introduces a significant performance enhancement for llamafile on powerpc by adding an FP16 MMA path for Q4/Q8 matrix multiplications, resulting in 1.5x to 2x speedup for relevant workloads.
b80981 fix1 featureThis release focuses on model graph optimization by deduping Qwen35 graphs and includes a minor fix by adding a missing sigmoid function.
b80951 fixThis release primarily addresses a bug in the ggml webgpu backend related to large matrix-vector multiplication dispatching. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.
b80941 fix1 featureThis release introduces server-side debugging by saving generated text for the /slots endpoint based on the LLAMA_SERVER_SLOTS_DEBUG environment variable.
b80914 fixes3 featuresThis release focuses heavily on refactoring and implementing preliminary JIT compilation for key matrix operations within the ggml WebGPU backend, alongside organizing the shader library.
b80931 featureThis release introduces model support for GLM-OCR and updates the conversion script. Pre-built binaries for numerous platforms are provided.
b80891 featureThe Vulkan backend was updated to split matrix multiplication operations into multiple dispatches to prevent overflow issues when handling large batch dimensions. This release also includes a comprehensive set of pre-compiled binaries for diverse operating systems and hardware configurations.
b80881 fix1 featureThis release optimizes internal string handling by inlining small helper functions and utilizing string_view where appropriate, alongside fixing related corner cases. New binaries are provided for numerous platforms.
b8087This release refactors the OpenCL implementations for expm1 and softplus kernels and introduces the use of 'h' for half literals in OpenCL operations.
b80861 featureThis release includes performance optimizations for OpenCL mean and sum_row kernels and provides updated pre-compiled binaries for a wide range of operating systems and hardware configurations.
b80831 fixThis release disables LTO for CPU feature detection in ggml to resolve Illegal instruction errors occurring on older hardware due to aggressive cross-module optimization.
b80821 featureThis release enables CUDA graphs for MMID operations when the batch size is small (1 to 4) and includes various pre-compiled binaries for different operating systems and hardware configurations.
b80791 featureThis release updates the build process by linking ws2_32 as PUBLIC on Windows and provides a comprehensive set of pre-built binaries across multiple operating systems and hardware architectures.
b8078This release includes a cleanup of the library linking logic in the build system. It also provides extensive pre-built binaries for macOS, Linux, Windows, and openEuler across various architectures and acceleration backends.
b80772 featuresThis release introduces support for JoyAI-LLM-Flash conversion by updating tokenizer hash mappings and adding a new pre-tokenizer name for joyai-llm.
b80761 featureThis release introduces proper batching support for the Perplexity integration and provides updated binary distributions for macOS, Linux, Windows, and openEuler targeting various CPU/GPU architectures.
b80751 featureThis release introduces inline functions for common operations and provides updated binary distributions for numerous platforms including macOS, Linux, and Windows with various hardware acceleration options.
b80741 featureThis release promotes `ggml_is_view` to a public API and renames an internal helper function from `ggml_aux_is_view` to `ggml_impl_is_view`.
b80731 fix1 featureThis release introduces support for Tiny Aya Models and includes fixes for tokenizer regex edge cases. It also provides numerous pre-compiled binaries for different operating systems and hardware configurations.
b8072The build system was reworked to correctly handle the deprecation of llama_option_depr, specifically addressing LLAMA_CURL. Numerous pre-compiled binaries for diverse platforms and hardware configurations are now available.
b80711 fixThis release refines the ROCm compilation workaround for ROCWMMA_FATTN/GFX9 to be conditional on newer ROCm versions, resolving an issue observed with ROCm 6.4.4.
b80701 fix2 featuresThis release introduces model graph deduplication and updates for Qwen family models, including the addition of `llm_build_delta_net_base`, alongside providing numerous pre-compiled binaries for diverse platforms.
b80694 fixesThis release focuses on internal fixes within the graph and continuous modules, specifically addressing issues related to KQ mask reuse and adapter checks.
b80681 fix2 featuresThis release introduces SVE optimization for aarch64 in the ggml kernel, improving performance on supported hardware, and includes extensive pre-compiled binaries for multiple platforms.
b80671 featureThis release primarily updates the binary distributions for ggml synchronization, providing new builds for macOS, Linux, Windows, and openEuler targeting various CPU/GPU architectures.
b8064Breaking1 fix3 featuresThis release focuses heavily on CUDA performance optimizations for iq2xxs/iq2xs/iq3xxs dequantization, including register savings and algorithmic simplification, alongside fixing a type definition issue.
b8061Breaking1 fixThis release primarily addresses a build issue related to the KleidiAI backend when compiling multiple CPU backends by correcting the use of CMake's FetchContent functions. It also provides updated pre-compiled binaries for numerous platforms.
b8062The LLAMA_HTTPLIB build option was removed due to cpp-httplib now compiling correctly on visionOS, simplifying the build process. This release also provides updated binary distributions for macOS, Linux, Windows, and openEuler.
b80601 fixThis release primarily addresses a bug concerning output reordering when backend sampling is utilized. It also provides updated pre-compiled binaries for numerous operating systems and hardware configurations.
b80591 fixThis release focuses on stability by avoiding undefined behavior in the ggml gemm ukernel. It also provides updated pre-built binaries for numerous platforms and hardware configurations.
b80581 featureThis release introduces performance optimizations for the ggml CPU backend, specifically targeting the ggml_vec_dot_bf16 function on s390x architecture, alongside updated pre-compiled binaries.
b80577 fixes2 featuresThis release introduces significant performance enhancements to ggml-cpu via a new GEMM microkernel and addresses several low-level implementation details and warnings. It also provides extensive new pre-compiled binaries for various platforms and accelerators.
b80561 fixThis release primarily addresses a CMake build issue related to the KleidiAI install target failure when using EXCLUDE_FROM_ALL, ensuring proper exclusion while maintaining functionality.
b80542 featuresThis release adds support for Nemotron Nano 12B v2 VL models and simplifies related code. It also implements a change to pre-downsample position embeddings during GGUF conversion for fixed input size handling.
b80534 featuresThis release focuses on internal optimizations, primarily targeting the Qwen3Next model graph execution and refining chunking logic by removing redundancy and avoiding mask passing.
b80521 fixThis release primarily addresses a bug in GGML related to the interaction between GGML_DEBUG and OpenMP compilation flags. It also provides updated pre-built binaries for numerous operating systems and hardware configurations.
Common Errors
DeviceLostError2 reportsvk::DeviceLostError usually signifies that the GPU has encountered an unrecoverable error, often due to exceeding memory limits or hitting a timeout. Reduce the model size, lower batch sizes, or decrease the max context length to alleviate memory pressure. Alternatively, increase the GPU timeout duration in your system settings or driver configurations if that is permissible.
InternalServerError2 reportsInternalServerError in llama-cpp often arises from unsupported model architectures or operations, such as attempting multimodal input with a model not designed for it or faulty tool calling within a specific model. To resolve this, verify model compatibility with the requested operation in your code, and update llama-cpp to the latest version or use a compatible model known to work with multimodal inputs or tool calling. If issues persist, inspect the model's configuration, particularly its handling of vision or function calling, and revise your prompts accordingly.
FileNotFoundError1 reportThe "FileNotFoundError" in llama-cpp usually means a required file path, often a model or tokenizer component, isn't valid or the file doesn't exist at that location. Double-check the path specified in your command-line arguments or configuration files for typos and ensure the necessary files are actually present in the indicated directory. If converting from Hugging Face, ensure all necessary files, like "tokenizer.model", were downloaded correctly.
NotImplementedError1 reportNotImplementedError usually arises when a function or method is called but lacks a concrete implementation in the current class or codebase. To fix it, locate the function causing the error, either implement the missing functionality there, or use a different function call. If you are converting models, ensure the conversion script supports the specific model architecture being used.
Related AI & LLMs Packages
AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.
Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models.
🦜🔗 The platform for reliable agents.
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
A high-throughput and memory-efficient inference and serving engine for LLMs
Subscribe to Updates
Get notified when new versions are released