Change8

llama.cpp

AI & LLMs

LLM inference in C/C++

Latest: b9352100 releases3 breaking changes4 common errorsView on GitHub

Release History

b93523 fixes
May 26, 2026

This release addresses naming inconsistencies and a print issue within the ggml-zendnn backend. It also provides updated pre-compiled binaries across multiple operating systems and hardware configurations.

b93512 features
May 26, 2026

This release provides a comprehensive set of pre-built binaries for various operating systems (macOS, Linux, Android, Windows) and hardware targets, including updates for specific GPU backends like ROCm 7.2 and CUDA 12.4/13.1.

b93341 fix
May 26, 2026

This release addresses a CUDA-related bug concerning PDL synchronization for FWHT operations and provides a comprehensive set of pre-built binaries for numerous platforms including macOS, Linux, Android, Windows, and openEuler.

b93331 feature
May 26, 2026

This release introduces the ability to retrieve the Apple device ID in the metal backend and provides updated pre-compiled binaries for various operating systems and hardware configurations.

b9331
May 26, 2026

This release focuses on restructuring Continuous Integration (CI) workflows by splitting jobs into separate workflows for better organization and build management. It also provides a comprehensive set of pre-built binaries for macOS, Linux, Android, Windows, and openEuler platforms.

b93301 fix
May 26, 2026

This release corrects the tensor operation tagging for ffn_latent in Nemotron models, resolving a loading issue that negatively impacted performance. Various pre-compiled binaries for different platforms are also provided.

b93292 features
May 26, 2026

This release introduces a fast Walsh-Hadamard transform for CUDA and updates internal kernel logic, including setting the warp size to 64.

b9326
May 26, 2026

This release primarily focuses on providing pre-compiled binaries for various operating systems and hardware configurations, including updates for macOS, Linux, Android, Windows, and openEuler platforms.

b93202 fixes
May 25, 2026

This release addresses critical issues by fixing the ggml context size calculation and resolving a memory leak. It also involves internal restructuring by moving the split state cache back into the context.

b93197 fixes3 features
May 25, 2026

This release introduces new GGUF initialization functions (`gguf_init_from_callback`, `gguf_init_from_buffer`) and resolves several memory management and offset calculation bugs within the GGUF reader implementation.

b93181 fix
May 25, 2026

This release includes a fix where the MTP layer kv-cache now correctly respects the draft type ctk. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b9315
May 25, 2026

This release documents a limitation in the llama module, stating that only one on-device state can be saved per sequence, and provides numerous pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler.

b93131 feature
May 25, 2026

This release introduces performance enhancements to ggml by parallelizing the initialization of quantization look-up tables using OpenMP. It also provides numerous pre-built binaries for various operating systems and hardware configurations.

b9311
May 25, 2026

This release updates the vendored cpp-httplib dependency to version 0.45.1 and provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b93103 fixes2 features
May 25, 2026

This release focuses on improving server checkpoint creation reliability, especially for chat and multimodal prompts, and includes various platform-specific binary updates. A new configuration option `--checkpoint-min-step` has been added to manage checkpoint frequency.

b93052 fixes
May 24, 2026

This release includes fixes for the CMake build system, specifically addressing the UI build by setting -fPIC for the static library and renaming a helper function. It also provides updated binaries across macOS, Linux, Android, Windows, and openEuler platforms.

b92971 fix2 features
May 23, 2026

This release introduces support for NVFP4 MTP scale tensors and links Qwen3.5 MTP tensors, alongside minor internal alignment fixes.

b92961 fix
May 23, 2026

This release includes a bug fix within the ggml library related to interface method checking. It also provides numerous pre-compiled binaries for different operating systems and hardware configurations.

b92951 fix
May 23, 2026

This release primarily addresses a build issue related to SPIRV-Headers on Windows for Vulkan builds. It also provides extensive pre-compiled binaries across macOS, Linux, Android, Windows, and openEuler platforms.

b92941 feature
May 23, 2026

This release introduces a generalization for Adreno MoE kernels on OpenCL and provides extensive pre-compiled binaries across macOS, Linux, Android, Windows, and openEuler platforms.

b92912 features
May 22, 2026

This release significantly improves MoE prefill throughput on SYCL by optimizing the expert routing calculation complexity. It also provides a comprehensive set of pre-built binaries for diverse hardware and operating system targets.

b92921 fix
May 22, 2026

This release addresses a bug fix for an integer overflow in perplexity calculation and provides numerous pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler targets.

b92901 fix1 feature
May 22, 2026

This release centralizes Level Zero detection in the SYCL backend initialization function and restores a previously removed warning message.

b92891 feature
May 22, 2026

This release introduces a gating mechanism for SYCL delta net calculation when K > 1 and provides numerous pre-compiled binaries across various operating systems and hardware configurations.

b92861 feature
May 22, 2026

This release introduces Q8_0 quantization support for the ggml-zendnn backend and includes synchronization updates for that backend. Various pre-compiled binaries for different operating systems and hardware configurations are provided.

b9285
May 22, 2026

This release focuses on updating build configurations, specifically ensuring the router app is only built during standalone builds via CMake changes, and provides extensive pre-built binaries for numerous platforms.

b92841 fix
May 22, 2026

This release addresses a critical bug in the HybridDNA tokenizer to prevent BPE token collisions and includes numerous pre-compiled binaries for various operating systems and hardware configurations.

b92835 fixes
May 22, 2026

This release primarily addresses build system issues by ensuring shared implementation libraries are correctly installed via CMake and fixes various continuous integration build failures across Apple and Android platforms.

b92795 fixes1 feature
May 22, 2026

This release introduces significant performance improvements to the Vulkan backend by fusing the snake activation sequence into a single kernel. Several internal refinements were made to the fusion logic, including stricter type and dimension checks.

b9277
May 22, 2026

This release focuses on internal maintenance by moving the save-load-state example into the test suite and updating continuous integration workflows. Numerous pre-built binaries for macOS, Linux, Android, Windows, and openEuler are provided.

b92761 feature
May 22, 2026

The server now exposes detailed prompt token counts via the /slots endpoint, enhancing monitoring capabilities. This release also includes a wide array of pre-compiled binaries for different platforms.

b92751 fix2 features
May 21, 2026

This release focuses on performance optimizations for Metal, specifically improving the concat kernel with row batching and fixing the set kernel threads. Extensive internal testing refactoring was also performed for CPY shape operations.

b92742 fixes
May 21, 2026

This release fixes a critical VRAM leak occurring during server sleep/resume cycles for Multi-Token Prediction (MTP) models by improving resource cleanup in the destroy function. It also provides numerous pre-compiled binaries for various platforms and hardware configurations.

b92731 fix
May 21, 2026

This release includes a fix for the server component where the subcommand was not being re-injected when the router spawned child processes under a unified binary structure. Various pre-compiled binaries for different platforms are also provided.

b92724 features
May 21, 2026

This release introduces several new application features including batched benchmarking, parameter fitting, quantization, and perplexity calculation. It also provides a comprehensive set of pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler across various CPU/GPU backends.

b92711 fix
May 21, 2026

This release optimizes performance by skipping redundant logit computations during draft model follow-up decoding. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b9270Breaking2 fixes2 features
May 21, 2026

This release introduces full support for the Carbon-3B tokenizer by promoting its specialized DNA handling logic into a new top-level vocabulary type, LLAMA_VOCAB_TYPE_HYBRIDDNA. This involved significant refactoring of tokenizer initialization and conversion logic to align with existing tokenizer family conventions.

b92671 fix
May 21, 2026

This release includes an internal fix in ggml related to 2D tensor operations and provides updated pre-compiled binaries across macOS, Linux, Android, Windows, and openEuler platforms.

b92662 fixes
May 21, 2026

This release addresses a critical null-buffer crash occurring in graph input processing for models with specific attention layer configurations (SWA-only or zero SWA layers). Fixes include adding necessary buffer checks and preventing null dereferences during tensor reuse checks.

b92651 fix1 feature
May 21, 2026

This release focuses on internal optimizations and fixes for the SSM-CONV backend, including better handling for large prompts and resolving an issue in hex-rope related to cache initialization.

b92641 feature
May 21, 2026

This release introduces the ability to display the application version and provides a comprehensive set of pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler across various CPU/GPU architectures and acceleration frameworks.

b92631 fix2 features
May 21, 2026

This release merges HunyuanOCR into HunyuanVL, resolving OCR vision precision issues by aligning its sampling method with the reference implementation. Numerous platform-specific binaries are provided.

b92602 fixes3 features
May 21, 2026

This release focuses on refactoring the OpenCL backend initialization, improving GPU identification, and optimizing kernel loading for argsort and flash_attn operations.

b92591 fix
May 21, 2026

This release addresses a critical nullptr crash in the speculative decoding path related to device enumeration. It primarily contains a low-level fix within the common/speculative module.

b92582 fixes2 features
May 21, 2026

This release focuses heavily on DeepSeek-OCR image processing fixes and refactoring to match Pillow parity, alongside minor fixes for llama-chat and internal code structure improvements.

b92571 feature
May 21, 2026

This release includes an optimization for the Vulkan IM2COL shader and provides updated binary distributions for numerous operating systems and hardware configurations.

b92551 fix2 features
May 21, 2026

This release focuses on reworking the HMX quantized matmul implementation on Hexagon, including updates to dequant logic and removal of non-pipelined versions. It also includes minor platform-specific updates and bug fixes.

b92544 fixes3 features
May 20, 2026

This release introduces Programmatic Dependent Launch (PDL) for significant performance improvements on Hopper+ NVIDIA GPUs by optimizing kernel execution overlap. Several fixes were implemented to correctly enable/disable PDL based on hardware architecture and environment settings.

b9253Breaking1 fix2 features
May 20, 2026

This release introduces a unified llama executable for the application and standardizes server operations using the 'serve' command. Build targets have been updated, and a revert restored previous STATIC behavior.

b92513 features
May 20, 2026

This release updates mtmd fit_params to include mmproj, renames a utility function, and adds support for ggml_backend_dev_t along with debug logging.

b92473 features
May 20, 2026

This release introduces performance optimizations for pad and copy operations on the metal backend, alongside improvements to threadgroup row packing.

b92451 feature
May 20, 2026

This release includes tuning for ggml-cuda RDNA3 Q6_K MMVQ nwarps and provides a comprehensive set of pre-compiled binaries for numerous operating systems and hardware configurations.

b92441 feature
May 20, 2026

This release introduces OpenCL support for MoE models using q4_k, q5_k, and q6_k quantization on Adreno GPUs and provides updated binaries across multiple operating systems and architectures.

b92431 feature
May 20, 2026

This release introduces MROPE and IMROPE support within the HTP rope operation and provides numerous pre-compiled binaries across multiple operating systems and hardware configurations.

b92358 fixes2 features
May 20, 2026

This release focuses on MTP clean-up, primarily affecting speculative decoding implementations by fixing parameter handling, re-enabling certain configurations, and updating documentation. Several deprecated CLI options for speculative decoding were removed.

b92401 fix
May 20, 2026

This release primarily focuses on distributing pre-compiled binaries across various platforms and hardware configurations, including fixes for command-line help output.

b92391 fix
May 20, 2026

This release primarily focuses on distributing pre-compiled binaries across various platforms and hardware configurations, including fixes for verbosity settings.

b92224 fixes2 features
May 19, 2026

This release introduces support for the TRI operation within the Hexagon backend, alongside various cleanups and fixes related to merge conflicts and configuration errors in the Hexagon and GGML components.

b92212 fixes1 feature
May 18, 2026

This release introduces support for the PAD operation on the Hexagon HTP backend via HVX kernels and resolves minor merge conflicts and configuration issues in the Hexagon implementation.

b9219
May 18, 2026

This release removes the Hugging Face cache migration process and provides a comprehensive set of pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler targeting various CPU/GPU backends.

b92165 fixes2 features
May 18, 2026

This release refactors the UI models store, MCP service, and gate logs, scoping console output based on VITE_DEBUG environment variables. It also includes several deduplication and cleanup fixes in the model fetching logic.

b92131 feature
May 18, 2026

This release introduces an initialization of the pre-norm embedding mask flag within the llama module and provides updated binary distributions for macOS, Linux, Android, Windows, and openEuler targeting various CPU/GPU backends.

b92091 feature
May 18, 2026

This release introduces a SYCL optimization for the Q6_K MMVQ dot product via scalar SWAR byte-subtract. It also provides extensive pre-compiled binaries across multiple operating systems and hardware targets.

b92081 feature
May 18, 2026

This release introduces an optimization for SYCL by routing small f32 matmuls to oneMKL. It also provides extensive pre-compiled binaries across multiple operating systems and hardware configurations.

b92041 feature
May 18, 2026

This release introduces support for d_conv=15 within ssm-conv.cu, expanding configuration options for SSM convolutions. Numerous pre-compiled binaries for various operating systems and hardware configurations are provided.

b92031 fix
May 18, 2026

This release primarily addresses a bug fix related to the LLAMA_BUILD_UI logic within the CMake build system and provides updated pre-compiled binaries across multiple operating systems and hardware configurations.

b9202
May 17, 2026

This release primarily focuses on providing updated pre-compiled binaries across numerous platforms and hardware configurations. The CMake build system no longer installs a conversion script.

b92001 fix1 feature
May 17, 2026

This release optimizes prompt decoding performance in MTP for llama models and includes a fix for llama-graph. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b91981 fix
May 17, 2026

This release primarily addresses build configuration issues for ggml-vulkan on macOS CI by ensuring SPIRV-Headers are correctly located during CMake setup. Numerous pre-built binaries for various platforms are also provided.

b91971 feature
May 17, 2026

This release introduces new bf16 to f32 copy pipelines for the Vulkan backend and provides updated pre-compiled binaries for numerous platforms including macOS, Linux, Android, Windows, and openEuler.

b91961 feature
May 17, 2026

This release introduces support for unaligned tensors when using ROPE acceleration via Vulkan. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b91941 feature
May 17, 2026

This release introduces a performance optimization on the Vulkan backend by fusing SSM_CONV, BIAS, and SILU operations. It also provides extensive pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler.

b91931 fix2 features
May 17, 2026

The server logic for embedding normalization has been updated to correctly handle the --embd-normalize CLI argument and use a configurable default value. Numerous platform-specific binaries are also provided.

b91921 fix
May 17, 2026

This release focuses on reducing noisy logging within the ngram component and provides updated binary distributions for macOS, Linux, Android, Windows, and openEuler platforms.

b91901 feature
May 16, 2026

This release includes an internal change in the server router to allocate temporary buffers on the heap. It also provides extensive pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler platforms supporting various CPU/GPU backends.

b91891 fix
May 16, 2026

This release addresses an issue by skipping device enumeration in server router mode to avoid unnecessary CUDA primary context creation. It also provides updated binary distributions for numerous operating systems and hardware architectures.

b9186
May 16, 2026

This release focuses on synchronizing with ggml and providing pre-compiled binaries for a wide range of platforms including macOS, Linux, Android, Windows, and openEuler with various hardware acceleration options.

b9181
May 16, 2026

This release updates the bundled vendor library cpp-httplib to version 0.45.0 and provides extensive pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler targeting different CPU/GPU backends.

b918015 fixes5 features
May 16, 2026

This release introduces significant enhancements to speculative decoding by adding MTP support and enabling partial rollback capabilities across CPU, Vulkan, and Metal backends. Numerous bug fixes were applied across conversion, server logic, and memory handling.

b91746 fixes6 features
May 16, 2026

This release focuses heavily on renaming and restructuring the UI components, standardizing naming from 'webui' to 'ui' across the repository, CMake variables, CLI flags, and internal structures, while maintaining backward compatibility.

b91731 fix
May 15, 2026

This release primarily addresses a fix for release symlinks in the continuous integration process. It also provides updated pre-compiled binaries for numerous operating systems and hardware configurations.

b91721 fix
May 15, 2026

This release updates the web UI checksum verification to use lowercase hashes and provides a comprehensive set of pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler targeting various CPU/GPU backends.

b91693 fixes2 features
May 15, 2026

This release introduces chunking support and preprocessing fixes for the qwen3a model within mtmd, alongside several minor internal adjustments and the distribution of new binaries for various platforms.

b91651 fix
May 15, 2026

This release includes a fix for an issue related to the transformation of the top . entry in the release archive, alongside the distribution of updated binaries for numerous operating systems and hardware configurations.

b91631 fix
May 15, 2026

This release addresses an issue where clone operations were not performing a deep copy and provides updated binary distributions for macOS, Linux, Android, Windows, and openEuler targeting various hardware and acceleration backends.

b91612 fixes1 feature
May 15, 2026

This release introduces support for Codex CLI by selectively skipping unsupported Responses tools and includes fixes related to gpt-oss apply_patch handling. It also provides extensive pre-compiled binaries for multiple platforms.

b91591 feature
May 15, 2026

This release focuses on performance improvements within the ggml-hexagon backend by adding an optimized fast-path for reshape copy operations.

b9158Breaking3 features
May 14, 2026

This release introduces RDNA3 support for the CUDA mma FA kernel and includes performance tuning for RDNA3, RDNA4, and CDNA architectures, while noting a change in accumulator data layout for RDNA3/4 optimizations.

b91564 fixes1 feature
May 14, 2026

This release enables NVIDIA self-hosted CI for ggml-webgpu, addresses several precision and placement issues within WebGPU builds, and provides extensive pre-compiled binaries across multiple operating systems and hardware configurations.

b91513 fixes1 feature
May 14, 2026

This release includes minor fixes across logging, arguments, and server components, along with significant updates to pre-built binaries supporting various hardware accelerators like Vulkan, ROCm, SYCL, and CUDA.

b91501 feature
May 14, 2026

This release introduces IME2 instruction support for the SpacemiT backend within ggml-cpu and provides updated binary distributions for numerous operating systems and hardware targets.

b9148
May 14, 2026
b91453 fixes3 features
May 14, 2026

This release introduces a major fix for SYCL multi-GPU systems by switching to Level Zero memory allocations (zeMemAllocDevice) to prevent system RAM exhaustion. It also adds compile-time and runtime flags to manage the Level Zero path selection.

b91441 fix
May 14, 2026

This release includes a targeted optimization for ggml-webgpu performance based on head dimension divisibility and provides updated binaries across numerous platforms including macOS, Linux, Android, Windows, and openEuler.

b91431 fix
May 14, 2026

This release primarily addresses a numerical stability issue by ensuring intermediate calculations use float casting to avoid operator ambiguity. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b91422 fixes2 features
May 14, 2026

This release focuses on expanding OpenCL support for MoE models on Adreno GPUs, adding q5_0 and q5_1 quantization levels, alongside general stability improvements.

b91411 fix2 features
May 14, 2026

This release introduces support for the `continue_final_message` flag in the server and WebUI to align with the vLLM API, ensuring correct behavior when continuing final messages during generation.

b91401 fix
May 14, 2026

This release addresses a specific crash related to OpenCL MoE warmups on Adreno devices and provides updated binary distributions for numerous operating systems and hardware targets.

b91391 fix
May 13, 2026

This release includes a fix for flushing the GPU profile timestamp before queryset overflow. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b91341 fix
May 13, 2026

This release updates the download utility to prevent premature exiting on errors and provides extensive pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler across various CPU/GPU backends.

Common Errors

NotImplementedError2 reports

NotImplementedError in llama-cpp often arises when attempting to use a feature or model architecture that hasn't yet been fully implemented in the conversion or evaluation code. To resolve this, either update to the latest version of llama-cpp which may include the necessary implementation or contribute the missing functionality by implementing the required logic for the specific operator/model architecture and submitting a pull request. If an update is not available, using a model known to work can also provide a workaround.

DeviceLostError2 reports

DeviceLostError in llama-cpp usually indicates the GPU lost connection or encountered a critical error, often due to out-of-memory issues or driver instability, especially with Vulkan. Try reducing the model size, batch size, or number of threads to decrease GPU memory usage or updating to the latest GPU drivers to resolve potential driver bugs. Consider using a different backend like CUDA or Metal if available to circumvent Vulkan-specific problems.

InternalServerError2 reports

InternalServerError in llama-cpp often arises from unsupported model architectures or operations, such as attempting multimodal input with a model not designed for it or faulty tool calling within a specific model. To resolve this, verify model compatibility with the requested operation in your code, and update llama-cpp to the latest version or use a compatible model known to work with multimodal inputs or tool calling. If issues persist, inspect the model's configuration, particularly its handling of vision or function calling, and revise your prompts accordingly.

FileNotFoundError1 report

The "FileNotFoundError" in llama-cpp usually means a required file path, often a model or tokenizer component, isn't valid or the file doesn't exist at that location. Double-check the path specified in your command-line arguments or configuration files for typos and ensure the necessary files are actually present in the indicated directory. If converting from Hugging Face, ensure all necessary files, like "tokenizer.model", were downloaded correctly.

Related AI & LLMs Packages

Subscribe to Updates

Get notified when new versions are released

RSS Feed