llama.cpp
AI & LLMsLLM inference in C/C++
Release History
b93523 fixesThis release addresses naming inconsistencies and a print issue within the ggml-zendnn backend. It also provides updated pre-compiled binaries across multiple operating systems and hardware configurations.
b93512 featuresThis release provides a comprehensive set of pre-built binaries for various operating systems (macOS, Linux, Android, Windows) and hardware targets, including updates for specific GPU backends like ROCm 7.2 and CUDA 12.4/13.1.
b93341 fixThis release addresses a CUDA-related bug concerning PDL synchronization for FWHT operations and provides a comprehensive set of pre-built binaries for numerous platforms including macOS, Linux, Android, Windows, and openEuler.
b93331 featureThis release introduces the ability to retrieve the Apple device ID in the metal backend and provides updated pre-compiled binaries for various operating systems and hardware configurations.
b9331This release focuses on restructuring Continuous Integration (CI) workflows by splitting jobs into separate workflows for better organization and build management. It also provides a comprehensive set of pre-built binaries for macOS, Linux, Android, Windows, and openEuler platforms.
b93301 fixThis release corrects the tensor operation tagging for ffn_latent in Nemotron models, resolving a loading issue that negatively impacted performance. Various pre-compiled binaries for different platforms are also provided.
b93292 featuresThis release introduces a fast Walsh-Hadamard transform for CUDA and updates internal kernel logic, including setting the warp size to 64.
b9326This release primarily focuses on providing pre-compiled binaries for various operating systems and hardware configurations, including updates for macOS, Linux, Android, Windows, and openEuler platforms.
b93202 fixesThis release addresses critical issues by fixing the ggml context size calculation and resolving a memory leak. It also involves internal restructuring by moving the split state cache back into the context.
b93197 fixes3 featuresThis release introduces new GGUF initialization functions (`gguf_init_from_callback`, `gguf_init_from_buffer`) and resolves several memory management and offset calculation bugs within the GGUF reader implementation.
b93181 fixThis release includes a fix where the MTP layer kv-cache now correctly respects the draft type ctk. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.
b9315This release documents a limitation in the llama module, stating that only one on-device state can be saved per sequence, and provides numerous pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler.
b93131 featureThis release introduces performance enhancements to ggml by parallelizing the initialization of quantization look-up tables using OpenMP. It also provides numerous pre-built binaries for various operating systems and hardware configurations.
b9311This release updates the vendored cpp-httplib dependency to version 0.45.1 and provides numerous pre-compiled binaries for various operating systems and hardware configurations.
b93103 fixes2 featuresThis release focuses on improving server checkpoint creation reliability, especially for chat and multimodal prompts, and includes various platform-specific binary updates. A new configuration option `--checkpoint-min-step` has been added to manage checkpoint frequency.
b93052 fixesThis release includes fixes for the CMake build system, specifically addressing the UI build by setting -fPIC for the static library and renaming a helper function. It also provides updated binaries across macOS, Linux, Android, Windows, and openEuler platforms.
b92971 fix2 featuresThis release introduces support for NVFP4 MTP scale tensors and links Qwen3.5 MTP tensors, alongside minor internal alignment fixes.
b92961 fixThis release includes a bug fix within the ggml library related to interface method checking. It also provides numerous pre-compiled binaries for different operating systems and hardware configurations.
b92951 fixThis release primarily addresses a build issue related to SPIRV-Headers on Windows for Vulkan builds. It also provides extensive pre-compiled binaries across macOS, Linux, Android, Windows, and openEuler platforms.
b92941 featureThis release introduces a generalization for Adreno MoE kernels on OpenCL and provides extensive pre-compiled binaries across macOS, Linux, Android, Windows, and openEuler platforms.
b92912 featuresThis release significantly improves MoE prefill throughput on SYCL by optimizing the expert routing calculation complexity. It also provides a comprehensive set of pre-built binaries for diverse hardware and operating system targets.
b92921 fixThis release addresses a bug fix for an integer overflow in perplexity calculation and provides numerous pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler targets.
b92901 fix1 featureThis release centralizes Level Zero detection in the SYCL backend initialization function and restores a previously removed warning message.
b92891 featureThis release introduces a gating mechanism for SYCL delta net calculation when K > 1 and provides numerous pre-compiled binaries across various operating systems and hardware configurations.
b92861 featureThis release introduces Q8_0 quantization support for the ggml-zendnn backend and includes synchronization updates for that backend. Various pre-compiled binaries for different operating systems and hardware configurations are provided.
b9285This release focuses on updating build configurations, specifically ensuring the router app is only built during standalone builds via CMake changes, and provides extensive pre-built binaries for numerous platforms.
b92841 fixThis release addresses a critical bug in the HybridDNA tokenizer to prevent BPE token collisions and includes numerous pre-compiled binaries for various operating systems and hardware configurations.
b92835 fixesThis release primarily addresses build system issues by ensuring shared implementation libraries are correctly installed via CMake and fixes various continuous integration build failures across Apple and Android platforms.
b92795 fixes1 featureThis release introduces significant performance improvements to the Vulkan backend by fusing the snake activation sequence into a single kernel. Several internal refinements were made to the fusion logic, including stricter type and dimension checks.
b9277This release focuses on internal maintenance by moving the save-load-state example into the test suite and updating continuous integration workflows. Numerous pre-built binaries for macOS, Linux, Android, Windows, and openEuler are provided.
b92761 featureThe server now exposes detailed prompt token counts via the /slots endpoint, enhancing monitoring capabilities. This release also includes a wide array of pre-compiled binaries for different platforms.
b92751 fix2 featuresThis release focuses on performance optimizations for Metal, specifically improving the concat kernel with row batching and fixing the set kernel threads. Extensive internal testing refactoring was also performed for CPY shape operations.
b92742 fixesThis release fixes a critical VRAM leak occurring during server sleep/resume cycles for Multi-Token Prediction (MTP) models by improving resource cleanup in the destroy function. It also provides numerous pre-compiled binaries for various platforms and hardware configurations.
b92731 fixThis release includes a fix for the server component where the subcommand was not being re-injected when the router spawned child processes under a unified binary structure. Various pre-compiled binaries for different platforms are also provided.
b92724 featuresThis release introduces several new application features including batched benchmarking, parameter fitting, quantization, and perplexity calculation. It also provides a comprehensive set of pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler across various CPU/GPU backends.
b92711 fixThis release optimizes performance by skipping redundant logit computations during draft model follow-up decoding. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.
b9270Breaking2 fixes2 featuresThis release introduces full support for the Carbon-3B tokenizer by promoting its specialized DNA handling logic into a new top-level vocabulary type, LLAMA_VOCAB_TYPE_HYBRIDDNA. This involved significant refactoring of tokenizer initialization and conversion logic to align with existing tokenizer family conventions.
b92671 fixThis release includes an internal fix in ggml related to 2D tensor operations and provides updated pre-compiled binaries across macOS, Linux, Android, Windows, and openEuler platforms.
b92662 fixesThis release addresses a critical null-buffer crash occurring in graph input processing for models with specific attention layer configurations (SWA-only or zero SWA layers). Fixes include adding necessary buffer checks and preventing null dereferences during tensor reuse checks.
b92651 fix1 featureThis release focuses on internal optimizations and fixes for the SSM-CONV backend, including better handling for large prompts and resolving an issue in hex-rope related to cache initialization.
b92641 featureThis release introduces the ability to display the application version and provides a comprehensive set of pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler across various CPU/GPU architectures and acceleration frameworks.
b92631 fix2 featuresThis release merges HunyuanOCR into HunyuanVL, resolving OCR vision precision issues by aligning its sampling method with the reference implementation. Numerous platform-specific binaries are provided.
b92602 fixes3 featuresThis release focuses on refactoring the OpenCL backend initialization, improving GPU identification, and optimizing kernel loading for argsort and flash_attn operations.
b92591 fixThis release addresses a critical nullptr crash in the speculative decoding path related to device enumeration. It primarily contains a low-level fix within the common/speculative module.
b92582 fixes2 featuresThis release focuses heavily on DeepSeek-OCR image processing fixes and refactoring to match Pillow parity, alongside minor fixes for llama-chat and internal code structure improvements.
b92571 featureThis release includes an optimization for the Vulkan IM2COL shader and provides updated binary distributions for numerous operating systems and hardware configurations.
b92551 fix2 featuresThis release focuses on reworking the HMX quantized matmul implementation on Hexagon, including updates to dequant logic and removal of non-pipelined versions. It also includes minor platform-specific updates and bug fixes.
b92544 fixes3 featuresThis release introduces Programmatic Dependent Launch (PDL) for significant performance improvements on Hopper+ NVIDIA GPUs by optimizing kernel execution overlap. Several fixes were implemented to correctly enable/disable PDL based on hardware architecture and environment settings.
b9253Breaking1 fix2 featuresThis release introduces a unified llama executable for the application and standardizes server operations using the 'serve' command. Build targets have been updated, and a revert restored previous STATIC behavior.
b92513 featuresThis release updates mtmd fit_params to include mmproj, renames a utility function, and adds support for ggml_backend_dev_t along with debug logging.
b92473 featuresThis release introduces performance optimizations for pad and copy operations on the metal backend, alongside improvements to threadgroup row packing.
b92451 featureThis release includes tuning for ggml-cuda RDNA3 Q6_K MMVQ nwarps and provides a comprehensive set of pre-compiled binaries for numerous operating systems and hardware configurations.
b92441 featureThis release introduces OpenCL support for MoE models using q4_k, q5_k, and q6_k quantization on Adreno GPUs and provides updated binaries across multiple operating systems and architectures.
b92431 featureThis release introduces MROPE and IMROPE support within the HTP rope operation and provides numerous pre-compiled binaries across multiple operating systems and hardware configurations.
b92358 fixes2 featuresThis release focuses on MTP clean-up, primarily affecting speculative decoding implementations by fixing parameter handling, re-enabling certain configurations, and updating documentation. Several deprecated CLI options for speculative decoding were removed.
b92401 fixThis release primarily focuses on distributing pre-compiled binaries across various platforms and hardware configurations, including fixes for command-line help output.
b92391 fixThis release primarily focuses on distributing pre-compiled binaries across various platforms and hardware configurations, including fixes for verbosity settings.
b92224 fixes2 featuresThis release introduces support for the TRI operation within the Hexagon backend, alongside various cleanups and fixes related to merge conflicts and configuration errors in the Hexagon and GGML components.
b92212 fixes1 featureThis release introduces support for the PAD operation on the Hexagon HTP backend via HVX kernels and resolves minor merge conflicts and configuration issues in the Hexagon implementation.
b9219This release removes the Hugging Face cache migration process and provides a comprehensive set of pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler targeting various CPU/GPU backends.
b92165 fixes2 featuresThis release refactors the UI models store, MCP service, and gate logs, scoping console output based on VITE_DEBUG environment variables. It also includes several deduplication and cleanup fixes in the model fetching logic.
b92131 featureThis release introduces an initialization of the pre-norm embedding mask flag within the llama module and provides updated binary distributions for macOS, Linux, Android, Windows, and openEuler targeting various CPU/GPU backends.
b92091 featureThis release introduces a SYCL optimization for the Q6_K MMVQ dot product via scalar SWAR byte-subtract. It also provides extensive pre-compiled binaries across multiple operating systems and hardware targets.
b92081 featureThis release introduces an optimization for SYCL by routing small f32 matmuls to oneMKL. It also provides extensive pre-compiled binaries across multiple operating systems and hardware configurations.
b92041 featureThis release introduces support for d_conv=15 within ssm-conv.cu, expanding configuration options for SSM convolutions. Numerous pre-compiled binaries for various operating systems and hardware configurations are provided.
b92031 fixThis release primarily addresses a bug fix related to the LLAMA_BUILD_UI logic within the CMake build system and provides updated pre-compiled binaries across multiple operating systems and hardware configurations.
b9202This release primarily focuses on providing updated pre-compiled binaries across numerous platforms and hardware configurations. The CMake build system no longer installs a conversion script.
b92001 fix1 featureThis release optimizes prompt decoding performance in MTP for llama models and includes a fix for llama-graph. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.
b91981 fixThis release primarily addresses build configuration issues for ggml-vulkan on macOS CI by ensuring SPIRV-Headers are correctly located during CMake setup. Numerous pre-built binaries for various platforms are also provided.
b91971 featureThis release introduces new bf16 to f32 copy pipelines for the Vulkan backend and provides updated pre-compiled binaries for numerous platforms including macOS, Linux, Android, Windows, and openEuler.
b91961 featureThis release introduces support for unaligned tensors when using ROPE acceleration via Vulkan. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.
b91941 featureThis release introduces a performance optimization on the Vulkan backend by fusing SSM_CONV, BIAS, and SILU operations. It also provides extensive pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler.
b91931 fix2 featuresThe server logic for embedding normalization has been updated to correctly handle the --embd-normalize CLI argument and use a configurable default value. Numerous platform-specific binaries are also provided.
b91921 fixThis release focuses on reducing noisy logging within the ngram component and provides updated binary distributions for macOS, Linux, Android, Windows, and openEuler platforms.
b91901 featureThis release includes an internal change in the server router to allocate temporary buffers on the heap. It also provides extensive pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler platforms supporting various CPU/GPU backends.
b91891 fixThis release addresses an issue by skipping device enumeration in server router mode to avoid unnecessary CUDA primary context creation. It also provides updated binary distributions for numerous operating systems and hardware architectures.
b9186This release focuses on synchronizing with ggml and providing pre-compiled binaries for a wide range of platforms including macOS, Linux, Android, Windows, and openEuler with various hardware acceleration options.
b9181This release updates the bundled vendor library cpp-httplib to version 0.45.0 and provides extensive pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler targeting different CPU/GPU backends.
b918015 fixes5 featuresThis release introduces significant enhancements to speculative decoding by adding MTP support and enabling partial rollback capabilities across CPU, Vulkan, and Metal backends. Numerous bug fixes were applied across conversion, server logic, and memory handling.
b91746 fixes6 featuresThis release focuses heavily on renaming and restructuring the UI components, standardizing naming from 'webui' to 'ui' across the repository, CMake variables, CLI flags, and internal structures, while maintaining backward compatibility.
b91731 fixThis release primarily addresses a fix for release symlinks in the continuous integration process. It also provides updated pre-compiled binaries for numerous operating systems and hardware configurations.
b91721 fixThis release updates the web UI checksum verification to use lowercase hashes and provides a comprehensive set of pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler targeting various CPU/GPU backends.
b91693 fixes2 featuresThis release introduces chunking support and preprocessing fixes for the qwen3a model within mtmd, alongside several minor internal adjustments and the distribution of new binaries for various platforms.
b91651 fixThis release includes a fix for an issue related to the transformation of the top . entry in the release archive, alongside the distribution of updated binaries for numerous operating systems and hardware configurations.
b91631 fixThis release addresses an issue where clone operations were not performing a deep copy and provides updated binary distributions for macOS, Linux, Android, Windows, and openEuler targeting various hardware and acceleration backends.
b91612 fixes1 featureThis release introduces support for Codex CLI by selectively skipping unsupported Responses tools and includes fixes related to gpt-oss apply_patch handling. It also provides extensive pre-compiled binaries for multiple platforms.
b91591 featureThis release focuses on performance improvements within the ggml-hexagon backend by adding an optimized fast-path for reshape copy operations.
b9158Breaking3 featuresThis release introduces RDNA3 support for the CUDA mma FA kernel and includes performance tuning for RDNA3, RDNA4, and CDNA architectures, while noting a change in accumulator data layout for RDNA3/4 optimizations.
b91564 fixes1 featureThis release enables NVIDIA self-hosted CI for ggml-webgpu, addresses several precision and placement issues within WebGPU builds, and provides extensive pre-compiled binaries across multiple operating systems and hardware configurations.
b91513 fixes1 featureThis release includes minor fixes across logging, arguments, and server components, along with significant updates to pre-built binaries supporting various hardware accelerators like Vulkan, ROCm, SYCL, and CUDA.
b91501 featureThis release introduces IME2 instruction support for the SpacemiT backend within ggml-cpu and provides updated binary distributions for numerous operating systems and hardware targets.
b9148b91453 fixes3 featuresThis release introduces a major fix for SYCL multi-GPU systems by switching to Level Zero memory allocations (zeMemAllocDevice) to prevent system RAM exhaustion. It also adds compile-time and runtime flags to manage the Level Zero path selection.
b91441 fixThis release includes a targeted optimization for ggml-webgpu performance based on head dimension divisibility and provides updated binaries across numerous platforms including macOS, Linux, Android, Windows, and openEuler.
b91431 fixThis release primarily addresses a numerical stability issue by ensuring intermediate calculations use float casting to avoid operator ambiguity. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.
b91422 fixes2 featuresThis release focuses on expanding OpenCL support for MoE models on Adreno GPUs, adding q5_0 and q5_1 quantization levels, alongside general stability improvements.
b91411 fix2 featuresThis release introduces support for the `continue_final_message` flag in the server and WebUI to align with the vLLM API, ensuring correct behavior when continuing final messages during generation.
b91401 fixThis release addresses a specific crash related to OpenCL MoE warmups on Adreno devices and provides updated binary distributions for numerous operating systems and hardware targets.
b91391 fixThis release includes a fix for flushing the GPU profile timestamp before queryset overflow. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.
b91341 fixThis release updates the download utility to prevent premature exiting on errors and provides extensive pre-compiled binaries for macOS, Linux, Android, Windows, and openEuler across various CPU/GPU backends.
Common Errors
NotImplementedError2 reportsNotImplementedError in llama-cpp often arises when attempting to use a feature or model architecture that hasn't yet been fully implemented in the conversion or evaluation code. To resolve this, either update to the latest version of llama-cpp which may include the necessary implementation or contribute the missing functionality by implementing the required logic for the specific operator/model architecture and submitting a pull request. If an update is not available, using a model known to work can also provide a workaround.
DeviceLostError2 reportsDeviceLostError in llama-cpp usually indicates the GPU lost connection or encountered a critical error, often due to out-of-memory issues or driver instability, especially with Vulkan. Try reducing the model size, batch size, or number of threads to decrease GPU memory usage or updating to the latest GPU drivers to resolve potential driver bugs. Consider using a different backend like CUDA or Metal if available to circumvent Vulkan-specific problems.
InternalServerError2 reportsInternalServerError in llama-cpp often arises from unsupported model architectures or operations, such as attempting multimodal input with a model not designed for it or faulty tool calling within a specific model. To resolve this, verify model compatibility with the requested operation in your code, and update llama-cpp to the latest version or use a compatible model known to work with multimodal inputs or tool calling. If issues persist, inspect the model's configuration, particularly its handling of vision or function calling, and revise your prompts accordingly.
FileNotFoundError1 reportThe "FileNotFoundError" in llama-cpp usually means a required file path, often a model or tokenizer component, isn't valid or the file doesn't exist at that location. Double-check the path specified in your command-line arguments or configuration files for typos and ensure the necessary files are actually present in the indicated directory. If converting from Hugging Face, ensure all necessary files, like "tokenizer.model", were downloaded correctly.
Related AI & LLMs Packages
AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.
Get up and running with OpenAI gpt-oss, DeepSeek-R1, Gemma 3 and other models.
🦜🔗 The platform for reliable agents.
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
A high-throughput and memory-efficient inference and serving engine for LLMs
Subscribe to Updates
Get notified when new versions are released