llama.cpp

This release introduces hardware acceleration support for Depthwise Convolution (CONV_2D_DW) on the Metal backend and includes corresponding bug fixes for F16 kernels on CPU.

b99381 feature

This release enables unsafe math optimizations for HIP (AMD) builds to improve performance, alongside providing a wide array of pre-compiled binaries for diverse operating systems and hardware configurations.

b99372 fixes

This release focuses on aligning the CUDA snake fusion matcher behavior with other backends by correcting type predicates and enforcing contiguity checks for operands. Various pre-built binaries for different operating systems and hardware configurations are provided.

b99361 fix

This release includes a fix for prompt batch splitting in the server component to respect the minimum step size. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b99351 fix5 features

This release introduces significant enhancements to the hexagon backend, specifically adding VISION RoPE support and improving handling of strided and non-contiguous memory operations within hex-rope.

b99341 feature

This release focuses on performance tuning for the ggml-webgpu backend by adjusting the flash attention subgroup split parameter. It also provides a comprehensive set of pre-compiled binaries for diverse platforms.

b99333 fixes

This release focuses on stability and correctness for OpenCL operations, specifically fixing Q6_K GEMM/GEMV issues on Adreno GPUs when weight dimensions are not multiples of 128, and resolving potential buffer corruption via improved alignment handling.

b99321 fix1 feature

This release focuses on performance tuning for the Vulkan backend by disabling FA mask_opt on GCN, while re-enabling mask optimization for attention head size 256.

b99312 features

This release introduces significant performance optimizations for Mixture-of-Experts (MoE) prefill on OpenCL via ragged-tile FP16 GEMM skipping and quarter-granularity tile-skipping.

b99301 fix

This release primarily contains a bug fix for llama-batch regarding sequence position handling and provides updated pre-compiled binaries for numerous platforms and accelerators.

b99291 feature

This release includes a performance optimization for Vulkan on small AMD GPUs by adjusting the submission threshold. It also provides updated pre-built binaries for numerous platforms.

b99283 fixes3 features

This release focuses heavily on internal optimizations and robustness improvements for the hexagon backend, particularly for matrix multiplication and attention kernels, alongside asynchronous queue handling and workpool enhancements.

b99274 fixes3 features

The command-line interface (CLI) has been significantly refactored to use an HTTP-based implementation and now supports router mode. This release also includes numerous platform-specific binary updates.

b99251 feature

This release introduces support for f16->f16 GGML_OP_SET_ROWS in the CUDA backend and provides updated pre-compiled binaries for numerous platforms including macOS, Linux, Android, and Windows.

b99241 feature

This release includes a significant refactoring of fused operations within the llama component. It also provides updated pre-compiled binaries for a wide range of platforms including macOS, Linux, Android, and Windows with various acceleration backends.

b9923Breaking2 fixes

This release refactors the server-stream component, implementing a pimpl pattern, prefixing public functions with "server_stream_", and improving concurrency safety via mutex guards. Logging for stream operations has also been reduced to the debug level.

b99221 feature

This release introduces an enhancement to llama-batch for recurrent models by adding the n_keep_tail parameter in split_equal. It also provides updated binary distributions for numerous operating systems and hardware configurations.

b99181 feature

This release introduces the ability to set rows using src0 f16 within the metal backend. It also provides a comprehensive set of pre-compiled binaries for numerous operating systems and hardware configurations.

b9917

b99161 fix

This release primarily addresses a bug in the ggml library concerning indexing within the simd_gemm scalar tail-column path. Various pre-built binaries for different platforms and accelerators are provided.

b99151 feature

This release introduces CPU f16->f16 support for GGML_OP_SET_ROWS and refactors related compute functions. It also standardizes assertion usage within ggml by replacing assert() with GGML_ASSERT().

b99141 fix

This release addresses a potential crash within the OpenCL implementation related to AOS reconstruction and provides updated binary distributions for numerous platforms and hardware configurations.

b99131 feature

This release introduces Q2_0 quantization support for the CPU backend. It also provides a comprehensive set of pre-compiled binaries for numerous operating systems and hardware configurations.

b99121 fix

This release primarily updates the pre-compiled binaries across various platforms (macOS, Linux, Android, Windows) and includes a fix for naming and spacing in spec files.

b99113 fixes4 features

This release focuses heavily on CUDA performance optimizations by fusing MMVQ post-scale operations for NVFP4 and adding dense MMVQ fusion. Several cleanups and reordering steps were performed to stabilize and restrict fusion benefits to appropriate kernels.

b99103 fixes

This release focuses on server-side stability by fixing draft model fit/load inconsistencies and refactoring parameter initialization. It also provides numerous pre-compiled binaries for diverse operating systems and hardware configurations.

b99091 feature

This release introduces timing and progress updates to the streaming responses API endpoint and provides updated binary distributions across various platforms including macOS, Linux, Android, and Windows.

b9908Breaking

This release enforces a strict RAM limit for the prompt cache, ensuring entries exceeding the limit are skipped and aggressively evicting old entries to maintain the limit. It also updates the availability of pre-built binaries across multiple platforms.

b99071 fix

This release includes a minor fix by adding a missing <fstream> include in common.h and provides updated pre-built binaries for numerous platforms including macOS, Linux, Android, and Windows.

b99061 fix1 feature

This release addresses a build issue on ROCm platforms by explicitly adding -fno-finite-math-only when using -ffast-math, preventing build failures related to NaN/Infinity handling.

b99051 fix

This release primarily addresses a bug fix related to the quantized kv-cache for dsv4 within the llama component. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b99041 fix

This release primarily addresses bug fixes related to SYCL backend operations (CONT & CPY unit tests) and includes updated pre-built binaries across multiple platforms.

b99021 fix1 feature

This release introduces SYCL support for cross entropy loss operations and resolves a related formatting issue. It also provides updated pre-compiled binaries across multiple operating systems and architectures.

b99013 fixes2 features

This release focuses on SYCL performance enhancements, particularly for the DMMV path by setting K_QUANTS_PER_ITERATION to 1 and fixing gating logic. Several internal SYCL initialization and constant definitions were also updated.

b98991 feature

This release enhances the SYCL backend by improving argsort support across all unit tests and provides updated binary distributions for various platforms including macOS, Linux, Android, and Windows.

b98981 fix1 feature

This release addresses an AOT double type issue related to SYCL functionality and provides a comprehensive set of pre-built binaries across macOS, Linux, Android, Windows, and UI targets.

b9897

This release updates environment variable naming conventions for SYCL features, changing 'disable' prefixes to 'enable' prefixes, and provides numerous pre-compiled binaries for various platforms.

b98952 fixes

This release addresses critical bugs in speculative decoding, specifically fixing an out-of-bounds read and cleanup issues in the ngram-map when prompts are shrunk. It also provides updated pre-built binaries for numerous operating systems and hardware targets.

b98941 fix

This release includes a bug fix for Vulkan operations, ensuring stability when handling unsupported f16 types in GGML_OP_SET_ROWS. It also contains minor chore updates.

b98932 fixes6 features

This release focuses heavily on OpenCL performance optimizations for flash attention decoding, including vectorized kernels and specific tuning for various quantization levels and model architectures like gemma-4. Several critical bugs related to DK=512 decode crashes and OpenCL compilation errors were also resolved.

b98921 feature

This release optimizes the default thread count for PowerPC (PPC) architectures on Linux and AIX. It also provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b98911 fix1 feature

This release introduces the col2im_1d operation to the Metal backend across multiple precisions and refines the Metal backend's operator predicate checks for COL2IM_1D.

b98905 fixes

This release focuses on bug fixes across CUDA, CDNA, and BF16 logic, alongside a refactoring of the cuBLAS implementation. It also provides numerous pre-compiled binaries for different operating systems and hardware configurations.

b98881 feature

This release enhances CUDA flash attention by extending K-type validation to V-types and provides updated pre-compiled binaries across multiple operating systems and hardware configurations.

b98861 feature

This release introduces an optimization to the ggml-cpu backend by utilizing the UE4M3 LUT for ARM NVFP4 dot products and provides numerous pre-compiled binaries for various operating systems and hardware configurations.

b98851 fix2 features

This release enables tiled matmul support on AIX CPUs, which significantly improves performance for certain models, while also addressing a segmentation fault issue on AIX by reducing stack buffer usage.

b98841 fix

This release primarily addresses a critical 32-bit integer overflow bug in the Vulkan implementation of CEIL_DIV. It also provides updated pre-compiled binaries for numerous platforms and hardware configurations.

b98821 feature

This release updates scripts to use HF_TOKEN for UI asset downloads and provides a comprehensive set of pre-built binaries for numerous operating systems and hardware accelerators.

b98811 feature