b8701

📅 Apr 8, 2026📦 llama-cppView on GitHub →

✨ 2 features🐛 3 fixes🔧 3 symbols

Summary

This release introduces performance improvements for q4_0 and q4_1 mmq kernels on AMD GPUs via ds_read_b128 optimization and includes various bug fixes and code cleanup in the CUDA implementation.

✨ New Features

Implemented ds_read_b128 instructions for q4_0 and q4_1 mmq kernels in ggml-cuda, which saves LDS bandwidth and improves performance on MI50 and RX6800XT.
Vectorized LDS load update using ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for a generic implementation.

🐛 Bug Fixes

Fixed max_cpy usage in the loading loop.
Fixed typo in q4_1 kernel.
Removed trailing white lines and spaces in mmq.cuh.

Affected Symbols

ggml/src/ggml-cuda/mmq.cuh ggml_cuda_get_max_cpy_bytes ggml_cuda_memcpy_1