b9158

Breaking Changes

📅 May 14, 2026📦 llama-cppView on GitHub →

⚠ 1 breaking✨ 3 features🔧 1 symbols

Summary

This release introduces RDNA3 support for the CUDA mma FA kernel and includes performance tuning for RDNA3, RDNA4, and CDNA architectures, while noting a change in accumulator data layout for RDNA3/4 optimizations.

⚠️ Breaking Changes

The data layout of accumulators along the attention head dimension is scrambled when using the RDNA3/RDNA4 optimized tile kernel (which uses 32 logical units for FP16 accumulation). This is to enable more efficient transposition. Users relying on the specific layout of accumulators must be aware of this change.

Migration Steps

If using RDNA3/RDNA4 with FP16 accumulation and head sizes not divisible by 32 (e.g., 80 or 112), the kernel falls back to the regular length of 16 with FP32 accumulation.
Users must be aware that the accumulator data layout is scrambled when the RDNA3/RDNA4 tile kernel is active due to performance optimizations.

✨ New Features

Added RDNA3 support to the CUDA mma FA kernel.
Enabled support for head sizes up to 256 for CDNA architectures.
Tuned kernel parameters for RDNA3, RDNA4, and CDNA1.

Affected Symbols

ggml_cuda_mma::data_layout