b9158
Breaking Changes📦 llama-cppView on GitHub →
⚠ 1 breaking✨ 3 features🔧 1 symbols
Summary
This release introduces RDNA3 support for the CUDA mma FA kernel and includes performance tuning for RDNA3, RDNA4, and CDNA architectures, while noting a change in accumulator data layout for RDNA3/4 optimizations.
⚠️ Breaking Changes
- The data layout of accumulators along the attention head dimension is scrambled when using the RDNA3/RDNA4 optimized tile kernel (which uses 32 logical units for FP16 accumulation). This is to enable more efficient transposition. Users relying on the specific layout of accumulators must be aware of this change.
Migration Steps
- If using RDNA3/RDNA4 with FP16 accumulation and head sizes not divisible by 32 (e.g., 80 or 112), the kernel falls back to the regular length of 16 with FP32 accumulation.
- Users must be aware that the accumulator data layout is scrambled when the RDNA3/RDNA4 tile kernel is active due to performance optimizations.
✨ New Features
- Added RDNA3 support to the CUDA mma FA kernel.
- Enabled support for head sizes up to 256 for CDNA architectures.
- Tuned kernel parameters for RDNA3, RDNA4, and CDNA1.