Change8

b8179

📦 llama-cppView on GitHub →
6 features🐛 3 fixes🔧 3 symbols

Summary

This release introduces significant performance enhancements for AMD CDNA3 (MI300X) hardware by adding MFMA support to the flash attention MMA kernel. It also refines the dispatch logic for flash attention kernels based on batch size and head dimensions.

Migration Steps

  1. If you were relying on the hardcoded FATTN_WARP_SIZE definition, note that it has been replaced by a dynamic call to ggml_cuda_get_physical_warp_size().
  2. Code relying on the VEC fallback for small batches in flash attention may need adjustment as small batches now fall through to the tile kernel.

✨ New Features

  • Added CDNA3 MFMA support for flash attention MMA kernel, enabling MI300X (gfx942) acceleration.
  • Implemented configuration for CDNA flash attention with head sizes 64, 80, 96, 112, 128.
  • Introduced FP16 MFMA intrinsic path in mma.cuh.
  • Added manual V transpose load for MFMA register layout.
  • Routed CDNA to use MMA kernel for prompt processing and VEC kernel for token generation.
  • Improved dispatch logic by using a threshold based on effective NQ (eff_nq >= 128) to select between MMA and tile kernels for flash attention.

🐛 Bug Fixes

  • Fixed Q loading and combine stride granularity for non-power-of-2 heads in flash attention.
  • Replaced hardcoded FATTN_WARP_SIZE definition with a call to ggml_cuda_get_physical_warp_size() in device functions.
  • Removed VEC fallback path; small batches now fall through to the tile kernel.

Affected Symbols