Change8

b8143

📦 llama-cppView on GitHub →
18 features🐛 11 fixes🔧 2 symbols

Summary

This release focuses heavily on refactoring and optimizing the Vulkan Scalar Flash Attention implementation, introducing fp16 support, improving synchronization, and applying numerous hardware-specific tuning fixes across AMD, Intel, and Nvidia platforms.

Migration Steps

  1. If using scalar FA with Bc=4, this configuration is now invalid and must be changed.
  2. Users on GCN AMD GPUs using the proprietary driver should note that f16 FA is now disabled.

✨ New Features

  • Vulkan Scalar Flash Attention Refactor implemented.
  • Enabled using fp16 in scalar flash attention shader.
  • Implemented splitting rows inside of subgroups for faster synchronization in Vulkan FA.
  • Added support for using f32 scalar FA when f16 is not supported by the device.
  • Added medium rows FA shader Br size support.
  • Cached q values into registers for KQ computation.
  • Fused lf accumulation, pf, and v accumulation into a single loop.
  • Enabled staging K and V loads through shared memory (shmem) (only on Nvidia for V staging).
  • Defaulted Bc to 32 for scalar FA.
  • Enabled dynamic subgroups for Intel devices.
  • Used vectorized stores.
  • Used float_type for dequantize4 functions.
  • Used smaller scalar rows size for smaller rows count.
  • Relaxed flash attention split_k condition to allow non-gqa use.
  • Used minimal subgroup size on Intel.
  • Added Intel shader core count lookup-table.
  • Allowed printing pipeline stats.
  • Limited occupancy for GCN for small batch FA with large HSK.

🐛 Bug Fixes

  • Fixed AMD workgroup size issue in Vulkan FA.
  • Optimized masksh use.
  • Added padding to mask shmem buffer.
  • Fixed issue where Bc 4 for scalar FA was an invalid configuration.
  • Used wave32 on AMD RDNA for scalar FA.
  • Fixed rebase issues.
  • Fixed gqa opt logic.
  • Fixed block_rows issue with small n_rows.
  • Fixed hsk=72/80 issue.
  • Fixed bad RDNA performance on head size <= 128 by limiting occupancy.
  • Disabled f16 FA for GCN AMD GPUs on the proprietary driver.

Affected Symbols