b8143
📦 llama-cppView on GitHub →
✨ 18 features🐛 11 fixes🔧 2 symbols
Summary
This release focuses heavily on refactoring and optimizing the Vulkan Scalar Flash Attention implementation, introducing fp16 support, improving synchronization, and applying numerous hardware-specific tuning fixes across AMD, Intel, and Nvidia platforms.
Migration Steps
- If using scalar FA with Bc=4, this configuration is now invalid and must be changed.
- Users on GCN AMD GPUs using the proprietary driver should note that f16 FA is now disabled.
✨ New Features
- Vulkan Scalar Flash Attention Refactor implemented.
- Enabled using fp16 in scalar flash attention shader.
- Implemented splitting rows inside of subgroups for faster synchronization in Vulkan FA.
- Added support for using f32 scalar FA when f16 is not supported by the device.
- Added medium rows FA shader Br size support.
- Cached q values into registers for KQ computation.
- Fused lf accumulation, pf, and v accumulation into a single loop.
- Enabled staging K and V loads through shared memory (shmem) (only on Nvidia for V staging).
- Defaulted Bc to 32 for scalar FA.
- Enabled dynamic subgroups for Intel devices.
- Used vectorized stores.
- Used float_type for dequantize4 functions.
- Used smaller scalar rows size for smaller rows count.
- Relaxed flash attention split_k condition to allow non-gqa use.
- Used minimal subgroup size on Intel.
- Added Intel shader core count lookup-table.
- Allowed printing pipeline stats.
- Limited occupancy for GCN for small batch FA with large HSK.
🐛 Bug Fixes
- Fixed AMD workgroup size issue in Vulkan FA.
- Optimized masksh use.
- Added padding to mask shmem buffer.
- Fixed issue where Bc 4 for scalar FA was an invalid configuration.
- Used wave32 on AMD RDNA for scalar FA.
- Fixed rebase issues.
- Fixed gqa opt logic.
- Fixed block_rows issue with small n_rows.
- Fixed hsk=72/80 issue.
- Fixed bad RDNA performance on head size <= 128 by limiting occupancy.
- Disabled f16 FA for GCN AMD GPUs on the proprietary driver.