b8966
📦 llama-cppView on GitHub →
✨ 2 features🐛 2 fixes🔧 2 symbols
Summary
This release introduces significant performance enhancements to ggml-cuda by adding flash-attn support for specific Mistral Small 4 configurations and fixes a critical bug related to sink indexing in warp groups.
Migration Steps
- If using flash-attn for GQA!=32, the system now returns BEST_FATTN_KERNEL_NONE.
✨ New Features
- Added flash-attn support for ggml-cuda when DKQ=320/DV=256 with ncols2=32 (GQA=32), including MMA-f16 and tile kernel configs.
- Added support for Mistral Small 4 (head sizes 320/256) restricted to ncols2=32 for GQA ratio 32 only.
🐛 Bug Fixes
- Fixed a bug where sinks=1 with ncols=32 caused incorrect output matching CPU results due to identical sink indices across warp groups; introduced sink_base to resolve this.
- Changed default kernel config from DQK=256,DV=256 to DQK=512,DV=512.