b8966

📅 Apr 28, 2026📦 llama-cppView on GitHub →

✨ 2 features🐛 2 fixes🔧 2 symbols

Summary

This release introduces significant performance enhancements to ggml-cuda by adding flash-attn support for specific Mistral Small 4 configurations and fixes a critical bug related to sink indexing in warp groups.

Migration Steps

If using flash-attn for GQA!=32, the system now returns BEST_FATTN_KERNEL_NONE.

✨ New Features

Added flash-attn support for ggml-cuda when DKQ=320/DV=256 with ncols2=32 (GQA=32), including MMA-f16 and tile kernel configs.
Added support for Mistral Small 4 (head sizes 320/256) restricted to ncols2=32 for GQA ratio 32 only.

🐛 Bug Fixes

Fixed a bug where sinks=1 with ncols=32 caused incorrect output matching CPU results due to identical sink indices across warp groups; introduced sink_base to resolve this.
Changed default kernel config from DQK=256,DV=256 to DQK=512,DV=512.

Affected Symbols

ggml-cuda template-instances/generate_cu_files.py