Change8

b8966

📦 llama-cppView on GitHub →
2 features🐛 2 fixes🔧 2 symbols

Summary

This release introduces significant performance enhancements to ggml-cuda by adding flash-attn support for specific Mistral Small 4 configurations and fixes a critical bug related to sink indexing in warp groups.

Migration Steps

  1. If using flash-attn for GQA!=32, the system now returns BEST_FATTN_KERNEL_NONE.

✨ New Features

  • Added flash-attn support for ggml-cuda when DKQ=320/DV=256 with ncols2=32 (GQA=32), including MMA-f16 and tile kernel configs.
  • Added support for Mistral Small 4 (head sizes 320/256) restricted to ncols2=32 for GQA ratio 32 only.

🐛 Bug Fixes

  • Fixed a bug where sinks=1 with ncols=32 caused incorrect output matching CPU results due to identical sink indices across warp groups; introduced sink_base to resolve this.
  • Changed default kernel config from DQK=256,DV=256 to DQK=512,DV=512.

Affected Symbols