Change8

b8922

📦 llama-cppView on GitHub →
2 features🐛 6 fixes🔧 1 symbols

Summary

This release introduces significant enhancements to ggml-webgpu, enabling Flash Attention support on more browser environments through tile and vec path optimizations. Several internal memory and shader path selections were refined for better performance and stability.

✨ New Features

  • Enabled FLASH_ATTN_EXT support on ggml-webgpu for browsers lacking subgroup matrix support via a tile flash attention fallback.
  • Added support for flash attention vec and tile versions on browsers via ggml-webgpu.

🐛 Bug Fixes

  • Modified the vec path in ggml-webgpu to discard the mnk parameter.
  • Removed Q_TILE as it is always 1 for the vec path.
  • Moved row_max and exp_sum calculations to local registers.
  • Ensured different bindings sharing the same underlying buffer have identical usage flags.
  • Turned off skip_validation and addressed buffer overlapping when nwg==1.
  • Merged bindings when KV overlap occurs.

Affected Symbols