b8922

📅 Apr 24, 2026📦 llama-cppView on GitHub →

✨ 2 features🐛 6 fixes🔧 1 symbols

Summary

This release introduces significant enhancements to ggml-webgpu, enabling Flash Attention support on more browser environments through tile and vec path optimizations. Several internal memory and shader path selections were refined for better performance and stability.

✨ New Features

Enabled FLASH_ATTN_EXT support on ggml-webgpu for browsers lacking subgroup matrix support via a tile flash attention fallback.
Added support for flash attention vec and tile versions on browsers via ggml-webgpu.

🐛 Bug Fixes

Modified the vec path in ggml-webgpu to discard the mnk parameter.
Removed Q_TILE as it is always 1 for the vec path.
Moved row_max and exp_sum calculations to local registers.
Ensured different bindings sharing the same underlying buffer have identical usage flags.
Turned off skip_validation and addressed buffer overlapping when nwg==1.
Merged bindings when KV overlap occurs.

Affected Symbols

ggml-webgpu