b8922
📦 llama-cppView on GitHub →
✨ 2 features🐛 6 fixes🔧 1 symbols
Summary
This release introduces significant enhancements to ggml-webgpu, enabling Flash Attention support on more browser environments through tile and vec path optimizations. Several internal memory and shader path selections were refined for better performance and stability.
✨ New Features
- Enabled FLASH_ATTN_EXT support on ggml-webgpu for browsers lacking subgroup matrix support via a tile flash attention fallback.
- Added support for flash attention vec and tile versions on browsers via ggml-webgpu.
🐛 Bug Fixes
- Modified the vec path in ggml-webgpu to discard the mnk parameter.
- Removed Q_TILE as it is always 1 for the vec path.
- Moved row_max and exp_sum calculations to local registers.
- Ensured different bindings sharing the same underlying buffer have identical usage flags.
- Turned off skip_validation and addressed buffer overlapping when nwg==1.
- Merged bindings when KV overlap occurs.