b8639
📦 llama-cppView on GitHub →
✨ 3 features🐛 2 fixes🔧 3 symbols
Summary
This release introduces vectorized flash attention support for ggml-webgpu, along with numerous internal cleanups and fixes to optimize the vectorized path, especially for f16 data types.
✨ New Features
- Added vectorized flash attention implementation for ggml-webgpu.
- Enabled vectorized flash attention path when specific conditions on Q, V, and K tensors are met (e.g., Q->ne[1] < 20, specific modulo checks, K->type == f16).
- Enabled vec path for q4 and q8 quantization types in flash attention.
🐛 Bug Fixes
- Fixed flash-attn vec nwg=1 path and tightened vec specialization in ggml-webgpu.
- Reduced redundant workgroup barrier usage and used select in flast_attn_vec_split.wgsl.