b8639

📅 Apr 2, 2026📦 llama-cppView on GitHub →

✨ 3 features🐛 2 fixes🔧 3 symbols

Summary

This release introduces vectorized flash attention support for ggml-webgpu, along with numerous internal cleanups and fixes to optimize the vectorized path, especially for f16 data types.

✨ New Features

Added vectorized flash attention implementation for ggml-webgpu.
Enabled vectorized flash attention path when specific conditions on Q, V, and K tensors are met (e.g., Q->ne[1] < 20, specific modulo checks, K->type == f16).
Enabled vec path for q4 and q8 quantization types in flash attention.

🐛 Bug Fixes

Fixed flash-attn vec nwg=1 path and tightened vec specialization in ggml-webgpu.
Reduced redundant workgroup barrier usage and used select in flast_attn_vec_split.wgsl.

Affected Symbols

ggml-webgpu flast_attn_vec_split.wgsl ggml-webgpu.cpp