Change8

b8639

📦 llama-cppView on GitHub →
3 features🐛 2 fixes🔧 3 symbols

Summary

This release introduces vectorized flash attention support for ggml-webgpu, along with numerous internal cleanups and fixes to optimize the vectorized path, especially for f16 data types.

✨ New Features

  • Added vectorized flash attention implementation for ggml-webgpu.
  • Enabled vectorized flash attention path when specific conditions on Q, V, and K tensors are met (e.g., Q->ne[1] < 20, specific modulo checks, K->type == f16).
  • Enabled vec path for q4 and q8 quantization types in flash attention.

🐛 Bug Fixes

  • Fixed flash-attn vec nwg=1 path and tightened vec specialization in ggml-webgpu.
  • Reduced redundant workgroup barrier usage and used select in flast_attn_vec_split.wgsl.

Affected Symbols