b8204

📅 Mar 5, 2026📦 llama-cppView on GitHub →

✨ 8 features🐛 2 fixes🔧 8 symbols

Summary

This release focuses heavily on Flash Attention optimizations for hexagon, including DMA, mpyacc, and multi-row improvements, alongside various MatMul updates and bug fixes in hexagon vector operations.

Migration Steps

Refactored hvx_dot_f16_f16_aa_rx4 to accept vector and leftover element counts as parameters for improved clarity and flexibility.

✨ New Features

Implemented Flash Attention optimizations (dma, mpyacc, multi-row) for hexagon.
Optimized MatMul operations for hexagon.
Enhanced hvx_dot_f16_f16_aa_rx4 for improved performance by expanding vector handling and optimizing accumulation.
Added hvx_dot_f16_f16_aa_rx32 for enhanced vector processing in flash attention.
Used block-size 64 for DMA pipelining in hex-fa.
Optimized vec-dot for v79 and above in hex-fa.
Rewrote mad_f32_f16 using hvx_vec_mpyacc in hex-fa.
Used mpyacc in matmul dot functions in hex-mm.

🐛 Bug Fixes

Fixed compiling error in ggml-hexagon.
Fixed hvx_dot_f16_f16_aa_rx4 to handle leftover elements correctly using masking.

Affected Symbols

ggml-hexagon hvx_dot_f16_f16_aa_rx4 hvx_vec_reduce_sum_f32x4 hvx_dot_f16_f16_aa_rx32 flash-attn-ops.c hex-fa hex-mm vec_mpyacc