b9857

📅 Jul 1, 2026📦 llama-cppView on GitHub →

✨ 4 features🐛 7 fixes🔧 3 symbols

Summary

This release focuses heavily on reworking the hexagon flash attention implementation, bringing significant optimizations and accuracy improvements across various internal components (hex-mm, hex-fa, hmx-fa). Numerous bug fixes and performance enhancements related to tracing, memory alignment, and kernel usage were also implemented.

✨ New Features

Reworked hexagon flash attention implementation for optimizations and accuracy improvements.
Added support for FA_SELECT in hex-fa.
Added tanh_f16 and exp2_f16 kernels and utilized them in FA.
Preliminary support for Sinks in hmx-fa.

🐛 Bug Fixes

Fixed dst-spad alignment in hex-mm.
Fixed tracing instrumentation to cover all functions in hex-fa.
Updated hvx fallback thresholds to recover t/g regressions.
Fixed vtcm size compute to use fp32 for accumulators.
Fixed src2 stride handling when mm is fused with add in hex-mm.
Stopped using -inf to initialize mask to avoid conversion overflows in hmx-fa.
Removed the need to explicitly guard -inf in the f16->f32 converter.

Affected Symbols

hex-mm hex-fa hmx-fa