b9857
📦 llama-cppView on GitHub →
✨ 4 features🐛 7 fixes🔧 3 symbols
Summary
This release focuses heavily on reworking the hexagon flash attention implementation, bringing significant optimizations and accuracy improvements across various internal components (hex-mm, hex-fa, hmx-fa). Numerous bug fixes and performance enhancements related to tracing, memory alignment, and kernel usage were also implemented.
✨ New Features
- Reworked hexagon flash attention implementation for optimizations and accuracy improvements.
- Added support for FA_SELECT in hex-fa.
- Added tanh_f16 and exp2_f16 kernels and utilized them in FA.
- Preliminary support for Sinks in hmx-fa.
🐛 Bug Fixes
- Fixed dst-spad alignment in hex-mm.
- Fixed tracing instrumentation to cover all functions in hex-fa.
- Updated hvx fallback thresholds to recover t/g regressions.
- Fixed vtcm size compute to use fp32 for accumulators.
- Fixed src2 stride handling when mm is fused with add in hex-mm.
- Stopped using -inf to initialize mask to avoid conversion overflows in hmx-fa.
- Removed the need to explicitly guard -inf in the f16->f32 converter.