Change8

b9857

📦 llama-cppView on GitHub →
4 features🐛 7 fixes🔧 3 symbols

Summary

This release focuses heavily on reworking the hexagon flash attention implementation, bringing significant optimizations and accuracy improvements across various internal components (hex-mm, hex-fa, hmx-fa). Numerous bug fixes and performance enhancements related to tracing, memory alignment, and kernel usage were also implemented.

✨ New Features

  • Reworked hexagon flash attention implementation for optimizations and accuracy improvements.
  • Added support for FA_SELECT in hex-fa.
  • Added tanh_f16 and exp2_f16 kernels and utilized them in FA.
  • Preliminary support for Sinks in hmx-fa.

🐛 Bug Fixes

  • Fixed dst-spad alignment in hex-mm.
  • Fixed tracing instrumentation to cover all functions in hex-fa.
  • Updated hvx fallback thresholds to recover t/g regressions.
  • Fixed vtcm size compute to use fp32 for accumulators.
  • Fixed src2 stride handling when mm is fused with add in hex-mm.
  • Stopped using -inf to initialize mask to avoid conversion overflows in hmx-fa.
  • Removed the need to explicitly guard -inf in the f16->f32 converter.

Affected Symbols