b9470
📦 llama-cppView on GitHub →
✨ 4 features🐛 5 fixes🔧 7 symbols
Summary
This release focuses heavily on cleanup and performance optimizations across Hexagon (HEX) and HMX backends for matrix multiplication (MUL_MAT, MUL_MAT_ID), Flash Attention, and GDN, introducing initial F32 matmul support and fixing several fusion and stride bugs.
✨ New Features
- Initial support for F32 * F32 -> F32 matmuls in hex-mm.
- Added support for F32 * F32 -> F32 matmul_2d on HMX using Q4_0 dequantization to F16.
- Re-introduced a more generic pipelined vs non-pipelined mode for hmx-mm.
- Initial version of MAT_MUL_ID support for HMX.
🐛 Bug Fixes
- Fixed src1 stride use in fused rms_norm_mul in hex-rms-norm.
- Cleared spad pointers in ops that clobber it, fixing failures in fused rms-norm-mul for qwen3.5-2B at specific batch sizes.
- Fixed mxfp4 handling for MUL_MAT_ID in hmx-mm.
- Fixed a bug in fusion logic that was messing up the order of the src tensors when some srcs are empty in hex-ops.
- Correctly fallback to HVX in hex-fa if sinks are present or dimensions are not quite right.