b9784
Breaking Changes📦 llama-cppView on GitHub →
⚠ 2 breaking✨ 4 features🐛 8 fixes🔧 11 symbols
Summary
This release introduces a major rework of hexagon matrix multiplication (MUL_MAT/MUL_MAT_ID) with new tiled weight repacking and performance optimizations across HVX and HMX backends. Support for hardware older than architecture v73 has been removed.
⚠️ Breaking Changes
- Support for hardware architecture versions older than v73 has been removed because HMX is now required for most use-cases.
- The new tiled repack format (renamed from x4x2) is now permanent; older formats are removed.
Migration Steps
- Ensure target hardware architecture is v73 or newer, as support for older architectures is dropped.
- Update build system to use the new tiled repack format (formerly x4x2) consistently.
✨ New Features
- Reworked MUL_MAT and MUL_MAT_ID operations in hexagon backend, including 32x32 tiled weight repack, kernel-params, and cached graphs.
- Added support for non-tiled matrix multiplication (mm) as a fallback option in hex-mm.
- Added support for simple graph caching to avoid recomputing kernel-params.
- Enabled HMX for all builds via CMake update.
🐛 Bug Fixes
- Fixed HMX/HVX fallback logic and MUL_MAT_ID allocation, unbreaking OLMoE.
- Fixed matmul-id kernel params selection, unbreaking OLMoE and LFM.
- Fixed HVX flat fallback to pass all MUL_MAT tests.
- Restored pipelined mode in HMX-MM.
- Fixed HVX-MM to accumulate in fp32 in tiled kernels for better accuracy and same performance.
- Fixed HVX-MM loop unrolling and removed unnecessary masking for tiled accumulators.
- Fixed MUL_MAT_ID kernel_param handling to ensure host/NPU synchronization.
- Relaxed hardcoded checks for rows being a multiple of 256, now relying on VTCM size requirements.