Change8

b9784

Breaking Changes
📦 llama-cppView on GitHub →
2 breaking4 features🐛 8 fixes🔧 11 symbols

Summary

This release introduces a major rework of hexagon matrix multiplication (MUL_MAT/MUL_MAT_ID) with new tiled weight repacking and performance optimizations across HVX and HMX backends. Support for hardware older than architecture v73 has been removed.

⚠️ Breaking Changes

  • Support for hardware architecture versions older than v73 has been removed because HMX is now required for most use-cases.
  • The new tiled repack format (renamed from x4x2) is now permanent; older formats are removed.

Migration Steps

  1. Ensure target hardware architecture is v73 or newer, as support for older architectures is dropped.
  2. Update build system to use the new tiled repack format (formerly x4x2) consistently.

✨ New Features

  • Reworked MUL_MAT and MUL_MAT_ID operations in hexagon backend, including 32x32 tiled weight repack, kernel-params, and cached graphs.
  • Added support for non-tiled matrix multiplication (mm) as a fallback option in hex-mm.
  • Added support for simple graph caching to avoid recomputing kernel-params.
  • Enabled HMX for all builds via CMake update.

🐛 Bug Fixes

  • Fixed HMX/HVX fallback logic and MUL_MAT_ID allocation, unbreaking OLMoE.
  • Fixed matmul-id kernel params selection, unbreaking OLMoE and LFM.
  • Fixed HVX flat fallback to pass all MUL_MAT tests.
  • Restored pipelined mode in HMX-MM.
  • Fixed HVX-MM to accumulate in fp32 in tiled kernels for better accuracy and same performance.
  • Fixed HVX-MM loop unrolling and removed unnecessary masking for tiled accumulators.
  • Fixed MUL_MAT_ID kernel_param handling to ensure host/NPU synchronization.
  • Relaxed hardcoded checks for rows being a multiple of 256, now relying on VTCM size requirements.

Affected Symbols