b9000

📅 May 2, 2026📦 llama-cppView on GitHub →

✨ 3 features🐛 4 fixes🔧 10 symbols

Summary

This release introduces HMX-accelerated flash attention for prefill and experimental fp16 softmax acceleration. It includes numerous internal optimizations, bug fixes for correctness in flash attention paths, and refinement of numerical stability when using softcap and EXP2_HF together.

Migration Steps

If using the no-ALiBi fast path with additive positional bias, this version ensures the bias is preserved.
If using both softcap and EXP2_HF, the numerical stability of softcapped outputs has been corrected.

✨ New Features

Added HMX-accelerated flash attention support for prefill.
Introduced experimental fp16 softmax (EXP2_HF) to accelerate flash attention, using hvx_exp2_hf directly.
Relaxed matmul pipeline gate to cover k > n shapes (e.g., FFN_down).

🐛 Bug Fixes

Fixed prefill correctness issues in HMX flash attention (dst indexing, softmax reduce, V stride).
Fixed p_tiles dual-tile OOB race condition in HMX flash attention and enabled MT + pipeline.
Preserved additive mask bias in the no-ALiBi fast path where it was previously dropped when mask carried positional bias.
Fixed softcap+EXP2_HF interaction: log2(e) is now folded into the post-tanh multiplier (v_cap) instead of pre-baking into qk_scale to prevent numerical errors when both are active.

Affected Symbols

hmx_set_output_scales hmx_load_tile_pair_fp16 hmx_consume_accumulator_fp16 fa_phase_q_load fa_phase_o_store fa_softmax_thread hmx_mat_mul_permuted_qk_0_d16a32 qk_dot o_update o_norm