b9000
📦 llama-cppView on GitHub →
✨ 3 features🐛 4 fixes🔧 10 symbols
Summary
This release introduces HMX-accelerated flash attention for prefill and experimental fp16 softmax acceleration. It includes numerous internal optimizations, bug fixes for correctness in flash attention paths, and refinement of numerical stability when using softcap and EXP2_HF together.
Migration Steps
- If using the no-ALiBi fast path with additive positional bias, this version ensures the bias is preserved.
- If using both softcap and EXP2_HF, the numerical stability of softcapped outputs has been corrected.
✨ New Features
- Added HMX-accelerated flash attention support for prefill.
- Introduced experimental fp16 softmax (EXP2_HF) to accelerate flash attention, using hvx_exp2_hf directly.
- Relaxed matmul pipeline gate to cover k > n shapes (e.g., FFN_down).
🐛 Bug Fixes
- Fixed prefill correctness issues in HMX flash attention (dst indexing, softmax reduce, V stride).
- Fixed p_tiles dual-tile OOB race condition in HMX flash attention and enabled MT + pipeline.
- Preserved additive mask bias in the no-ALiBi fast path where it was previously dropped when mask carried positional bias.
- Fixed softcap+EXP2_HF interaction: log2(e) is now folded into the post-tanh multiplier (v_cap) instead of pre-baking into qk_scale to prevent numerical errors when both are active.