b8956

📅 Apr 28, 2026📦 llama-cppView on GitHub →

✨ 8 features🐛 3 fixes🔧 24 symbols

Summary

This release introduces numerous new CANN operators and significant performance optimizations for existing operations like GLU and CROSS_ENTROPY_LOSS. Crucially, it fixes a major bug in the ACL graph cache that caused incorrect results when mixing F16 and BF16 tensors.

Migration Steps

For L2_NORM, note that in-place ClampMin was fixed (was clamping the wrong tensor) and eps clamping is now added before division to prevent divide-by-zero.

✨ New Features

Added new CANN operators: GGML_OP_SET (via aclnnInplaceCopy), GGML_OP_CUMSUM (via aclnnCumsum), GGML_OP_FILL (via aclnnInplaceFillScalar), GGML_OP_DIAG (via aclnnInplaceCopy on diagonal strides), GGML_OP_TRI (via aclnnTril/aclnnTriu), GGML_OP_SOLVE_TRI (via aclnnTriangularSolve), and GGML_UNARY_OP_SOFTPLUS (via aclnnSoftplus).
Optimized GLU variants (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK) by fusing them with aclnnSwiGlu / aclnnGeGluV3 when applicable and moving fallback checks internally.
Optimized CROSS_ENTROPY_LOSS by replacing a 5-kernel sequence with a single aclnnSoftmaxCrossEntropyWithLogits call.
Optimized PAD_REFLECT_1D by eliminating a per-ne[3] loop, asserting contiguity, and calling ReflectionPad1d once on the full 4-D view.
Optimized GET_ROWS by replacing IndexSelect with GatherV2 per batch slice and inlining the batch loop.
Optimized SET_ROWS by replacing IndexCopy with InplaceIndexCopy per batch slice and inlining the batch loop.
Optimized OUT_PROD by replacing an O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with a per-slice Matmul loop, improving handling of strided-broadcast batch dims.
Implemented backend memset_tensor via aclrtMemset (previously NULL).

🐛 Bug Fixes

Fixed COUNT_EQUAL to use non-inplace EqTensor into a same-type temporary buffer instead of InplaceEqTensor, preventing src0 corruption.
Fixed ACL graph cache (USE_ACL_GRAPH) by restoring node_type and src_type[] fields in ggml_graph_node_properties, ensuring correct type checks (F16 vs BF16) to prevent graph sharing errors.
Fixed graph cache op_params matching by comparing full GGML_MAX_OP_PARAMS bytes to prevent incorrect cache replay when parameters differ slightly.

Summary

Migration Steps

✨ New Features

🐛 Bug Fixes

Affected Symbols