b8956
📦 llama-cppView on GitHub →
✨ 8 features🐛 3 fixes🔧 24 symbols
Summary
This release introduces numerous new CANN operators and significant performance optimizations for existing operations like GLU and CROSS_ENTROPY_LOSS. Crucially, it fixes a major bug in the ACL graph cache that caused incorrect results when mixing F16 and BF16 tensors.
Migration Steps
- For L2_NORM, note that in-place ClampMin was fixed (was clamping the wrong tensor) and eps clamping is now added before division to prevent divide-by-zero.
✨ New Features
- Added new CANN operators: GGML_OP_SET (via aclnnInplaceCopy), GGML_OP_CUMSUM (via aclnnCumsum), GGML_OP_FILL (via aclnnInplaceFillScalar), GGML_OP_DIAG (via aclnnInplaceCopy on diagonal strides), GGML_OP_TRI (via aclnnTril/aclnnTriu), GGML_OP_SOLVE_TRI (via aclnnTriangularSolve), and GGML_UNARY_OP_SOFTPLUS (via aclnnSoftplus).
- Optimized GLU variants (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK) by fusing them with aclnnSwiGlu / aclnnGeGluV3 when applicable and moving fallback checks internally.
- Optimized CROSS_ENTROPY_LOSS by replacing a 5-kernel sequence with a single aclnnSoftmaxCrossEntropyWithLogits call.
- Optimized PAD_REFLECT_1D by eliminating a per-ne[3] loop, asserting contiguity, and calling ReflectionPad1d once on the full 4-D view.
- Optimized GET_ROWS by replacing IndexSelect with GatherV2 per batch slice and inlining the batch loop.
- Optimized SET_ROWS by replacing IndexCopy with InplaceIndexCopy per batch slice and inlining the batch loop.
- Optimized OUT_PROD by replacing an O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with a per-slice Matmul loop, improving handling of strided-broadcast batch dims.
- Implemented backend memset_tensor via aclrtMemset (previously NULL).
🐛 Bug Fixes
- Fixed COUNT_EQUAL to use non-inplace EqTensor into a same-type temporary buffer instead of InplaceEqTensor, preventing src0 corruption.
- Fixed ACL graph cache (USE_ACL_GRAPH) by restoring node_type and src_type[] fields in ggml_graph_node_properties, ensuring correct type checks (F16 vs BF16) to prevent graph sharing errors.
- Fixed graph cache op_params matching by comparing full GGML_MAX_OP_PARAMS bytes to prevent incorrect cache replay when parameters differ slightly.