b8417
📦 llama-cppView on GitHub →
✨ 2 features🐛 1 fixes🔧 3 symbols
Summary
This release enhances CANN support by enabling Flash Attention for non-16-multiple head dimensions and corrects an ALiBi slope calculation bug when using F16 data types. It also ships updated pre-built binaries for numerous platforms.
✨ New Features
- Enabled FLASH_ATTN_EXT support when head dimension D is not a multiple of 16 by padding Q/K/V and slicing the output.
- Added support for various new binary distributions for Linux (Vulkan, ROCm 7.2, OpenVINO) and Windows (CUDA 12.4, CUDA 13.1, Vulkan, SYCL, HIP).
🐛 Bug Fixes
- Fixed ALiBi slope calculation error when dtype is F16 (e.g., GQA with 48 heads) by using ggml_type_size(dtype) instead of sizeof(float) for the second-part offset calculation in aclnn_get_slope.