Change8

b8417

📦 llama-cppView on GitHub →
2 features🐛 1 fixes🔧 3 symbols

Summary

This release enhances CANN support by enabling Flash Attention for non-16-multiple head dimensions and corrects an ALiBi slope calculation bug when using F16 data types. It also ships updated pre-built binaries for numerous platforms.

✨ New Features

  • Enabled FLASH_ATTN_EXT support when head dimension D is not a multiple of 16 by padding Q/K/V and slicing the output.
  • Added support for various new binary distributions for Linux (Vulkan, ROCm 7.2, OpenVINO) and Windows (CUDA 12.4, CUDA 13.1, Vulkan, SYCL, HIP).

🐛 Bug Fixes

  • Fixed ALiBi slope calculation error when dtype is F16 (e.g., GQA with 48 heads) by using ggml_type_size(dtype) instead of sizeof(float) for the second-part offset calculation in aclnn_get_slope.

Affected Symbols