b9828
📦 llama-cppView on GitHub →
✨ 8 features🐛 1 fixes🔧 4 symbols
Summary
This release focuses heavily on OpenCL improvements, introducing significant enhancements to Flash Attention kernels for various precisions (f16, f32, q4_0, q8_0) and fixing an infinity calculation bug related to finite math flags.
✨ New Features
- OpenCL Flash Attention (FA) kernel reworked for f16 and f32 support.
- Added OpenCL flash-attention prefill prepass kernels: flash_attn_kv_pad_f16, flash_attn_mask_pad_f16, and flash_attn_blk_f16.
- Added OpenCL FA kernels for q4_0 and q8_0 quantization formats.
- Added OpenCL `set_rows` functionality for f32 to q8_0/q4_0 conversions.
- Added OpenCL dequantization kernels for q4_0 and q8_0.
- Added OpenCL FA tile tuning table with override capability.
- Wired up host side support for OpenCL Flash Attention.
- OpenCL q4_0 MoE tensors are now SOA'ed (Structure of Arrays).
🐛 Bug Fixes
- Fixed infinity calculation when using the `-cl-finite-math-only` compiler flag in OpenCL.