b9828

📅 Jun 27, 2026📦 llama-cppView on GitHub →

✨ 8 features🐛 1 fixes🔧 4 symbols

Summary

This release focuses heavily on OpenCL improvements, introducing significant enhancements to Flash Attention kernels for various precisions (f16, f32, q4_0, q8_0) and fixing an infinity calculation bug related to finite math flags.

✨ New Features

OpenCL Flash Attention (FA) kernel reworked for f16 and f32 support.
Added OpenCL flash-attention prefill prepass kernels: flash_attn_kv_pad_f16, flash_attn_mask_pad_f16, and flash_attn_blk_f16.
Added OpenCL FA kernels for q4_0 and q8_0 quantization formats.
Added OpenCL `set_rows` functionality for f32 to q8_0/q4_0 conversions.
Added OpenCL dequantization kernels for q4_0 and q8_0.
Added OpenCL FA tile tuning table with override capability.
Wired up host side support for OpenCL Flash Attention.
OpenCL q4_0 MoE tensors are now SOA'ed (Structure of Arrays).

🐛 Bug Fixes

Fixed infinity calculation when using the `-cl-finite-math-only` compiler flag in OpenCL.

Affected Symbols

flash_attn_kv_pad_f16 flash_attn_mask_pad_f16 flash_attn_blk_f16 set_rows