Change8

b9828

📦 llama-cppView on GitHub →
8 features🐛 1 fixes🔧 4 symbols

Summary

This release focuses heavily on OpenCL improvements, introducing significant enhancements to Flash Attention kernels for various precisions (f16, f32, q4_0, q8_0) and fixing an infinity calculation bug related to finite math flags.

✨ New Features

  • OpenCL Flash Attention (FA) kernel reworked for f16 and f32 support.
  • Added OpenCL flash-attention prefill prepass kernels: flash_attn_kv_pad_f16, flash_attn_mask_pad_f16, and flash_attn_blk_f16.
  • Added OpenCL FA kernels for q4_0 and q8_0 quantization formats.
  • Added OpenCL `set_rows` functionality for f32 to q8_0/q4_0 conversions.
  • Added OpenCL dequantization kernels for q4_0 and q8_0.
  • Added OpenCL FA tile tuning table with override capability.
  • Wired up host side support for OpenCL Flash Attention.
  • OpenCL q4_0 MoE tensors are now SOA'ed (Structure of Arrays).

🐛 Bug Fixes

  • Fixed infinity calculation when using the `-cl-finite-math-only` compiler flag in OpenCL.

Affected Symbols