Change8

b8779

📦 llama-cppView on GitHub →
3 features🐛 3 fixes🔧 4 symbols

Summary

This release introduces Vulkan Flash Attention DP4A support for quantized KV caches using integer dot products and includes several fixes related to indexing and quantization checks in the Vulkan backend.

✨ New Features

  • Implemented Vulkan Flash Attention DP4A shader for quantized KV cache using integer dot product.
  • Added support for more KV type quants in Vulkan implementation.
  • Readded fast paths for quantizations smaller than 8bit.

🐛 Bug Fixes

  • Fixed SHMEM_STAGING indexing in Vulkan implementation.
  • Fixed issues related to mmq gate and shmem checks in Vulkan implementation.
  • Added supported quants checks to Flash Attention tests.

Affected Symbols