Change8

b9452

📦 llama-cppView on GitHub →
2 features🐛 2 fixes🔧 1 symbols

Summary

This release introduces significant performance improvements for Vulkan backends by optimizing block data loading and subtraction for Q3_K/Q6_K quantizations, especially benefiting Intel BMG and Xe2 architectures. Several platform-specific builds have been disabled.

✨ New Features

  • Implemented block-load optimization for Q3_K/Q6_K block data on Vulkan, involving subtraction directly on 32-bit integers.
  • Enabled MMVQ usage for Q quants on Xe2 hardware, aligning with NVIDIA overrides.

🐛 Bug Fixes

  • Improved performance for Q2_K/Q3_K/Q6_K quantization formats when using MMVQ on Intel BMG.
  • Forced block loads on Vulkan to improve performance due to mesa's limitations in coalescing back-to-back loads from alternating arrays.

Affected Symbols