b9452
📦 llama-cppView on GitHub →
✨ 2 features🐛 2 fixes🔧 1 symbols
Summary
This release introduces significant performance improvements for Vulkan backends by optimizing block data loading and subtraction for Q3_K/Q6_K quantizations, especially benefiting Intel BMG and Xe2 architectures. Several platform-specific builds have been disabled.
✨ New Features
- Implemented block-load optimization for Q3_K/Q6_K block data on Vulkan, involving subtraction directly on 32-bit integers.
- Enabled MMVQ usage for Q quants on Xe2 hardware, aligning with NVIDIA overrides.
🐛 Bug Fixes
- Improved performance for Q2_K/Q3_K/Q6_K quantization formats when using MMVQ on Intel BMG.
- Forced block loads on Vulkan to improve performance due to mesa's limitations in coalescing back-to-back loads from alternating arrays.