b8809
📦 llama-cppView on GitHub →
✨ 4 features🐛 3 fixes🔧 8 symbols
Summary
This release focuses heavily on improving SYCL performance and stability, particularly for Q8_0, Q4_K, and Q6_K quantization formats by fixing reordering bugs, adding missing dequantizers, and implementing robust memory fallback mechanisms for device allocation failures.
Migration Steps
- If building with SYCL support and targeting older Linux kernels (< 6.8 on Ubuntu 26.04+), consider setting the CMake option `-DGGML_SYCL_HOST_MEM_FALLBACK=OFF` to disable the host memory fallback path if issues arise.
- Users who previously relied on the behavior when VRAM was full should note that the reorder step is now slower (~21 t/s vs ~38 t/s) when falling back to host memory, but the optimization is preserved for subsequent inference.
✨ New Features
- SYCL Q8_0 reordering now includes a reorder-aware dequantizer for the GEMM path during prompt processing.
- SYCL Q8_0 reordering handles device memory exhaustion by falling back to host memory for the temporary buffer, preserving the optimization if device allocation fails, or skipping reorder entirely if both fail.
- SYCL Q4_K and Q6_K quantization formats now include DMMV kernels that correctly read from the SOA reorder layout.
- Introduced `sycl_reorder_temp_buffer` RAII class to manage temporary buffers for SYCL reordering.
🐛 Bug Fixes
- Fixed garbage output during prompt processing after Q8_0 reordering due to a missing dequantize path for GEMM.
- Fixed crash on full VRAM during SYCL Q8_0 reordering by implementing a host memory fallback for temporary allocations.
- Fixed issue where `opt_for_reorder()` incorrectly marked tensors as reordered even when allocation failed, preventing subsequent kernels from reading incorrect data layouts (garbage/NaN results).