Change8

b8809

📦 llama-cppView on GitHub →
4 features🐛 3 fixes🔧 8 symbols

Summary

This release focuses heavily on improving SYCL performance and stability, particularly for Q8_0, Q4_K, and Q6_K quantization formats by fixing reordering bugs, adding missing dequantizers, and implementing robust memory fallback mechanisms for device allocation failures.

Migration Steps

  1. If building with SYCL support and targeting older Linux kernels (< 6.8 on Ubuntu 26.04+), consider setting the CMake option `-DGGML_SYCL_HOST_MEM_FALLBACK=OFF` to disable the host memory fallback path if issues arise.
  2. Users who previously relied on the behavior when VRAM was full should note that the reorder step is now slower (~21 t/s vs ~38 t/s) when falling back to host memory, but the optimization is preserved for subsequent inference.

✨ New Features

  • SYCL Q8_0 reordering now includes a reorder-aware dequantizer for the GEMM path during prompt processing.
  • SYCL Q8_0 reordering handles device memory exhaustion by falling back to host memory for the temporary buffer, preserving the optimization if device allocation fails, or skipping reorder entirely if both fail.
  • SYCL Q4_K and Q6_K quantization formats now include DMMV kernels that correctly read from the SOA reorder layout.
  • Introduced `sycl_reorder_temp_buffer` RAII class to manage temporary buffers for SYCL reordering.

🐛 Bug Fixes

  • Fixed garbage output during prompt processing after Q8_0 reordering due to a missing dequantize path for GEMM.
  • Fixed crash on full VRAM during SYCL Q8_0 reordering by implementing a host memory fallback for temporary allocations.
  • Fixed issue where `opt_for_reorder()` incorrectly marked tensors as reordered even when allocation failed, preventing subsequent kernels from reading incorrect data layouts (garbage/NaN results).

Affected Symbols