b9519
📦 llama-cppView on GitHub →
✨ 2 features🐛 1 fixes🔧 2 symbols
Summary
This release ports the multi-column MMVQ optimization to the SYCL backend for improved performance and fixes a bug where the reorder kernel was not being triggered correctly in ggml-sycl for multi-column operations.
✨ New Features
- Ported multi-column MMVQ optimization from CUDA backend to SYCL backend, improving weight reading efficiency.
- Bootstrapped weight reordering on small multi-column batches (ne[1] <= 8) in ggml-sycl to ensure the faster reorder kernel is utilized.
🐛 Bug Fixes
- Fixed an issue in ggml-sycl where weight reorder was only triggered on single-token mat-vec, causing speculative/MTP verify to run on the slower non-reorder kernel for multi-column mat-vec.