b8888
📦 llama-cppView on GitHub →
✨ 2 features🐛 1 fixes🔧 6 symbols
Summary
This release focuses on SYCL backend improvements, enhancing memory efficiency for MoE models by correctly sizing buffers and introducing a BF16 fast path for matrix multiplication to prevent out-of-memory errors.
Migration Steps
- If using SYCL and encountering memory allocation errors with MoE models on Level Zero, the buffer sizing logic has been corrected, which may resolve the issue without code changes.
- If using SYCL with BF16 weights (like lm_head), the execution path now utilizes a DNNL fast path, which should improve performance and stability.
✨ New Features
- SYCL backend: Improved memory efficiency for mul_mat_id operations by sizing staging buffers based on routed rows instead of total elements.
- SYCL backend: Added a BF16 fast path for matrix multiplication (mul_mat) using DNNL when the source tensor (src0) is BF16, avoiding large F32 dequantization buffers for large-vocab models.
🐛 Bug Fixes
- SYCL: Fixed potential UR_RESULT_ERROR_OUT_OF_HOST_MEMORY errors on Level Zero for MoE models by correctly sizing ggml_sycl_mul_mat_id staging buffers by the actual number of routed rows (ids->ne[1] * n_ids).