Change8

b8888

📦 llama-cppView on GitHub →
2 features🐛 1 fixes🔧 6 symbols

Summary

This release focuses on SYCL backend improvements, enhancing memory efficiency for MoE models by correctly sizing buffers and introducing a BF16 fast path for matrix multiplication to prevent out-of-memory errors.

Migration Steps

  1. If using SYCL and encountering memory allocation errors with MoE models on Level Zero, the buffer sizing logic has been corrected, which may resolve the issue without code changes.
  2. If using SYCL with BF16 weights (like lm_head), the execution path now utilizes a DNNL fast path, which should improve performance and stability.

✨ New Features

  • SYCL backend: Improved memory efficiency for mul_mat_id operations by sizing staging buffers based on routed rows instead of total elements.
  • SYCL backend: Added a BF16 fast path for matrix multiplication (mul_mat) using DNNL when the source tensor (src0) is BF16, avoiding large F32 dequantization buffers for large-vocab models.

🐛 Bug Fixes

  • SYCL: Fixed potential UR_RESULT_ERROR_OUT_OF_HOST_MEMORY errors on Level Zero for MoE models by correctly sizing ggml_sycl_mul_mat_id staging buffers by the actual number of routed rows (ids->ne[1] * n_ids).

Affected Symbols