b9788

📅 Jun 25, 2026📦 llama-cppView on GitHub →

✨ 3 features🐛 3 fixes🔧 7 symbols

Summary

This release introduces significant performance improvements for SYCL by enabling tensor parallelism (--split-mode tensor) for dual-GPU setups, featuring optimized small and large tensor all-reduce paths. Several minor fixes and documentation updates were also applied to the SYCL backend implementation.

Migration Steps

If using SYCL tensor parallelism, ensure device-to-device memcpy calls align with the 7-parameter upstream variety if updating related code.

✨ New Features

Added SYCL support for tensor parallelism using the --split-mode tensor flag.
Implemented backend-specific all-reduce via comm_init/comm_free/comm_allreduce_tensor trio for SYCL dual-GPU tensor parallelism.
Introduced two paths for dual-GPU all-reduce (N=2): FP32 direct memcpy + ADD kernel for small tensors (< 32768 elements), and BF16-compressed path for large tensors (halving PCIe bytes).

🐛 Bug Fixes

Fixed comments in SYCL implementation.
Fixed a typo and removed a trailing whitespace in SYCL implementation files.
Moved dev2dev_memcpy calls to use the upstream 7-parameter variety.

Affected Symbols

ggml-sycl.h ggml-sycl.cpp comm_init comm_free comm_allreduce_tensor get_proc_address dev2dev_memcpy