Change8

b9788

📦 llama-cppView on GitHub →
3 features🐛 3 fixes🔧 7 symbols

Summary

This release introduces significant performance improvements for SYCL by enabling tensor parallelism (--split-mode tensor) for dual-GPU setups, featuring optimized small and large tensor all-reduce paths. Several minor fixes and documentation updates were also applied to the SYCL backend implementation.

Migration Steps

  1. If using SYCL tensor parallelism, ensure device-to-device memcpy calls align with the 7-parameter upstream variety if updating related code.

✨ New Features

  • Added SYCL support for tensor parallelism using the --split-mode tensor flag.
  • Implemented backend-specific all-reduce via comm_init/comm_free/comm_allreduce_tensor trio for SYCL dual-GPU tensor parallelism.
  • Introduced two paths for dual-GPU all-reduce (N=2): FP32 direct memcpy + ADD kernel for small tensors (< 32768 elements), and BF16-compressed path for large tensors (halving PCIe bytes).

🐛 Bug Fixes

  • Fixed comments in SYCL implementation.
  • Fixed a typo and removed a trailing whitespace in SYCL implementation files.
  • Moved dev2dev_memcpy calls to use the upstream 7-parameter variety.

Affected Symbols