b9788
📦 llama-cppView on GitHub →
✨ 3 features🐛 3 fixes🔧 7 symbols
Summary
This release introduces significant performance improvements for SYCL by enabling tensor parallelism (--split-mode tensor) for dual-GPU setups, featuring optimized small and large tensor all-reduce paths. Several minor fixes and documentation updates were also applied to the SYCL backend implementation.
Migration Steps
- If using SYCL tensor parallelism, ensure device-to-device memcpy calls align with the 7-parameter upstream variety if updating related code.
✨ New Features
- Added SYCL support for tensor parallelism using the --split-mode tensor flag.
- Implemented backend-specific all-reduce via comm_init/comm_free/comm_allreduce_tensor trio for SYCL dual-GPU tensor parallelism.
- Introduced two paths for dual-GPU all-reduce (N=2): FP32 direct memcpy + ADD kernel for small tensors (< 32768 elements), and BF16-compressed path for large tensors (halving PCIe bytes).
🐛 Bug Fixes
- Fixed comments in SYCL implementation.
- Fixed a typo and removed a trailing whitespace in SYCL implementation files.
- Moved dev2dev_memcpy calls to use the upstream 7-parameter variety.