b8738
📦 llama-cppView on GitHub →
✨ 9 features🐛 16 fixes🔧 8 symbols
Summary
This release introduces experimental backend-agnostic tensor parallelism in ggml, supporting models like GPT-OSS and Qwen 3 MoE across multiple GPUs. Numerous bug fixes address stability, quantization handling, and backend-specific issues across Vulkan, Metal, ROCm, and various model implementations.
Migration Steps
- Remove shfl and AllReduce calls from the backend interface if present, as they were removed from the backend interface.
- If using custom allocation logic, note that the allocation workaround was moved out of ggml-alloc.c.
- If relying on specific device determination logic, note that logic for determining Meta devices in llama is now more robust.
✨ New Features
- Introduced experimental backend-agnostic tensor parallelism in ggml.
- Added support for tensor parallelism for GPT-OSS and Qwen 3 MoE models.
- Added support for 4/8 GPUs in tensor parallelism.
- Added NCCL support for tensor parallelism.
- Added RCCL support for GGML HIP backend.
- Added support for tensor dimensions not divisible by the number of devices (n_devs).
- Added support for device-specific host buffer types (e.g., pinned memory for CUDA) if all underlying backends expose the same type.
- Added support for Qwen 3.5.
- Added support for Gemma 4 MoE.
🐛 Bug Fixes
- Partial Vulkan fix applied.
- Fixed output pattern issue.
- Fixed a segmentation fault occurring without NCCL.
- Fixed view_offs scaling.
- Fixed compilation errors.
- Fixed Qwen-30B-A3B Q4_0 issues related to uneven GPU splits by deciding block size based on tensor quantization type.
- Fixed crashes due to KV cache serialization by adding support for setting/getting tensors with non-zero offsets in the meta backend.
- Fixed metal build issues.
- Fixed usage count in static memory allocations.
- Fixed tensor granularity issues.
- Improved memory distribution.
- Fixed device mismatch during scatter of allReduce, which caused sync copies.
- Fixed Qwen 3.5 MoE.
- Fixed OpenVino and SYCL issues.
- Fixed test-llama-archs for CPU-only builds.
- Fixed GPT-OSS issues.