b8738

📅 Apr 9, 2026📦 llama-cppView on GitHub →

✨ 9 features🐛 16 fixes🔧 8 symbols

Summary

This release introduces experimental backend-agnostic tensor parallelism in ggml, supporting models like GPT-OSS and Qwen 3 MoE across multiple GPUs. Numerous bug fixes address stability, quantization handling, and backend-specific issues across Vulkan, Metal, ROCm, and various model implementations.

Migration Steps

Remove shfl and AllReduce calls from the backend interface if present, as they were removed from the backend interface.
If using custom allocation logic, note that the allocation workaround was moved out of ggml-alloc.c.
If relying on specific device determination logic, note that logic for determining Meta devices in llama is now more robust.

✨ New Features

Introduced experimental backend-agnostic tensor parallelism in ggml.
Added support for tensor parallelism for GPT-OSS and Qwen 3 MoE models.
Added support for 4/8 GPUs in tensor parallelism.
Added NCCL support for tensor parallelism.
Added RCCL support for GGML HIP backend.
Added support for tensor dimensions not divisible by the number of devices (n_devs).
Added support for device-specific host buffer types (e.g., pinned memory for CUDA) if all underlying backends expose the same type.
Added support for Qwen 3.5.
Added support for Gemma 4 MoE.

🐛 Bug Fixes

Partial Vulkan fix applied.
Fixed output pattern issue.
Fixed a segmentation fault occurring without NCCL.
Fixed view_offs scaling.
Fixed compilation errors.
Fixed Qwen-30B-A3B Q4_0 issues related to uneven GPU splits by deciding block size based on tensor quantization type.
Fixed crashes due to KV cache serialization by adding support for setting/getting tensors with non-zero offsets in the meta backend.
Fixed metal build issues.
Fixed usage count in static memory allocations.
Fixed tensor granularity issues.
Improved memory distribution.
Fixed device mismatch during scatter of allReduce, which caused sync copies.
Fixed Qwen 3.5 MoE.
Fixed OpenVino and SYCL issues.
Fixed test-llama-archs for CPU-only builds.
Fixed GPT-OSS issues.

Affected Symbols

ggml backend interface (shfl, AllReduce removed)ggml-alloc.c (allocation workaround moved)llama_device ggml_backend_dev_is_meta() (removed)ggml-backend.h (meta backend API moved to ggml-backend-impl.h)ggml-ext.h (staging new APIs)llama-model.cpp (formatting)ggml-backend-meta.cpp (formatting)