b9095

📅 May 10, 2026📦 llama-cppView on GitHub →

✨ 3 features🐛 8 fixes🔧 7 symbols

Summary

This release introduces a high-performance, NCCL-free internal AllReduce kernel for tensor parallelism on CUDA, alongside significant improvements and fixes to the associated hang detection watchdog and build system.

Migration Steps

If using llama-bench, rename command-line argument `--allreduce` to `--reduction-provider` or `-rp`.
If debugging hangs with the new internal AllReduce provider, set GGML_CUDA_ALLREDUCE environment variable to "internal" or "nccl" to select the provider.

✨ New Features

Added an internal AllReduce kernel for CUDA provider using a single-phase CUDA kernel for tensor parallelism (LLAMA_SPLIT_MODE_TENSOR).
Added support for selecting the AllReduce provider via the GGML_CUDA_ALLREDUCE environment variable ("nccl" or "internal").
Added an AllReduce hang watchdog (GGML_CUDA_AR_WATCHDOG) when compiled with -DGGML_CUDA_AR_WATCHDOG=ON for debugging hangs.

🐛 Bug Fixes

Fixed intermittent AllReduce hang on Blackwell PCIe by adding __threadfence_system() before the arrival signal write in signal_set.
Redesigned the watchdog from a blocking dispatch-thread poll to a non-blocking background thread, eliminating ~20ms per-slot latency.
Fixed watchdog shutdown ordering by stopping the watchdog thread before destroying GPU resources.
Added cudaStreamSynchronize in pipeline_free to drain in-flight kernels before freeing pinned host buffers.
Fixed watchdog polling to use sleep-first to avoid +0ms noise.
Fixed watchdog join() to return promptly by checking wdog_stop in both outer and inner loops.
Added Phase 3 breadcrumbs to debug[3] for hang localization.
Fixed ggml_cuda_select_allreduce_provider() to treat an empty GGML_CUDA_ALLREDUCE env var the same as unset.

Affected Symbols

ggml_cuda_allreduce_provider ggml_cuda_ar_allreduce ggml_backend_cuda_comm_context ggml_cuda_select_allreduce_provider ggml_cuda_ar_pipeline signal_set pipeline_free