Change8

b9095

📦 llama-cppView on GitHub →
3 features🐛 8 fixes🔧 7 symbols

Summary

This release introduces a high-performance, NCCL-free internal AllReduce kernel for tensor parallelism on CUDA, alongside significant improvements and fixes to the associated hang detection watchdog and build system.

Migration Steps

  1. If using llama-bench, rename command-line argument `--allreduce` to `--reduction-provider` or `-rp`.
  2. If debugging hangs with the new internal AllReduce provider, set GGML_CUDA_ALLREDUCE environment variable to "internal" or "nccl" to select the provider.

✨ New Features

  • Added an internal AllReduce kernel for CUDA provider using a single-phase CUDA kernel for tensor parallelism (LLAMA_SPLIT_MODE_TENSOR).
  • Added support for selecting the AllReduce provider via the GGML_CUDA_ALLREDUCE environment variable ("nccl" or "internal").
  • Added an AllReduce hang watchdog (GGML_CUDA_AR_WATCHDOG) when compiled with -DGGML_CUDA_AR_WATCHDOG=ON for debugging hangs.

🐛 Bug Fixes

  • Fixed intermittent AllReduce hang on Blackwell PCIe by adding __threadfence_system() before the arrival signal write in signal_set.
  • Redesigned the watchdog from a blocking dispatch-thread poll to a non-blocking background thread, eliminating ~20ms per-slot latency.
  • Fixed watchdog shutdown ordering by stopping the watchdog thread before destroying GPU resources.
  • Added cudaStreamSynchronize in pipeline_free to drain in-flight kernels before freeing pinned host buffers.
  • Fixed watchdog polling to use sleep-first to avoid +0ms noise.
  • Fixed watchdog join() to return promptly by checking wdog_stop in both outer and inner loops.
  • Added Phase 3 breadcrumbs to debug[3] for hang localization.
  • Fixed ggml_cuda_select_allreduce_provider() to treat an empty GGML_CUDA_ALLREDUCE env var the same as unset.

Affected Symbols