b9095
📦 llama-cppView on GitHub →
✨ 3 features🐛 8 fixes🔧 7 symbols
Summary
This release introduces a high-performance, NCCL-free internal AllReduce kernel for tensor parallelism on CUDA, alongside significant improvements and fixes to the associated hang detection watchdog and build system.
Migration Steps
- If using llama-bench, rename command-line argument `--allreduce` to `--reduction-provider` or `-rp`.
- If debugging hangs with the new internal AllReduce provider, set GGML_CUDA_ALLREDUCE environment variable to "internal" or "nccl" to select the provider.
✨ New Features
- Added an internal AllReduce kernel for CUDA provider using a single-phase CUDA kernel for tensor parallelism (LLAMA_SPLIT_MODE_TENSOR).
- Added support for selecting the AllReduce provider via the GGML_CUDA_ALLREDUCE environment variable ("nccl" or "internal").
- Added an AllReduce hang watchdog (GGML_CUDA_AR_WATCHDOG) when compiled with -DGGML_CUDA_AR_WATCHDOG=ON for debugging hangs.
🐛 Bug Fixes
- Fixed intermittent AllReduce hang on Blackwell PCIe by adding __threadfence_system() before the arrival signal write in signal_set.
- Redesigned the watchdog from a blocking dispatch-thread poll to a non-blocking background thread, eliminating ~20ms per-slot latency.
- Fixed watchdog shutdown ordering by stopping the watchdog thread before destroying GPU resources.
- Added cudaStreamSynchronize in pipeline_free to drain in-flight kernels before freeing pinned host buffers.
- Fixed watchdog polling to use sleep-first to avoid +0ms noise.
- Fixed watchdog join() to return promptly by checking wdog_stop in both outer and inner loops.
- Added Phase 3 breadcrumbs to debug[3] for hang localization.
- Fixed ggml_cuda_select_allreduce_provider() to treat an empty GGML_CUDA_ALLREDUCE env var the same as unset.