b8210

📅 Mar 5, 2026📦 llama-cppView on GitHub →

✨ 3 features🐛 5 fixes🔧 4 symbols

Summary

This release significantly improves CUDA performance by reducing synchronizations through asynchronous copies and relaxing sync requirements. It also includes internal refactoring to improve backend detection and compilation robustness.

Migration Steps

Apply suggestion from @ggerganov: rename 'src' to 'buf_src' in relevant code paths (occurs twice).

✨ New Features

Added CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async().
Introduced a function to relax synchronization requirements between input copies on supported backends (initially CUDA).
Made the opt-in to relax explicit syncs more general, allowing backends like Vulkan to adopt the change for HtoD copies and graph execution.

🐛 Bug Fixes

Exchanged synchronous copy with async copy function in CUDA operations.
Reworked backend detection in ggml-backend.cpp to avoid linking conflicts.
Relaxed requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues.
Corrected initialization of ggml_backend_sync_mode in ggml_backend_sched_split initialization.
Simplified synchronizations to adhere to `saaasg` pattern.

Affected Symbols

ggml_backend_cuda_cpy_tensor_async()ggml_backend_sync_mode ggml_backend_sched_split ggml-backend.cpp