b9820
📦 llama-cppView on GitHub →
✨ 3 features🐛 10 fixes🔧 3 symbols
Summary
This release focuses heavily on scheduler improvements, reducing synchronizations for better performance, especially on CUDA, and hardening backend interactions across various hardware targets.
Migration Steps
- If relying on specific backend/buffer type checks for async CUDA copies, note that this check is now relaxed to only depend on buffer type.
✨ New Features
- Added CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async().
- Added function to relax sync requirements between input copies on supported backends (CUDA for now).
- Makes opt-in to relax use of explicit syncs more general, allowing backends like vulkan to adopt this change for HtoD copies and graph execution.
🐛 Bug Fixes
- Reintroduced less synchronizations during split compute.
- Improved CUDA performance via less synchronizations between tokens.
- Reworked backend detection in ggml-backend.cpp to avoid linking conflicts.
- Relaxed requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues.
- Reintroduced stricter check for CPU->CUDA backend async copy via GGML_DEVICE_TYPE_CPU.
- Corrected initialization of ggml_backend_sync_mode in ggml_backend_sched_split initialization.
- Simplified synchronizations to adhere to `saaasg` pattern.
- Added single-GPU synchronizations to multi-GPU settings to fix hip backend pipeline parallel bugs.
- Excluded hip/MUSA from copy_from_host CPU split -> GPU split optimization (Scheduler Hardening).
- Re-added original additional synchronizations for non-async backends (Scheduler Hardening).