b8210
📦 llama-cppView on GitHub →
✨ 3 features🐛 5 fixes🔧 4 symbols
Summary
This release significantly improves CUDA performance by reducing synchronizations through asynchronous copies and relaxing sync requirements. It also includes internal refactoring to improve backend detection and compilation robustness.
Migration Steps
- Apply suggestion from @ggerganov: rename 'src' to 'buf_src' in relevant code paths (occurs twice).
✨ New Features
- Added CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async().
- Introduced a function to relax synchronization requirements between input copies on supported backends (initially CUDA).
- Made the opt-in to relax explicit syncs more general, allowing backends like Vulkan to adopt the change for HtoD copies and graph execution.
🐛 Bug Fixes
- Exchanged synchronous copy with async copy function in CUDA operations.
- Reworked backend detection in ggml-backend.cpp to avoid linking conflicts.
- Relaxed requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues.
- Corrected initialization of ggml_backend_sync_mode in ggml_backend_sched_split initialization.
- Simplified synchronizations to adhere to `saaasg` pattern.