Change8

b9820

📦 llama-cppView on GitHub →
3 features🐛 10 fixes🔧 3 symbols

Summary

This release focuses heavily on scheduler improvements, reducing synchronizations for better performance, especially on CUDA, and hardening backend interactions across various hardware targets.

Migration Steps

  1. If relying on specific backend/buffer type checks for async CUDA copies, note that this check is now relaxed to only depend on buffer type.

✨ New Features

  • Added CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async().
  • Added function to relax sync requirements between input copies on supported backends (CUDA for now).
  • Makes opt-in to relax use of explicit syncs more general, allowing backends like vulkan to adopt this change for HtoD copies and graph execution.

🐛 Bug Fixes

  • Reintroduced less synchronizations during split compute.
  • Improved CUDA performance via less synchronizations between tokens.
  • Reworked backend detection in ggml-backend.cpp to avoid linking conflicts.
  • Relaxed requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues.
  • Reintroduced stricter check for CPU->CUDA backend async copy via GGML_DEVICE_TYPE_CPU.
  • Corrected initialization of ggml_backend_sync_mode in ggml_backend_sched_split initialization.
  • Simplified synchronizations to adhere to `saaasg` pattern.
  • Added single-GPU synchronizations to multi-GPU settings to fix hip backend pipeline parallel bugs.
  • Excluded hip/MUSA from copy_from_host CPU split -> GPU split optimization (Scheduler Hardening).
  • Re-added original additional synchronizations for non-async backends (Scheduler Hardening).

Affected Symbols