b9066
📦 llama-cppView on GitHub →
✨ 2 features🔧 1 symbols
Summary
This release introduces significant performance improvements for CUDA by batching the out_prod inner loop with cublasSgemmStridedBatched and extends this optimization to HIP and MUSA backends.
✨ New Features
- CUDA: Batched out_prod inner loop optimized using cublasSgemmStridedBatched.
- Added cublasSgemmStridedBatched mapping support for HIP and MUSA backends.