b9066

📅 May 7, 2026📦 llama-cppView on GitHub →

✨ 2 features🔧 1 symbols

Summary

This release introduces significant performance improvements for CUDA by batching the out_prod inner loop with cublasSgemmStridedBatched and extends this optimization to HIP and MUSA backends.

✨ New Features

CUDA: Batched out_prod inner loop optimized using cublasSgemmStridedBatched.
Added cublasSgemmStridedBatched mapping support for HIP and MUSA backends.

Affected Symbols

cublasSgemmStridedBatched