Change8

b9066

📦 llama-cppView on GitHub →
2 features🔧 1 symbols

Summary

This release introduces significant performance improvements for CUDA by batching the out_prod inner loop with cublasSgemmStridedBatched and extends this optimization to HIP and MUSA backends.

✨ New Features

  • CUDA: Batched out_prod inner loop optimized using cublasSgemmStridedBatched.
  • Added cublasSgemmStridedBatched mapping support for HIP and MUSA backends.

Affected Symbols