Change8

b8579

📦 llama-cppView on GitHub →
4 features🐛 2 fixes🔧 6 symbols

Summary

This release focuses on significant performance optimizations for Mixture of Experts (MOE) operations by introducing a new kernel for BS > 1 and refining optimization flags. It also increases batch size limits for certain kernels.

✨ New Features

  • Optimized MOE GEMV kernel for batch size (BS) > 1 using a new mul_mat_vec_q_moe kernel.
  • Introduced a new MOE GEMV kernel configuration based on GPU architecture and datatype.
  • Increased max batch size for MMVQ kernels supporting MUL_MAT_ID to 8.
  • Cherry-picked changes to enable small_k optimization only when beneficial.

🐛 Bug Fixes

  • Simplified the original GEMV kernel by removing the `is_multi_token_id` specialization.
  • Removed em-dashes from release notes content.

Affected Symbols