b8579

📅 Mar 29, 2026📦 llama-cppView on GitHub →

✨ 4 features🐛 2 fixes🔧 6 symbols

Summary

This release focuses on significant performance optimizations for Mixture of Experts (MOE) operations by introducing a new kernel for BS > 1 and refining optimization flags. It also increases batch size limits for certain kernels.

✨ New Features

Optimized MOE GEMV kernel for batch size (BS) > 1 using a new mul_mat_vec_q_moe kernel.
Introduced a new MOE GEMV kernel configuration based on GPU architecture and datatype.
Increased max batch size for MMVQ kernels supporting MUL_MAT_ID to 8.
Cherry-picked changes to enable small_k optimization only when beneficial.

🐛 Bug Fixes

Simplified the original GEMV kernel by removing the `is_multi_token_id` specialization.
Removed em-dashes from release notes content.

Affected Symbols

MOE GEMV kernel mul_mat_vec_q_moe kernel GEMV kernel is_multi_token_id specialization MMVQ kernels MUL_MAT_ID