b8579
📦 llama-cppView on GitHub →
✨ 4 features🐛 2 fixes🔧 6 symbols
Summary
This release focuses on significant performance optimizations for Mixture of Experts (MOE) operations by introducing a new kernel for BS > 1 and refining optimization flags. It also increases batch size limits for certain kernels.
✨ New Features
- Optimized MOE GEMV kernel for batch size (BS) > 1 using a new mul_mat_vec_q_moe kernel.
- Introduced a new MOE GEMV kernel configuration based on GPU architecture and datatype.
- Increased max batch size for MMVQ kernels supporting MUL_MAT_ID to 8.
- Cherry-picked changes to enable small_k optimization only when beneficial.
🐛 Bug Fixes
- Simplified the original GEMV kernel by removing the `is_multi_token_id` specialization.
- Removed em-dashes from release notes content.