b8469

📅 Mar 22, 2026📦 llama-cppView on GitHub →

✨ 1 features🐛 1 fixes

Summary

This release focuses on performance improvements for CUDA operations, specifically optimizing thread block sizing for small K-dimensions in tensor parallelism, alongside providing numerous pre-compiled binaries.

Migration Steps

Converted tabs to spaces in relevant code sections.

✨ New Features

Improved CUDA performance for tensor parallelism by increasing the number of output elements per thread block when the K-dimension is small, particularly beneficial for FFN-down matrices in models like MOEs.

🐛 Bug Fixes

Fixed an issue where threads were idle due to a fixed warp group size irrespective of small K-dimensions in tensor parallelism.