Change8

b8469

📦 llama-cppView on GitHub →
1 features🐛 1 fixes

Summary

This release focuses on performance improvements for CUDA operations, specifically optimizing thread block sizing for small K-dimensions in tensor parallelism, alongside providing numerous pre-compiled binaries.

Migration Steps

  1. Converted tabs to spaces in relevant code sections.

✨ New Features

  • Improved CUDA performance for tensor parallelism by increasing the number of output elements per thread block when the K-dimension is small, particularly beneficial for FFN-down matrices in models like MOEs.

🐛 Bug Fixes

  • Fixed an issue where threads were idle due to a fixed warp group size irrespective of small K-dimensions in tensor parallelism.