b8469
📦 llama-cppView on GitHub →
✨ 1 features🐛 1 fixes
Summary
This release focuses on performance improvements for CUDA operations, specifically optimizing thread block sizing for small K-dimensions in tensor parallelism, alongside providing numerous pre-compiled binaries.
Migration Steps
- Converted tabs to spaces in relevant code sections.
✨ New Features
- Improved CUDA performance for tensor parallelism by increasing the number of output elements per thread block when the K-dimension is small, particularly beneficial for FFN-down matrices in models like MOEs.
🐛 Bug Fixes
- Fixed an issue where threads were idle due to a fixed warp group size irrespective of small K-dimensions in tensor parallelism.