b8680
📦 llama-cppView on GitHub →
✨ 4 features🔧 1 symbols
Summary
This release introduces an optimized flash_attn_stream_k_fixup kernel for CUDA to improve performance under specific conditions. It also provides updated pre-compiled binaries across multiple operating systems and hardware targets.
✨ New Features
- Implemented an optimized flash_attn_stream_k_fixup kernel for CUDA.
- The new kernel is specialized for cases where nblocks_stream_k is a multiple of ntiles_dst.
- Added logic to make nblocks_stream_k a multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst.
- The new kernel is conditionally used only when nblocks_stream_k_raw > 4 * ntiles_dst to ensure sufficient GPU concurrency.