b8680

📅 Apr 6, 2026📦 llama-cppView on GitHub →

✨ 4 features🔧 1 symbols

Summary

This release introduces an optimized flash_attn_stream_k_fixup kernel for CUDA to improve performance under specific conditions. It also provides updated pre-compiled binaries across multiple operating systems and hardware targets.

✨ New Features

Implemented an optimized flash_attn_stream_k_fixup kernel for CUDA.
The new kernel is specialized for cases where nblocks_stream_k is a multiple of ntiles_dst.
Added logic to make nblocks_stream_k a multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst.
The new kernel is conditionally used only when nblocks_stream_k_raw > 4 * ntiles_dst to ensure sufficient GPU concurrency.

Affected Symbols

flash_attn_stream_k_fixup kernel