Change8

b8680

📦 llama-cppView on GitHub →
4 features🔧 1 symbols

Summary

This release introduces an optimized flash_attn_stream_k_fixup kernel for CUDA to improve performance under specific conditions. It also provides updated pre-compiled binaries across multiple operating systems and hardware targets.

✨ New Features

  • Implemented an optimized flash_attn_stream_k_fixup kernel for CUDA.
  • The new kernel is specialized for cases where nblocks_stream_k is a multiple of ntiles_dst.
  • Added logic to make nblocks_stream_k a multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst.
  • The new kernel is conditionally used only when nblocks_stream_k_raw > 4 * ntiles_dst to ensure sufficient GPU concurrency.

Affected Symbols