Change8

b7619

📦 llama-cppView on GitHub →
1 features🔧 3 symbols

Summary

This release introduces a CUDA optimization to reduce memory overhead by conditionally allocating the Flash Attention temporary buffer. It includes a wide range of pre-built binaries for multiple operating systems and hardware architectures.

✨ New Features

  • Optimized CUDA memory allocation by only allocating the Flash Attention (FA) temporary buffer when required.

🔧 Affected Symbols

CUDAFlash AttentionFA tmp buffer