b7619
📦 llama-cppView on GitHub →
✨ 1 features🔧 3 symbols
Summary
This release introduces a CUDA optimization to reduce memory overhead by conditionally allocating the Flash Attention temporary buffer. It includes a wide range of pre-built binaries for multiple operating systems and hardware architectures.
✨ New Features
- Optimized CUDA memory allocation by only allocating the Flash Attention (FA) temporary buffer when required.
🔧 Affected Symbols
CUDAFlash AttentionFA tmp buffer