b7619

📅 Jan 3, 2026📦 llama-cppView on GitHub →

✨ 1 features🔧 3 symbols

Summary

This release introduces a CUDA optimization to reduce memory overhead by conditionally allocating the Flash Attention temporary buffer. It includes a wide range of pre-built binaries for multiple operating systems and hardware architectures.

✨ New Features

Optimized CUDA memory allocation by only allocating the Flash Attention (FA) temporary buffer when required.

🔧 Affected Symbols

CUDAFlash AttentionFA tmp buffer