b9274
📦 llama-cppView on GitHub →
🐛 2 fixes🔧 3 symbols
Summary
This release fixes a critical VRAM leak occurring during server sleep/resume cycles for Multi-Token Prediction (MTP) models by improving resource cleanup in the destroy function. It also provides numerous pre-compiled binaries for various platforms and hardware configurations.
🐛 Bug Fixes
- Fixed a VRAM leak on server sleep/resume cycles for MTP models by ensuring the speculative decoder (spec), draft context (ctx_dft), and draft model (model_dft) are properly freed.
- Ensured proper cleanup order in destroy() by resetting spec, ctx_dft, and model_dft before resetting llama_init to prevent potential use-after-free errors.