v0.11.5
📦 ollama
✨ 6 features🐛 2 fixes🔧 5 symbols
Summary
This release introduces significant memory management improvements for GPU scheduling and multi-GPU setups, alongside performance optimizations for gpt-oss models and reduced installation sizes.
Migration Steps
- To enable the new memory estimation logic, run the server with the environment variable: OLLAMA_NEW_ESTIMATES=1 ollama serve
✨ New Features
- Improved memory management for GPU model scheduling, leading to better VRAM utilization and fewer OOM errors.
- Improved multi-GPU scheduling and reduced VRAM allocation for setups with more than 2 GPUs.
- The Ollama app now persists default selections for model, Turbo, and Web Search across restarts.
- Flash attention can now be enabled for pure-CPU models using OLLAMA_FLASH_ATTENTION=1.
- Performance improvements for gpt-oss models.
- Reduced installation size on Windows and Linux platforms.
🐛 Bug Fixes
- Fixed error when parsing bad harmony tool calls.
- Fixed OpenAI-compatible API to support the reasoning_effort parameter.
🔧 Affected Symbols
gpt-ossOLLAMA_NEW_ESTIMATESOLLAMA_FLASH_ATTENTIONOpenAI-compatible APIreasoning_effort