v0.11.5

📅 Aug 15, 2025📦 ollama

✨ 6 features🐛 2 fixes🔧 5 symbols

Summary

This release introduces significant memory management improvements for GPU scheduling and multi-GPU setups, alongside performance optimizations for gpt-oss models and reduced installation sizes.

Migration Steps

To enable the new memory estimation logic, run the server with the environment variable: OLLAMA_NEW_ESTIMATES=1 ollama serve

✨ New Features

Improved memory management for GPU model scheduling, leading to better VRAM utilization and fewer OOM errors.
Improved multi-GPU scheduling and reduced VRAM allocation for setups with more than 2 GPUs.
The Ollama app now persists default selections for model, Turbo, and Web Search across restarts.
Flash attention can now be enabled for pure-CPU models using OLLAMA_FLASH_ATTENTION=1.
Performance improvements for gpt-oss models.
Reduced installation size on Windows and Linux platforms.

🐛 Bug Fixes

Fixed error when parsing bad harmony tool calls.
Fixed OpenAI-compatible API to support the reasoning_effort parameter.

🔧 Affected Symbols

gpt-ossOLLAMA_NEW_ESTIMATESOLLAMA_FLASH_ATTENTIONOpenAI-compatible APIreasoning_effort