Change8

v0.11.5

📦 ollama
6 features🐛 2 fixes🔧 5 symbols

Summary

This release introduces significant memory management improvements for GPU scheduling and multi-GPU setups, alongside performance optimizations for gpt-oss models and reduced installation sizes.

Migration Steps

  1. To enable the new memory estimation logic, run the server with the environment variable: OLLAMA_NEW_ESTIMATES=1 ollama serve

✨ New Features

  • Improved memory management for GPU model scheduling, leading to better VRAM utilization and fewer OOM errors.
  • Improved multi-GPU scheduling and reduced VRAM allocation for setups with more than 2 GPUs.
  • The Ollama app now persists default selections for model, Turbo, and Web Search across restarts.
  • Flash attention can now be enabled for pure-CPU models using OLLAMA_FLASH_ATTENTION=1.
  • Performance improvements for gpt-oss models.
  • Reduced installation size on Windows and Linux platforms.

🐛 Bug Fixes

  • Fixed error when parsing bad harmony tool calls.
  • Fixed OpenAI-compatible API to support the reasoning_effort parameter.

🔧 Affected Symbols

gpt-ossOLLAMA_NEW_ESTIMATESOLLAMA_FLASH_ATTENTIONOpenAI-compatible APIreasoning_effort