Change8

v0.12.4

Breaking Changes
📦 ollamaView on GitHub →
2 breaking3 features🐛 5 fixes🔧 5 symbols

Summary

This release enables Flash Attention by default for Qwen 3 models and improves VRAM detection, while dropping support for older macOS versions and specific AMD GPU architectures.

⚠️ Breaking Changes

  • macOS 12 Monterey and macOS 13 Ventura are no longer supported.
  • AMD gfx900 and gfx906 (MI50, MI60, etc) GPUs are no longer supported via ROCm.

Migration Steps

  1. Users on macOS 12 or 13 must upgrade their OS to a supported version (macOS 14+).
  2. Users with AMD gfx900/gfx906 GPUs should monitor for future Vulkan support or remain on a previous version.
  3. If experiencing issues with Qwen 3 models, use OLLAMA_FLASH_ATTENTION=0 to disable the new default behavior.

✨ New Features

  • Flash attention is now enabled by default for Qwen 3 and Qwen 3 Coder.
  • Added OLLAMA_FLASH_ATTENTION environment variable override to disable flash attention (set to 0).
  • Improved VRAM detection reliability and accuracy.

🐛 Bug Fixes

  • Fixed minor memory estimation issues when scheduling models on NVIDIA GPUs.
  • Fixed inconsistent keep_alive value acceptance between /api/chat and /api/generate endpoints.
  • Fixed tool calling rendering issues with qwen3-coder.
  • Fixed crash occurring when templates were not correctly defined.
  • Fixed memory calculations on NVIDIA iGPUs.

🔧 Affected Symbols

/api/chat/api/generateOLLAMA_FLASH_ATTENTIONqwen3-coderROCm