v0.12.4

Breaking Changes

📅 Oct 3, 2025📦 ollamaView on GitHub →

⚠ 2 breaking✨ 3 features🐛 5 fixes🔧 5 symbols

Summary

This release enables Flash Attention by default for Qwen 3 models and improves VRAM detection, while dropping support for older macOS versions and specific AMD GPU architectures.

⚠️ Breaking Changes

macOS 12 Monterey and macOS 13 Ventura are no longer supported.
AMD gfx900 and gfx906 (MI50, MI60, etc) GPUs are no longer supported via ROCm.

Migration Steps

Users on macOS 12 or 13 must upgrade their OS to a supported version (macOS 14+).
Users with AMD gfx900/gfx906 GPUs should monitor for future Vulkan support or remain on a previous version.
If experiencing issues with Qwen 3 models, use OLLAMA_FLASH_ATTENTION=0 to disable the new default behavior.

✨ New Features

Flash attention is now enabled by default for Qwen 3 and Qwen 3 Coder.
Added OLLAMA_FLASH_ATTENTION environment variable override to disable flash attention (set to 0).
Improved VRAM detection reliability and accuracy.

🐛 Bug Fixes

Fixed minor memory estimation issues when scheduling models on NVIDIA GPUs.
Fixed inconsistent keep_alive value acceptance between /api/chat and /api/generate endpoints.
Fixed tool calling rendering issues with qwen3-coder.
Fixed crash occurring when templates were not correctly defined.
Fixed memory calculations on NVIDIA iGPUs.

🔧 Affected Symbols

/api/chat/api/generateOLLAMA_FLASH_ATTENTIONqwen3-coderROCm