v0.12.4
Breaking Changes📦 ollamaView on GitHub →
⚠ 2 breaking✨ 3 features🐛 5 fixes🔧 5 symbols
Summary
This release enables Flash Attention by default for Qwen 3 models and improves VRAM detection, while dropping support for older macOS versions and specific AMD GPU architectures.
⚠️ Breaking Changes
- macOS 12 Monterey and macOS 13 Ventura are no longer supported.
- AMD gfx900 and gfx906 (MI50, MI60, etc) GPUs are no longer supported via ROCm.
Migration Steps
- Users on macOS 12 or 13 must upgrade their OS to a supported version (macOS 14+).
- Users with AMD gfx900/gfx906 GPUs should monitor for future Vulkan support or remain on a previous version.
- If experiencing issues with Qwen 3 models, use OLLAMA_FLASH_ATTENTION=0 to disable the new default behavior.
✨ New Features
- Flash attention is now enabled by default for Qwen 3 and Qwen 3 Coder.
- Added OLLAMA_FLASH_ATTENTION environment variable override to disable flash attention (set to 0).
- Improved VRAM detection reliability and accuracy.
🐛 Bug Fixes
- Fixed minor memory estimation issues when scheduling models on NVIDIA GPUs.
- Fixed inconsistent keep_alive value acceptance between /api/chat and /api/generate endpoints.
- Fixed tool calling rendering issues with qwen3-coder.
- Fixed crash occurring when templates were not correctly defined.
- Fixed memory calculations on NVIDIA iGPUs.
🔧 Affected Symbols
/api/chat/api/generateOLLAMA_FLASH_ATTENTIONqwen3-coderROCm