August-2025-v2
📦 unslothView on GitHub →
✨ 7 features🐛 9 fixes🔧 6 symbols
Summary
This release introduces Unsloth Flex Attention for gpt-oss training, drastically improving context length, VRAM efficiency, and speed. Numerous bug fixes and support for new models/features like QAT + LoRA are also included.
Migration Steps
- Update vLLM installation instructions for Blackwell if using the latest vLLM release.
✨ New Features
- Introduced Unsloth Flex Attention support for OpenAI gpt-oss training, enabling >8× longer context lengths, >50% less VRAM usage, and >1.5× faster training.
- Unsloth Flex Attention allows training with 60K context length on 80GB VRAM for BF16 LoRA.
- Added ability to export/save QLoRA fine-tuned gpt-oss models to llama.cpp, vLLM, or HF.
- Added support for Qwen3 Instruct / Thinking chat templates.
- Added support for Qwen3 4B to mapper.py.
- Added support for QAT + LoRA.
- Allowed torch.float32 dtype in FastLanguageModel.
🐛 Bug Fixes
- Fixed gpt-oss training losses going to infinity on float16 GPUs (like T4 Colab).
- Fixed gpt-oss implementation issues, ensuring `swiglu_limit = 7.0` is properly applied during MXFP4 inference [in transformers](https://github.com/huggingface/transformers/pull/40197).
- Fixed potential generator exhaustion bug in model loading file detection.
- Fixed vision model GGUF quantization_method error type.
- Fixed original_push_to_hub fallback.
- Fixed extras transformers typo in pyproject.toml.
- Fixed is casual setting for qwen3.
- Fixed gemma-3n issues.
- Handled transformers move to dtype from torch_dtype.
🔧 Affected Symbols
gpt-ossFastLanguageModelswiglu_limitQwen3 InstructQwen3 4Bgemma-3n