Change8

August-2025-v2

📦 unslothView on GitHub →
7 features🐛 9 fixes🔧 6 symbols

Summary

This release introduces Unsloth Flex Attention for gpt-oss training, drastically improving context length, VRAM efficiency, and speed. Numerous bug fixes and support for new models/features like QAT + LoRA are also included.

Migration Steps

  1. Update vLLM installation instructions for Blackwell if using the latest vLLM release.

✨ New Features

  • Introduced Unsloth Flex Attention support for OpenAI gpt-oss training, enabling >8× longer context lengths, >50% less VRAM usage, and >1.5× faster training.
  • Unsloth Flex Attention allows training with 60K context length on 80GB VRAM for BF16 LoRA.
  • Added ability to export/save QLoRA fine-tuned gpt-oss models to llama.cpp, vLLM, or HF.
  • Added support for Qwen3 Instruct / Thinking chat templates.
  • Added support for Qwen3 4B to mapper.py.
  • Added support for QAT + LoRA.
  • Allowed torch.float32 dtype in FastLanguageModel.

🐛 Bug Fixes

  • Fixed gpt-oss training losses going to infinity on float16 GPUs (like T4 Colab).
  • Fixed gpt-oss implementation issues, ensuring `swiglu_limit = 7.0` is properly applied during MXFP4 inference [in transformers](https://github.com/huggingface/transformers/pull/40197).
  • Fixed potential generator exhaustion bug in model loading file detection.
  • Fixed vision model GGUF quantization_method error type.
  • Fixed original_push_to_hub fallback.
  • Fixed extras transformers typo in pyproject.toml.
  • Fixed is casual setting for qwen3.
  • Fixed gemma-3n issues.
  • Handled transformers move to dtype from torch_dtype.

🔧 Affected Symbols

gpt-ossFastLanguageModelswiglu_limitQwen3 InstructQwen3 4Bgemma-3n