Change8

v0.1.405-beta

Breaking Changes
📦 unslothView on GitHub →
2 breaking16 features🐛 23 fixes🔧 15 symbols

Summary

This release brings significant performance gains, up to 2x faster GGUF inference via MTP speculative decoding, and introduces extensive API provider support with built-in tools like web search and code execution. Security has been heavily hardened across the Studio platform, alongside numerous training and installer reliability fixes.

⚠️ Breaking Changes

  • Removed the `torch.load` fallback on `training_args.bin`. Untrusted pickles can no longer execute on model load, which may break workflows relying on this fallback.
  • Cross-family GGUF projector is now blocked in flat local directories, preventing wrong-vision-tower loads. If you were loading cross-family models from flat directories, this might fail.

Migration Steps

  1. If you rely on `torch.load` fallback for `training_args.bin`, ensure your pickles are trusted or update your loading mechanism.
  2. If you were loading cross-family GGUF models from flat local directories, reorganize your structure or ensure the projector loading is handled correctly.
  3. If using older Unsloth versions, ensure you are using `v0.1.405-beta` or newer, not `v0.1.40-beta`.
  4. If installing or running Unsloth Studio, consider setting `STUDIO_HOME` or `UNSLOTH_STUDIO_HOME` if you require a custom installation path.
  5. Users on Windows should see improved CUDA installation due to paired `cudart` bundle and Torch NVIDIA DLL paths being added to `PATH`.
  6. Users on Ubuntu 24.04 building HIP source should ensure `--gcc-install-dir` is injected if necessary.

✨ New Features

  • Approximately 2x faster GGUF inference enabled by automatically activated MTP (speculative decoding).
  • API support for providers like OpenAI, Anthropic, etc., featuring auto prompt caching, web search, and code execution.
  • Ability to connect to external inference backends: vLLM and Ollama llama-server.
  • Experimental MLX inference support for running quants and models locally on Mac machines.
  • Proper support for non-English languages (e.g., Japanese, Chinese) in prompting/sending.
  • Built-in web search functionality for OpenAI, Anthropic, OpenRouter, and Kimi.
  • Built-in code execution for OpenAI and Anthropic (Anthropic containers are persisted and reused across turns).
  • Prompt caching enabled for OpenAI and Anthropic models, potentially saving 50 to 90% of costs.
  • API key is now optional for local providers (llama.cpp / vLLM / Ollama).
  • Auto-loading of models when adding a cloud provider.
  • OpenDocument chat attachments support in Unsloth Studio.
  • New Continued Pretraining (CPT) training method available as a first-class option.
  • Opt-in fused `lm_head` + cross-entropy forward path controlled by `UNSLOTH_RETURN_LOGITS=1` environment variable.
  • Custom install paths supported via `STUDIO_HOME` or `UNSLOTH_STUDIO_HOME` environment variables.
  • CPU-only Linux x86_64 systems are now routed to `ggml-org/llama.cpp` prebuilts.
  • Introduction of `unsloth --version` command flag.

🐛 Bug Fixes

  • Fixed silent wrong saves by implementing layout-aware MoE LoRA merge with loud-fail on fallback.
  • Fixed `num_logits_to_keep` regression when using transformers >= 4.52.
  • Preserved tokenizer EOS token on merged saves.
  • Fixed PEFT checkpoint resumption under sentence-transformers >= 5.4.
  • Restored Flash > SDPA > Flex attention priority for non-Gemma3 models.
  • Fixed ORPO text-only tokenization when used with processors.
  • Fixed embedding matrix size mismatch issues.
  • Fixed Vicuna chat template.
  • Unified legacy and new logits kwargs in `fast_generate` (fixing Mistral merge site issue).
  • Made `higher_precision_softmax` idempotent.
  • Patched every `LOSS_MAPPING` key aliased to `ForCausalLMLoss` (covering transformers 5.x).
  • Fixed GGUF converter sibling imports.
  • Added UTF-8 encoding to all text-mode file operations.
  • Fixed serialization of GGUF reload and inheritance of `unsloth-run` extra arguments.
  • Fixed `/recommended-folders` 500 error on unreadable model directories under Python 3.12+.
  • Fixed authentication rate-limiting to be proxy-aware.
  • Fixed IME composer hardening and RTL `dir="auto"` issues.
  • Fixed long log-line truncation.
  • Fixed tool reasoning trace rendering in UI.
  • Fixed Gemma attention mask issues during training.
  • Fixed Gemma-4 MoE LoRA extractor registration to resolve `grouped_mm` contraction crash.
  • Fixed silent CPU fallback warning when GPU headroom is low.
  • Fixed issues related to stale or old bundled llama.cpp prebuilt when MTP is used.

Affected Symbols