v0.1.405-beta
Breaking Changes📦 unslothView on GitHub →
⚠ 2 breaking✨ 16 features🐛 23 fixes🔧 15 symbols
Summary
This release brings significant performance gains, up to 2x faster GGUF inference via MTP speculative decoding, and introduces extensive API provider support with built-in tools like web search and code execution. Security has been heavily hardened across the Studio platform, alongside numerous training and installer reliability fixes.
⚠️ Breaking Changes
- Removed the `torch.load` fallback on `training_args.bin`. Untrusted pickles can no longer execute on model load, which may break workflows relying on this fallback.
- Cross-family GGUF projector is now blocked in flat local directories, preventing wrong-vision-tower loads. If you were loading cross-family models from flat directories, this might fail.
Migration Steps
- If you rely on `torch.load` fallback for `training_args.bin`, ensure your pickles are trusted or update your loading mechanism.
- If you were loading cross-family GGUF models from flat local directories, reorganize your structure or ensure the projector loading is handled correctly.
- If using older Unsloth versions, ensure you are using `v0.1.405-beta` or newer, not `v0.1.40-beta`.
- If installing or running Unsloth Studio, consider setting `STUDIO_HOME` or `UNSLOTH_STUDIO_HOME` if you require a custom installation path.
- Users on Windows should see improved CUDA installation due to paired `cudart` bundle and Torch NVIDIA DLL paths being added to `PATH`.
- Users on Ubuntu 24.04 building HIP source should ensure `--gcc-install-dir` is injected if necessary.
✨ New Features
- Approximately 2x faster GGUF inference enabled by automatically activated MTP (speculative decoding).
- API support for providers like OpenAI, Anthropic, etc., featuring auto prompt caching, web search, and code execution.
- Ability to connect to external inference backends: vLLM and Ollama llama-server.
- Experimental MLX inference support for running quants and models locally on Mac machines.
- Proper support for non-English languages (e.g., Japanese, Chinese) in prompting/sending.
- Built-in web search functionality for OpenAI, Anthropic, OpenRouter, and Kimi.
- Built-in code execution for OpenAI and Anthropic (Anthropic containers are persisted and reused across turns).
- Prompt caching enabled for OpenAI and Anthropic models, potentially saving 50 to 90% of costs.
- API key is now optional for local providers (llama.cpp / vLLM / Ollama).
- Auto-loading of models when adding a cloud provider.
- OpenDocument chat attachments support in Unsloth Studio.
- New Continued Pretraining (CPT) training method available as a first-class option.
- Opt-in fused `lm_head` + cross-entropy forward path controlled by `UNSLOTH_RETURN_LOGITS=1` environment variable.
- Custom install paths supported via `STUDIO_HOME` or `UNSLOTH_STUDIO_HOME` environment variables.
- CPU-only Linux x86_64 systems are now routed to `ggml-org/llama.cpp` prebuilts.
- Introduction of `unsloth --version` command flag.
🐛 Bug Fixes
- Fixed silent wrong saves by implementing layout-aware MoE LoRA merge with loud-fail on fallback.
- Fixed `num_logits_to_keep` regression when using transformers >= 4.52.
- Preserved tokenizer EOS token on merged saves.
- Fixed PEFT checkpoint resumption under sentence-transformers >= 5.4.
- Restored Flash > SDPA > Flex attention priority for non-Gemma3 models.
- Fixed ORPO text-only tokenization when used with processors.
- Fixed embedding matrix size mismatch issues.
- Fixed Vicuna chat template.
- Unified legacy and new logits kwargs in `fast_generate` (fixing Mistral merge site issue).
- Made `higher_precision_softmax` idempotent.
- Patched every `LOSS_MAPPING` key aliased to `ForCausalLMLoss` (covering transformers 5.x).
- Fixed GGUF converter sibling imports.
- Added UTF-8 encoding to all text-mode file operations.
- Fixed serialization of GGUF reload and inheritance of `unsloth-run` extra arguments.
- Fixed `/recommended-folders` 500 error on unreadable model directories under Python 3.12+.
- Fixed authentication rate-limiting to be proxy-aware.
- Fixed IME composer hardening and RTL `dir="auto"` issues.
- Fixed long log-line truncation.
- Fixed tool reasoning trace rendering in UI.
- Fixed Gemma attention mask issues during training.
- Fixed Gemma-4 MoE LoRA extractor registration to resolve `grouped_mm` contraction crash.
- Fixed silent CPU fallback warning when GPU headroom is low.
- Fixed issues related to stale or old bundled llama.cpp prebuilt when MTP is used.