v0.1.405-beta

Breaking Changes

📅 May 18, 2026📦 unslothView on GitHub →

⚠ 2 breaking✨ 16 features🐛 23 fixes🔧 15 symbols

Summary

This release brings significant performance gains, up to 2x faster GGUF inference via MTP speculative decoding, and introduces extensive API provider support with built-in tools like web search and code execution. Security has been heavily hardened across the Studio platform, alongside numerous training and installer reliability fixes.

⚠️ Breaking Changes

Removed the `torch.load` fallback on `training_args.bin`. Untrusted pickles can no longer execute on model load, which may break workflows relying on this fallback.
Cross-family GGUF projector is now blocked in flat local directories, preventing wrong-vision-tower loads. If you were loading cross-family models from flat directories, this might fail.

Migration Steps

If you rely on `torch.load` fallback for `training_args.bin`, ensure your pickles are trusted or update your loading mechanism.
If you were loading cross-family GGUF models from flat local directories, reorganize your structure or ensure the projector loading is handled correctly.
If using older Unsloth versions, ensure you are using `v0.1.405-beta` or newer, not `v0.1.40-beta`.
If installing or running Unsloth Studio, consider setting `STUDIO_HOME` or `UNSLOTH_STUDIO_HOME` if you require a custom installation path.
Users on Windows should see improved CUDA installation due to paired `cudart` bundle and Torch NVIDIA DLL paths being added to `PATH`.
Users on Ubuntu 24.04 building HIP source should ensure `--gcc-install-dir` is injected if necessary.

✨ New Features

Approximately 2x faster GGUF inference enabled by automatically activated MTP (speculative decoding).
API support for providers like OpenAI, Anthropic, etc., featuring auto prompt caching, web search, and code execution.
Ability to connect to external inference backends: vLLM and Ollama llama-server.
Experimental MLX inference support for running quants and models locally on Mac machines.
Proper support for non-English languages (e.g., Japanese, Chinese) in prompting/sending.
Built-in web search functionality for OpenAI, Anthropic, OpenRouter, and Kimi.
Built-in code execution for OpenAI and Anthropic (Anthropic containers are persisted and reused across turns).
Prompt caching enabled for OpenAI and Anthropic models, potentially saving 50 to 90% of costs.
API key is now optional for local providers (llama.cpp / vLLM / Ollama).
Auto-loading of models when adding a cloud provider.
OpenDocument chat attachments support in Unsloth Studio.
New Continued Pretraining (CPT) training method available as a first-class option.
Opt-in fused `lm_head` + cross-entropy forward path controlled by `UNSLOTH_RETURN_LOGITS=1` environment variable.
Custom install paths supported via `STUDIO_HOME` or `UNSLOTH_STUDIO_HOME` environment variables.
CPU-only Linux x86_64 systems are now routed to `ggml-org/llama.cpp` prebuilts.
Introduction of `unsloth --version` command flag.

🐛 Bug Fixes

Fixed silent wrong saves by implementing layout-aware MoE LoRA merge with loud-fail on fallback.
Fixed `num_logits_to_keep` regression when using transformers >= 4.52.
Preserved tokenizer EOS token on merged saves.
Fixed PEFT checkpoint resumption under sentence-transformers >= 5.4.
Restored Flash > SDPA > Flex attention priority for non-Gemma3 models.
Fixed ORPO text-only tokenization when used with processors.
Fixed embedding matrix size mismatch issues.
Fixed Vicuna chat template.
Unified legacy and new logits kwargs in `fast_generate` (fixing Mistral merge site issue).
Made `higher_precision_softmax` idempotent.
Patched every `LOSS_MAPPING` key aliased to `ForCausalLMLoss` (covering transformers 5.x).
Fixed GGUF converter sibling imports.
Added UTF-8 encoding to all text-mode file operations.
Fixed serialization of GGUF reload and inheritance of `unsloth-run` extra arguments.
Fixed `/recommended-folders` 500 error on unreadable model directories under Python 3.12+.
Fixed authentication rate-limiting to be proxy-aware.
Fixed IME composer hardening and RTL `dir="auto"` issues.
Fixed long log-line truncation.
Fixed tool reasoning trace rendering in UI.
Fixed Gemma attention mask issues during training.
Fixed Gemma-4 MoE LoRA extractor registration to resolve `grouped_mm` contraction crash.
Fixed silent CPU fallback warning when GPU headroom is low.
Fixed issues related to stale or old bundled llama.cpp prebuilt when MTP is used.

Affected Symbols

llama.cpp torch.load training_args.bin num_logits_to_keep fast_generate LOSS_MAPPING ForCausalLMLoss grouped_mm HF_DATASETS_OFFLINE HF_HUB_OFFLINE flash-attn flash-linear-attention tilelang Qwen3.5 family sentence-transformers