b9180

📅 May 16, 2026📦 llama-cppView on GitHub →

✨ 5 features🐛 15 fixes⚡ 1 deprecations🔧 11 symbols

Summary

This release introduces significant enhancements to speculative decoding by adding MTP support and enabling partial rollback capabilities across CPU, Vulkan, and Metal backends. Numerous bug fixes were applied across conversion, server logic, and memory handling.

Migration Steps

If using speculative decoding, note that the system now uses `mtp-` prefix for identifying MTP models.
If relying on specific internal types, note that `llama_context_type` is now used instead of `llama_graph_type`.
If using server features, be aware that RS-based MTP is now disabled when combined with other speculative decoding types.

✨ New Features

Speculative decoding (spec) now supports MTP (Mixture of Transformation/Tuning).
Added ability for speculative decoding to rollback up to `draft_max` by storing GDN intermediates, reducing wastage when draft tokens are rejected.
Enabled checkpointing with partial rollback for llama memory.
Added GDN partial rollback support for the Vulkan backend.
Added GDN partial rollback support for the Metal backend, including storing intermediate states for partial rollback.

🐛 Bug Fixes

Fixed batch size calculation/handling.
Fixed issues related to MTP conversion.
Fixed pycheck issues in conversion.
Fixed convert issues.
Fixed faulty bitwise check in recurrent memory.
Fixed logic for ngram + mtp compatibility.
Fixed pending state issues.
Fixed early exit logic in server context.
Fixed compatibility issues between spec and n-gram.
Fixed comment issues.
Fixed MTP path in download.
Fixed enorm op in llama-arch.
Fixed type annotations in conversion.
Fixed test case for loading into a dirty context.
Cleared `rs_idx` in clear operation for llama-memory-recurrent.

Affected Symbols

llama_context_type llama_graph_type llama_model_has_mtp llama_arch need_embd delta_net_base ggml_pad need_rs_seq part_bounded n_rs rs_idx

⚡ Deprecations

Removed unused symbol `llama_arch`.