b8842

📅 Apr 19, 2026📦 llama-cppView on GitHub →

✨ 4 features🐛 8 fixes⚡ 2 deprecations🔧 8 symbols

Summary

This release introduces speculative checkpointing for the server, significantly refactoring the speculative decoding logic in C++ and improving robustness across various continuation scenarios. Several deprecated functions and arguments related to speculative prompting have been removed.

Migration Steps

Remove usage of the `--spec-use-checkpoints` argument if present.
Remove calls to the deprecated `server_prompt_checkpoint_with_size` function.

✨ New Features

Implemented speculative checkpointing in the server.
Enabled speculative decoding using checkpoints.
Enabled MTMD speculative decoding in continuation logic.
Speculative checkpoints now include draft model state and logging.

🐛 Bug Fixes

Fixed draft check logic when using speculative checkpoints.
Fixed speculative checkpoint logging issues.
Fixed ngram-map/begin index calculation for speculative decoding.
Fixed speculative checkpoint initialization (ensuring begin() is called).
Restored sampler state in speculative checkpoints and cleared memory upon restoration.
Continuation logic no longer ignores partial drafts even if they are short.
Fixed nullptr dereference related to draft checkpoints.
Fixed the accepted number count in continuation logic.

Affected Symbols

speculative.cpp server_speculative_callback n_tokens_cur create_checkpoint ngram-map/begin idx_last_check leave_draft_state common_speculative_accept_response

⚡ Deprecations

Removed argument `--spec-use-checkpoints`.
Removed function `server_prompt_checkpoint_with_size`.