b8842
📦 llama-cppView on GitHub →
✨ 4 features🐛 8 fixes⚡ 2 deprecations🔧 8 symbols
Summary
This release introduces speculative checkpointing for the server, significantly refactoring the speculative decoding logic in C++ and improving robustness across various continuation scenarios. Several deprecated functions and arguments related to speculative prompting have been removed.
Migration Steps
- Remove usage of the `--spec-use-checkpoints` argument if present.
- Remove calls to the deprecated `server_prompt_checkpoint_with_size` function.
✨ New Features
- Implemented speculative checkpointing in the server.
- Enabled speculative decoding using checkpoints.
- Enabled MTMD speculative decoding in continuation logic.
- Speculative checkpoints now include draft model state and logging.
🐛 Bug Fixes
- Fixed draft check logic when using speculative checkpoints.
- Fixed speculative checkpoint logging issues.
- Fixed ngram-map/begin index calculation for speculative decoding.
- Fixed speculative checkpoint initialization (ensuring begin() is called).
- Restored sampler state in speculative checkpoints and cleared memory upon restoration.
- Continuation logic no longer ignores partial drafts even if they are short.
- Fixed nullptr dereference related to draft checkpoints.
- Fixed the accepted number count in continuation logic.
Affected Symbols
⚡ Deprecations
- Removed argument `--spec-use-checkpoints`.
- Removed function `server_prompt_checkpoint_with_size`.