Change8

b8842

📦 llama-cppView on GitHub →
4 features🐛 8 fixes2 deprecations🔧 8 symbols

Summary

This release introduces speculative checkpointing for the server, significantly refactoring the speculative decoding logic in C++ and improving robustness across various continuation scenarios. Several deprecated functions and arguments related to speculative prompting have been removed.

Migration Steps

  1. Remove usage of the `--spec-use-checkpoints` argument if present.
  2. Remove calls to the deprecated `server_prompt_checkpoint_with_size` function.

✨ New Features

  • Implemented speculative checkpointing in the server.
  • Enabled speculative decoding using checkpoints.
  • Enabled MTMD speculative decoding in continuation logic.
  • Speculative checkpoints now include draft model state and logging.

🐛 Bug Fixes

  • Fixed draft check logic when using speculative checkpoints.
  • Fixed speculative checkpoint logging issues.
  • Fixed ngram-map/begin index calculation for speculative decoding.
  • Fixed speculative checkpoint initialization (ensuring begin() is called).
  • Restored sampler state in speculative checkpoints and cleared memory upon restoration.
  • Continuation logic no longer ignores partial drafts even if they are short.
  • Fixed nullptr dereference related to draft checkpoints.
  • Fixed the accepted number count in continuation logic.

Affected Symbols

⚡ Deprecations

  • Removed argument `--spec-use-checkpoints`.
  • Removed function `server_prompt_checkpoint_with_size`.