b9109
Breaking Changes📦 llama-cppView on GitHub →
⚠ 2 breaking✨ 6 features🐛 4 fixes🔧 6 symbols
Summary
This release introduces major support for parallel drafting, enabling more complex speculative decoding strategies, and refactors context management across server and spec components. Several internal structure changes necessitate migration steps for speculative decoding configuration.
⚠️ Breaking Changes
- Support for incompatible vocabs has been dropped. Users must ensure their vocabulary files are compatible with the current spec.
- The old `type` field in the `common_params_speculative` struct has been replaced by a vector to allow specifying multiple speculative types.
Migration Steps
- When configuring speculative decoding, replace the single `type` field in `common_params_speculative` with a vector of speculative types.
- Use `common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>)` to determine enabled implementations based on user-provided spec types.
- Use `common_speculative_type_from_names(const std::vector<std::string> & names)` to parse user-provided spec types specified as names.
✨ New Features
- Added parallel drafting support.
- Introduced `common_speculative_process()` for speculative processing.
- Enabled support for multiple speculative types (chain of speculators).
- Implemented logic to maximize expected accepted tokens by calculating acceptance probability product.
- Draft prompt cache and checkpoints added to the server component.
- Context processing now handles images through the draft context.
🐛 Bug Fixes
- Fixed multi-turn draft processing issues on the server.
- Corrected the URL for the draft model.
- Fixed the `n_past` type in the spec component.
- Fixed the slot context draft pointer (`ctx_drft ptr`).