b9109

Breaking Changes

📅 May 11, 2026📦 llama-cppView on GitHub →

⚠ 2 breaking✨ 6 features🐛 4 fixes🔧 6 symbols

Summary

This release introduces major support for parallel drafting, enabling more complex speculative decoding strategies, and refactors context management across server and spec components. Several internal structure changes necessitate migration steps for speculative decoding configuration.

⚠️ Breaking Changes

Support for incompatible vocabs has been dropped. Users must ensure their vocabulary files are compatible with the current spec.
The old `type` field in the `common_params_speculative` struct has been replaced by a vector to allow specifying multiple speculative types.

Migration Steps

When configuring speculative decoding, replace the single `type` field in `common_params_speculative` with a vector of speculative types.
Use `common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>)` to determine enabled implementations based on user-provided spec types.
Use `common_speculative_type_from_names(const std::vector<std::string> & names)` to parse user-provided spec types specified as names.

✨ New Features

Added parallel drafting support.
Introduced `common_speculative_process()` for speculative processing.
Enabled support for multiple speculative types (chain of speculators).
Implemented logic to maximize expected accepted tokens by calculating acceptance probability product.
Draft prompt cache and checkpoints added to the server component.
Context processing now handles images through the draft context.

🐛 Bug Fixes

Fixed multi-turn draft processing issues on the server.
Corrected the URL for the draft model.
Fixed the `n_past` type in the spec component.
Fixed the slot context draft pointer (`ctx_drft ptr`).

Affected Symbols

common_speculative_init()common_speculative_process()common_params_speculative common_speculative_type common_get_enabled_speculative_impls common_speculative_type_from_names