b9464
📦 llama-cppView on GitHub →
✨ 2 features🐛 2 fixes🔧 2 symbols
Summary
This release refactors speculative decoding logic by introducing common_speculative_n_max and fixes issues related to n_outputs_max, while disabling the auto-enablement of draft-simple mode.
Migration Steps
- If you relied on draft-simple being automatically enabled, you may need to explicitly enable it now.
- Review usage of logic previously handled by server_n_outputs_max as it has been moved to common_speculative_n_max.
✨ New Features
- Added common_speculative_n_max helper function to centralize speculative max-draft-size logic.
- Draft context now always includes n_parallel outputs.
🐛 Bug Fixes
- Fixed logic related to n_outputs_max in speculative decoding.
- Removed automatic enabling of draft-simple mode.