Change8

b9745

📦 llama-cppView on GitHub →
6 features🐛 2 fixes🔧 8 symbols

Summary

This release introduces significant support for Step3.5/3.7 flash MTP, adding new APIs and speculative multi-head processing capabilities. Several internal cleanups and bug fixes were also applied.

Migration Steps

  1. The function 'draft_multi_head' has been merged into 'draft()'.
  2. The term 'nextn' has been renamed to 'mtp' in relevant contexts.

✨ New Features

  • Support for Step3.5/3.7 flash MTP (Multi-Turn Processing) implementation.
  • Added mtp_layer_offset and include nextn flags in graph reuse.
  • Introduced llama_set_mtp_layer_offset and llama_model_n_nextn_layer API.
  • Implemented offset head selection and requirement for all MTP blocks.
  • Added speculative multi-head process() and draft() functions.
  • Outputs are now gathered via inp_out_ids.

🐛 Bug Fixes

  • Fixes applied to core functionality.
  • Fix implemented for multi-sequence processing.

Affected Symbols