Change8

b9180

📦 llama-cppView on GitHub →
5 features🐛 15 fixes1 deprecations🔧 11 symbols

Summary

This release introduces significant enhancements to speculative decoding by adding MTP support and enabling partial rollback capabilities across CPU, Vulkan, and Metal backends. Numerous bug fixes were applied across conversion, server logic, and memory handling.

Migration Steps

  1. If using speculative decoding, note that the system now uses `mtp-` prefix for identifying MTP models.
  2. If relying on specific internal types, note that `llama_context_type` is now used instead of `llama_graph_type`.
  3. If using server features, be aware that RS-based MTP is now disabled when combined with other speculative decoding types.

✨ New Features

  • Speculative decoding (spec) now supports MTP (Mixture of Transformation/Tuning).
  • Added ability for speculative decoding to rollback up to `draft_max` by storing GDN intermediates, reducing wastage when draft tokens are rejected.
  • Enabled checkpointing with partial rollback for llama memory.
  • Added GDN partial rollback support for the Vulkan backend.
  • Added GDN partial rollback support for the Metal backend, including storing intermediate states for partial rollback.

🐛 Bug Fixes

  • Fixed batch size calculation/handling.
  • Fixed issues related to MTP conversion.
  • Fixed pycheck issues in conversion.
  • Fixed convert issues.
  • Fixed faulty bitwise check in recurrent memory.
  • Fixed logic for ngram + mtp compatibility.
  • Fixed pending state issues.
  • Fixed early exit logic in server context.
  • Fixed compatibility issues between spec and n-gram.
  • Fixed comment issues.
  • Fixed MTP path in download.
  • Fixed enorm op in llama-arch.
  • Fixed type annotations in conversion.
  • Fixed test case for loading into a dirty context.
  • Cleared `rs_idx` in clear operation for llama-memory-recurrent.

Affected Symbols

⚡ Deprecations

  • Removed unused symbol `llama_arch`.