Change8

b8786

📦 llama-cppView on GitHub →
2 features🐛 1 fixes🔧 2 symbols

Summary

This release optimizes performance by conditionally creating the reasoning budget sampler, ensuring backend sampling remains enabled when no token budget is set. It also preserves sampler creation when grammar is lazy to maintain tool usage compatibility.

✨ New Features

  • Improved performance by skipping the reasoning budget sampler when no budget is requested, which re-enables backend sampling for faster token generation on GPU.
  • Ensured the reasoning budget sampler is preserved when grammar is lazy (grammar_lazy=true) to maintain thinking-block grammar suppression when tools are in use.

🐛 Bug Fixes

  • Fixed an issue where the reasoning budget sampler was unconditionally created even when the budget was the default (-1), leading to unnecessary per-token overhead and potential speed regressions (e.g., 30% slowdown reported on Vulkan).

Affected Symbols