b9270
Breaking Changes📦 llama-cppView on GitHub →
⚠ 2 breaking✨ 2 features🐛 2 fixes⚡ 2 deprecations🔧 18 symbols
Summary
This release introduces full support for the Carbon-3B tokenizer by promoting its specialized DNA handling logic into a new top-level vocabulary type, LLAMA_VOCAB_TYPE_HYBRIDDNA. This involved significant refactoring of tokenizer initialization and conversion logic to align with existing tokenizer family conventions.
⚠️ Breaking Changes
- The tokenizer logic for Carbon-3B was refactored from using a new pre-type (LLAMA_VOCAB_PRE_TYPE_CARBON) on top of BPE to being promoted to its own vocabulary type, LLAMA_VOCAB_TYPE_HYBRIDDNA. This required significant changes in tokenizer initialization and dispatch logic.
- The conditional logic in conversion/base.py's get_vocab_base and get_vocab_base_pre that detected HybridDNATokenizer by class name and passed trust_remote_code=True has been removed. Dispatch is now handled via a tokenizer_class check in LlamaModel.set_vocab in conversion/llama.py.
Migration Steps
- If you were relying on the previous internal structure for Carbon tokenization (using LLAMA_VOCAB_PRE_TYPE_CARBON), update your code to use the new LLAMA_VOCAB_TYPE_HYBRIDDNA.
- When converting Hugging Face models, ensure that the conversion script handles the dispatch based on tokenizer_class == "HybridDNATokenizer" in LlamaModel.set_vocab, as the previous short-circuit in get_vocab_base/get_vocab_base_pre is removed.
- If you were using the previous Carbon pre-type detection mechanism in conversion scripts, note that dispatch is now class-name driven, and the stale chkhsh entry in convert_hf_to_gguf_update.py is removed.
✨ New Features
- Added support for Carbon-3B tokenizer via the new LLAMA_VOCAB_TYPE_HYBRIDDNA.
- The HybridDNATokenizer handles text inside <dna>...</dna> regions by chunking bases into fixed 6-mers (right-padded with 'A') and mapping bases outside ACGT to <oov>.
🐛 Bug Fixes
- Relaxed an assertion in llm_tokenizer_bpe to allow the new HYBRIDDNA vocab type.
- Dropped a local hybriddna fixture, moving it to ggml-org/vocabs.
Affected Symbols
LLAMA_VOCAB_PRE_TYPE_CARBONHybridDNATokenizerllm_tokenizer_bpe_session::tokenizetokenize_carbonemit_dna_kmersget_vocab_baseget_vocab_base_preconvert_hf_to_gguf_update.pyLLAMA_VOCAB_TYPE_HYBRIDDNAllm_tokenizer_hybriddnainit_tokenizertokenizetype_namebyte_to_tokentoken_to_piecetokenizer.ggml.modeltokenizer.ggml.preLlamaModel.set_vocab
⚡ Deprecations
- The short-lived LLAMA_VOCAB_PRE_TYPE_CARBON has been dropped.
- The stale chkhsh entry and trust_remote_code special-casing in convert_hf_to_gguf_update.py related to Carbon tokenization are dropped as dispatch is now class-name driven.