b9045
📦 llama-cppView on GitHub →
✨ 5 features🐛 5 fixes🔧 5 symbols
Summary
This release introduces full support for the ibm-granite/granite-4.0-1b-speech model, detailing its audio preprocessing, encoder architecture, and GGUF conversion handling. Numerous internal refactorings were performed to standardize tensor naming and model structure for this new modality.
Migration Steps
- Rename gs_ prefixed tensors to generic/architecture names.
- Use tensor_mapping.py for all granite_speech tensors.
- Fold GraniteSpeechTextModel into GraniteModel.
- Replace n_layer hack with explicit has_standard_layers flag.
- Replace hardcoded magic numbers with GGUF hparams for granite speech.
- Merge qformer_proj_layer into clip_layer.
- Make generic audio layer_norm_eps read optional.
- Use filter_tensors instead of modify_tensors for skipping in converter for granite_speech.
✨ New Features
- Added support for ibm-granite/granite-4.0-1b-speech model.
- Implemented Conformer encoder with Shaw relative position encoding, QFormer projector, and log-mel spectrogram with frame stacking for Granite Speech.
- Encoder now uses GLU gating, folded batch norm, and SSM depthwise conv.
- QFormer compresses encoder output via windowed cross-attention (window=15, queries=3) into the LLM embedding space.
- Audio preprocessing includes reflect-padded STFT, 80-bin mel filterbank, dynamic range compression, and 2x frame stacking (80->160 mel).
🐛 Bug Fixes
- GGUF converter handles batch norm folding at export time, fused K/V split, and Conv1d weight reshaping.
- Fixed type-check for GraniteSpeechMmprojModel registration in converter.
- Removed redundant ggml_build_forward_expand on inputs for granite_speech.
- Hard-coded eps in cpp for granite_speech, removed from GGUF metadata (except encoder eps which is kept).
- Fixed alignment and ordering issues across granite speech files.