Change8

b9045

📦 llama-cppView on GitHub →
5 features🐛 5 fixes🔧 5 symbols

Summary

This release introduces full support for the ibm-granite/granite-4.0-1b-speech model, detailing its audio preprocessing, encoder architecture, and GGUF conversion handling. Numerous internal refactorings were performed to standardize tensor naming and model structure for this new modality.

Migration Steps

  1. Rename gs_ prefixed tensors to generic/architecture names.
  2. Use tensor_mapping.py for all granite_speech tensors.
  3. Fold GraniteSpeechTextModel into GraniteModel.
  4. Replace n_layer hack with explicit has_standard_layers flag.
  5. Replace hardcoded magic numbers with GGUF hparams for granite speech.
  6. Merge qformer_proj_layer into clip_layer.
  7. Make generic audio layer_norm_eps read optional.
  8. Use filter_tensors instead of modify_tensors for skipping in converter for granite_speech.

✨ New Features

  • Added support for ibm-granite/granite-4.0-1b-speech model.
  • Implemented Conformer encoder with Shaw relative position encoding, QFormer projector, and log-mel spectrogram with frame stacking for Granite Speech.
  • Encoder now uses GLU gating, folded batch norm, and SSM depthwise conv.
  • QFormer compresses encoder output via windowed cross-attention (window=15, queries=3) into the LLM embedding space.
  • Audio preprocessing includes reflect-padded STFT, 80-bin mel filterbank, dynamic range compression, and 2x frame stacking (80->160 mel).

🐛 Bug Fixes

  • GGUF converter handles batch norm folding at export time, fused K/V split, and Conv1d weight reshaping.
  • Fixed type-check for GraniteSpeechMmprojModel registration in converter.
  • Removed redundant ggml_build_forward_expand on inputs for granite_speech.
  • Hard-coded eps in cpp for granite_speech, removed from GGUF metadata (except encoder eps which is kept).
  • Fixed alignment and ordering issues across granite speech files.

Affected Symbols