Change8

b8064

Breaking Changes
📦 llama-cppView on GitHub →
2 breaking3 features🐛 1 fixes🔧 3 symbols

Summary

This release focuses heavily on CUDA performance optimizations for iq2xxs/iq2xs/iq3xxs dequantization, including register savings and algorithmic simplification, alongside fixing a type definition issue.

⚠️ Breaking Changes

  • The type alias "uint" was removed and replaced with "uint32_t" in CUDA code, which will cause compilation errors if "uint" was used directly.
  • IQ2XXS sum scaling logic was simplified, changing the mathematical expression from `(sum * scale + sum / 2) / 4` to `(sum * (scale * 2 + 1)) / 8` and `((aux32 >> 28) * 2 + 1)` to `(aux32 >> 27 | 1)`. This is an internal implementation change but could affect any code relying on the exact intermediate calculations if it was inspecting them.

Migration Steps

  1. If you were using the type alias "uint" in CUDA related code, replace all instances with "uint32_t".

✨ New Features

  • Optimized dequantization for iq2xxs, iq2xs, and iq3xxs formats on CUDA by loading all 8 int8 values for a grid position at once.
  • Implemented sign calculation via popcount instead of fetching from the ksigns table in CUDA dequantization.
  • Simplified sum scaling for iq2xxs in CUDA, saving 3 registers in mul_mat_vec_q.

🐛 Bug Fixes

  • Fixed compilation error caused by the undefined identifier "uint" by replacing it with "uint32_t".

Affected Symbols