b8064
Breaking Changes📦 llama-cppView on GitHub →
⚠ 2 breaking✨ 3 features🐛 1 fixes🔧 3 symbols
Summary
This release focuses heavily on CUDA performance optimizations for iq2xxs/iq2xs/iq3xxs dequantization, including register savings and algorithmic simplification, alongside fixing a type definition issue.
⚠️ Breaking Changes
- The type alias "uint" was removed and replaced with "uint32_t" in CUDA code, which will cause compilation errors if "uint" was used directly.
- IQ2XXS sum scaling logic was simplified, changing the mathematical expression from `(sum * scale + sum / 2) / 4` to `(sum * (scale * 2 + 1)) / 8` and `((aux32 >> 28) * 2 + 1)` to `(aux32 >> 27 | 1)`. This is an internal implementation change but could affect any code relying on the exact intermediate calculations if it was inspecting them.
Migration Steps
- If you were using the type alias "uint" in CUDA related code, replace all instances with "uint32_t".
✨ New Features
- Optimized dequantization for iq2xxs, iq2xs, and iq3xxs formats on CUDA by loading all 8 int8 values for a grid position at once.
- Implemented sign calculation via popcount instead of fetching from the ksigns table in CUDA dequantization.
- Simplified sum scaling for iq2xxs in CUDA, saving 3 registers in mul_mat_vec_q.
🐛 Bug Fixes
- Fixed compilation error caused by the undefined identifier "uint" by replacing it with "uint32_t".