Change8

b8701

📦 llama-cppView on GitHub →
2 features🐛 3 fixes🔧 3 symbols

Summary

This release introduces performance improvements for q4_0 and q4_1 mmq kernels on AMD GPUs via ds_read_b128 optimization and includes various bug fixes and code cleanup in the CUDA implementation.

✨ New Features

  • Implemented ds_read_b128 instructions for q4_0 and q4_1 mmq kernels in ggml-cuda, which saves LDS bandwidth and improves performance on MI50 and RX6800XT.
  • Vectorized LDS load update using ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for a generic implementation.

🐛 Bug Fixes

  • Fixed max_cpy usage in the loading loop.
  • Fixed typo in q4_1 kernel.
  • Removed trailing white lines and spaces in mmq.cuh.

Affected Symbols