b8701
📦 llama-cppView on GitHub →
✨ 2 features🐛 3 fixes🔧 3 symbols
Summary
This release introduces performance improvements for q4_0 and q4_1 mmq kernels on AMD GPUs via ds_read_b128 optimization and includes various bug fixes and code cleanup in the CUDA implementation.
✨ New Features
- Implemented ds_read_b128 instructions for q4_0 and q4_1 mmq kernels in ggml-cuda, which saves LDS bandwidth and improves performance on MI50 and RX6800XT.
- Vectorized LDS load update using ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for a generic implementation.
🐛 Bug Fixes
- Fixed max_cpy usage in the loading loop.
- Fixed typo in q4_1 kernel.
- Removed trailing white lines and spaces in mmq.cuh.