b9827

📅 Jun 27, 2026📦 llama-cppView on GitHub →

✨ 2 features🐛 1 fixes🔧 2 symbols

Summary

This release introduces a significant performance optimization for CUDA by adding a fast path for 2D strided copies using cudaMemcpy2DAsync, fixing an issue in GDN recurrent snapshot updates. OpenVINO support for this optimized path is currently disabled pending further fixes.

Migration Steps

Return unsupported for strided copy in OpenVINO, as new tests are failing.

✨ New Features

Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy for same-type, same-shape strided copies that are 2D pitched block copies.
Implemented optimized strided copy path using cudaMemcpy2DAsync when tensors are not fully contiguous but each row is contiguous.

🐛 Bug Fixes

Fixed GDN recurrent snapshot update issue with -np 4 where rollback slots were separated by cache stride gaps, due to the new optimized copy path.

Affected Symbols

ggml_cuda_cpy cudaMemcpy2DAsync