b9827
📦 llama-cppView on GitHub →
✨ 2 features🐛 1 fixes🔧 2 symbols
Summary
This release introduces a significant performance optimization for CUDA by adding a fast path for 2D strided copies using cudaMemcpy2DAsync, fixing an issue in GDN recurrent snapshot updates. OpenVINO support for this optimized path is currently disabled pending further fixes.
Migration Steps
- Return unsupported for strided copy in OpenVINO, as new tests are failing.
✨ New Features
- Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy for same-type, same-shape strided copies that are 2D pitched block copies.
- Implemented optimized strided copy path using cudaMemcpy2DAsync when tensors are not fully contiguous but each row is contiguous.
🐛 Bug Fixes
- Fixed GDN recurrent snapshot update issue with -np 4 where rollback slots were separated by cache stride gaps, due to the new optimized copy path.