b8776
📦 llama-cppView on GitHub →
✨ 1 features🐛 1 fixes🔧 3 symbols
Summary
This release limits DeviceSegmentedSort to immediate mode due to CUDA graph capture limitations, ensuring stability when using CUDA graphs, and includes performance comparisons between the two sorting methods.
✨ New Features
- Added test case to enforce dispatch to DeviceSegmentedRadixSort when running in CUDA graph mode.
🐛 Bug Fixes
- Limited DeviceSegmentedSort to immediate mode because it is not capturable in a CUDA graph, falling back to the slower DeviceSegmentedRadixSort in graph mode.