b9731
📦 llama-cppView on GitHub →
✨ 1 features🔧 1 symbols
Summary
This release focuses on performance optimization within the server component by implementing partial sorting for token probabilities, leading to substantial speed gains. It also provides numerous pre-compiled binaries for diverse hardware and operating system configurations.
✨ New Features
- Optimized token probability retrieval in the server by using std::partial_sort to order only the requested top-n tokens instead of the full vocabulary, resulting in significant performance improvements (e.g., 8555.6 us/op down to 704.3 us/op for vocab=128000, n_top=0, iters=100).