Change8

b9731

📦 llama-cppView on GitHub →
1 features🔧 1 symbols

Summary

This release focuses on performance optimization within the server component by implementing partial sorting for token probabilities, leading to substantial speed gains. It also provides numerous pre-compiled binaries for diverse hardware and operating system configurations.

✨ New Features

  • Optimized token probability retrieval in the server by using std::partial_sort to order only the requested top-n tokens instead of the full vocabulary, resulting in significant performance improvements (e.g., 8555.6 us/op down to 704.3 us/op for vocab=128000, n_top=0, iters=100).

Affected Symbols