b8873
📦 llama-cppView on GitHub →
✨ 5 features🐛 4 fixes🔧 2 symbols
Summary
This release introduces significant improvements to OpenVINO backend, including thread safety enhancements, NPU memory optimizations via weightless caching, and added support for Gelu tanh and Imrope. CI/CD pipelines for OpenVINO were also restructured.
Migration Steps
- Use i4/i8 quantization directly for symmetric quantization cases in OpenVINO.
✨ New Features
- Implemented thread safety guarantees per request.
- Added support for Gelu tanh activation function.
- Added support for Imrope.
- Added WeightlessCacheAttribute to reduce NPU memory usage for OpenVINO.
- Added GPU and NPU support to the OpenVINO Dockerfile.
🐛 Bug Fixes
- Fixed ROPE yarn case.
- Fixed sticky stateful configuration issues.
- Fixed explicit ov::Tensor frees in ggml_backend_openvino_free.
- Fixed thread-safety issues related to the shared runtime context.