Change8

b8873

📦 llama-cppView on GitHub →
5 features🐛 4 fixes🔧 2 symbols

Summary

This release introduces significant improvements to OpenVINO backend, including thread safety enhancements, NPU memory optimizations via weightless caching, and added support for Gelu tanh and Imrope. CI/CD pipelines for OpenVINO were also restructured.

Migration Steps

  1. Use i4/i8 quantization directly for symmetric quantization cases in OpenVINO.

✨ New Features

  • Implemented thread safety guarantees per request.
  • Added support for Gelu tanh activation function.
  • Added support for Imrope.
  • Added WeightlessCacheAttribute to reduce NPU memory usage for OpenVINO.
  • Added GPU and NPU support to the OpenVINO Dockerfile.

🐛 Bug Fixes

  • Fixed ROPE yarn case.
  • Fixed sticky stateful configuration issues.
  • Fixed explicit ov::Tensor frees in ggml_backend_openvino_free.
  • Fixed thread-safety issues related to the shared runtime context.

Affected Symbols