Change8

v4.3.0

Breaking Changes
📦 localaiView on GitHub →
2 breaking9 features🐛 3 fixes🔧 9 symbols

Summary

LocalAI 4.3.0 hardens security with keyless cosign signatures for backend images and significantly improves performance by enabling the llama-cpp prompt cache by default. This release also introduces detailed per-API-key usage tracking and major stability improvements for Distributed Mode.

⚠️ Breaking Changes

  • Enabling strict backend integrity checking via `--require-backend-integrity` or `LOCALAI_REQUIRE_BACKEND_INTEGRITY=true` will now cause a hard-fail if a backend image lacks the required policy or SHA256 digest, whereas previously it might have only warned.
  • In Distributed Mode, model loading logic was changed to ensure per-request routing across replicas. If custom logic relied on the previous behavior where the first request pinned subsequent traffic to that node, that behavior is now broken, and traffic should be correctly load-balanced across available replicas.

Migration Steps

  1. If you rely on strict backend integrity, enable `--require-backend-integrity` or set `LOCALAI_REQUIRE_BACKEND_INTEGRITY=true` and ensure your gallery YAML includes the necessary `verification:` block.
  2. If you experience unexpected load balancing behavior in Distributed Mode, review your configuration as per-request replica routing is now enforced.
  3. If you need to disable the new default prompt caching for llama.cpp models, set `prompt_cache_all: false` or use `options: ["kv_unified:false"]` in your model YAML.

✨ New Features

  • Backend OCI images now ship with keyless cosign signatures for enhanced integrity verification.
  • Introduced a per-gallery `verification:` policy in YAML to control signature verification, including an opt-in strict mode.
  • The `llama-cpp` server-side prompt cache is enabled by default, significantly speeding up repeated system prompts.
  • Usage tracking now includes a per-API-key + per-user Sources view for administrators to attribute traffic.
  • Distributed mode received optimizations including per-request replica routing and cached health probes.
  • Asynchronous installation progress streaming is available for per-node backend installs via the gallery job queue.
  • L4T13 (cu130/aarch64) backends for `vllm`, `sglang`, and `vllm-omni` are restored using PyPI aarch64+cu130 wheels.
  • A Nix Flake (`flake.nix`) is provided for Dockerless setup on NixOS.
  • API and backend trace payloads are capped using `LOCALAI_TRACING_MAX_BODY_BYTES` to maintain UI responsiveness.

🐛 Bug Fixes

  • Fixed an issue in Distributed Mode where model loading cached a client bound to a single node/replica, causing subsequent requests to be pinned to that node instead of load-balanced.
  • Fixed an issue where `probeHealth` checks in Distributed Mode could stall requests due to serialization against in-flight predict calls; health checks are now memoized and coalesced.
  • Fixed `llama-cpp` prompt cache initialization: `kv_unified=true` is now the default, ensuring the host prompt cache is correctly allocated and written across requests.

Affected Symbols