Change8

b9499

📦 llama-cppView on GitHub →
5 features🔧 2 symbols

Summary

This release focuses on internal refactoring within ggml-webgpu, specifically starting a FlashAttention refactor and standardizing quantization logic across relevant modules.

✨ New Features

  • Began refactoring for FlashAttention in ggml-webgpu.
  • Standardized quantization support across ggml-webgpu components.
  • Split k/v quantization logic.
  • Refactored and abstracted quantization logic for flash_attn and mul_mat.
  • Added quantization support to the tile path.

Affected Symbols