b9499
📦 llama-cppView on GitHub →
✨ 5 features🔧 2 symbols
Summary
This release focuses on internal refactoring within ggml-webgpu, specifically starting a FlashAttention refactor and standardizing quantization logic across relevant modules.
✨ New Features
- Began refactoring for FlashAttention in ggml-webgpu.
- Standardized quantization support across ggml-webgpu components.
- Split k/v quantization logic.
- Refactored and abstracted quantization logic for flash_attn and mul_mat.
- Added quantization support to the tile path.