b7678
📦 llama-cppView on GitHub →
✨ 5 features🐛 12 fixes🔧 3 symbols
Summary
This release introduces the initial implementation of FlashAttention for WebGPU and brings significant performance enhancements through faster tensor operations and WebGPU subgroup matrix multiplication. Numerous fixes were applied across WebGPU and Wasm builds.
Migration Steps
- If using WebGPU, note that shader replacements now use map instead of pair of strings.
✨ New Features
- Initial FlashAttention implementation for ggml webgpu.
- Added support for fast matrix and matrix/vector multiplication.
- Implemented subgroup matrix multiplication support for WebGPU.
- Added support for q4_0 quantization in WebGPU.
- Refactored pipelines and workgroup calculations for better performance/portability.
🐛 Bug Fixes
- Fixed rms_norm double declaration bug.
- Fixed autoconfig issues.
- Ensured all operators (including xielu) are working.
- Fixed bug in unary operators kernel related to REPL_Template support.
- Fixed WebGPU build on emscripten.
- Fixed single-thread case for init_tensor_uniform.
- Fixed test-backend-ops emscripten build for f16/quantized types.
- Used emscripten memory64 to support get_memory.
- Moved wasm single-thread logic out of test-backend-ops for cpu backend.
- Disabled multiple threads for emscripten single-thread builds in ggml_graph_plan.
- Avoided error on device destruction and added todos for proper cleanup.
- Fixed unused warning.
🔧 Affected Symbols
rms_normunary operators kernelmul_mat