vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (llama/12833) 4b7a407 jeffbolznv commited on Apr 9
vulkan: Use fp16 for the flash attention P*V multiplication (llama/12783) 4e46f41 jeffbolznv commited on Apr 9
llama : fix FA when KV cache is not used (i.e. embeddings) (llama/12825) e7cb2dc ggerganov commited on Apr 8
ggml: don't include arm_neon.h when using CUDA 12 with ARM Neon (ggml/1187) 87f1ea3 cmdr2 commited on Apr 10
ggml : add more generic custom op, remove deprecated custom ops (ggml/1183) ba7a5f8 Diego Devesa commited on Apr 9
Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (llama/12812) 3d4b079 Neo Zhang Jianyu commited on Apr 8
sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (llama/12734) 7d3e668 jeffzhou2000 commited on Apr 7
vulkan: Use unclamped loads for flash attention mask (llama/12720) a76ef69 jeffbolznv commited on Apr 6
Vulkan: Tune Vulkan mmq int dot shader for performance (llama/12767) b3bf710 OccamRazor commited on Apr 5
sycl: allow ggml-sycl configuration and compilation using Visual Studio project/solution (llama/12625) 27cbcc9 Nicolò Scipione commited on Apr 4
cmake: fix ggml-shaders-gen compiler paths containing spaces (llama/12747) 1c89b7d Ronny Brendel commited on Apr 4
vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (llama/12630) ee422be jeffbolznv commited on Apr 4
vulkan: set cmake minimum and project name in vulkan-shaders (llama/12744) 2459781 jeffbolznv commited on Apr 4
CUDA: Prefer vector flash decoding kernel for Gemma models (llama/12738) 5d7a13f Gaurav Garg JohannesGaessler commited on Apr 3
vulkan: Fix missing cmake logic for dot product extension (llama/12721) 7a1e8f8 jeffbolznv commited on Apr 3
CANN: Support operator SIN COS ARGMAX (llama/12709) 904aaf5 Chenguang Li noemotiovon commited on Apr 3
Simplify and improve CUDA graphs through use of indirect copy pointers (llama/9017) a2fdbe6 Alan Gray slaren commited on Apr 3
opencl: use `max_alloc_size` in backend ctx instead of querying again (llama/12705) 3847456 lhez commited on Apr 3
vulkan: Implement split_k for coopmat2 flash attention. (llama/12627) 5ab06d6 jeffbolznv commited on Apr 2
vulkan: Implement grouped query attention in the coopmat2 FA shader (llama/12559) e7bebe6 jeffbolznv commited on Apr 2
llama : add option to override model tensor buffers (llama/11397) 3d000b6 Diego Devesa commited on Apr 2
examples : add HEAPU8 to exported runtime methods (#3062) 2339555 unverified danbev commited on Apr 20
ruby : make Ruby bindings installed with build options (#3056) 8d0a50d unverified KitaitiMakoto commited on Apr 17
whisper : add no_context parameter to whisper_params (#3045) 0e991f8 unverified sachaarbonel commited on Apr 16
examples : add FFmpeg v7.0 support to ffmpeg-transcode.cpp (#3038) 880d905 unverified fujimotos commited on Apr 15
docs : update README.md to note newer nvidia gpus (#3031) 9401dde unverified Jeff Klassen commited on Apr 11
addon.node : support max_context api for addon.node (#3025) 6c51a9b unverified Lin Xiaodong linxiaodong commited on Apr 11
whisper : reduce delta_min from 1000ms to 100ms (#3028) d3e767a unverified ggerganov commited on Apr 11
docs : document how to use 'WHISPER_FFMPEG' build option (#3029) aa64fa0 unverified fujimotos commited on Apr 10