whisper.cpp / ggml-cuda.cu

Commit History

CUDA: add FP32 FlashAttention vector kernel (llama/7188)
03d4b22
unverified

JohannesGaessler commited on

ggml : full ALiBi support (llama/7192)
192bda4

ggerganov commited on

Introduction of CUDA Graphs to LLama.cpp (llama/6766)
08fc76d

agray3 slaren commited on

Add an option to build without CUDA VMM (llama/7067)
38b1143

wtambellini commited on

ggml : group all experts in a single ggml_mul_mat_id (llama/6505)
f0b5c67

slaren ggerganov commited on

CUDA: fix matrix multiplication logic for tests (llama/6667)
6ccb5a5

JohannesGaessler commited on

feat: implemented sigmoid function (ggml/806)
cd0c122

Justina Cho commited on

llama : add Command R Plus support (llama/6491)
8cf7097
unverified

Carolinabanana S S slaren ggerganov commited on

ggml : mul_mat_id use the same tensor for all the experts (llama/6387)
26fdc9f
unverified

slaren ggerganov commited on

ggml: bypass code incompatible with CUDA < 11.1 (#2020)
32f4e35
unverified

primenko commited on

sync : ggml (#2001)
cbbfa9e
unverified

ggerganov commited on

llama : add pipeline parallelism support (llama/6017)
b5bb3f3
unverified

slaren compilade ggerganov commited on

ggml : reuse quantum structs across backends (llama/5943)
bb0625f
unverified

ggerganov commited on

1.5 bit: we can do even better (llama/5999)
36cc71e
unverified

Kawrakow ikawrakow commited on

Better 1.5 bit quantization (llama/5971)
f3a62cc
unverified

Kawrakow ikawrakow commited on

ggml : add ggml-common.h to deduplicate shared code (llama/5940)
0a37735
unverified

ggerganov commited on

ggml : introduce ggml_status (ggml/750)
151c676
unverified

Michael Podvitskiy slaren ggerganov commited on

cuda : fix data race in soft max (llama/5853)
d1b60e4
unverified

slaren commited on

ggml : IQ3_S improvements (llama/5829)
06a8e30
unverified

Kawrakow ikawrakow commited on

add some new ops, fix some operators and add batch operations to certain operators. (ggml/747)
dd8e3f9
unverified

leejet ggerganov slaren commited on

ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (llama/5760)
9a07f42
unverified

Kawrakow ikawrakow commited on

IQ4_XS: a 4.25 bpw quantization (llama/5747)
0ee1bfb
unverified

Kawrakow ikawrakow commited on

cuda : replace remaining shfl_xor with calls to warp_reduce functions (llama/5744)
753b30d
unverified

Engininja2 commited on

Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (llama/5721)
2b9bb9e
unverified

Kawrakow ikawrakow ggerganov commited on

CUDA: fix DEBUG_CUDA_MALLOC (llama/5729)
f18f386
unverified

JohannesGaessler commited on

code : normalize enum names (llama/5697)
93e0830
unverified

ggerganov commited on

IQ3_S: a much better alternative to Q3_K (llama/5676)
32589c9
unverified

Kawrakow ikawrakow commited on

Introduce backend GUIDs (ggml/743)
a7eb9f6
unverified

UEXTM.com slaren commited on

ggml : always define ggml_fp16_t as uint16_t (llama/5666)
bc567d3
unverified

ggerganov commited on

sync : llama.cpp (ggml/0)
f8e8d34
unverified

ggerganov commited on

cuda : ignore peer access already enabled errors (llama/5597)
a817d85
unverified

slaren commited on

ci : enable -Werror for CUDA builds (llama/5579)
df03a10
unverified

ggerganov commited on

cuda, metal : fix nans in soft_max (llama/5574)
44164ac
unverified

slaren ggerganov commited on

1.5 bit quantization (llama/5453)
9c3aa6a
unverified

Kawrakow ikawrakow commited on

ggml : add ALiBi support for ggml_soft_max_ext (llama/5488)
26c019a
unverified

ggerganov commited on

cuda : print message when initialization fails (llama/5512)
1f047ca
unverified

slaren commited on

CUDA: mul_mat_vec_q tiling, refactor mul mat logic (llama/5434)
c0cfa9b
unverified

JohannesGaessler slaren commited on

CUDA: more warps for mmvq on NVIDIA (llama/5394)
7ab774c
unverified

JohannesGaessler commited on

CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (llama/5386)
3ff7660
unverified

JohannesGaessler commited on

CUDA: mul_mat_vec_q max. batch size 8 -> 4 (llama/5370)
7aa3216
unverified

JohannesGaessler commited on

CUDA: mul_mat_vec_q for batch sizes > 1 (llama/5351)
ae45b38
unverified

JohannesGaessler commited on

cuda : fix LLAMA_CUDA_F16 (llama/5262)
5fd8fb7
unverified

slaren commited on

llava : add MobileVLM support (llama/5132)
f17a416
unverified

JidongZhang-THU slaren commited on

sync : ggml (llama/0)
cdb7964
unverified

ggerganov commited on

SOTA 3-bit quants (llama/5196)
4649943
unverified

Kawrakow ikawrakow commited on

`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (ggml/686)
75d438c
unverified

John Balis slaren commited on

cuda : fix tensor size calculation for non-split buffer (llama/5145)
8f3eb65
unverified

slaren commited on

cuda : fix 2-bit quants on amd hip (llama/5105)
aadbd67
unverified

Engininja2 commited on