JohannesGaessler's picture
CUDA: faster softmax via shared memory + fp16 math (llama/4742)
52c45b9 unverified