Inference Acceleration
updated
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language
Models
Paper
• 2401.12522
• Published
• 12
Hydragen: High-Throughput LLM Inference with Shared Prefixes
Paper
• 2402.05099
• Published
• 20
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
Paper
• 2402.04291
• Published
• 50
Shortened LLaMA: A Simple Depth Pruning for Large Language Models
Paper
• 2402.02834
• Published
• 17
Batch Prompting: Efficient Inference with Large Language Model APIs
Paper
• 2301.08721
• Published
• 1
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Paper
• 2403.09919
• Published
• 21
LLM Agent Operating System
Paper
• 2403.16971
• Published
• 73
The Unreasonable Ineffectiveness of the Deeper Layers
Paper
• 2403.17887
• Published
• 82
Better & Faster Large Language Models via Multi-token Prediction
Paper
• 2404.19737
• Published
• 81
Clover: Regressive Lightweight Speculative Decoding with Sequential
Knowledge
Paper
• 2405.00263
• Published
• 16
LLaMA-NAS: Efficient Neural Architecture Search for Large Language
Models
Paper
• 2405.18377
• Published
• 21
TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices
Paper
• 2410.00531
• Published
• 33
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading
Paper
• 2502.12574
• Published
• 13