|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- custom_generate |
|
|
--- |
|
|
|
|
|
# LagKV Cache |
|
|
|
|
|
## Introduction |
|
|
|
|
|
 |
|
|
|
|
|
LagKV is an efficient and robust KV compression algorithm. It uses lag tokens information to compress the previous ones which significantly boost the compression performance with little computation overhead. |
|
|
|
|
|
[Original Github](https://github.com/AI-Lab-China-Merchants-Bank/LagKV) |
|
|
|
|
|
Details are in the following work: |
|
|
|
|
|
[LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important](https://arxiv.org/abs/2504.04704) |
|
|
|
|
|
## Example usage |
|
|
|
|
|
We can use the custom generation method in this repository like the the base `generate` from `transformers`: |
|
|
|
|
|
```py |
|
|
# requires `transformers>=4.52.0` |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
# Preparing model, tokenizer, and model inputs |
|
|
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B") |
|
|
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", device_map="auto") |
|
|
messages = [{"role": "user", "content": "Tell me a story about a cat."}] |
|
|
text = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True, |
|
|
enable_thinking=False |
|
|
) |
|
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
# Using lagkv cache |
|
|
gen_out = model.generate( |
|
|
# usual `generate` arguments |
|
|
**model_inputs, |
|
|
do_sample=False, |
|
|
max_new_tokens=100, |
|
|
return_dict_in_generate=True, |
|
|
# lagkv cache arguments (default `lag_ratio=0.5,lag_size=128,lag_sink_size=16`) |
|
|
custom_generate="CMB-AI-LAB/lagkv_cache", |
|
|
trust_remote_code=True, |
|
|
) |
|
|
print(tokenizer.batch_decode(gen_out.sequences, skip_special_tokens=True)) |
|
|
assert "lagkvcache" in str(type(gen_out.past_key_values)).lower() |