kashif
/

DeepConf

Transformers

custom_generate

sampling

Model card Files Files and versions

xet

Community

kashif HF Staff commited on Oct 20

Commit

30add1f

1 Parent(s): cfa4f52

update readme

Browse files

Files changed (1) hide show

README.md +102 -54

README.md CHANGED Viewed

@@ -20,95 +20,144 @@ DeepCONF monitors the confidence of generated tokens and stops generation when c
 - `enable_conf` (bool): Whether to enable the DeepCONF strategy. Defaults to `False`.
 - `window_size` (int): Size of the sliding window for confidence calculation. Defaults to `2048`.
 - `threshold` (float): Confidence threshold for early stopping. Defaults to `17.0`.
-- `conf_topk` (int): Number of top tokens to use for confidence calculation from the full vocabulary. Defaults to `20` (matches official implementation).
 - `output_confidences` (bool): If `True` and `return_dict_in_generate=True`, returns a per-step confidence tensor alongside generated sequences for debugging/visualization.
 ## Usage
 To use this custom generation strategy, you can pass it directly to the `generate` method:
 ```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model = AutoModelForCausalLM.from_pretrained("your-model")
 tokenizer = AutoTokenizer.from_pretrained("your-model")
-inputs = tokenizer("Hello, world!", return_tensors="pt")
 # Generate with DeepCONF (Hub repo)
 outputs = model.generate(
     **inputs,
-    enable_conf=True,
-    window_size=2048,
-    threshold=17.0,
-    output_confidences=True,           # request confidences
-    return_dict_in_generate=True,      # required to get tensors
-    max_new_tokens=100,
     custom_generate="kashif/DeepConf",  # Hugging Face Hub repo
     trust_remote_code=True
 )
 ```
-## Calibration (DeepConf-low/high)
-DeepConf’s online stopping threshold is derived from a short warmup phase. You collect warmup trace confidences, then pass them into the generator to auto-derive the threshold for either DeepConf-low (aggressive) or DeepConf-high (permissive).
-1. Warmup (num_return_sequences): collect per-trace confidences (C_t = min(step_confidences))
 ```python
 from transformers import GenerationConfig
-prompt = "Explain artificial intelligence."
-Ninit = 8  # number of warmup traces
-warmup_C = []
-warm_cfg = GenerationConfig.from_model_config(model.config)
-warm_cfg.do_sample = True
-warm_cfg.temperature = 0.7
-warm_cfg.top_p = 0.95
-warm_cfg.max_new_tokens = 64
-warm_cfg.enable_conf = True
-warm_cfg.return_dict_in_generate = True
-warm_cfg.output_confidences = True
-warm_cfg.num_return_sequences = Ninit
-# IMPORTANT: Do not set `warm_cfg.threshold` here. Warmup should not apply online early stopping.
-out = model.generate(
-    **tokenizer(prompt, return_tensors="pt"),
-    generation_config=warm_cfg,
     custom_generate="kashif/DeepConf",
     trust_remote_code=True,
 )
-# Per-trace Ct = min over steps
-warmup_C = out.confidences.min(dim=1).values.tolist()
 ```
-2. Online: pass warmup confidences to auto-derive threshold
 ```python
-gen_cfg = GenerationConfig.from_model_config(model.config)
-gen_cfg.enable_conf = True
-gen_cfg.return_dict_in_generate = True
-gen_cfg.output_confidences = True
-# Choose a variant:
-# - DeepConf-low (aggressive): eta=0.1 → 90th percentile threshold
-# - DeepConf-high (permissive): eta=0.9 → 10th percentile threshold
-gen_cfg.deepconf_variant = "low"  # or "high"
-# Optional: override eta explicitly
-# gen_cfg.deepconf_eta = 0.1  # defaults: 0.1 for low, 0.9 for high
-# Provide warmup confidences; the threshold will be derived internally
-gen_cfg.deepconf_warmup_confidences = warmup_C
-out = model.generate(
-    **tokenizer(prompt, return_tensors="pt"),
     custom_generate="kashif/DeepConf",
     trust_remote_code=True,
-    generation_config=gen_cfg,
-    max_new_tokens=128,
 )
 ```
 ## Technical Details
 ### Confidence Calculation
@@ -123,7 +172,6 @@ This approach:
 - Uses the **full probability distribution** (before any top-k/top-p/temperature filtering)
 - Always considers a **fixed number of tokens** (conf_topk=20)
 - Naturally **includes the sampled token** if it's in the top-k
-- Matches the **official DeepConf implementation** exactly
 ### Online Stopping

 - `enable_conf` (bool): Whether to enable the DeepCONF strategy. Defaults to `False`.
 - `window_size` (int): Size of the sliding window for confidence calculation. Defaults to `2048`.
 - `threshold` (float): Confidence threshold for early stopping. Defaults to `17.0`.
+- `conf_topk` (int): Number of top tokens to use for confidence calculation from the full vocabulary. Defaults to `20`.
 - `output_confidences` (bool): If `True` and `return_dict_in_generate=True`, returns a per-step confidence tensor alongside generated sequences for debugging/visualization.
 ## Usage
+### Basic Usage
 To use this custom generation strategy, you can pass it directly to the `generate` method:
 ```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
+# Load model and tokenizer
+model = AutoModelForCausalLM.from_pretrained(
+    "your-model",
+    torch_dtype="auto",
+    device_map="auto"
+)
 tokenizer = AutoTokenizer.from_pretrained("your-model")
+# Prepare your prompt
+question = "What is the square root of 144?"
+messages = [{"role": "user", "content": question}]
+prompt = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+# Configure generation with DeepCONF
+gen_config = GenerationConfig(
+    do_sample=True,
+    temperature=0.7,
+    top_p=0.95,
+    max_new_tokens=512,
+    enable_conf=True,              # Enable DeepCONF
+    window_size=2048,              # Sliding window size
+    threshold=17.0,                # Confidence threshold
+    conf_topk=20,                  # Top-k for confidence (default: 20)
+    output_confidences=True,       # Return confidence scores
+    return_dict_in_generate=True,  # Required for confidence output
+)
 # Generate with DeepCONF (Hub repo)
 outputs = model.generate(
     **inputs,
+    generation_config=gen_config,
     custom_generate="kashif/DeepConf",  # Hugging Face Hub repo
     trust_remote_code=True
 )
+# Access results
+generated_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
+print(f"Generated: {generated_text}")
+# Access per-step confidences if requested
+if hasattr(outputs, 'confidences'):
+    confidences = outputs.confidences  # Shape: (batch_size, num_generated_tokens)
+    print(f"Min confidence: {confidences.min().item():.3f}")
+    print(f"Mean confidence: {confidences.mean().item():.3f}")
 ```
+### Calibration (DeepConf-low/high)
+DeepConf's online stopping threshold can be automatically derived from a warmup phase. This allows you to calibrate the threshold based on actual model behavior rather than using a fixed value.
+**Step 1: Warmup Phase** - Generate multiple sequences and collect their minimum confidences:
 ```python
 from transformers import GenerationConfig
+# Prepare inputs
+question = "What is 2 + 2?"
+messages = [{"role": "user", "content": question}]
+prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+# Configure warmup generation
+warmup_cfg = GenerationConfig(
+    do_sample=True,
+    temperature=0.7,
+    top_p=0.95,
+    max_new_tokens=256,
+    enable_conf=True,               # Enable confidence tracking
+    return_dict_in_generate=True,
+    output_confidences=True,
+    num_return_sequences=8,         # Generate 8 warmup sequences
+    # Note: Do NOT set threshold here - warmup should run without early stopping
+)
+# Generate warmup sequences
+warmup_out = model.generate(
+    **inputs,
+    generation_config=warmup_cfg,
     custom_generate="kashif/DeepConf",
     trust_remote_code=True,
 )
+# Extract minimum confidence per sequence (C_t = min over all steps)
+warmup_C = warmup_out.confidences.min(dim=1).values.tolist()
+print(f"Warmup min confidences: {warmup_C}")
 ```
+**Step 2: Production Generation** - Use warmup confidences to auto-derive threshold:
 ```python
+# Configure production generation with calibrated threshold
+gen_cfg = GenerationConfig(
+    do_sample=True,
+    temperature=0.7,
+    top_p=0.95,
+    max_new_tokens=512,
+    enable_conf=True,
+    return_dict_in_generate=True,
+    output_confidences=True,
+    # Automatic threshold calibration
+    deepconf_variant="low",  # "low" (aggressive, 90th percentile) or "high" (permissive, 10th percentile)
+    deepconf_warmup_confidences=warmup_C,  # Pass warmup confidences
+    # Optional: deepconf_eta=0.1,  # Override eta (defaults: 0.1 for low, 0.9 for high)
+)
+# Generate with calibrated threshold
+outputs = model.generate(
+    **inputs,
+    generation_config=gen_cfg,
     custom_generate="kashif/DeepConf",
     trust_remote_code=True,
 )
+print(f"Generated: {tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)}")
 ```
+**Variant Explanation:**
+- **DeepConf-low** (eta=0.1): Uses 90th percentile threshold → More aggressive early stopping
+- **DeepConf-high** (eta=0.9): Uses 10th percentile threshold → More permissive, allows longer generation
 ## Technical Details
 ### Confidence Calculation
 - Uses the **full probability distribution** (before any top-k/top-p/temperature filtering)
 - Always considers a **fixed number of tokens** (conf_topk=20)
 - Naturally **includes the sampled token** if it's in the top-k
 ### Online Stopping