Model Card for FinetunedLAMAtoR1-001-3B
Model Details
Technical Specifications
Model Architecture and Objective
- Base Model: Llama-3.2-3B-Instruct
- Architecture: Causal Decoder-Only Transformer
- Hidden Size: 3072
- Layers: 28
- Heads: 24
- Parameters: ~3.21B (Loaded in 4-bit quantization)
- Precision: Float16 (during inference/training via LoRA)
Compute Infrastructure
- Hardware: Tesla T4 GPU (Google Colab)
- VRAM Usage: ~2.24 GB (Model) + Training Overhead
- Quantization: 4-bit (QLoRA) via
bitsandbytes
Model Weights
- Type: LoRA Adapter (Peft)
- Adapter File Size: ~92 MB
- Total Saved Size: ~108 MB
Model Description
This model is a fine-tuned version of unsloth/Llama-3.2-3B-Instruct designed to mimic reflective, human-like stream-of-consciousness reasoning. It was trained using Unsloth on the ServiceNow-AI/R1-Distill-SFT dataset.
The model utilizes a specific system prompt to trigger a "thinking" process (Chain of Thought) before providing the final answer, aiming to replicate the reasoning capabilities seen in models like DeepSeek-R1.
- Developed by: Muhammad Shaheer Khan
- Model type: Causal Language Model (LoRA Fine-tune)
- Language(s) (NLP): English
- License: Llama 3.2 Community License
- Finetuned from model: unsloth/Llama-3.2-3B-Instruct
Uses
Direct Use
The model is intended for reasoning tasks where explainability and step-by-step logic are required. It excels at math problems, logic puzzles, and complex queries requiring iterative thought.
System Prompt: To activate the reasoning capabilities, you must use the following system prompt:
"You are a reflective assistant engaging in thorough, iterative reasoning, mimicking human stream-of-consciousness thinking. Your approach emphasizes exploration, self-doubt, and continuous refinement before coming up with an answer."
How to Get Started with the Model
You can use the model with the unsloth library for 2x faster inference, or standard Hugging Face transformers.
Using Unsloth (Recommended)
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B",
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
)
# Enable native 2x faster inference
FastLanguageModel.for_inference(model)
tokenizer = get_chat_template(
tokenizer,
chat_template = "llama-3.1",
)
sys_prompt = """You are a reflective assistant engaging in thorough, iterative reasoning, mimicking human stream-of-consciousness thinking. Your approach emphasizes exploration, self-doubt, and continuous refinement before coming up with an answer.
<problem>
{}
</problem>
"""
message = sys_prompt.format("If there are a dozen of eggs at cost $60, how much one egg cost?")
messages = [{"role": "user", "content": message}]
inputs = tokenizer.apply_chat_template(
messages,
tokenize = True,
add_generation_prompt = True,
return_tensors = "pt",
).to("cuda")
outputs = model.generate(
input_ids = inputs,
max_new_tokens = 1024,
use_cache = True,
temperature = 1.5,
min_p = 0.1
)
print(tokenizer.batch_decode(outputs))
Model tree for Muhammad-Shaheer/FinetunedLAMAtoR1-001-3B
Base model
meta-llama/Llama-3.2-3B-Instruct