See axolotl config

axolotl version: 0.13.0.dev0

base_model: kajuma/DiffLlama-1B
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

hub_model_id: 
hub_strategy: 
push_dataset_to_hub:
hf_use_auth_token: true

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_cross_entropy: false
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true

load_in_8bit: false
load_in_4bit: false
strict: false

chat_template: tokenizer_default

datasets:
  - path: kajuma/Zero_SFT_Ja_v3.5
    type: chat_template
    field_messages: messages
    message_field_role: role
    message_field_content: content

shuffle_merged_datasets: true
dataset_prepared_path: ./output/dataset
val_set_size: 0.002
output_dir: ./output/model

sequence_len: 4096
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true

adapter:
lora_model_dir:
lora_r:
lora_alpha:
lora_dropout:
lora_target_linear:
lora_fan_in_fan_out:

wandb_project: diffllama
wandb_entity: tepic
wandb_watch:
wandb_name: diffllama-sft-datapilot
wandb_log_model:

gradient_accumulation_steps: 32
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
cosine_min_lr_ratio: 0.1
learning_rate: 5e-4

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: false
early_stopping_patience:
auto_resume_from_checkpoints: true
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: false

save_strategy: steps
save_steps: 100
save_total_limit: 1

warmup_steps: 20
eval_steps: 100
eval_batch_size: 4
eval_table_size:
eval_max_new_tokens:
debug:
deepspeed:
weight_decay: 0.01
fsdp:
fsdp_config:
special_tokens:

output/model

This model is a fine-tuned version of kajuma/DiffLlama-1B on the kajuma/Zero_SFT_Ja_v3.5 dataset. It achieves the following results on the evaluation set:

Loss: 1.7823
Ppl: 5.9437
Memory/max Active (gib): 26.29
Memory/max Allocated (gib): 26.29
Memory/device Reserved (gib): 27.83

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0005
train_batch_size: 1
eval_batch_size: 4
seed: 42
gradient_accumulation_steps: 32
total_train_batch_size: 32
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 20
training_steps: 575

Training results

Training Loss	Epoch	Step	Validation Loss	Ppl	Active (gib)	Allocated (gib)	Reserved (gib)
No log	0	0	2.5499	12.8055	19.52	19.52	19.89
2.221	0.1739	100	2.1053	8.2094	26.29	26.29	27.82
2.0187	0.3477	200	1.9684	7.1593	26.29	26.29	27.83
1.8819	0.5216	300	1.8712	6.4960	26.29	26.29	27.83
1.7977	0.6955	400	1.8093	6.1060	26.29	26.29	27.83
1.7511	0.8693	500	1.7823	5.9437	26.29	26.29	27.83

Framework versions

Transformers 4.57.1
Pytorch 2.8.0+cu128
Datasets 4.4.1
Tokenizers 0.22.1

Downloads last month: 237

Safetensors

Model size

1B params

Tensor type

F32

BF16

Model tree for kajuma/diffllama-1B-sft-5e4

Base model

kajuma/DiffLlama-1B

Finetuned

(2)

this model

kajuma
/

diffllama-1B-sft-5e4

output/model

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for kajuma/diffllama-1B-sft-5e4

Dataset used to train kajuma/diffllama-1B-sft-5e4