databricks/databricks-dolly-15k
Viewer β’ Updated β’ 15k β’ 33.7k β’ 964
pip install -r requirements.txt
Place your text corpus at data/raw/english.md.
python utils/clean_wiki.py
python data/download_sft.py
Outputs:
data/raw/english_clean.txt,data/sft_data.jsonl
python tokenizer/train_tokenizer.py
Outputs:
tokenizer/spm.model,tokenizer/spm.vocab
python training/dataset.py --prepare
Outputs:
data/processed/train.bin,data/processed/val.binPrints token count and train/val split
python training/pretrain.py
Expected: val loss should drop below ~3.5 Checkpoints saved to
checkpoints/when val loss improves
python training/sft.py
Outputs:
checkpoints/sft_final.pt
python inference/chat.py --checkpoint checkpoints/sft_final.pt
BATCH_SIZE to 4 or context_len to 256 in scripts/config.spm.model exists.pip install torch --index-url https://download.pytorch.org/whl/cu124