Instructions to use hitonet/hito-2b-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use hitonet/hito-2b-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="hitonet/hito-2b-GGUF", filename="hito-2b-F16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use hitonet/hito-2b-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf hitonet/hito-2b-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf hitonet/hito-2b-GGUF:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf hitonet/hito-2b-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf hitonet/hito-2b-GGUF:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf hitonet/hito-2b-GGUF:F16 # Run inference directly in the terminal: ./llama-cli -hf hitonet/hito-2b-GGUF:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf hitonet/hito-2b-GGUF:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf hitonet/hito-2b-GGUF:F16
Use Docker
docker model run hf.co/hitonet/hito-2b-GGUF:F16
- LM Studio
- Jan
- vLLM
How to use hitonet/hito-2b-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "hitonet/hito-2b-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hitonet/hito-2b-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/hitonet/hito-2b-GGUF:F16
- Ollama
How to use hitonet/hito-2b-GGUF with Ollama:
ollama run hf.co/hitonet/hito-2b-GGUF:F16
- Unsloth Studio new
How to use hitonet/hito-2b-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for hitonet/hito-2b-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for hitonet/hito-2b-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for hitonet/hito-2b-GGUF to start chatting
- Pi new
How to use hitonet/hito-2b-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf hitonet/hito-2b-GGUF:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "hitonet/hito-2b-GGUF:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use hitonet/hito-2b-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf hitonet/hito-2b-GGUF:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default hitonet/hito-2b-GGUF:F16
Run Hermes
hermes
- Docker Model Runner
How to use hitonet/hito-2b-GGUF with Docker Model Runner:
docker model run hf.co/hitonet/hito-2b-GGUF:F16
- Lemonade
How to use hitonet/hito-2b-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull hitonet/hito-2b-GGUF:F16
Run and chat with the model
lemonade run user.hito-2b-GGUF-F16
List all available models
lemonade list
About
This repository contains quantized GGUF builds of Hito 2B, a 2-billion-parameter language model fine-tuned from Qwen3.5-2B with Hitonet's Progressive LoRA Merging (PLM) and GRPO training pipelines, featuring the Cognitive Framework for structured nested reasoning.
GGUF is the format used by llama.cpp, ollama, LM Studio, and most local inference stacks. Use these files for on-device or server deployment without requiring Python or Transformers.
For the full model card, methodology, benchmarks, and the Cognitive Framework specification, see the main repository: hitonet/hito-2b.
Quantizations
| File | Bits (information) | Storage | Size | Quality | Recommended For |
|---|---|---|---|---|---|
hito-2b-F16.gguf |
16 | 16 bpw | 3.6 GB | Reference | Research, benchmarking, publication (recommended) |
hito-2b-Q8_0.gguf |
8 | 8.5 bpw | 1.9 GB | Near-lossless | Production deployment (recommended) |
hito-2b-Q6_K.gguf |
6 | 6.5 bpw | 1.5 GB | Excellent | Quality-focused local use |
hito-2b-Q5_K_M.gguf |
5 | 5.7 bpw | 1.4 GB | Good | Local use on modest hardware |
hito-2b-TQ1_0.gguf |
1.58 (ternary) | 1.7 bpw | 687 MB | Research only | BitNet-style ternary experiments |
A note on quantization and Hito's reasoning quality
Hito 2B's behavior depends on the integrity of the nested cognitive structure the model was trained to produce. Quantization affects small models like this one disproportionately more than large models, because each weight carries a greater share of the model's overall capability. The structured reasoning traces, the self-correction loop, and the committed answer can all drift subtly at lower precision. For faithful reproduction of the benchmark numbers and the example transcripts reported in the main repository, please consider the following guidance:
F16is the reference. It contains the same weights used during training and evaluation. Every result quoted in the main model card and in the example transcripts was measured on this precision. This is the correct choice for research, publication, and benchmark comparisons.Q8_0is effectively indistinguishable fromF16in practice. Perplexity overhead is negligible, and the Cognitive Framework's self-correction loop is preserved intact. This is our recommended choice for any production deployment where storage permits it.Q6_Kis an excellent compromise whenQ8_0is too large for your environment. Reasoning structure is preserved; very minor vocabulary-level drift is possible on long generations but will not change conclusions on the benchmark tasks.Q5_K_Mis an acceptable daily driver for local chat on modest hardware. Most users will not notice a quality difference relative toQ6_Kin casual conversation, though subtle reasoning errors may appear more frequently on complex multi-step problems.TQ1_0(1.58-bit ternary) is not recommended for normal use. It causes visible degradation in the cognitive trace and is provided for researchers investigating whether structured reasoning scaffolds survive extreme quantization.
If you are evaluating Hito 2B and plan to report or publish results, please use F16 or Q8_0 to stay aligned with our reported numbers. If a specific output differs from what you expect or from what is shown in our example transcripts, try Q8_0 or F16 before concluding there is a model issue. This is especially important for the structured reasoning examples, where quantization-induced drift can alter the tag sequence and the committed answer.
Quick Start
Ollama
Pull and run any quantization directly from Hugging Face. The repository includes template, system, and params files that ollama auto-detects, so the chat template, stop tokens, and sampling parameters are applied out of the box.
# Research and benchmarking (reference quality)
ollama run hf.co/hitonet/hito-2b-GGUF:F16 # 3.6 GB, reference
# Production deployment (near-lossless, recommended)
ollama run hf.co/hitonet/hito-2b-GGUF:Q8_0 # 1.9 GB
# Quality-focused local use
ollama run hf.co/hitonet/hito-2b-GGUF:Q6_K # 1.5 GB
ollama run hf.co/hitonet/hito-2b-GGUF:Q5_K_M # 1.4 GB
# Research-only extreme quantization
ollama run hf.co/hitonet/hito-2b-GGUF:TQ1_0 # 687 MB, 1.58-bit ternary
Pull without running:
ollama pull hf.co/hitonet/hito-2b-GGUF:Q5_K_M
If you pulled this model before 2026-04-22 and want the current template, refresh your local copy:
ollama rm hf.co/hitonet/hito-2b-GGUF:Q5_K_M
ollama pull hf.co/hitonet/hito-2b-GGUF:Q5_K_M
llama.cpp
# Download the preferred quantization, then:
./llama-cli -m hito-2b-Q5_K_M.gguf --interactive --n-predict 4000 \
--temp 0.7 --top-p 0.95 --top-k 20 -c 8192
# Or run as an OpenAI-compatible server:
./llama-server -m hito-2b-Q5_K_M.gguf -c 8192 --host 0.0.0.0 --port 8080
LM Studio
- Open LM Studio
- Search for
hitonet/hito-2b-GGUF - Download your preferred quantization
- Load and chat
Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama(
model_path="hito-2b-Q5_K_M.gguf",
n_ctx=8192,
n_gpu_layers=-1, # all layers on GPU if available
)
out = llm.create_chat_completion(
messages=[{"role": "user", "content": "If x + 1/x = 3, what is x^3 + 1/x^3?"}],
max_tokens=4000,
temperature=0.7,
)
print(out["choices"][0]["message"]["content"])
Notes on the Cognitive Framework
Hito 2B emits reasoning inside a <think>...</think> block using structured cognitive tags (<understand>, <verify>, <commit>, etc.). After the closing </think>, the committed answer is produced as the user-facing output.
For deployment:
- Default chat UIs will show the full
<think>block. Users who want only the answer can render content after</think>. - Ollama with the included Modelfile correctly separates thinking from the reply when using
think=truein the chat API. - Wrapper tools can parse the nested tags to surface reasoning stages, confidence signals, and self-correction events.
See COGNITIVE_FRAMEWORK.md in the main repo for the full tag taxonomy and integration patterns.
License
Released under the Hitonet Community License. Non-commercial use is permitted with attribution. Commercial use requires written permission from Hitonet.
- Full license text: LICENSE
- Commercial licensing: legal@hitonet.com
Links
- Main model (safetensors): hitonet/hito-2b
- Website: hitonet.com
- Chat interface: chat.hitonet.com
- API platform: platform.hitonet.com
Structured reasoning for small language models
- Downloads last month
- 748
1-bit
5-bit
6-bit
8-bit
16-bit