Instructions to use hitonet/hito-2b-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use hitonet/hito-2b-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="hitonet/hito-2b-GGUF",
	filename="hito-2b-F16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use hitonet/hito-2b-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf hitonet/hito-2b-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf hitonet/hito-2b-GGUF:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf hitonet/hito-2b-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf hitonet/hito-2b-GGUF:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf hitonet/hito-2b-GGUF:F16
# Run inference directly in the terminal:
./llama-cli -hf hitonet/hito-2b-GGUF:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf hitonet/hito-2b-GGUF:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf hitonet/hito-2b-GGUF:F16

Use Docker

docker model run hf.co/hitonet/hito-2b-GGUF:F16

LM Studio
Jan

vLLM

How to use hitonet/hito-2b-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "hitonet/hito-2b-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hitonet/hito-2b-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/hitonet/hito-2b-GGUF:F16

Ollama
How to use hitonet/hito-2b-GGUF with Ollama:
```
ollama run hf.co/hitonet/hito-2b-GGUF:F16
```

Unsloth Studio new

How to use hitonet/hito-2b-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for hitonet/hito-2b-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for hitonet/hito-2b-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for hitonet/hito-2b-GGUF to start chatting

Pi new

How to use hitonet/hito-2b-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf hitonet/hito-2b-GGUF:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "hitonet/hito-2b-GGUF:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use hitonet/hito-2b-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf hitonet/hito-2b-GGUF:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default hitonet/hito-2b-GGUF:F16

Run Hermes

hermes

Docker Model Runner
How to use hitonet/hito-2b-GGUF with Docker Model Runner:
```
docker model run hf.co/hitonet/hito-2b-GGUF:F16
```

Lemonade

How to use hitonet/hito-2b-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull hitonet/hito-2b-GGUF:F16

Run and chat with the model

lemonade run user.hito-2b-GGUF-F16

List all available models

lemonade list

Hito 2B · GGUF

Quantized builds of Hito 2B for Ollama, llama.cpp, and LM Studio

About

This repository contains quantized GGUF builds of Hito 2B, a 2-billion-parameter language model fine-tuned from Qwen3.5-2B with Hitonet's Progressive LoRA Merging (PLM) and GRPO training pipelines, featuring the Cognitive Framework for structured nested reasoning.

GGUF is the format used by llama.cpp, ollama, LM Studio, and most local inference stacks. Use these files for on-device or server deployment without requiring Python or Transformers.

For the full model card, methodology, benchmarks, and the Cognitive Framework specification, see the main repository: hitonet/hito-2b.

Quantizations

File	Bits (information)	Storage	Size	Quality	Recommended For
`hito-2b-F16.gguf`	16	16 bpw	3.6 GB	Reference	Research, benchmarking, publication (recommended)
`hito-2b-Q8_0.gguf`	8	8.5 bpw	1.9 GB	Near-lossless	Production deployment (recommended)
`hito-2b-Q6_K.gguf`	6	6.5 bpw	1.5 GB	Excellent	Quality-focused local use
`hito-2b-Q5_K_M.gguf`	5	5.7 bpw	1.4 GB	Good	Local use on modest hardware
`hito-2b-TQ1_0.gguf`	1.58 (ternary)	1.7 bpw	687 MB	Research only	BitNet-style ternary experiments

A note on quantization and Hito's reasoning quality

Hito 2B's behavior depends on the integrity of the nested cognitive structure the model was trained to produce. Quantization affects small models like this one disproportionately more than large models, because each weight carries a greater share of the model's overall capability. The structured reasoning traces, the self-correction loop, and the committed answer can all drift subtly at lower precision. For faithful reproduction of the benchmark numbers and the example transcripts reported in the main repository, please consider the following guidance:

F16 is the reference. It contains the same weights used during training and evaluation. Every result quoted in the main model card and in the example transcripts was measured on this precision. This is the correct choice for research, publication, and benchmark comparisons.
Q8_0 is effectively indistinguishable from F16 in practice. Perplexity overhead is negligible, and the Cognitive Framework's self-correction loop is preserved intact. This is our recommended choice for any production deployment where storage permits it.
Q6_K is an excellent compromise when Q8_0 is too large for your environment. Reasoning structure is preserved; very minor vocabulary-level drift is possible on long generations but will not change conclusions on the benchmark tasks.
Q5_K_M is an acceptable daily driver for local chat on modest hardware. Most users will not notice a quality difference relative to Q6_K in casual conversation, though subtle reasoning errors may appear more frequently on complex multi-step problems.
TQ1_0 (1.58-bit ternary) is not recommended for normal use. It causes visible degradation in the cognitive trace and is provided for researchers investigating whether structured reasoning scaffolds survive extreme quantization.

If you are evaluating Hito 2B and plan to report or publish results, please use F16 or Q8_0 to stay aligned with our reported numbers. If a specific output differs from what you expect or from what is shown in our example transcripts, try Q8_0 or F16 before concluding there is a model issue. This is especially important for the structured reasoning examples, where quantization-induced drift can alter the tag sequence and the committed answer.

Quick Start

Ollama

Pull and run any quantization directly from Hugging Face. The repository includes template, system, and params files that ollama auto-detects, so the chat template, stop tokens, and sampling parameters are applied out of the box.

# Research and benchmarking (reference quality)
ollama run hf.co/hitonet/hito-2b-GGUF:F16       # 3.6 GB, reference

# Production deployment (near-lossless, recommended)
ollama run hf.co/hitonet/hito-2b-GGUF:Q8_0      # 1.9 GB

# Quality-focused local use
ollama run hf.co/hitonet/hito-2b-GGUF:Q6_K      # 1.5 GB
ollama run hf.co/hitonet/hito-2b-GGUF:Q5_K_M    # 1.4 GB

# Research-only extreme quantization
ollama run hf.co/hitonet/hito-2b-GGUF:TQ1_0     # 687 MB, 1.58-bit ternary

Pull without running:

ollama pull hf.co/hitonet/hito-2b-GGUF:Q5_K_M

If you pulled this model before 2026-04-22 and want the current template, refresh your local copy:

ollama rm hf.co/hitonet/hito-2b-GGUF:Q5_K_M
ollama pull hf.co/hitonet/hito-2b-GGUF:Q5_K_M

llama.cpp

# Download the preferred quantization, then:
./llama-cli -m hito-2b-Q5_K_M.gguf --interactive --n-predict 4000 \
            --temp 0.7 --top-p 0.95 --top-k 20 -c 8192

# Or run as an OpenAI-compatible server:
./llama-server -m hito-2b-Q5_K_M.gguf -c 8192 --host 0.0.0.0 --port 8080

LM Studio

Open LM Studio
Search for hitonet/hito-2b-GGUF
Download your preferred quantization
Load and chat

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="hito-2b-Q5_K_M.gguf",
    n_ctx=8192,
    n_gpu_layers=-1,  # all layers on GPU if available
)

out = llm.create_chat_completion(
    messages=[{"role": "user", "content": "If x + 1/x = 3, what is x^3 + 1/x^3?"}],
    max_tokens=4000,
    temperature=0.7,
)
print(out["choices"][0]["message"]["content"])

Notes on the Cognitive Framework

Hito 2B emits reasoning inside a <think>...</think> block using structured cognitive tags (<understand>, <verify>, <commit>, etc.). After the closing </think>, the committed answer is produced as the user-facing output.

For deployment:

Default chat UIs will show the full <think> block. Users who want only the answer can render content after </think>.
Ollama with the included Modelfile correctly separates thinking from the reply when using think=true in the chat API.
Wrapper tools can parse the nested tags to surface reasoning stages, confidence signals, and self-correction events.

See COGNITIVE_FRAMEWORK.md in the main repo for the full tag taxonomy and integration patterns.

License

Released under the Hitonet Community License. Non-commercial use is permitted with attribution. Commercial use requires written permission from Hitonet.

Full license text: LICENSE
Commercial licensing: legal@hitonet.com

Model tree for hitonet/hito-2b-GGUF

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

Qwen/Qwen3.5-2B

Finetuned

hitonet/hito-2b

Quantized

(3)

this model

hitonet
/

hito-2b-GGUF

Hito 2B · GGUF

Quantized builds of Hito 2B for Ollama, llama.cpp, and LM Studio

About

Quantizations

A note on quantization and Hito's reasoning quality

Quick Start

Ollama

llama.cpp

LM Studio

Python (llama-cpp-python)

Notes on the Cognitive Framework

License

Links

Model tree for hitonet/hito-2b-GGUF