Hitonet

Hito 2B · GGUF

Quantized builds of Hito 2B for Ollama, llama.cpp, and LM Studio

Original Model


About

This repository contains quantized GGUF builds of Hito 2B, a 2-billion-parameter language model fine-tuned from Qwen3.5-2B with Hitonet's Progressive LoRA Merging (PLM) and GRPO training pipelines, featuring the Cognitive Framework for structured nested reasoning.

GGUF is the format used by llama.cpp, ollama, LM Studio, and most local inference stacks. Use these files for on-device or server deployment without requiring Python or Transformers.

For the full model card, methodology, benchmarks, and the Cognitive Framework specification, see the main repository: hitonet/hito-2b.


Quantizations

File Bits (information) Storage Size Quality Recommended For
hito-2b-F16.gguf 16 16 bpw 3.6 GB Reference Research, benchmarking, publication (recommended)
hito-2b-Q8_0.gguf 8 8.5 bpw 1.9 GB Near-lossless Production deployment (recommended)
hito-2b-Q6_K.gguf 6 6.5 bpw 1.5 GB Excellent Quality-focused local use
hito-2b-Q5_K_M.gguf 5 5.7 bpw 1.4 GB Good Local use on modest hardware
hito-2b-TQ1_0.gguf 1.58 (ternary) 1.7 bpw 687 MB Research only BitNet-style ternary experiments

A note on quantization and Hito's reasoning quality

Hito 2B's behavior depends on the integrity of the nested cognitive structure the model was trained to produce. Quantization affects small models like this one disproportionately more than large models, because each weight carries a greater share of the model's overall capability. The structured reasoning traces, the self-correction loop, and the committed answer can all drift subtly at lower precision. For faithful reproduction of the benchmark numbers and the example transcripts reported in the main repository, please consider the following guidance:

  • F16 is the reference. It contains the same weights used during training and evaluation. Every result quoted in the main model card and in the example transcripts was measured on this precision. This is the correct choice for research, publication, and benchmark comparisons.
  • Q8_0 is effectively indistinguishable from F16 in practice. Perplexity overhead is negligible, and the Cognitive Framework's self-correction loop is preserved intact. This is our recommended choice for any production deployment where storage permits it.
  • Q6_K is an excellent compromise when Q8_0 is too large for your environment. Reasoning structure is preserved; very minor vocabulary-level drift is possible on long generations but will not change conclusions on the benchmark tasks.
  • Q5_K_M is an acceptable daily driver for local chat on modest hardware. Most users will not notice a quality difference relative to Q6_K in casual conversation, though subtle reasoning errors may appear more frequently on complex multi-step problems.
  • TQ1_0 (1.58-bit ternary) is not recommended for normal use. It causes visible degradation in the cognitive trace and is provided for researchers investigating whether structured reasoning scaffolds survive extreme quantization.

If you are evaluating Hito 2B and plan to report or publish results, please use F16 or Q8_0 to stay aligned with our reported numbers. If a specific output differs from what you expect or from what is shown in our example transcripts, try Q8_0 or F16 before concluding there is a model issue. This is especially important for the structured reasoning examples, where quantization-induced drift can alter the tag sequence and the committed answer.


Quick Start

Ollama

Pull and run any quantization directly from Hugging Face. The repository includes template, system, and params files that ollama auto-detects, so the chat template, stop tokens, and sampling parameters are applied out of the box.

# Research and benchmarking (reference quality)
ollama run hf.co/hitonet/hito-2b-GGUF:F16       # 3.6 GB, reference

# Production deployment (near-lossless, recommended)
ollama run hf.co/hitonet/hito-2b-GGUF:Q8_0      # 1.9 GB

# Quality-focused local use
ollama run hf.co/hitonet/hito-2b-GGUF:Q6_K      # 1.5 GB
ollama run hf.co/hitonet/hito-2b-GGUF:Q5_K_M    # 1.4 GB

# Research-only extreme quantization
ollama run hf.co/hitonet/hito-2b-GGUF:TQ1_0     # 687 MB, 1.58-bit ternary

Pull without running:

ollama pull hf.co/hitonet/hito-2b-GGUF:Q5_K_M

If you pulled this model before 2026-04-22 and want the current template, refresh your local copy:

ollama rm hf.co/hitonet/hito-2b-GGUF:Q5_K_M
ollama pull hf.co/hitonet/hito-2b-GGUF:Q5_K_M

llama.cpp

# Download the preferred quantization, then:
./llama-cli -m hito-2b-Q5_K_M.gguf --interactive --n-predict 4000 \
            --temp 0.7 --top-p 0.95 --top-k 20 -c 8192

# Or run as an OpenAI-compatible server:
./llama-server -m hito-2b-Q5_K_M.gguf -c 8192 --host 0.0.0.0 --port 8080

LM Studio

  1. Open LM Studio
  2. Search for hitonet/hito-2b-GGUF
  3. Download your preferred quantization
  4. Load and chat

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="hito-2b-Q5_K_M.gguf",
    n_ctx=8192,
    n_gpu_layers=-1,  # all layers on GPU if available
)

out = llm.create_chat_completion(
    messages=[{"role": "user", "content": "If x + 1/x = 3, what is x^3 + 1/x^3?"}],
    max_tokens=4000,
    temperature=0.7,
)
print(out["choices"][0]["message"]["content"])

Notes on the Cognitive Framework

Hito 2B emits reasoning inside a <think>...</think> block using structured cognitive tags (<understand>, <verify>, <commit>, etc.). After the closing </think>, the committed answer is produced as the user-facing output.

For deployment:

  • Default chat UIs will show the full <think> block. Users who want only the answer can render content after </think>.
  • Ollama with the included Modelfile correctly separates thinking from the reply when using think=true in the chat API.
  • Wrapper tools can parse the nested tags to surface reasoning stages, confidence signals, and self-correction events.

See COGNITIVE_FRAMEWORK.md in the main repo for the full tag taxonomy and integration patterns.


License

Released under the Hitonet Community License. Non-commercial use is permitted with attribution. Commercial use requires written permission from Hitonet.


Links


Hitonet
Structured reasoning for small language models
Downloads last month
748
GGUF
Model size
2B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

1-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hitonet/hito-2b-GGUF

Finetuned
Qwen/Qwen3.5-2B
Finetuned
hitonet/hito-2b
Quantized
(3)
this model