Instructions to use opensota/deepseek-v4-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use opensota/deepseek-v4-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="opensota/deepseek-v4-gguf", filename="DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use opensota/deepseek-v4-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf opensota/deepseek-v4-gguf:F32 # Run inference directly in the terminal: llama-cli -hf opensota/deepseek-v4-gguf:F32
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf opensota/deepseek-v4-gguf:F32 # Run inference directly in the terminal: llama-cli -hf opensota/deepseek-v4-gguf:F32
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf opensota/deepseek-v4-gguf:F32 # Run inference directly in the terminal: ./llama-cli -hf opensota/deepseek-v4-gguf:F32
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf opensota/deepseek-v4-gguf:F32 # Run inference directly in the terminal: ./build/bin/llama-cli -hf opensota/deepseek-v4-gguf:F32
Use Docker
docker model run hf.co/opensota/deepseek-v4-gguf:F32
- LM Studio
- Jan
- vLLM
How to use opensota/deepseek-v4-gguf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "opensota/deepseek-v4-gguf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "opensota/deepseek-v4-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/opensota/deepseek-v4-gguf:F32
- Ollama
How to use opensota/deepseek-v4-gguf with Ollama:
ollama run hf.co/opensota/deepseek-v4-gguf:F32
- Unsloth Studio new
How to use opensota/deepseek-v4-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for opensota/deepseek-v4-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for opensota/deepseek-v4-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for opensota/deepseek-v4-gguf to start chatting
- Pi new
How to use opensota/deepseek-v4-gguf with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf opensota/deepseek-v4-gguf:F32
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "opensota/deepseek-v4-gguf:F32" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use opensota/deepseek-v4-gguf with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf opensota/deepseek-v4-gguf:F32
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default opensota/deepseek-v4-gguf:F32
Run Hermes
hermes
- Docker Model Runner
How to use opensota/deepseek-v4-gguf with Docker Model Runner:
docker model run hf.co/opensota/deepseek-v4-gguf:F32
- Lemonade
How to use opensota/deepseek-v4-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull opensota/deepseek-v4-gguf:F32
Run and chat with the model
lemonade run user.deepseek-v4-gguf-F32
List all available models
lemonade list
DeepSeek V4 Flash — GGUF for ds4
This quants are specific for the DS4 inference engine. They may work with other inference engines or not (they should, but not the MTP model which requires a specific loader).
https://github.com/antirez/ds4
Files
| File | Size | Routed experts (ffn_{gate,up,down}_exps) |
Everything else |
|---|---|---|---|
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf |
80.8 GiB | IQ2_XXS (gate, up) + Q2_K (down) |
Q8_0 attn proj / shared experts / output, F16 router + embed + indexer + compressor + HC, F32 norms / sinks / bias |
DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf |
153.3 GiB | Q4_K (all three) |
same as above |
DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf |
3.6 GiB | MTP / speculative-decoding support (optional, not standalone). |
Use q2 on 128 GB Mac machines, q4 on machines with ≥ 256 GB RAM, pair either with MTP for optional speculative decoding.
Quantization recipe
The filename is the spec. In detail, for the q2 file:
| Tensor class | Quant | Notes |
|---|---|---|
blk.*.ffn_gate_exps, blk.*.ffn_up_exps |
IQ2_XXS |
routed-expert up/gate |
blk.*.ffn_down_exps |
Q2_K |
routed-expert down (K-quant for quality) |
blk.*.ffn_{gate,up,down}_shexp |
Q8_0 |
shared experts |
blk.*.attn_q_a, attn_q_b, attn_kv, attn_output_a, attn_output_b |
Q8_0 |
all attention projections (MLA + low-rank output) |
output.weight |
Q8_0 |
output head |
token_embd.weight |
F16 |
input embedding |
blk.*.ffn_gate_inp (router) |
F16 |
learned router |
blk.*.exp_probs_b (router bias), blk.*.attn_sinks, all *_norm.weight |
F32 |
|
blk.*.ffn_gate_tid2eid |
I32 |
hash-routing tables (first 3 layers only) |
blk.*.attn_compressor_*, blk.*.indexer_*, blk.*.hc_*, blk.*.output_hc_* |
F16 / F32 |
DSv4-specific auxiliary blocks |
For the q4 file, only the three routed-expert classes change to Q4_K. Everything else is byte-for-byte identical to the q2 recipe.
The motivation behind the asymmetry: the routed experts are the majority of the parameter count but each individual expert handles only a fraction of tokens, so aggressive quantization on them costs less in average quality than the same treatment of router, projections, or shared experts. Keeping the decision-making components at Q8_0 preserves model behavior; crushing the experts buys the size.
Usage
git clone https://github.com/antirez/ds4
cd ds4
./download_model.sh q2 # 128 GB RAM machines
./download_model.sh q4 # >= 256 GB RAM machines
./download_model.sh mtp # optional MTP / speculative decoding
make
./ds4 -p "Explain Redis streams in one paragraph."
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
The download_model.sh script fetches from this repository, resumes partial downloads, and points ./ds4flash.gguf at the selected variant.
License
MIT. The base model copyright is held by DeepSeek; the GGUFs are redistributed under the base model's release terms.
- Downloads last month
- 480
Model tree for opensota/deepseek-v4-gguf
Base model
deepseek-ai/DeepSeek-V4-Flash