Instructions to use lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF",
	filename="Qwen__Qwen3-Next-80B-A3B-Instruct-IQ4_NL.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M

Use Docker

docker model run hf.co/lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M

Ollama
How to use lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF with Ollama:
```
ollama run hf.co/lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M
```

Unsloth Studio new

How to use lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF to start chatting

Pi new

How to use lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF with Docker Model Runner:
```
docker model run hf.co/lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M
```

Lemonade

How to use lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3-Next-80B-A3B-Instruct-GGUF-Q4_K_M

List all available models

lemonade list

New update: 2025-12-17

model is running faster on all hardware since the new release :

on Mac M4 Max:

brew upgrade llama.cpp

llama-cli --version

version: 7440 (0e49a7b8b)

llama-cli -hf lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_0 --prompt 'Write a paragraph about quantum computing' --no-mmap -st -ngl 99

[ Prompt: 88.9 t/s | Generation: 22.4 t/s ] GPU -ngl 99 Mac M4 Max

[ Prompt: 45.4 t/s | Generation: 6.7 t/s ] CPU -ngl 0 Mac M4 Max

on NVIDIA L40S 48GB

[ Prompt: 308.2 t/s | Generation: 89.4 t/s ] GPU -ngl 99 NVIDIA L40S 48GB

Recent update:

added IQ4_XS

Qwen3-Next-80B-A3B-Instruct ❤️ llama.cpp

The qwen_next PR (Pull Request #16095) was merged into the main branch and is in llama.cpp release b7186

Homebrew is updated and you can just do:

brew upgrade llama.cpp

you may also just build from source:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
time cmake -B build
time cmake --build build --config Release --parallel $(nproc --all)

The speed in tokens/second is decent and will be improved over time:

for Q4_0 quant:

on Macbook M4 Max:

prompt: 54 t/s gen: 11 t/s (CPU only ie -ngl 0)
prompt: 41 t/s gen: 7 t/s (GPU only ie -ngl 99)

on NVIDIA CUDA L40S:

prompt: 127 t/s gen: 42 t/s GPU

Recent update:

added IQ4_NL, Q4_1, Q5_0

added Q3_K_S, Q3_K_L, Q5_K_S

Update:

I have tested some of these smaller models on NVIDIA with default CUDA compile with the excellent release from @cturan on NVIDIA L40S GPU.

Since L40S GPU is 48GB VRAM, I was able to run Q2_K, Q3_K_M, Q4_K_S, Q4_0 and Q4_MXFP4_MOE:

but Q4_K_M was too big. Although it works if using -ngl 45 but it slowed down quite a bit.

There may be a better way but did not have time to test.

Was able to get a good speed of 53 tokens per second in the generation and 800 tokens per second in the prompt reading.

wget https://github.com/cturan/llama.cpp/archive/refs/tags/test.tar.gz
tar xf test.tar.gz
cd llama.cpp-test

# export PATH=/usr/local/cuda/bin:$PATH

time cmake -B build -DGGML_CUDA=ON
time cmake --build build --config Release --parallel $(nproc --all)

You may need to add /usr/local/cuda/bin to your PATH to find nvcc (Nvidia CUDA compiler)

Building from source took about 7 minutes.

For more detail on CUDA build see: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#cuda

Quantized Models:

These quantized models were generated using the excellent pull request from @pwilkin #16095 on 2025-10-19 with commit 2fdbf16eb.

NOTE: currently they only work with the llama.cpp 16095 pull request which is still in development. Speed and quality should improve over time.

How to build and run for MacOS

PR=16095
git clone https://github.com/ggml-org/llama.cpp llama.cpp-PR-$PR
cd llama.cpp-PR-$PR

git fetch origin pull/$PR/head:pr-$PR
git checkout pr-$PR

time cmake -B build
time cmake --build build --config Release --parallel $(nproc --all)

Run examples

Run with Hugging Face model:

build/bin/llama-cli -hf lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF --prompt 'What is the capital of France?' --no-mmap -st

by default will download lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M

To download:

wget https://huggingface.co/lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF/resolve/main/Qwen__Qwen3-Next-80B-A3B-Instruct-Q4_0.gguf

pip install hf_transfer 'huggingface_hub[cli]'
hf download lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF Qwen__Qwen3-Next-80B-A3B-Instruct-Q4_0.gguf

Run with local model file:

build/bin/llama-cli -m Qwen__Qwen3-Next-80B-A3B-Instruct-Q4_0.gguf --prompt 'Write a paragraph about quantum computing' --no-mmap -st

Example prompt and output

User prompt:

Write a paragraph about quantum computing

Assistant output:

Quantum computing represents a revolutionary leap in computational power by harnessing the principles of quantum mechanics, such as superposition and entanglement, to process information in fundamentally new ways. Unlike classical computers, which use bits that are either 0 or 1, quantum computers use quantum bits, or qubits, which can exist in a combination of both states simultaneously. This allows quantum computers to explore vast solution spaces in parallel, making them potentially exponentially faster for certain problems—like factoring large numbers, optimizing complex systems, or simulating molecular structures for drug discovery. While still in its early stages, with challenges including qubit stability, error correction, and scalability, quantum computing holds transformative promise for fields ranging from cryptography to artificial intelligence. As researchers and tech companies invest heavily in hardware and algorithmic development, the race to achieve practical, fault-tolerant quantum machines is accelerating, heralding a new era in computing technology.

[end of text]

Downloads last month: 1,065

GGUF

Model size

80B params

Architecture

qwen3next

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

View +1 variant

Model tree for lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF

Base model

Qwen/Qwen3-Next-80B-A3B-Instruct

Quantized

(69)

this model