Instructions to use WilliamSong/qwen3-embedding-0.6b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use WilliamSong/qwen3-embedding-0.6b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="WilliamSong/qwen3-embedding-0.6b", filename="qwen3-embedding-0.6b-fix.gguf", )
llm.create_chat_completion( messages = "\"Today is a sunny day and I will get some ice cream.\"" )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use WilliamSong/qwen3-embedding-0.6b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf WilliamSong/qwen3-embedding-0.6b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf WilliamSong/qwen3-embedding-0.6b:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf WilliamSong/qwen3-embedding-0.6b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf WilliamSong/qwen3-embedding-0.6b:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf WilliamSong/qwen3-embedding-0.6b:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf WilliamSong/qwen3-embedding-0.6b:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf WilliamSong/qwen3-embedding-0.6b:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf WilliamSong/qwen3-embedding-0.6b:Q4_K_M
Use Docker
docker model run hf.co/WilliamSong/qwen3-embedding-0.6b:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use WilliamSong/qwen3-embedding-0.6b with Ollama:
ollama run hf.co/WilliamSong/qwen3-embedding-0.6b:Q4_K_M
- Unsloth Studio new
How to use WilliamSong/qwen3-embedding-0.6b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for WilliamSong/qwen3-embedding-0.6b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for WilliamSong/qwen3-embedding-0.6b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for WilliamSong/qwen3-embedding-0.6b to start chatting
- Pi new
How to use WilliamSong/qwen3-embedding-0.6b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf WilliamSong/qwen3-embedding-0.6b:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "WilliamSong/qwen3-embedding-0.6b:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use WilliamSong/qwen3-embedding-0.6b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf WilliamSong/qwen3-embedding-0.6b:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default WilliamSong/qwen3-embedding-0.6b:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use WilliamSong/qwen3-embedding-0.6b with Docker Model Runner:
docker model run hf.co/WilliamSong/qwen3-embedding-0.6b:Q4_K_M
- Lemonade
How to use WilliamSong/qwen3-embedding-0.6b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull WilliamSong/qwen3-embedding-0.6b:Q4_K_M
Run and chat with the model
lemonade run user.qwen3-embedding-0.6b-Q4_K_M
List all available models
lemonade list
llm.create_chat_completion(
messages = "\"Today is a sunny day and I will get some ice cream.\""
)Qwen3-Embedding-0.6B (GGUF) Models
This directory contains GGUF builds of the Qwen3 0.6B embedding model, produced from the upstream base repository Qwen/Qwen3-0.6B-Base (original Hugging Face layout in ../Qwen3-Embedding-0.6B/).
Contents
| File | Purpose |
|---|---|
qwen3-embedding-0.6b.Q4_K_M.gguf |
Quantized (Q4_K_M) GGUF for efficient inference. |
qwen3-embedding-0.6b-fix.gguf |
Same model with explicit sep_token / EOS metadata fix applied. |
Special Token Configuration
Extracted from tokenizer_config.json:
"sep_token": "<|endoftext|>",
"sep_token_id": 151643
The model uses <|endoftext|> as both the padding (pad_token) and separator (sep_token). For embedding generation each input text MUST terminate with the separator token (or the converter must auto-append it) to avoid a runtime warning:
[WARNING] At least one last token in strings embedded is not SEP. 'tokenizer.ggml.add_eos_token' should be set to 'true' in the GGUF header
Why the Warning Appears
If the GGUF metadata key tokenizer.ggml.add_eos_token is absent or false, llama.cpp will not auto-append the final SEP/EOS token for embedding inputs. Any input string that does not already end with <|endoftext|> triggers the warning and may yield subโoptimal embeddings (slightly different token boundary semantics).
Fix Implemented
The file qwen3-embedding-0.6b-fix.gguf was regenerated ensuring:
tokenizer.ggml.add_eos_token = truesep_token(<|endoftext|>) retained with id151643
This makes llama.cpp automatically append the SEP/EOS token when missing, silencing the warning and standardizing embeddings.
Rebuilding From Upstream (Recommended Process)
- Obtain upstream model:
- Clone or download
Qwen/Qwen3-0.6B-Base(embedding variant directory).
- Clone or download
- Convert to GGUF using the current
llama.cppconversion script:- Use the repo's
convert_hf_to_gguf.py(it already sets EOS for Qwen tokenizers). Example:
- Use the repo's
python3 llama.cpp/convert_hf_to_gguf.py \
--model Qwen3-Embedding-0.6B \
--outfile qwen3-embedding-0.6b-fix.gguf \
--ftype q4_k_m
If you previously produced a GGUF that shows the warning, just re-run conversion with an up-to-date
llama.cppcheckout. The script internally writestokenizer.ggml.add_eos_token = truefor this tokenizer family.
Post-Conversion Validation
Run a quick embedding call and confirm no warning appears:
./llama.cpp/build/bin/embedding \
-m models/qwen3-embedding-0.6b-fix.gguf \
-p "Hello world"
If you still see the warning:
- Confirm the binary was rebuilt after updating sources (
makeorcmake --build). - Inspect metadata using a small Python snippet:
from gguf import GGUFReader
r = GGUFReader("models/qwen3-embedding-0.6b-fix.gguf")
for f in r.fields:
if f.name == "tokenizer.ggml.add_eos_token":
print("ADD_EOS_TOKEN=", f.parts[-1])
Expected output: ADD_EOS_TOKEN= True
Manual Patch (Fallback Method)
If re-conversion is inconvenient, you can clone metadata and force the flag:
from gguf import GGUFReader, GGUFWriter, constants as C
src = GGUFReader("qwen3-embedding-0.6b.Q4_K_M.gguf")
dst = GGUFWriter("qwen3-embedding-0.6b-fix.gguf", src.architecture)
# Copy all existing fields except override ADD_EOS
for field in src.fields:
if field.name == C.Keys.Tokenizer.ADD_EOS:
continue
dst.add_field(field.name, field.field_type, field.parts)
dst.add_add_eos_token(True) # set flag
# Copy tensors
for tensor in src.tensors:
data = tensor.data()
dst.add_tensor(tensor.name, data, tensor.shape, tensor.tensor_type)
dst.write_header_to_file()
dst.write_kv_data_to_file()
dst.write_tensors_to_file()
dst.close()
After patching, re-run the validation step.
Usage Notes for Embeddings
- Always feed raw text; no special wrapping needed. Auto-SEP happens with the fixed file.
- For batch embeddings, ensure each string ends cleanly (avoid trailing spaces if you rely on identical hashes downstream).
- The dimensionality matches upstream Qwen3-Embedding-0.6B (refer to upstream docs for exact embedding size).
License & Attribution
The original model weights and tokenizer come from the Qwen project (Qwen/Qwen3-0.6B-Base). Review their license and usage terms before redistribution. This README documents conversion adjustments only (metadata EOS flag addition).
Changelog
- Initial addition: added fixed GGUF with
tokenizer.ggml.add_eos_token = trueto suppress SEP warning.
For further improvements (FP16 build, alternative quantization tiers, or batching examples), open an issue or PR in this repo.
- Downloads last month
- 55
4-bit
Model tree for WilliamSong/qwen3-embedding-0.6b
Base model
Qwen/Qwen3-0.6B-Base
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="WilliamSong/qwen3-embedding-0.6b", filename="", )