Instructions to use norallm/normistral-11b-warm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use norallm/normistral-11b-warm with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="norallm/normistral-11b-warm")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b-warm")
model = AutoModelForCausalLM.from_pretrained("norallm/normistral-11b-warm")

llama-cpp-python

How to use norallm/normistral-11b-warm with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="norallm/normistral-11b-warm",
	filename="normistral-11b-warm.Q3_K_M.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Inference
Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use norallm/normistral-11b-warm with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf norallm/normistral-11b-warm:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf norallm/normistral-11b-warm:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf norallm/normistral-11b-warm:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf norallm/normistral-11b-warm:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf norallm/normistral-11b-warm:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf norallm/normistral-11b-warm:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf norallm/normistral-11b-warm:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf norallm/normistral-11b-warm:Q4_K_M

Use Docker

docker model run hf.co/norallm/normistral-11b-warm:Q4_K_M

LM Studio
Jan

vLLM

How to use norallm/normistral-11b-warm with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "norallm/normistral-11b-warm"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "norallm/normistral-11b-warm",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/norallm/normistral-11b-warm:Q4_K_M

SGLang

How to use norallm/normistral-11b-warm with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "norallm/normistral-11b-warm" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "norallm/normistral-11b-warm",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "norallm/normistral-11b-warm" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "norallm/normistral-11b-warm",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Ollama
How to use norallm/normistral-11b-warm with Ollama:
```
ollama run hf.co/norallm/normistral-11b-warm:Q4_K_M
```

Unsloth Studio new

How to use norallm/normistral-11b-warm with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for norallm/normistral-11b-warm to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for norallm/normistral-11b-warm to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for norallm/normistral-11b-warm to start chatting

Docker Model Runner
How to use norallm/normistral-11b-warm with Docker Model Runner:
```
docker model run hf.co/norallm/normistral-11b-warm:Q4_K_M
```

Lemonade

How to use norallm/normistral-11b-warm with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull norallm/normistral-11b-warm:Q4_K_M

Run and chat with the model

lemonade run user.normistral-11b-warm-Q4_K_M

List all available models

lemonade list

NorMistral-11b-warm is a large Norwegian language model initialized from Mistral-Nemo-Base-2407 and continually pretrained on a total of 250 billion subword tokens – using a mix of Scandinavian, Sámi, English and code data (four repetitions of open Norwegian texts). The model is introduced in the paper Small Languages, Big Models: A Study of Continual Training on Languages of Norway by Samuel et al. 2025, and forms part of the NORA.LLM family developed by the Language Technology Group at the University of Oslo (LTG).

Disclaimer: This model is pretrained on raw (mostly web-based) textual data. It is not finetuned to follow instructions, and it can generate harmful completions after inappropriate user prompts. It is primarily intended for research purposes.

License

We release the model under Apache 2.0 license to indicate that we do not impose any additional constraints on the model weights. However, we do not own the data in the training collection.

Pretraining corpus

The model is pretrained on a combination of publicly available data and a custom web crawl for Sámi. The total training corpus consists of 250 billion tokens from the following sources:

Norwegian text (Bokmål and Nynorsk); this collection was created by the National Library of Norway and it's a prerelease of an update of NCC (codenamed "Mímir core"). It consists of: a) the public part of Norwegian Colossal Corpus (NCC) with permissible licenses (i.e. it doesn't include newspaper texts with the CC BY-NC 2.0 license); b) Bokmål and Nynorsk CulturaX, and c) Bokmål and Nynorsk HPLT corpus v1.2.
Northern Sámi texts are sourced from a) Glot500; b) the SIKOR North Saami free corpus; and c) a custom web crawl (seeded from Sámi Wikipedia external links) published separately as ltg/saami-web.
Additional languages for knowledge/language transfer: a) Danish, Swedish, Icelandic, and Faroese from CulturaX and Glot500; b) high-quality English from FineWeb-edu; and c) programming code from The Stack v2 (the high-quality subset).

The corpus is carefully balanced through strategic upsampling to handle the resource disparity between languages. Following data-constrained scaling laws, the corpus data for target languages is repeated multiple times (up to 16x for low-resource languages) to reach the optimal training budget while avoiding overfitting:

Tokenizer

This model uses a new tokenizer, specially trained on the target languages. Therefore it offers substantially faster inference than the original Mistral-Nemo-Base-2407 model. Here are the subword-to-word split ratios across different languages:

Tokenizer	# tokens	Bokmål	Nynorsk	Sámi	Danish	Swedish
Mistral-Nemo-Base-2407	131072	1.79	1.87	2.63	1.82	2.00
NorMistral-11b-warm	51200	1.22	1.28	1.82	1.33	1.39

Evaluation

More details about the evaluation setup and the new Norwegian benchmarks will be described in upcoming papers.

Model details

Model Developers: Language Technology Group at the University of Oslo in collaboration with NORA.LLM.

Architecture: NorMistral-11B uses the Mistral architecture based on an improved Llama design, featuring:

Pre-normalization with RMSNorm
SwiGLU activation function
Rotary positional embeddings
Grouped-query attention
40 transformer layers
Hidden dimension: 5,120
Intermediate dimension: 14,336
32 query heads and 8 key & value heads (dimension 128)
Vocabulary size: 51,200 tokens
Total parameters: 11.4 billion

Training Details:

Training tokens: 250 billion
Batch size: 1,024 × 4,096 tokens (# sequences × sequence length)
Training steps: 60,000
Peak learning rate: 1e-4
Warm-up steps: 1,000
Learning rate decay steps: 10,000
Optimizer: AdamW (β₁=0.9, β₂=0.95, ε=1e-8)
Weight decay: 0.1
Training precision: bfloat16
Hardware: 256 AMD MI250X GPUs (128 GB)
Training time: 8.5 days
Theoretical computation: 2.0e22 FLOP/s
Model FLOP/s utilization (MFU): 38%

Unique Features:

Hybrid masked-causal training (90% causal LM, 10% masked next-token prediction)
Can be used as both a causal generative model and a bidirectional encoder model
Three-stage continual pretraining:
1. Tokenizer optimization for target languages
2. Embedding weight realignment
3. Full model training

Base Model: Initialized from Mistral-Nemo-Base-2407

License: Apache-2.0

Example usage

Basic Causal Language Model Usage

Here's how to use NorMistral-11B as a standard causal language model for translation:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Import the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b-warm")
model = AutoModelForCausalLM.from_pretrained("norallm/normistral-11b-warm").cuda().eval()

# Define zero-shot translation prompt template
prompt = """Engelsk: {0}
Bokmål:"""

# Define tokens that should end the generation (any token with a newline)
eos_token_ids = [
    token_id
    for token_id in range(tokenizer.vocab_size)
    if '\n' in tokenizer.decode([token_id])
]

# Generation function
@torch.no_grad()
def generate(text):
    text = prompt.format(text)
    input_ids = tokenizer(text, return_tensors='pt').input_ids.cuda()
    prediction = model.generate(
        input_ids,
        max_new_tokens=64,
        do_sample=False,
        eos_token_id=eos_token_ids
    )
    return tokenizer.decode(prediction[0, input_ids.size(1):]).strip()

# Example usage
generate("I'm excited to try this new Norwegian language model!")
# > Expected output: 'Jeg er spent på å prøve denne nye norske språkmodellen!'

Memory-Efficient Loading

For systems with limited VRAM, you can load the model in 8-bit or 4-bit quantization:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b-warm")

# Load in 8-bit mode (requires ~12GB VRAM)
model = AutoModelForCausalLM.from_pretrained(
    "norallm/normistral-11b",
    device_map='auto',
    load_in_8bit=True,
    torch_dtype=torch.bfloat16
)

# Or load in 4-bit mode (requires ~8GB VRAM)
model = AutoModelForCausalLM.from_pretrained(
    "norallm/normistral-11b",
    device_map='auto',
    load_in_4bit=True,
    torch_dtype=torch.bfloat16
)

NorMistral-11b is also a bidirectional masked language model

Having been pretrained on a mixed causal-masked objective, this model knows how to process texts bidirectionally. You can thus finetune this model like any other BERT (or any other prefix language model). The model can also be used directly for masked language modeling:

from transformers import AutoTokenizer, AutoModelForCausalLM

# First, we will have to import the tokenizer and the language model
# we can use CausalLM instead of MaskedLM just fine
tokenizer = AutoTokenizer.from_pretrained(
    "norallm/normistral-11b-warm"
)
model = AutoModelForCausalLM.from_pretrained(
    "norallm/normistral-11b-warm"
).cuda().eval()

# A partially-masked input text string
text = "En søt lundefugl flyr over de<mask>norske fjorder."
input_ids = tokenizer(text, return_tensors='pt').input_ids.cuda()

# An empty attention mask allows uncontrained bidirectional attention
attention_mask = torch.zeros(input_ids.size(0), 1, input_ids.size(1), input_ids.size(1), device=input_ids.device)

output_logits = model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    return_dict=True
).logits
predictions = output_logits[0, :, :].argmax(dim=-1)

# Expected output:
# En søt lundefugl flyr over de<mask> norske fjorder. -> En søt lundefugl flyr over de vakre norske fjorder.
print(f"{tokenizer.decode(input_ids[0, 1:])} -> {tokenizer.decode(predictions[:-1])}")

Citation

@inproceedings{samuel-etal-2025-small,
    title = "Small Languages, Big Models: {A} Study of Continual Training on Languages of {Norway}",
    author = "Samuel, David  and
      Mikhailov, Vladislav  and
      Velldal, Erik  and
      {\O}vrelid, Lilja  and
      Charpentier, Lucas Georges Gabriel  and
      Kutuzov, Andrey  and
      Oepen, Stephan",
    editor = "Johansson, Richard  and
      Stymne, Sara",
    booktitle = "Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)",
    month = mar,
    year = "2025",
    address = "Tallinn, Estonia",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2025.nodalida-1.61/",
    pages = "573--608",
    ISBN = "978-9908-53-109-0",
}