Instructions to use BSC-LT/salamandra-2b-base-gptq with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use BSC-LT/salamandra-2b-base-gptq with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="BSC-LT/salamandra-2b-base-gptq")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("BSC-LT/salamandra-2b-base-gptq")
model = AutoModelForCausalLM.from_pretrained("BSC-LT/salamandra-2b-base-gptq")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use BSC-LT/salamandra-2b-base-gptq with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "BSC-LT/salamandra-2b-base-gptq"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "BSC-LT/salamandra-2b-base-gptq",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/BSC-LT/salamandra-2b-base-gptq

SGLang

How to use BSC-LT/salamandra-2b-base-gptq with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "BSC-LT/salamandra-2b-base-gptq" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "BSC-LT/salamandra-2b-base-gptq",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "BSC-LT/salamandra-2b-base-gptq" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "BSC-LT/salamandra-2b-base-gptq",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use BSC-LT/salamandra-2b-base-gptq with Docker Model Runner:
```
docker model run hf.co/BSC-LT/salamandra-2b-base-gptq
```

Salamandra-2b-gptq Model Card

This model is the gptq-quantized version of Salamandra-2b for speculative decoding.

The model weights are quantized from FP16 to W4A16 (4-bit weights and FP16 activations) using the GPTQ algorithm. Inferencing with this model can be done using VLLM.

Salamandra is a highly multilingual model pre-trained from scratch that comes in three different sizes — 2B, 7B and 40B parameters — with their respective base and instruction-tuned variants, promoted and financed by the Government of Catalonia through the Aina Project and the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of ILENIA Project with reference 2022/TL22/00215337.

This model card corresponds to the gptq-quantized version of Salamandra-2b for speculative decoding.

The entire Salamandra family is released under a permissive Apache 2.0 license.

How to Use

The following example code works under Python 3.9.16, vllm==0.6.3.post1, torch==2.4.0 and torchvision==0.19.0, though it should run on any current version of the libraries. This is an example of how to create a text completion using the model:

from vllm import LLM, SamplingParams

model_name = "BSC-LT/salamandra-2b-base-gptq"
llm = LLM(model=model_name)

outputs = llm.generate("El mercat del barri ",
                       sampling_params=SamplingParams(
                           temperature=0.5,
                           max_tokens=200)
                       )
print(outputs[0].outputs[0].text)

Author

International Business Machines (IBM).

Copyright

International Business Machines (IBM).

Contact

For further information, please send an email to langtech@bsc.es.

Acknowledgements

We appreciate the collaboration with IBM in this work. Specifically, the IBM team created gptq-quantized version of the Salamandra-2b model for speculative decoding released here.

Disclaimer

Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

Barcelona Supercomputing Center and International Business Machines shall not be held liable for any outcomes resulting from third-party use.