Instructions to use iproskurina/opt-1.3b-GPTQ-4bit-g128 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use iproskurina/opt-1.3b-GPTQ-4bit-g128 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="iproskurina/opt-1.3b-GPTQ-4bit-g128")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("iproskurina/opt-1.3b-GPTQ-4bit-g128")
model = AutoModelForCausalLM.from_pretrained("iproskurina/opt-1.3b-GPTQ-4bit-g128")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use iproskurina/opt-1.3b-GPTQ-4bit-g128 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "iproskurina/opt-1.3b-GPTQ-4bit-g128"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "iproskurina/opt-1.3b-GPTQ-4bit-g128",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/iproskurina/opt-1.3b-GPTQ-4bit-g128

SGLang

How to use iproskurina/opt-1.3b-GPTQ-4bit-g128 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "iproskurina/opt-1.3b-GPTQ-4bit-g128" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "iproskurina/opt-1.3b-GPTQ-4bit-g128",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "iproskurina/opt-1.3b-GPTQ-4bit-g128" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "iproskurina/opt-1.3b-GPTQ-4bit-g128",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use iproskurina/opt-1.3b-GPTQ-4bit-g128 with Docker Model Runner:
```
docker model run hf.co/iproskurina/opt-1.3b-GPTQ-4bit-g128
```

OPT-1.3B - GPTQ

Model creator: Meta AI
Original model: OPT-1.3B

The model published in this repo was quantized to 4bit using AutoGPTQ.

Quantization details

All quantization parameters were taken from GPTQ paper.

GPTQ calibration data consisted of 128 random 2048 token segments from the C4 dataset.

The grouping size used for quantization is equal to 128.

How to use this GPTQ model from Python code

Install the necessary packages

Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.

pip3 install --upgrade transformers optimum
# If using PyTorch 2.1 + CUDA 12.x:
pip3 install --upgrade auto-gptq
# or, if using PyTorch 2.1 + CUDA 11.x:
pip3 install --upgrade auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

If you are using PyTorch 2.0, you will need to install AutoGPTQ from source. Likewise if you have problems with the pre-built wheels, you should try building from source:

pip3 uninstall -y auto-gptq
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
git checkout v0.5.1
pip3 install .

You can then use the following code


from transformers import AutoTokenizer, TextGenerationPipeline,AutoModelForCausalLM
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
pretrained_model_dir = "iproskurina/opt-1.3b-gptq-4bit"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(pretrained_model_dir, device="cuda:0", model_basename="model")
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])

LICENSE

Run the model with GPTQModel

GPTQModel package: https://github.com/ModelCloud/GPTQModel

pip install -v gptqmodel=="1.8.0" --no-build-isolation
from gptqmodel import GPTQModel

model_id = 'iproskurina/opt-1.3b-GPTQ-4bit-g128'
model = GPTQModel.load(model_id)
result = model.generate("Uncovering deep insights")[0] # tokens
print(model.tokenizer.decode(result)) # string output