Instructions to use iproskurina/opt-1.3b-GPTQ-4bit-g128 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use iproskurina/opt-1.3b-GPTQ-4bit-g128 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="iproskurina/opt-1.3b-GPTQ-4bit-g128")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("iproskurina/opt-1.3b-GPTQ-4bit-g128") model = AutoModelForCausalLM.from_pretrained("iproskurina/opt-1.3b-GPTQ-4bit-g128") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use iproskurina/opt-1.3b-GPTQ-4bit-g128 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "iproskurina/opt-1.3b-GPTQ-4bit-g128" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "iproskurina/opt-1.3b-GPTQ-4bit-g128", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/iproskurina/opt-1.3b-GPTQ-4bit-g128
- SGLang
How to use iproskurina/opt-1.3b-GPTQ-4bit-g128 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "iproskurina/opt-1.3b-GPTQ-4bit-g128" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "iproskurina/opt-1.3b-GPTQ-4bit-g128", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "iproskurina/opt-1.3b-GPTQ-4bit-g128" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "iproskurina/opt-1.3b-GPTQ-4bit-g128", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use iproskurina/opt-1.3b-GPTQ-4bit-g128 with Docker Model Runner:
docker model run hf.co/iproskurina/opt-1.3b-GPTQ-4bit-g128
OPT-1.3B - GPTQ
The model published in this repo was quantized to 4bit using AutoGPTQ.
Quantization details
All quantization parameters were taken from GPTQ paper.
GPTQ calibration data consisted of 128 random 2048 token segments from the C4 dataset.
The grouping size used for quantization is equal to 128.
How to use this GPTQ model from Python code
Install the necessary packages
Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
pip3 install --upgrade transformers optimum
# If using PyTorch 2.1 + CUDA 12.x:
pip3 install --upgrade auto-gptq
# or, if using PyTorch 2.1 + CUDA 11.x:
pip3 install --upgrade auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
If you are using PyTorch 2.0, you will need to install AutoGPTQ from source. Likewise if you have problems with the pre-built wheels, you should try building from source:
pip3 uninstall -y auto-gptq
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
git checkout v0.5.1
pip3 install .
You can then use the following code
from transformers import AutoTokenizer, TextGenerationPipeline,AutoModelForCausalLM
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
pretrained_model_dir = "iproskurina/opt-1.3b-gptq-4bit"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(pretrained_model_dir, device="cuda:0", model_basename="model")
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])
Run the model with GPTQModel
GPTQModel package: https://github.com/ModelCloud/GPTQModel
pip install -v gptqmodel=="1.8.0" --no-build-isolation
from gptqmodel import GPTQModel
model_id = 'iproskurina/opt-1.3b-GPTQ-4bit-g128'
model = GPTQModel.load(model_id)
result = model.generate("Uncovering deep insights")[0] # tokens
print(model.tokenizer.decode(result)) # string output
- Downloads last month
- 10
Model tree for iproskurina/opt-1.3b-GPTQ-4bit-g128
Base model
facebook/opt-1.3b