---
quantized_by: QuixiAI
pipeline_tag: text-generation
language:
- en
base_model_relation: quantized
base_model: QuixiAI/Ina-v11.1
license: cc-by-nc-4.0
---

# Llamacpp Quantizations of Ina-v11.1

<img src="https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/XC9u8H6lq2Ud2FzI4wEP1.png" width="500" />

**Ina** interprets persona definitions as *executable instructions*.
The model follows `<<CHARACTER_DESCRIPTION>>` blocks with extremely high fidelity even during 10k–15k token erotic or dark-fiction role-play sessions.

Fine-tuned by **BaiAI** and **Eric Hartford (QuixiAI)** using QLoRA + DPO on large volumes of RP logs, creator-voice datasets, and persona modules.

Contributor Credits:
- "Cheshire Cat"
- [FitQueen666](https://huggingface.co/FitQueen666)
- [Jaroslavs Samcuks](https://huggingface.co/yarcat)
- [Eric Hartford](https://huggingface.co/QuixiAI)

---

Using <a href="https://github.com/ggml-org/llama.cpp/">llama.cpp</a>

Original model: https://huggingface.co/QuixiAI/Ina-v11.1

Run them in [LM Studio](https://lmstudio.ai/)

Run them directly with [llama.cpp](https://github.com/ggml-org/llama.cpp), or any other llama.cpp based project


## Download a file (not the whole branch) from below:

| Filename | Quant type | File Size | Split | Description |
| -------- | ---------- | --------- | ----- | ----------- |
| [Ina-v11.1-Q8_0.gguf](https://huggingface.co/QuixiAI/Ina-v11.1-gguf/tree/main/Ina-v11.1-Q8_0) | Q8_0 | 70GB | true | Extremely high quality, generally unneeded but max available quant. |
| [Ina-v11.1-Q6_K.gguf](https://huggingface.co/QuixiAI/Ina-v11.1-gguf/tree/main/Ina-v11.1-Q6_K) | Q6_K | 54GB | true | Very high quality, near perfect, *recommended*. |
| [Ina-v11.1-Q5_K_M.gguf](https://huggingface.co/QuixiAI/Ina-v11.1-gguf/blob/main/Ina-v11.1-Q5_K_M.gguf) | Q5_K_M | 47GB | true | High quality, *recommended*. |
| [Ina-v11.1-Q5_K_S.gguf](https://huggingface.co/QuixiAI/Ina-v11.1-gguf/blob/main/Ina-v11.1-Q5_K_S.gguf) | Q5_K_S | 38GB | true | High quality, *recommended*. |
| [Ina-v11.1-Q4_K_M.gguf](https://huggingface.co/QuixiAI/Ina-v11.1-gguf/blob/main/Ina-v11.1-Q4_K_M.gguf) | Q4_K_M | 40GB | true | Good quality, default size for most use cases, *recommended*. |
| [Ina-v11.1-Q4_1.gguf](https://huggingface.co/QuixiAI/Ina-v11.1-gguf/blob/main/Ina-v11.1-Q4_1.gguf) | Q4_1 | 41GB | true | Legacy format, similar performance to Q4_K_S but with improved tokens/watt on Apple silicon. |
| [Ina-v11.1-Q4_K_S.gguf](https://huggingface.co/QuixiAI/Ina-v11.1-gguf/blob/main/Ina-v11.1-Q4_K_S.gguf) | Q4_K_S | 38GB | true | Slightly lower quality with more space savings, *recommended*. |
| [Ina-v11.1-Q4_0.gguf](https://huggingface.co/QuixiAI/Ina-v11.1-gguf/blob/main/Ina-v11.1-Q4_0.gguf) | Q4_0 | 37GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. |
| [Ina-v11.1-IQ4_NL.gguf](https://huggingface.co/QuixiAI/Ina-v11.1-gguf/blob/main/Ina-v11.1-IQ4_NL.gguf) | IQ4_NL | 38GB | true | Similar to IQ4_XS, but slightly larger. Offers online repacking for ARM CPU inference. |
| [Ina-v11.1-IQ4_XS.gguf](https://huggingface.co/QuixiAI/Ina-v11.1-gguf/blob/main/Ina-v11.1-IQ4_XS.gguf) | IQ4_XS | 36GB | true | Decent quality, smaller than Q4_K_S with similar performance, *recommended*. |
| [Ina-v11.1-Q3_K_L.gguf](https://huggingface.co/QuixiAI/Ina-v11.1-gguf/blob/main/Ina-v11.1-Q3_K_L.gguf) | Q3_K_L | 35GB | true | Lower quality but usable, good for low RAM availability. |
| [Ina-v11.1-Q3_K_M.gguf](https://huggingface.co/QuixiAI/Ina-v11.1-gguf/blob/main/Ina-v11.1-Q3_K_M.gguf) | Q3_K_M | 32GB | true | Low quality. |
| [Ina-v11.1-IQ3_M.gguf](https://huggingface.co/QuixiAI/Ina-v11.1-gguf/blob/main/Ina-v11.1-IQ3_M.gguf) | IQ3_M | 30GB | true | Medium-low quality, new method with decent performance comparable to Q3_K_M. |
| [Ina-v11.1-Q3_K_S.gguf](https://huggingface.co/QuixiAI/Ina-v11.1-gguf/blob/main/Ina-v11.1-Q3_K_S.gguf) | Q3_K_S | 29GB | true | Low quality, not recommended. |
| [Ina-v11.1-IQ3_XS.gguf](https://huggingface.co/QuixiAI/Ina-v11.1-gguf/blob/main/Ina-v11.1-IQ3_XS.gguf) | IQ3_XS | 27GB | false | Lower quality, new method with decent performance, slightly better than Q3_K_S. |


## Downloading using huggingface-cli

<details>
  <summary>Click to view download instructions</summary>

First, make sure you have hugginface-cli installed:

```
pip install -U "huggingface_hub[cli]"
```

Then, you can target the specific file you want:

```
huggingface-cli download QuixiAI/Ina-v11.1-gguf --include "Ina-v11.1-Q4_K_M.gguf" --local-dir ./
```

If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run:

```
huggingface-cli download QuixiAI/Ina-v11.1-gguf --include "Ina-v11.1-Q8_0/*" --local-dir ./
```

You can either specify a new local-dir (Ina-v11.1-Q8_0) or download them all in place (./)

</details>

## Which file should I choose?

<details>
  <summary>Click here for details</summary>

A great write up with charts showing various performances is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)

The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.

If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.

If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.

Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'.

If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.

If you want to get more into the weeds, you can check out this extremely useful feature chart:

[llama.cpp feature matrix](https://github.com/ggml-org/llama.cpp/wiki/Feature-matrix)

But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.

These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide.

</details>

## Credits

I copied Bartowski's model card and made it my own, cheers!