Instructions to use maddes8cht/tiiuae-falcon-40b-instruct-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use maddes8cht/tiiuae-falcon-40b-instruct-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="maddes8cht/tiiuae-falcon-40b-instruct-gguf", filename="tiiuae-falcon-40b-instruct-Q2_K.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use maddes8cht/tiiuae-falcon-40b-instruct-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf maddes8cht/tiiuae-falcon-40b-instruct-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf maddes8cht/tiiuae-falcon-40b-instruct-gguf:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf maddes8cht/tiiuae-falcon-40b-instruct-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf maddes8cht/tiiuae-falcon-40b-instruct-gguf:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf maddes8cht/tiiuae-falcon-40b-instruct-gguf:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf maddes8cht/tiiuae-falcon-40b-instruct-gguf:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf maddes8cht/tiiuae-falcon-40b-instruct-gguf:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf maddes8cht/tiiuae-falcon-40b-instruct-gguf:Q4_K_M
Use Docker
docker model run hf.co/maddes8cht/tiiuae-falcon-40b-instruct-gguf:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use maddes8cht/tiiuae-falcon-40b-instruct-gguf with Ollama:
ollama run hf.co/maddes8cht/tiiuae-falcon-40b-instruct-gguf:Q4_K_M
- Unsloth Studio new
How to use maddes8cht/tiiuae-falcon-40b-instruct-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for maddes8cht/tiiuae-falcon-40b-instruct-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for maddes8cht/tiiuae-falcon-40b-instruct-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for maddes8cht/tiiuae-falcon-40b-instruct-gguf to start chatting
- Docker Model Runner
How to use maddes8cht/tiiuae-falcon-40b-instruct-gguf with Docker Model Runner:
docker model run hf.co/maddes8cht/tiiuae-falcon-40b-instruct-gguf:Q4_K_M
- Lemonade
How to use maddes8cht/tiiuae-falcon-40b-instruct-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull maddes8cht/tiiuae-falcon-40b-instruct-gguf:Q4_K_M
Run and chat with the model
lemonade run user.tiiuae-falcon-40b-instruct-gguf-Q4_K_M
List all available models
lemonade list
Failure loading in text-generation-webui.
Using the command:
python server.py --model maddes8cht_tiiuae-falcon-40b-instruct-gguf --listen --listen-port 4664 --verbose --api --xformers
... (or specifying a specific gguf file), I get:
2023-09-24 22:26:10 INFO:Loading settings from settings.json...
Traceback (most recent call last):
File "/home/user/text-generation-webui/server.py", line 216, in
model_settings = get_model_metadata(model_name)
File "/home/user/text-generation-webui/modules/models_settings.py", line 31, in get_model_metadata
for k in settings[pat]:
TypeError: 'NoneType' object is not iterable
I have no problem loading other ggufs.
Can you specify on which quantization levels the error occurs?
All of them?
This was one of my very first models I quantized, and while things where not as smooth in the beginning, i'm pretty sure to have tested them at least in Llama.cpp's main and server.
Did you get any of my other quantized models working ?
All of them, sadly.
I haven't tried any of your other models yet - I was waiting for maddes8cht/ehartford-WizardLM-Uncensored-Falcon-40b, which it looks like is now online, so I'll download it and try it out as soon as I can. :)
Thanks for your work, BTW, on actual open-source-licensed models. I wish more people would pay attention to licensing! LLaMA 2's viral license is a lot more insidious than I think a lot of people realize, harnessing the open-source community to do development for them and only allowing LLaMA 2 derivatives's generations to be used for training other LLaMA 2-derived models, while retaining the full commercial rights if any project ever goes big. FYI, if you're looking for more open-source models, I noticed that TigerBot is Apache licensed. :)
I would really like to get this fixed.
So I am eager to know if the wizardLM files work, please give me feedback.
And it could also be helpful to get a link on some examples of gguf files that do work for you.
I've tested one of the wizardLM files (the smallest one), and it does work. :) Will test more later this evening.
Okay - tht's strange.
I've tested again both of them, and they both work for me in Llama.cpp. I.ve also looked into the file with a hex-viewer if this one, the 40b instruct, does have the same gguf-file version, which it does. In the End I was almost sure that none of my models would work for me and it's possibly a problem with Oobaboga .
But glad to hear that it works.
I tested with the files I quantized and then used for the upload.
I may have to upload the model again - maybe something went wrong with the upload. It took me a while to get the uploads right without always losing the connection with these really big files and an asymmetric internet connection that only provides a fraction of the speed for the upload.
FYI, if you're looking for more open-source models, I noticed that TigerBot is Apache licensed. :)
It's still a Llama 2 based model, and I don't think relicensing it as Apache 2 is totally ok. They are allowed to license their own work as Apache 2, which should be mainly their training dataset, but not the full reaulting model - I think this is exactly what Meta's Llama 2 license prohibits.
So this is a similar blurry situation as with the early Alpaca models based on Llama (1): They released it as open source, which was then their dataset - but the status of the resulting models was very unclear.
BTW, the Bloke already have the TigerBot models converted.
This is about models he doesn't seem to care anymore.
I don't see a point in duplicating work he has already done.
"It's still a Llama 2 based model" - are you sure about that? I was under the impression that it was a foundational model. Says it was trained on 300B tokens, which would be an awful lot for a non-foundational model.
But yeah, no need to duplicate work :)
it's broken again as of 9 ish days ago due to changes to llama.cpp
This model works for me with the latest llama.cpp. (Did not try in text-generation-webui though.)
Yes, all the falcon models where updated somewhere around end of october / begin of novenber and should work on current Llama.cpp versions.
:)
I get this error when loading with llama.cpp on text gen
...
llama_model_loader: - type f32: 242 tensors
llama_model_loader: - type q8_0: 1 tensors
llama_model_loader: - type q6_K: 241 tensors
ERROR: byte not found in vocab: '
'
Fixed this by reinstalling text gen with newest version
Llama.cpp is under heavy construction, third party software using it (like oobabooga) needs to be updated as well to stay up to date.
There has been several changes regarding falcon support in Llama.cpp which might have taken some time to be implemented in Oobabooga.