This other day, I was intrigued by the possibility of running small LLMs on my Raspberry Pi 5:
Just for fun, I tried a couple of the most famous ones using ollama, but I became curious about the process people use to convert the raw models to a quantized version suited for lower-resource devices.
What follows is a detailed account of my one-day journey, complete with a step-by-step process of reducing and running the original llama3 with llama.cpp. Along the way, I’ve included some insightful notes on the learning I gained from this experience.
Create a folder called llama3 in your home folder (~):
mkdir ~/llama3
Enter this folder:
cd ~/llama3/
From now on, everything will happen from inside the project path ( ~/llama3 )
The next thing you will need to do is download llama3 from meta https://llama.meta.com/llama-downloads/
The process is straightforward: fill out the form and accept META’s policy. Then, it’s just a matter of running the download script and providing the URL authentication displayed after you submit the request form.
bash <(curl -s https://raw.githubusercontent.com/meta-llama/llama3/main/download.sh)
The script will ask which models to download. Since we are considering a resource-constrained environment, the 8B model is the best choice.
Enter the list of models to download without spaces (8B,8B-instruct,70B,70B-instruct), or press Enter for all:
Just type 8B-instruct, and the download process will start. This can take some time, depending on your internet connection, as it needs to download around 15GB of data; the downloaded content should look like this:
pascal@starbase:~/llama3$ tree
.
├── LICENSE
├── Meta-Llama-3-8B-Instruct
│ ├── checklist.chk
│ ├── consolidated.00.pth
│ ├── params.json
│ └── tokenizer.model
└── USE_POLICY
2 directories, 6 files
As you can guess, the “model” is the larger file with the .pth extension. This extension is the model output generated from PyTorch.
It’s likely that at some point in META’s pipeline that builds llama3, we find a line like this:
torch.save(model.state_dict(), "consolidated.00.pth")
If you were fine-tuning llama3, this is the file you would be working on. However, since the plan here is just to reduce the model to run it on a commodity CPU, you will need to convert it to a format that llama.cpp can understand.
In llama2 you could use llama.cpp “convert.py” script to directly get a gguf file, but for llama3, the script doesn’t work anymore, and the suggested approach is to use the HuggingFace transformers library as an intermediate step.
Create a virtual environment:
python3 -m venv .venv
Activate the environment
source .venv/bin/activate
Install the transformer libraries and dependencies:
pip install transformers transformers[torch] tiktoken blobfile sentencepiece
Run convert_llama_weights_to_hf.py
the script was installed in the previous step, and to access it, you need to use the full path inside the virtualenv folder (it can be a bit different if you have another python version)
python .venv/lib/python3.11/site-packages/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir Meta-Llama-3-8B-Instruct --model_size 8B --output_dir hf --llama_version 3
Depending on your hardware, this step can take a few minutes, but when it finishes, you get a new folder called “hf”. The project structure at this point should look like this:
(.venv) pascal@starbase:~/llama3$ tree
.
├── hf
│ ├── config.json
│ ├── generation_config.json
│ ├── model-00001-of-00004.safetensors
│ ├── model-00002-of-00004.safetensors
│ ├── model-00003-of-00004.safetensors
│ ├── model-00004-of-00004.safetensors
│ ├── model.safetensors.index.json
│ ├── special_tokens_map.json
│ ├── tokenizer_config.json
│ └── tokenizer.json
├── LICENSE
├── Meta-Llama-3-8B-Instruct
│ ├── checklist.chk
│ ├── consolidated.00.pth
│ ├── params.json
│ └── tokenizer.model
└── USE_POLICY
3 directories, 16 files
You are very close to getting the final model format, but llama.cpp can’t work with the hf format either, so now you need to convert it again, this time to the gguf format.
Clone llama.cpp from github:
git clone https://github.com/ggerganov/llama.cpp.git
Build it (this will take a few minutes):
make -C llama.cpp/
Convert the model from hugging face format to gguf :
./llama.cpp/convert-hf-to-gguf.py hf/ --outtype f32 --outfile meta-llama-3-8B-instruct.gguf
Your project should now look like this (directories content omitted for brevity):
(.venv) pascal@starbase:~/llama3$ tree -h -L 1
[4.0K] .
├── [4.0K] hf
├── [7.6K] LICENSE
├── [4.0K] llama.cpp
├── [4.0K] Meta-Llama-3-8B-Instruct
├── [ 30G] meta-llama-3-8B-instruct.gguf
└── [8.6K] USE_POLICY
4 directories, 3 files
You can already run the model meta-llama-3-8B-instruct.gguf using llama.cpp or ollama, but this is the full model and will be very slow. You need to reduce it a bit to make it possible to run it satisfactorily with lower resources and no GPU.
A briefly explanation from HuggingFace:
Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (
int8
) instead of the usual 32-bit floating point (float32
).Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), and operations like matrix multiplication can be performed much faster with integer arithmetic. It also allows to run models on embedded devices, which sometimes only support integer data types.
If you are not very savvy about CS here is a simpler version:
Quantization simplifies the model by representing its internal data with smaller numbers. This makes mathematical operations easier and faster. However, this simplification can lead to a slight decrease in the model’s accuracy, such as less certainty about the next word it should output.
The quantization is done using a script called quantize that was built in an early step; you can check the available options by simply calling it:
(.venv) pascal@starbase:~/llama3$ ./llama.cpp/quantize
... content omitted for brevity ...
Allowed quantization types:
2 or Q4_0 : 3.56G, +0.2166 ppl @ LLaMA-v1-7B
3 or Q4_1 : 3.90G, +0.1585 ppl @ LLaMA-v1-7B
8 or Q5_0 : 4.33G, +0.0683 ppl @ LLaMA-v1-7B
9 or Q5_1 : 4.70G, +0.0349 ppl @ LLaMA-v1-7B
19 or IQ2_XXS : 2.06 bpw quantization
20 or IQ2_XS : 2.31 bpw quantization
28 or IQ2_S : 2.5 bpw quantization
29 or IQ2_M : 2.7 bpw quantization
24 or IQ1_S : 1.56 bpw quantization
31 or IQ1_M : 1.75 bpw quantization
10 or Q2_K : 2.63G, +0.6717 ppl @ LLaMA-v1-7B
21 or Q2_K_S : 2.16G, +9.0634 ppl @ LLaMA-v1-7B
23 or IQ3_XXS : 3.06 bpw quantization
26 or IQ3_S : 3.44 bpw quantization
27 or IQ3_M : 3.66 bpw quantization mix
12 or Q3_K : alias for Q3_K_M
22 or IQ3_XS : 3.3 bpw quantization
11 or Q3_K_S : 2.75G, +0.5551 ppl @ LLaMA-v1-7B
12 or Q3_K_M : 3.07G, +0.2496 ppl @ LLaMA-v1-7B
13 or Q3_K_L : 3.35G, +0.1764 ppl @ LLaMA-v1-7B
25 or IQ4_NL : 4.50 bpw non-linear quantization
30 or IQ4_XS : 4.25 bpw non-linear quantization
15 or Q4_K : alias for Q4_K_M
14 or Q4_K_S : 3.59G, +0.0992 ppl @ LLaMA-v1-7B
15 or Q4_K_M : 3.80G, +0.0532 ppl @ LLaMA-v1-7B
17 or Q5_K : alias for Q5_K_M
16 or Q5_K_S : 4.33G, +0.0400 ppl @ LLaMA-v1-7B
17 or Q5_K_M : 4.45G, +0.0122 ppl @ LLaMA-v1-7B
18 or Q6_K : 5.15G, +0.0008 ppl @ LLaMA-v1-7B
7 or Q8_0 : 6.70G, +0.0004 ppl @ LLaMA-v1-7B
1 or F16 : 14.00G, -0.0020 ppl @ Mistral-7B
32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B
0 or F32 : 26.00G @ 7B
COPY : only copy tensors, no quantizing
The exercise of deciding which quantization level you want is a difficult one. A good starting point is the following links:
On a recent laptop Q8_0 might provide a good balance tho:
./llama.cpp/quantize meta-llama-3-8B-instruct.gguf meta-llama-3-8B-instruct-Q8_0.gguf Q8_0
After the quantize finishes, you will get a new gguf model that is only 8G compared to the original 30G:
(.venv) pascal@starbase:~/llama3$ tree -L 1 -h
[4.0K] .
├── [4.0K] hf
├── [7.6K] LICENSE
├── [4.0K] llama.cpp
├── [4.0K] Meta-Llama-3-8B-Instruct
├── [ 30G] meta-llama-3-8B-instruct.gguf
├── [8.0G] meta-llama-3-8B-instruct-Q8_0.gguf
└── [8.6K] USE_POLICY
4 directories, 4 files
You now have a reduced model that will run very well on modern CPUs, the .gguf is just like the ones you find in HuggingFace and can be used with ollama, lmstudio, or any other tool you normally use.
But you can already test it in chat-mode using llama.cpp itself:
./llama.cpp/main -m meta-llama-3-8B-instruct-Q8_0.gguf -n 512 --repeat_penalty 1.0 --color -i -r "User:" -f llama.cpp/prompts/chat-with-bob.txt