June 2, 2024

Quantize and run the original llama3 8B with llama.cpp

This other day, I was intrigued by the possibility of running small LLMs on my Raspberry Pi 5:

Just for fun, I tried a couple of the most famous ones using ollama, but I became curious about the process people use to convert the raw models to a quantized version suited for lower-resource devices.

What follows is a detailed account of my one-day journey, complete with a step-by-step process of reducing and running the original llama3 with llama.cpp. Along the way, I’ve included some insightful notes on the learning I gained from this experience.

Before you start

Create a folder called llama3 in your home folder (~):

mkdir ~/llama3

Enter this folder:

cd ~/llama3/

From now on, everything will happen from inside the project path ( ~/llama3 )

Download Llama3 models

The next thing you will need to do is download llama3 from meta https://llama.meta.com/llama-downloads/

The process is straightforward: fill out the form and accept META’s policy. Then, it’s just a matter of running the download script and providing the URL authentication displayed after you submit the request form.

bash <(curl -s  https://raw.githubusercontent.com/meta-llama/llama3/main/download.sh)

The script will ask which models to download. Since we are considering a resource-constrained environment, the 8B model is the best choice.

Enter the list of models to download without spaces (8B,8B-instruct,70B,70B-instruct), or press Enter for all:

Just type 8B-instruct, and the download process will start. This can take some time, depending on your internet connection, as it needs to download around 15GB of data; the downloaded content should look like this:

pascal@starbase:~/llama3$ tree
.
├── LICENSE
├── Meta-Llama-3-8B-Instruct
│   ├── checklist.chk
│   ├── consolidated.00.pth
│   ├── params.json
│   └── tokenizer.model
└── USE_POLICY

2 directories, 6 files

As you can guess, the “model” is the larger file with the .pth extension. This extension is the model output generated from PyTorch.

It’s likely that at some point in META’s pipeline that builds llama3, we find a line like this:

torch.save(model.state_dict(), "consolidated.00.pth")

If you were fine-tuning llama3, this is the file you would be working on. However, since the plan here is just to reduce the model to run it on a commodity CPU, you will need to convert it to a format that llama.cpp can understand.

Convert llama3 Torch data to HuggingFace

In llama2 you could use llama.cpp “convert.py” script to directly get a gguf file, but for llama3, the script doesn’t work anymore, and the suggested approach is to use the HuggingFace transformers library as an intermediate step.

Create a virtual environment:

python3 -m venv .venv 

Activate the environment

source .venv/bin/activate

Install the transformer libraries and dependencies:

pip install transformers transformers[torch] tiktoken blobfile sentencepiece

Run convert_llama_weights_to_hf.py

the script was installed in the previous step, and to access it, you need to use the full path inside the virtualenv folder (it can be a bit different if you have another python version)

python .venv/lib/python3.11/site-packages/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir Meta-Llama-3-8B-Instruct --model_size 8B --output_dir hf --llama_version 3

Depending on your hardware, this step can take a few minutes, but when it finishes, you get a new folder called “hf”. The project structure at this point should look like this:

(.venv) pascal@starbase:~/llama3$ tree
.
├── hf
│   ├── config.json
│   ├── generation_config.json
│   ├── model-00001-of-00004.safetensors
│   ├── model-00002-of-00004.safetensors
│   ├── model-00003-of-00004.safetensors
│   ├── model-00004-of-00004.safetensors
│   ├── model.safetensors.index.json
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   └── tokenizer.json
├── LICENSE
├── Meta-Llama-3-8B-Instruct
│   ├── checklist.chk
│   ├── consolidated.00.pth
│   ├── params.json
│   └── tokenizer.model
└── USE_POLICY

3 directories, 16 files

Convert llama3 hf to gguf

You are very close to getting the final model format, but llama.cpp can’t work with the hf format either, so now you need to convert it again, this time to the gguf format.

Clone llama.cpp from github:

git clone https://github.com/ggerganov/llama.cpp.git

Build it (this will take a few minutes):

make -C llama.cpp/

Convert the model from hugging face format to gguf :

./llama.cpp/convert-hf-to-gguf.py hf/ --outtype f32 --outfile meta-llama-3-8B-instruct.gguf

Your project should now look like this (directories content omitted for brevity):

(.venv) pascal@starbase:~/llama3$ tree -h -L 1
[4.0K]  .
├── [4.0K]  hf
├── [7.6K]  LICENSE
├── [4.0K]  llama.cpp
├── [4.0K]  Meta-Llama-3-8B-Instruct
├── [ 30G]  meta-llama-3-8B-instruct.gguf
└── [8.6K]  USE_POLICY

4 directories, 3 files

You can already run the model meta-llama-3-8B-instruct.gguf using llama.cpp or ollama, but this is the full model and will be very slow. You need to reduce it a bit to make it possible to run it satisfactorily with lower resources and no GPU.

Model quantization

A briefly explanation from HuggingFace:

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), and operations like matrix multiplication can be performed much faster with integer arithmetic. It also allows to run models on embedded devices, which sometimes only support integer data types.

If you are not very savvy about CS here is a simpler version:

Quantization simplifies the model by representing its internal data with smaller numbers. This makes mathematical operations easier and faster. However, this simplification can lead to a slight decrease in the model’s accuracy, such as less certainty about the next word it should output.

The quantization is done using a script called quantize that was built in an early step; you can check the available options by simply calling it:

(.venv) pascal@starbase:~/llama3$ ./llama.cpp/quantize

... content omitted for brevity ...

Allowed quantization types:
   2  or  Q4_0    :  3.56G, +0.2166 ppl @ LLaMA-v1-7B
   3  or  Q4_1    :  3.90G, +0.1585 ppl @ LLaMA-v1-7B
   8  or  Q5_0    :  4.33G, +0.0683 ppl @ LLaMA-v1-7B
   9  or  Q5_1    :  4.70G, +0.0349 ppl @ LLaMA-v1-7B
  19  or  IQ2_XXS :  2.06 bpw quantization
  20  or  IQ2_XS  :  2.31 bpw quantization
  28  or  IQ2_S   :  2.5  bpw quantization
  29  or  IQ2_M   :  2.7  bpw quantization
  24  or  IQ1_S   :  1.56 bpw quantization
  31  or  IQ1_M   :  1.75 bpw quantization
  10  or  Q2_K    :  2.63G, +0.6717 ppl @ LLaMA-v1-7B
  21  or  Q2_K_S  :  2.16G, +9.0634 ppl @ LLaMA-v1-7B
  23  or  IQ3_XXS :  3.06 bpw quantization
  26  or  IQ3_S   :  3.44 bpw quantization
  27  or  IQ3_M   :  3.66 bpw quantization mix
  12  or  Q3_K    : alias for Q3_K_M
  22  or  IQ3_XS  :  3.3 bpw quantization
  11  or  Q3_K_S  :  2.75G, +0.5551 ppl @ LLaMA-v1-7B
  12  or  Q3_K_M  :  3.07G, +0.2496 ppl @ LLaMA-v1-7B
  13  or  Q3_K_L  :  3.35G, +0.1764 ppl @ LLaMA-v1-7B
  25  or  IQ4_NL  :  4.50 bpw non-linear quantization
  30  or  IQ4_XS  :  4.25 bpw non-linear quantization
  15  or  Q4_K    : alias for Q4_K_M
  14  or  Q4_K_S  :  3.59G, +0.0992 ppl @ LLaMA-v1-7B
  15  or  Q4_K_M  :  3.80G, +0.0532 ppl @ LLaMA-v1-7B
  17  or  Q5_K    : alias for Q5_K_M
  16  or  Q5_K_S  :  4.33G, +0.0400 ppl @ LLaMA-v1-7B
  17  or  Q5_K_M  :  4.45G, +0.0122 ppl @ LLaMA-v1-7B
  18  or  Q6_K    :  5.15G, +0.0008 ppl @ LLaMA-v1-7B
   7  or  Q8_0    :  6.70G, +0.0004 ppl @ LLaMA-v1-7B
   1  or  F16     : 14.00G, -0.0020 ppl @ Mistral-7B
  32  or  BF16    : 14.00G, -0.0050 ppl @ Mistral-7B
   0  or  F32     : 26.00G              @ 7B
          COPY    : only copy tensors, no quantizing

The exercise of deciding which quantization level you want is a difficult one. A good starting point is the following links:

On a recent laptop Q8_0 might provide a good balance tho:

./llama.cpp/quantize meta-llama-3-8B-instruct.gguf meta-llama-3-8B-instruct-Q8_0.gguf Q8_0

After the quantize finishes, you will get a new gguf model that is only 8G compared to the original 30G:

(.venv) pascal@starbase:~/llama3$ tree -L 1 -h
[4.0K]  .
├── [4.0K]  hf
├── [7.6K]  LICENSE
├── [4.0K]  llama.cpp
├── [4.0K]  Meta-Llama-3-8B-Instruct
├── [ 30G]  meta-llama-3-8B-instruct.gguf
├── [8.0G]  meta-llama-3-8B-instruct-Q8_0.gguf
└── [8.6K]  USE_POLICY

4 directories, 4 files

Testing the model

You now have a reduced model that will run very well on modern CPUs, the .gguf is just like the ones you find in HuggingFace and can be used with ollama, lmstudio, or any other tool you normally use.

But you can already test it in chat-mode using llama.cpp itself:

./llama.cpp/main -m meta-llama-3-8B-instruct-Q8_0.gguf -n 512 --repeat_penalty 1.0 --color -i -r "User:" -f llama.cpp/prompts/chat-with-bob.txt