Running Llama 2 on a Laptop

Ariya Hidayat
Sep 1, 2023

Llama 2 is the latest giant-sized language model (LLM, Large Language Model) from Meta. As a local LLM, Llama 2 can also run on desktop machines and even laptops (with modest specifications), as long as we understand the speed (or slowness) of the inference process.

Compared to its predecessor, Llama 2 promises improved quality and a license update, making it usable for commercial applications.

The most popular way to enjoy Llama 2 lately is by using an open-source project called llama.cpp. As the name suggests, this is a C++ implementation of the original Llama 2 from Meta. There are two ways to use llama.cpp: direct C++ compilation or through a Python module. The latter is easier, so let's try that first.

Note that the following illustration shows how to use Llama 2 on Unix families, such as Linux or macOS. For Windows users, it's recommended to use WSL.

LLM in Python 

First, make sure pip is available for Python 3. For Debian/Ubuntu-based Linux, install it by running:

$ sudo apt install python3-pip python3-venv 

Then check with:

$ pip --version 
pip 23.0.1 from /usr/lib/python3/dist-packages/pip (python 3.11)

For cleanliness, let's create a virtual environment:

$ python -m venv ./venv 
$ source ./venv/bin/activate

Then we install the Python module named llm:

$ pip install llm Which can be checked with:
$ llm --version llm, version 0.6.1

Before using Llama 2, we need to install an extra plugin called llm-gpt4all, which will provide access to a collection of models from the gpt4all project:

$ llm install llm-gpt4all

Next, we can check which models are accessible:

$ llm models list | grep llama 
gpt4all: llama-2–7b-chat - Llama-2–7B Chat, 3.53GB download, 
needs 8GB RAM

Note the model name, llama-2–7b-chat. There's also information that using this model requires at least 8 GB of RAM.

Now we can use that model, for example, to answer a question:

$ llm -m llama-2-7b-chat 'What is the capital of Indonesia?'
The capital of Indonesia is Jakarta.

The first time it's run, of course, the model needs to be downloaded first (in the example above, a 3.5 GB file). So, the process will take a bit of time, depending on the speed of the internet connection. Fortunately, the model will be saved in the cache (usually in $HOME/.cache/gpt4all), so subsequent executions will be very fast.

$ llm -m llama-2-7b-chat 'Apakah ibukota Indonesia?'
I'm glad you asked! The capital city of Indonesia is Jakarta.

Note the limitations of this Llama 2 model, such as answers in English even though the question is in Indonesian.

For more detailed use, please refer to the official llm site at llm.datasette.io.

llama.cpp

How about using llama.cpp directly? Since this is a C++ project, make sure there's a usable C++ compiler. For Debian or Ubuntu-based Linux systems, this can be done by installing dependencies like this:

$ sudo apt install make g++

Next, we need to fetch llama.cpp from GitHub and start the compilation process:

$ git clone https://github.com/ggerganov/llama.cpp
$ cd llama.cpp
$ make

If there are no errors, there should be a file named main. We can check with:

$ ldd ./main 
linux-vdso.so.1 (0x00007fffac9c0000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f9d3e7a6000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f9d3e6c7000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f9d3e6a7000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9d3e4c6000)
/lib64/ld-linux-x86–64.so.2 (0x00007f9d3eb15000)

Next, we need a usable model. The easiest way is to go to HuggingFace and search for Llama 2 in GGML format, for example, https://huggingface.co/TheBloke/Llama-2-7B-GGML. For convenience, select Files and take llama-2–7b.ggmlv3.q4_0.bin.

After downloading, we can start "chatting" with the model by running a command like this (adjust based on the location of the newly downloaded model file):

$ ./main -m ./models/llama-2-7b.ggmlv3.q4_0.bin \
 -i -f prompts/chat-with-bob.txt -r "User:" \
 -c 512 -b 1024 -n 256 --keep 48 \
 --repeat_penalty 1.0 --color

Because it uses the — color option, what we type (as a question) will be colored green, different from the model's white-colored answer. The model's response can sometimes be a bit slow, as it heavily depends on the CPU's capability. If running Llama 2 on a laptop with a not-so-fast CPU, please be patient.

Again, note that Llama 2 has many limitations when using the Indonesian language. However, for interactions in English, Llama 2 is becoming more competent and can rival human intelligence!