Speeding up Llama with GPU

Ariya Hidayat

Oct 27, 2023

In my previous writing, I illustrated how to run an LLM (Large Language Model) using only a CPU. Of course, its performance might be inadequate. This time, let's use a GPU to accelerate the LLM's capabilities.

In some cases, inference speed can increase tenfold. Why is that? Because complex tensor calculations have been delegated through CUDA (Compute Unified Device Architecture), executed efficiently and in parallel by an NVIDIA GPU (Graphics Processing Unit).

How do we do it? Just like before, we first fetch the llama.cpp project:

$ sudo apt install make g++
$ git clone https://github.com/ggerganov/llama.cpp
$ cd llama.cpp

Before compiling, ensure the GPU driver is installed correctly. The easiest way is to first check with nvcc, which should appear like this:

$ nvcc - version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright © 2005–2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

Also, ensure nvidia-smi presents the correct information:

In the example above, I used a GTX 1080 Ti, an older GPU (launched in early 2017) but still useful for tinkering with Machine Learning, especially because of its decent VRAM size, which is 11 GB.

The compilation process can be done as follows:

$ make LLAMA_CUBLAS=1
NVCCFLAGS="--forward-unknown-to-host-compiler -arch=sm_61"

Where did we get the sm_61 value passed to arch? For this, visit developer.nvidia.com/cuda-gpus and search for the GPU in use, in this example, GTX 1080 Ti. From there, you'll see that the Compute Capability for that GPU is 6.1. So, we process it into sm_61. For other GPU models, the value might differ.

After a while, llama.cpp should be ready to use. There should be a file named server that can be checked with (note cublas, cuda, etc.):

$ ldd ./server
linux-vdso.so.1 (0x00007fff45714000)
libcublas.so.11 => /lib/x86_64-linux-gnu/libcublas.so.11 (0x00007fc620e00000)
libcudart.so.11.0 => /lib/x86_64-linux-gnu/libcudart.so.11.0 (0x00007fc620a00000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fc620600000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fc62ad32000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fc62a9e0000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc620200000)
libcublasLt.so.11 => /lib/x86_64-linux-gnu/libcublasLt.so.11 (0x00007fc60ac00000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc62ad2b000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fc62ad26000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fc62a9db000)
/lib64/ld-linux-x86–64.so.2 (0x00007fc62ae2f000)

Next, fetch the model you want to use. If you want it fast, try the Orca Mini 3B, which is small but competent enough for chatting. For a smarter model, look for the Llama 2 variant with 7B. All must be in GGML format, and it's recommended to have at least the 4-bit quantized version.

Once the model is obtained, run the Llama server like this:

./server -m /path/to/llama-2–7b-chat.ggmlv3.q4_K_M.bin -ngl 100

Then open a browser to localhost:8080 so we can enjoy a simple web interface to interact with the model.

The key here is -ngl 100, meaning we request 100 layers to delegate their execution (via CUDA) to the GPU. In practice, the number of layers thrown to the GPU depends on the model being run. In the example above, there are about 35 layers from the Llama 2 variant 7B, so only those are CUDA-processed.

If there's no -ngl parameter, only the CPU is needed, ignoring the GPU.

How does the inference speed compare with the CPU? Here are my test results.

To test, I posed a simple question, "How to cook rendang." All parameters like temperature, top_k, top_p, etc., were left at default. The three models I tested were Llama-2 versions 7B and 13B, and also Orca Mini 3B. Besides the GTX 1080 Ti, I also included a comparison with the relatively new RTX 3080 (released 3 years ago) with 10 GB VRAM.

From the results, it's clear that the smaller models are executed faster. And it turns out, besides being great for gaming, GPUs can also be very beneficial for artificial intelligence (AI)!

See all blog posts

Jan 17, 2024

Phi 2 for RAG and the Emergence of Small Language Model (SLM)

Ariya Hidayat

Dec 28, 2023

Pico Jarvis: An LLM-based Chatbot Demo with RAG (Part 3)

Ariya Hidayat

Dec 28, 2023

Pico Jarvis: An LLM-based Chatbot Demo with RAG (Part 2)

Ariya Hidayat