While you can download pre-built binaries, building from source is the best way to ensure you have the latest optimizations, full support for your specific hardware (especially if you are using an NVIDIA GPU or Apple Silicon), and the ability to modify the code for your specific needs.
Prerequisites
Install build dependencies
sudo apt update
sudo apt install build-essential cmake libssl-dev libopenblas-dev libmkl-dev pkg-config libglvnd-dev libglvnd0 git
Install nvidia drivers
List available drivers
ubuntu-drivers devices
Search for the latest available version and install it
sudo apt install nvidia-driver-580-server nvidia-utils-580-server nvidia-cuda-toolkit
Get llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
Build llama.cpp
You can search for the specifig cuda architecture for your GPU. In my test, I used an old GT1030 2GB, so its architecture is “6.1”
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=61
cmake --build build --config Release -j$(nproc)
Run llama.cpp
Thats all. Now you must have a “build” directory and inside it all the executables like llama-server or llama-cli to run your own models
Bench
The precompiles binaries for linux uses VULKAN, but while compiling from source I used CUDA, and here you can see the diff between the backends
| model | size | params | backend | ngl | n_batch | type_k | type_v | fa | test | t/s |
|---|---|---|---|---|---|---|---|---|---|---|
| qwen3 0.6B Q4_K – Medium | 372.65 MiB | 596.05 M | Vulkan | 99 | 1024 | q8_0 | q8_0 | 1 | pp512 | 494.23 ± 0.55 |
| qwen3 0.6B Q4_K – Medium | 372.65 MiB | 596.05 M | Vulkan | 99 | 1024 | q8_0 | q8_0 | 1 | tg128 | 64.77 ± 0.22 |
| qwen3 0.6B Q4_K – Medium | 372.65 MiB | 596.05 M | CUDA | 99 | 1024 | q8_0 | q8_0 | 1 | pp512 | 834.55 ± 2.12 |
| qwen3 0.6B Q4_K – Medium | 372.65 MiB | 596.05 M | CUDA | 99 | 1024 | q8_0 | q8_0 | 1 | tg128 | 40.72 ± 0.02 |
You can see the improve in prompt processing from CUDA vs VULKAN.