Build llama.cpp from Source – CUDA – Ubuntu Server

While you can download pre-built binaries, building from source is the best way to ensure you have the latest optimizations, full support for your specific hardware (especially if you are using an NVIDIA GPU or Apple Silicon), and the ability to modify the code for your specific needs.

Official docs

Prerequisites

Install build dependencies

sudo apt update
sudo apt install build-essential cmake libssl-dev libopenblas-dev libmkl-dev pkg-config libglvnd-dev libglvnd0 git

Install nvidia drivers

List available drivers

ubuntu-drivers devices

Search for the latest available version and install it

sudo apt install nvidia-driver-580-server nvidia-utils-580-server nvidia-cuda-toolkit

Get llama.cpp

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

Build llama.cpp

You can search for the specifig cuda architecture for your GPU. In my test, I used an old GT1030 2GB, so its architecture is “6.1”

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=61
cmake --build build --config Release -j$(nproc)

Run llama.cpp

Thats all. Now you must have a “build” directory and inside it all the executables like llama-server or llama-cli to run your own models

Bench

The precompiles binaries for linux uses VULKAN, but while compiling from source I used CUDA, and here you can see the diff between the backends

modelsizeparamsbackendngln_batchtype_ktype_vfatestt/s
qwen3 0.6B Q4_K – Medium372.65 MiB596.05 MVulkan991024q8_0q8_01pp512494.23 ± 0.55
qwen3 0.6B Q4_K – Medium372.65 MiB596.05 MVulkan991024q8_0q8_01tg12864.77 ± 0.22
qwen3 0.6B Q4_K – Medium372.65 MiB596.05 MCUDA991024q8_0q8_01pp512834.55 ± 2.12
qwen3 0.6B Q4_K – Medium372.65 MiB596.05 MCUDA991024q8_0q8_01tg12840.72 ± 0.02

You can see the improve in prompt processing from CUDA vs VULKAN.