llama.cpp

Use llama.cpp for CPU-only environments, local development, or edge deployment and on-device inference.

For GPU-accelerated inference at scale, consider using vLLM instead.

Installation

macOS/Linux
Pre-built Binaries
Build from Source

Install via Homebrew:

brew install llama.cpp

Download from llama.cpp releases.File naming: llama-<version>-bin-<os>-<feature>-<arch>.zipQuick selection guide:

Windows (CPU): llama-*-bin-win-avx2-x64.zip for Intel/AMD CPUs
Windows (NVIDIA GPU): llama-*-bin-win-cu12-x64.zip (requires CUDA drivers)
macOS (Intel): llama-*-bin-macos-x64.zip
macOS (Apple Silicon): llama-*-bin-macos-arm64.zip
Linux: llama-*-bin-linux-x64.zip

After downloading, unzip and run from that directory.

Detailed Download Tables by Platform

Use the tables below to determine which llama.cpp binary is best for your environment and download the relevant binary (version b7075), or browse all releases and find the latest version here.Windows

Hardware	Binary Name	Download Link
Nvidia GPU	llama-b7075-bin-win-cuda-12.4-x64.zip	Download
Intel GPU	llama-b7075-bin-win-sycl-x64.zip	Download
AMD GPU	llama-b7075-bin-win-vulkan-x64.zip	Download
Other GPU	llama-b7075-bin-win-vulkan-x64.zip	Download
Qualcomm Snapdragon CPU	llama-b7075-bin-win-cpu-arm64.zip	Download
Other (CPU-only)	llama-b7075-bin-win-cpu-x64.zip	Download

macOS

Hardware	Binary Name	Download Link
Intel	llama-b7075-bin-macos-x64.zip	Download
Apple Silicon	llama-b7075-bin-macos-arm64.zip	Download

Ubuntu

Hardware	Binary Name	Download Link
GPU	llama-b7075-bin-ubuntu-vulkan-x64.zip	Download
CPU-only	llama-b7075-bin-ubuntu-x64.zip	Download

Performance BenchmarksIf you are considering investing in hardware, here are some profiling results from a variety of machines and inference backends. As it currently stands, AMD Ryzen™ machines generally have the best-in-class performance with relatively standard llama.cpp configuration settings – and with custom configurations, this advantage tends to increase.

Device	Prefill speed (tok/s)	Decode speed (tok/s)
AMD Ryzen™ AI Max+ 395	5476	143
AMD Ryzen™ AI 9 HX 370	2680	113
Apple Mac Mini (M4)	1427	122
Qualcomm Snapdragon™ X1E-78-100	978	125
Intel Core™ Ultra 9 185H	1310	58
Intel Core™ Ultra 7 258V	1104	78

Note: for fair comparison, we conducted these benchmarks on the same model (LFM2-1.2B-Q4_0.gguf). For each hardware device, we also tested across all publicly available llama.cpp binaries, with different thread counts (4, 8, 12) for CPU runners, and took the best performing numbers for prefill and decode independently.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j 8

The compiled programs will be in ./build/bin/.For detailed build instructions including GPU support, see the llama.cpp documentation.

Downloading GGUF Models

llama.cpp uses the GGUF format, which stores quantized model weights for efficient inference. All LFM models are available in GGUF format on Hugging Face. See the Models page for all available GGUF models. You can download LFM models in GGUF format from Hugging Face as follows:

pip install huggingface-hub
hf download LiquidAI/LFM2.5-1.2B-Instruct-GGUF lfm2.5-1.2b-instruct-q4_k_m.gguf --local-dir .

Available quantization levels

Q4_0: 4-bit quantization, smallest size
Q4_K_M: 4-bit quantization, good balance of quality and size (recommended)
Q5_K_M: 5-bit quantization, better quality with moderate size increase
Q6_K: 6-bit quantization, excellent quality closer to original
Q8_0: 8-bit quantization, near-original quality
F16: 16-bit float, full precision

Basic Usage

llama.cpp offers three main interfaces for running inference: llama-cpp-python (Python bindings), llama-server (OpenAI-compatible server), and llama-cli (interactive CLI).

llama-cpp-python
llama-server
llama-cli

For Python applications, use the llama-cpp-python package.Installation:

pip install llama-cpp-python

For GPU support:

CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

Model Setup:

from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="lfm2.5-1.2b-instruct-q4_k_m.gguf",
    n_ctx=4096,
    n_threads=8
)

# Generate text
output = llm(
    "What is artificial intelligence?",
    max_tokens=512,
    temperature=0.7,
    top_p=0.9
)
print(output["choices"][0]["text"])

Chat Completions:

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing."}
    ],
    temperature=0.7,
    max_tokens=512
)
print(response["choices"][0]["message"]["content"])

llama-server provides an OpenAI-compatible API for serving models locally.Starting the Server:

llama-server -m lfm2.5-1.2b-instruct-q4_k_m.gguf -c 4096 --port 8080

Key parameters:

-m: Path to GGUF model file
-c: Context length (default: 4096)
--port: Server port (default: 8080)
-ngl 99: Offload layers to GPU (if available)

Using the Server:Once running at http://localhost:8080, use the OpenAI Python client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="lfm2.5-1.2b-instruct",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
    temperature=0.7,
    max_tokens=512
)
print(response.choices[0].message.content)

Using curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "lfm2-1.2b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7
  }'

llama-cli provides an interactive terminal interface for chatting with models.

llama-cli -m lfm2.5-1.2b-instruct-q4_k_m.gguf -c 4096 --color -i

Key parameters:

-m: Path to GGUF model file
-c: Context length
--color: Colored output
-i: Interactive mode
-ngl 99: Offload layers to GPU (if available)

Press Ctrl+C to exit.

Generation Parameters

Control text generation behavior using parameters in the OpenAI-compatible API or command-line flags. Key parameters:

temperature (float, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0
top_p (float, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top_p. Typical range: 0.1-1.0
top_k (int, default 40): Limits to top-k most probable tokens. Typical range: 1-100
max_tokens / --n-predict (int): Maximum number of tokens to generate
repetition_penalty / --repeat-penalty (float, default 1.1): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5
stop (str or list[str]): Strings that terminate generation when encountered

llama-cpp-python example

from llama_cpp import Llama

llm = Llama(
    model_path="lfm2.5-1.2b-instruct-q4_k_m.gguf",
    n_ctx=4096,
    n_threads=8
)

# Text generation with sampling parameters
output = llm(
    "What is machine learning?",
    max_tokens=512,
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    repeat_penalty=1.1,
    stop=["<|im_end|>", "<|endoftext|>"]
)
print(output["choices"][0]["text"])

# Chat completion with sampling parameters
response = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "Explain quantum computing."}
    ],
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    max_tokens=512,
    repeat_penalty=1.1
)
print(response["choices"][0]["message"]["content"])

llama-server (OpenAI-compatible API) example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="lfm2.5-1.2b-instruct",
    messages=[{"role": "user", "content": "What is machine learning?"}],
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    max_tokens=512,
    repetition_penalty=1.1,
)
print(response.choices[0].message.content)

For command-line tools (llama-cli), use flags like --temperature, --top-p, --top-k, --repeat-penalty, and --n-predict.

Vision Models

LFM2-VL GGUF models can be used for multimodal inference with llama.cpp.

Quick Start with llama-cli

Download llama.cpp binaries and run vision inference directly:

wget https://github.com/ggml-org/llama.cpp/releases/download/b7633/llama-b7633-bin-ubuntu-x64.tar.gz
tar -xzf llama-b7633-bin-ubuntu-x64.tar.gz

Download a test image:

import requests

image_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
img_data = requests.get(image_url).content
with open("test_image.jpg", "wb") as f:
    f.write(img_data)

Run inference (works on CPU):

llama-b7633/llama-cli \
    -hf LiquidAI/LFM2.5-VL-1.6B-GGUF:Q4_0 \
    --image test_image.jpg \
    --image-max-tokens 64 \
    -p "What's in this image?" \
    -n 128

The -hf flag downloads the model directly from Hugging Face. Use --image-max-tokens to control image token budget.

Alternative: Manual Model Download

If you prefer to download models manually:

pip install huggingface-hub
hf download LiquidAI/LFM2-VL-1.6B-GGUF LFM2-VL-1.6B-Q8_0.gguf --local-dir .
hf download LiquidAI/LFM2-VL-1.6B-GGUF mmproj-LFM2-VL-1.6B-Q8_0.gguf --local-dir .

Using llama-mtmd-cli

Run inference directly from the command line:

llama-mtmd-cli \
  -m LFM2-VL-1.6B-Q8_0.gguf \
  --mmproj mmproj-LFM2-VL-1.6B-Q8_0.gguf \
  --image image.jpg \
  -p "What is in this image?" \
  -ngl 99

Using llama-server

Start a vision model server with both the model and mmproj files:

llama-server \
  -m LFM2-VL-1.6B-Q8_0.gguf \
  --mmproj mmproj-LFM2-VL-1.6B-Q8_0.gguf \
  -c 4096 \
  --port 8080 \
  -ngl 99

Use with the OpenAI Python client:

from openai import OpenAI
import base64

client = OpenAI(
    base_url="http://localhost:8080/v1",  # The hosted llama-server
    api_key="not-needed"
)

# Encode image to base64
with open("image.jpg", "rb") as image_file:
    image_data = base64.b64encode(image_file.read()).decode("utf-8")

response = client.chat.completions.create(
    model="lfm2.5-vl-1.6b",  # Model name should match your server configuration
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}},
                {"type": "text", "text": "What's in this image?"}
            ]
        }
    ],
    max_tokens=256
)
print(response.choices[0].message.content)

Using llama-cpp-python

from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler

# Initialize with vision support
# Note: Use the correct chat handler for your model architecture
chat_handler = Llava15ChatHandler(clip_model_path="mmproj-model-f16.gguf")

llm = Llama(
    model_path="lfm2.5-vl-1.6b-q4_k_m.gguf",
    chat_handler=chat_handler,
    n_ctx=4096
)

# Generate with image
response = llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "file:///path/to/image.jpg"}},
                {"type": "text", "text": "Describe this image."}
            ]
        }
    ],
    max_tokens=256
)
print(response["choices"][0]["message"]["content"])

For a complete working example with step-by-step instructions, see the llama.cpp Vision Model Colab notebook.

Converting Custom Models

If you have a finetuned model or need to create a GGUF from a Hugging Face model:

# Clone llama.cpp if you haven't already
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Convert model with quantization
python convert_hf_to_gguf.py /path/to/your/model --outfile model.gguf --outtype q4_k_m

Use --outtype to specify the quantization level (e.g., q4_0, q4_k_m, q5_k_m, q6_k, q8_0, f16).

Example Applications

For more comprehensive example applications using llama.cpp with LFM models, check out these repositories:

The full list of llama.cpp language bindings can be found here.

Get Started

Models

Key Concepts

Inference

Fine-tuning

Frameworks

Help

Installation

Downloading GGUF Models

Basic Usage

Generation Parameters

Vision Models

Quick Start with llama-cli

Alternative: Manual Model Download

Converting Custom Models

Example Applications

Get Started

Models

Key Concepts

Inference

Fine-tuning

Frameworks

Help

​Installation

​Downloading GGUF Models

​Basic Usage

​Generation Parameters

​Vision Models

​Quick Start with llama-cli

​Alternative: Manual Model Download

​Converting Custom Models

​Example Applications

Installation

Downloading GGUF Models

Basic Usage

Generation Parameters

Vision Models

Quick Start with llama-cli

Alternative: Manual Model Download

Converting Custom Models

Example Applications