After downloading, unzip and run from that directory.
Detailed Download Tables by Platform
Use the tables below to determine which llama.cpp binary is best for your environment and download the relevant binary (version b7075), or browse all releases and find the latest version here.Windows
Performance BenchmarksIf you are considering investing in hardware, here are some profiling results from a variety of machines and inference backends. As it currently stands, AMD Ryzen™ machines generally have the best-in-class performance with relatively standard llama.cpp configuration settings – and with custom configurations, this advantage tends to increase.
Device
Prefill speed (tok/s)
Decode speed (tok/s)
AMD Ryzenâ„¢ AI Max+ 395
5476
143
AMD Ryzenâ„¢ AI 9 HX 370
2680
113
Apple Mac Mini (M4)
1427
122
Qualcomm Snapdragonâ„¢ X1E-78-100
978
125
Intel Coreâ„¢ Ultra 9 185H
1310
58
Intel Coreâ„¢ Ultra 7 258V
1104
78
Note: for fair comparison, we conducted these benchmarks on the same model (LFM2-1.2B-Q4_0.gguf). For each hardware device, we also tested across all publicly available llama.cpp binaries, with different thread counts (4, 8, 12) for CPU runners, and took the best performing numbers for prefill and decode independently.
llama.cpp uses the GGUF format, which stores quantized model weights for efficient inference. All LFM models are available in GGUF format on Hugging Face. See the Models page for all available GGUF models.You can download LFM models in GGUF format from Hugging Face as follows:
llama.cpp offers three main interfaces for running inference: llama-cpp-python (Python bindings), llama-server (OpenAI-compatible server), and llama-cli (interactive CLI).
llama-cpp-python
llama-server
llama-cli
For Python applications, use the llama-cpp-python package.Installation:
from openai import OpenAIimport base64client = OpenAI( base_url="http://localhost:8080/v1", # The hosted llama-server api_key="not-needed")# Encode image to base64with open("image.jpg", "rb") as image_file: image_data = base64.b64encode(image_file.read()).decode("utf-8")response = client.chat.completions.create( model="lfm2.5-vl-1.6b", # Model name should match your server configuration messages=[ { "role": "user", "content": [ {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}}, {"type": "text", "text": "What's in this image?"} ] } ], max_tokens=256)print(response.choices[0].message.content)
Using llama-cpp-python
Copy
Ask AI
from llama_cpp import Llamafrom llama_cpp.llama_chat_format import Llava15ChatHandler# Initialize with vision support# Note: Use the correct chat handler for your model architecturechat_handler = Llava15ChatHandler(clip_model_path="mmproj-model-f16.gguf")llm = Llama( model_path="lfm2.5-vl-1.6b-q4_k_m.gguf", chat_handler=chat_handler, n_ctx=4096)# Generate with imageresponse = llm.create_chat_completion( messages=[ { "role": "user", "content": [ {"type": "image_url", "image_url": {"url": "file:///path/to/image.jpg"}}, {"type": "text", "text": "Describe this image."} ] } ], max_tokens=256)print(response["choices"][0]["message"]["content"])
If you have a finetuned model or need to create a GGUF from a Hugging Face model:
Copy
Ask AI
# Clone llama.cpp if you haven't alreadygit clone https://github.com/ggerganov/llama.cppcd llama.cpp# Convert model with quantizationpython convert_hf_to_gguf.py /path/to/your/model --outfile model.gguf --outtype q4_k_m
Use --outtype to specify the quantization level (e.g., q4_0, q4_k_m, q5_k_m, q6_k, q8_0, f16).