Deploying on NVIDIA Jetson
LlamaFarm runs natively on NVIDIA Jetson devices (Orin, Xavier, TX2, Nano) with CUDA acceleration. This guide covers setup, optimization, and troubleshooting for Jetson deployments.
Supported Devices
| Device | Compute Capability | Memory | Performance |
|---|---|---|---|
| Jetson Orin Nano | 8.7 | 8GB shared | 35+ tok/s |
| Jetson Orin NX | 8.7 | 8-16GB shared | 35+ tok/s |
| Jetson AGX Orin | 8.7 | 32-64GB shared | 40+ tok/s |
| Jetson Xavier NX | 7.2 | 8-16GB shared | 25+ tok/s |
| Jetson AGX Xavier | 7.2 | 16-32GB shared | 30+ tok/s |
Quick Start (Automated)
LlamaFarm includes setup scripts that automate the Jetson configuration:
# 1. Start the Jetson container
docker run -it --rm --runtime=nvidia --network host \
-v $HOME/.cache:/root/.cache \
-v /path/to/llamafarm:/data/llamafarm \
dustynv/l4t-pytorch:r36.2.0 \
bash
# 2. Install UV package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.cargo/env
# 3. Run the automated setup
cd /data/llamafarm
./deployment/jetson/setup.sh
# 4. Start the server
./deployment/jetson/start.sh
The setup script automatically:
- Detects Jetson hardware
- Finds system Python with CUDA PyTorch
- Creates venv with
--system-site-packagesto inherit CUDA torch - Installs dependencies and removes conflicting PyPI packages
- Creates
.env.jetsonwith optimized environment variables
For custom model or context size:
./deployment/jetson/start.sh 'unsloth/Qwen3-0.6B-GGUF:Q4_K_M' 4096
Prerequisites
- JetPack SDK installed (includes CUDA, cuDNN, TensorRT)
- Python 3.10+ with UV package manager
- Custom llama.cpp build (see below)
Building llama.cpp for Jetson
Pre-built llama.cpp binaries don't include ARM64 CUDA. You must build from source:
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
# CRITICAL: Disable CUDA graphs for Tegra stability
cmake -B build \
-DGGML_CUDA=ON \
-DGGML_CUDA_GRAPHS=OFF \
-DCMAKE_CUDA_ARCHITECTURES="87" \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
Why GGML_CUDA_GRAPHS=OFF?
CUDA graphs provide performance benefits on discrete GPUs but cause stability issues on Jetson's unified memory architecture:
- Graph compilation overhead exceeds benefits for small batch sizes
- Memory pressure from graph captures
- Unpredictable behavior on unified memory systems
Architecture Flags
| Device | Flag |
|---|---|
| Orin (Nano/NX/AGX) | -DCMAKE_CUDA_ARCHITECTURES="87" |
| Xavier (NX/AGX) | -DCMAKE_CUDA_ARCHITECTURES="72" |
| TX2 | -DCMAKE_CUDA_ARCHITECTURES="62" |
| Nano | -DCMAKE_CUDA_ARCHITECTURES="53" |
Docker Container Setup (Recommended)
The easiest way to run LlamaFarm on Jetson is using NVIDIA's PyTorch container which includes CUDA-enabled PyTorch:
docker run -it --rm --runtime=nvidia --network host \
-v /home/$USER/.cache:/root/.cache \
-v /path/to/llamafarm:/data/llamafarm \
dustynv/l4t-pytorch:r36.2.0 \
bash
Install UV Package Manager
Inside the container:
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.cargo/env
Setup Universal Runtime with CUDA PyTorch
The container has CUDA-enabled PyTorch (Python 3.10), but the venv must be configured to use it:
cd /data/llamafarm/runtimes/universal
# Create venv with system site packages (inherits CUDA torch)
uv venv .venv --python /usr/local/bin/python3 --system-site-packages
# Install dependencies (this pulls CPU torch from PyPI)
uv pip install -r pyproject.toml --python .venv/bin/python
# Remove CPU torch, fall back to system CUDA torch
uv pip uninstall torch torchvision --python .venv/bin/python
# Downgrade numpy (system torch needs numpy 1.x)
uv pip install "numpy<2" --python .venv/bin/python
Verify CUDA PyTorch
.venv/bin/python -c "import torch; print('CUDA:', torch.cuda.is_available())"
# Should print: CUDA: True
This enables GPU acceleration for:
- llama.cpp inference (GGUF models) - 35+ tok/s
- Transformers models (classifiers, embeddings) - GPU accelerated
- Anomaly detection (sklearn) - CPU only
Installing Built Libraries
Copy the built libraries to LlamaFarm's cache:
CACHE_DIR="$HOME/.cache/llamafarm-llama/jetson"
mkdir -p "$CACHE_DIR"
cp build/src/libllama.so "$CACHE_DIR/"
cp build/ggml/src/libggml*.so* "$CACHE_DIR/"
# Point LlamaFarm to the custom build
export LLAMAFARM_LLAMA_LIB_DIR="$CACHE_DIR"
Running the Server
cd /path/to/llamafarm/runtimes/universal
# Start with conservative settings for 8GB memory
uv run python server.py \
--model 'unsloth/Qwen3-1.7B-GGUF:Q4_K_M' \
--ctx-len 2048
Test the API
curl -s -X POST http://localhost:11540/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "unsloth/Qwen3-1.7B-GGUF:Q4_K_M",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100,
"n_ctx": 2048
}'
Important: Always pass n_ctx in requests to avoid defaulting to the model's full context (40960 for Qwen3), which causes OOM.
Environment Variables
| Variable | Description | Default |
|---|---|---|
LLAMAFARM_SYNC_INFERENCE | Force sync (1) or async (0) inference | Auto-detect |
LLAMAFARM_LLAMA_LIB_DIR | Path to custom llama.cpp libraries | Auto |
CUDA_CACHE_DISABLE | Disable CUDA cache (reduces fragmentation) | 0 |
Memory Optimization
Jetson uses unified memory shared between CPU and GPU. Optimize with:
1. Use Quantized Models
Q4_K_M provides the best balance of quality and memory:
Model Size (Qwen3-1.7B):
- Q4_K_M: ~1.0 GB
- Q8_0: ~1.8 GB
- FP16: ~3.4 GB
2. Reduce Context Size
Start with n_ctx=2048 and increase only if needed:
KV Cache Memory (Qwen3-1.7B):
- n_ctx=2048: ~224 MB
- n_ctx=4096: ~448 MB
- n_ctx=8192: ~896 MB
3. Enable KV Cache Quantization
For even lower memory, use quantized KV cache:
# In your model configuration
cache_type_k="q4_0" # 4-bit keys
cache_type_v="q4_0" # 4-bit values
This reduces KV cache memory by ~4x.
Troubleshooting
"double free or corruption" crash
Cause: CUDA backend initialization from worker thread.
Solution: LlamaFarm automatically initializes the backend from the main thread. If you see this error, ensure you're using the latest version.
Slow performance (< 25 tok/s)
Check:
- Rebuild llama.cpp with
GGML_CUDA_GRAPHS=OFF - Verify all layers on GPU: look for "offloaded 29/29 layers" in logs
- Check
n_ctxisn't too large (causes memory pressure)
Out of memory
Solutions:
- Reduce
n_ctx(e.g., 2048 instead of 4096) - Use smaller quantization (Q4_K_M instead of Q8_0)
- Enable KV cache quantization
- Monitor with
tegrastats
Model loads on CPU
Check:
- Verify CUDA:
nvcc --version - Ensure libggml-cuda.so exists in library directory
- Check logs for "CUDA0" device detection
Performance Benchmarks
Tested on Jetson Orin Nano 8GB with Qwen3-1.7B Q4_K_M:
| Metric | Value |
|---|---|
| Tokens/second | 35-36 tok/s |
| Time to first token | ~70-100ms |
| Memory usage | ~2.5 GB |
| Power consumption | ~15W |
systemd Service
Create /etc/systemd/system/llamafarm.service:
[Unit]
Description=LlamaFarm Universal Runtime
After=network.target
[Service]
Type=simple
User=root
WorkingDirectory=/data/llamafarm/runtimes/universal
Environment=LLAMAFARM_LLAMA_LIB_DIR=/root/.cache/llamafarm-llama/jetson
Environment=LLAMAFARM_SYNC_INFERENCE=1
ExecStart=/root/.local/bin/uv run python server.py --model unsloth/Qwen3-1.7B-GGUF:Q4_K_M --ctx-len 2048
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable llamafarm
sudo systemctl start llamafarm