Models & Runtime

LlamaFarm focuses on inference rather than fine-tuning. The runtime section of llamafarm.yaml describes how chat completions are executed—whether against local Ollama, Lemonade, a vLLM gateway, or a remote hosted provider.

Multi-Model Support

LlamaFarm supports configuring multiple models in a single project. You can switch between models via CLI or API:

runtime:
  default_model: fast  # Which model to use by default

  models:
    fast:
      description: "Fast Ollama model for quick responses"
      provider: ollama
      model: gemma3:1b
      prompt_format: unstructured

    powerful:
      description: "More capable model"
      provider: ollama
      model: qwen3:8b

Using multi-model:

CLI: lf chat --model powerful "your question"
CLI: lf models list (shows all available models)
API: POST /v1/projects/{ns}/{id}/chat/completions with {"model": "powerful", ...}

Legacy single-model configs are still supported and automatically converted internally.

Runtime Responsibilities

Route chat requests to the configured provider.
Respect instructor modes (tools, json, md_json, etc.) when available.
Surface provider errors directly (incorrect model name, missing API key).
Cooperate with agent handlers (simple chat, structured output, RAG-aware prompts).

Choosing a Provider

Use Case	Configuration
Local models (Ollama)	`provider: ollama` (omit API key). Supports models pulled via `ollama pull`.
Local models (Lemonade)	`provider: lemonade` with GGUF models. Hardware-accelerated (NPU/GPU). See Lemonade Setup below.
Self-hosted vLLM / OpenAI-compatible	`provider: openai`, set `base_url` to your gateway, `api_key` as required.
Hosted APIs (OpenAI, Anthropic via proxy, Together, LM Studio)	`provider: openai`, set `base_url` if not using api.openai.com, provide API key.

Example using vLLM locally:

runtime:
  models:
    - name: vllm-model
      provider: openai
      model: mistral-small
      base_url: http://localhost:8000/v1
      api_key: sk-local-placeholder
      instructor_mode: json
      default: true

Agent Handlers

LlamaFarm selects an agent handler based on configuration:

Simple chat – direct user/system prompts, suitable for models without tool support.
Structured chat – uses instructor modes (tools, json) for models that support function/tool calls.
RAG chat – augments prompts with retrieved context, citations, and guardrails.
Classifier / Custom – future handlers for specialized workflows.

Choose handler behaviour in your project configuration (e.g., advanced agents defined by the server). Ensure the model supports the required features—some small models (TinyLlama) don’t handle tools, so stick with simple chat.

Lemonade Runtime

Lemonade is a high-performance local runtime that runs GGUF models with NPU/GPU acceleration. It's an alternative to Ollama with excellent performance on Apple Silicon and other hardware.

Quick Setup

1. Install Lemonade SDK:

uv pip install lemonade-sdk

2. Download a model:

# Balanced 4B model (recommended)
uv run lemonade-server-dev pull user.Qwen3-4B \
  --checkpoint unsloth/Qwen3-4B-GGUF:Q4_K_M \
  --recipe llamacpp

3. Start Lemonade server:

# From the llamafarm project root
LEMONADE_MODEL=user.Qwen3-4B nx start lemonade

Note: The nx start lemonade command automatically picks up configuration from your llamafarm.yaml. Currently, Lemonade must be manually started. In the future, Lemonade will run as a container and be auto-started by the LlamaFarm server.

4. Configure your project:

runtime:
  models:
    - name: lemon
      description: "Lemonade local model"
      provider: lemonade
      model: user.Qwen3-4B
      base_url: "http://127.0.0.1:11534/v1"
      default: true
      lemonade:
        backend: llamacpp  # llamacpp (GGUF), onnx, or transformers
        port: 11534
        context_size: 32768

Key Features

Hardware acceleration: Automatically detects and uses Metal (macOS), CUDA (NVIDIA), Vulkan (AMD/Intel), or CPU
Multiple backends: llamacpp (GGUF models), ONNX, or Transformers (PyTorch)
OpenAI-compatible API: Drop-in replacement for OpenAI-compatible endpoints
Port 11534: Default port (different from LlamaFarm's 8000 and Ollama's 11434)

Multi-Model with Lemonade

Run multiple Lemonade instances on different ports:

runtime:
  models:
    - name: lemon-fast
      provider: lemonade
      model: user.Qwen3-0.6B
      base_url: "http://127.0.0.1:11534/v1"
      lemonade:
        port: 11534

    - name: lemon-powerful
      provider: lemonade
      model: user.Qwen3-8B
      base_url: "http://127.0.0.1:11535/v1"
      lemonade:
        port: 11535

Start each instance in a separate terminal (from the llamafarm project root):

# Terminal 1
LEMONADE_MODEL=user.Qwen3-0.6B LEMONADE_PORT=11534 nx start lemonade

# Terminal 2
LEMONADE_MODEL=user.Qwen3-8B LEMONADE_PORT=11535 nx start lemonade

Note: In the future, Lemonade instances will run as containers and be auto-started by the LlamaFarm server.

More Information

For detailed setup instructions, model recommendations, and troubleshooting, see:

Lemonade Quickstart
Lemonade Runtime overview

Universal Runtime

The Universal Runtime is LlamaFarm's most versatile runtime provider, supporting any HuggingFace model through PyTorch Transformers and Diffusers. Unlike Ollama (GGUF-only) or Lemonade (optimized quantized models), Universal Runtime provides access to the entire HuggingFace Hub ecosystem.

Supported Model Formats

Current Support (Production):

HuggingFace Transformers – All PyTorch text models (GPT-2, Llama, Mistral, Qwen, Phi, BERT, etc.)
HuggingFace Diffusers – All PyTorch diffusion models (Stable Diffusion, SDXL, FLUX)
Model Types: Text Generation, Embeddings, Image Generation, Vision Classification, Audio Processing, Multimodal

Coming Soon:

ONNX Runtime – 2-5x faster inference with automatic model conversion
TensorRT – GPU-optimized inference for NVIDIA hardware

Quick Setup

1. Start Universal Runtime server:

# From project root (recommended)
nx start universal-runtime

# Or with custom port
LF_RUNTIME_PORT=8080 nx start universal-runtime

2. Configure your project:

runtime:
  models:
    - name: phi-2
      description: "Fast small language model"
      provider: universal
      model: microsoft/phi-2
      base_url: http://127.0.0.1:11540
      transformers:
        device: auto              # auto, cuda, mps, cpu
        dtype: auto               # auto, fp16, fp32, bf16
        trust_remote_code: true
        model_type: text          # text, embedding, image

3. Start chatting:

lf chat --model phi-2 "Explain quantum computing"

Configuration Examples

Example 1: Multi-Model Setup (Chat + Embeddings + Images)

runtime:
  default_model: balanced

  models:
    # Fast chat for quick responses
    - name: fast
      provider: universal
      model: Qwen/Qwen2.5-0.5B-Instruct
      base_url: http://127.0.0.1:11540
      transformers:
        device: auto
        dtype: auto

    # Balanced chat for quality
    - name: balanced
      provider: universal
      model: microsoft/phi-2
      base_url: http://127.0.0.1:11540
      transformers:
        device: auto
        dtype: auto

    # Embeddings for RAG
    - name: embedder
      provider: universal
      model: sentence-transformers/all-MiniLM-L6-v2
      base_url: http://127.0.0.1:11540
      transformers:
        device: auto
        dtype: auto
        model_type: embedding

    # Image generation
    - name: image-gen
      provider: universal
      model: stabilityai/stable-diffusion-2-1
      base_url: http://127.0.0.1:11540
      transformers:
        device: auto
        dtype: auto
        model_type: image
      diffusion:
        default_steps: 30
        default_guidance: 7.5
        default_size: "512x512"
        scheduler: euler
        enable_optimizations: true

Example 2: RAG with Universal Embeddings

runtime:
  models:
    - name: chat
      provider: universal
      model: microsoft/phi-2
      base_url: http://127.0.0.1:11540

    - name: embedder
      provider: universal
      model: nomic-ai/nomic-embed-text-v1.5
      base_url: http://127.0.0.1:11540
      transformers:
        model_type: embedding

rag:
  databases:
    - name: main_database
      type: ChromaStore
      embedding_strategies:
        - name: default_embeddings
          type: UniversalEmbedder
          config:
            model: nomic-ai/nomic-embed-text-v1.5
            dimension: 768
            batch_size: 16

Hardware Acceleration

Universal Runtime automatically detects and optimizes for your hardware:

Detection Priority:

NVIDIA CUDA – Best performance on NVIDIA GPUs
Apple Metal (MPS) – Optimized for Apple Silicon (M1/M2/M3)
CPU – Fallback for all platforms

Device Configuration:

transformers:
  device: auto    # Recommended: auto-detect best device
  device: cuda    # Force CUDA (NVIDIA)
  device: mps     # Force Metal (Apple Silicon)
  device: cpu     # Force CPU (compatible everywhere)

Data Type Configuration:

transformers:
  dtype: auto     # auto (fp16 on GPU, fp32 on CPU)
  dtype: fp16     # Half precision (faster, less memory)
  dtype: fp32     # Full precision (highest quality)
  dtype: bf16     # BFloat16 (NVIDIA Ampere+)

Environment Variables

# Server configuration
UNIVERSAL_RUNTIME_HOST=127.0.0.1
UNIVERSAL_RUNTIME_PORT=11540

# Model caching
TRANSFORMERS_CACHE=~/.cache/huggingface
HF_TOKEN=hf_xxxxx  # For gated models (Llama, etc.)

# Device control
TRANSFORMERS_FORCE_CPU=1     # Force CPU mode
TRANSFORMERS_SKIP_MPS=1      # Skip MPS, use CPU instead

API Usage

Universal Runtime provides an OpenAI-compatible API:

Chat Completions:

curl -X POST http://localhost:11540/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/phi-2",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

Embeddings:

curl -X POST http://localhost:11540/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sentence-transformers/all-MiniLM-L6-v2",
    "input": "Hello world"
  }'

Image Generation:

curl -X POST http://localhost:11540/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "a serene mountain lake at sunset",
    "model": "stabilityai/stable-diffusion-2-1",
    "size": "512x512"
  }'

Supported Model Categories

Text Generation (CausalLM) – GPT-2, Llama, Mistral, Qwen, Phi
Embeddings (Encoder) – BERT, sentence-transformers, BGE, nomic-embed
Image Generation (Diffusion) – Stable Diffusion, SDXL, FLUX
Vision Classification – ViT, CLIP, DINOv2
Audio Processing – Whisper, Wav2Vec2
Multimodal (Vision-Language) – BLIP, LLaVA, Florence

Key Differences: Universal vs Other Runtimes

Feature	Universal Runtime	Ollama	Lemonade
Model Format	PyTorch (Transformers)	GGUF (llama.cpp)	GGUF (llama.cpp)
Model Source	HuggingFace Hub	Ollama library	HuggingFace GGUF
Model Types	6 types (text, image, audio, etc.)	Text only	Text only
Optimization	PyTorch native	Quantized CPU/GPU	NPU/GPU accelerated
Memory Usage	Higher (full precision)	Lower (quantized)	Lower (quantized)
Setup	Auto-download from HF	`ollama pull`	`lemonade-server-dev pull`
Default Port	11540	11434	11534
Best For	Flexibility, multimodal	CPU speed, ease of use	NPU/GPU acceleration

When to Use Universal Runtime

Choose Universal Runtime when you need:

Access to any HuggingFace model (latest research models, domain-specific fine-tunes)
Multimodal capabilities (images, audio, vision)
Embedding generation for RAG
Custom model configurations (LoRA, adapters)

Choose Ollama when you need:

Fast CPU inference with quantized models
Simplest setup experience
Established model library

Choose Lemonade when you need:

Maximum performance on Apple Silicon
NPU/GPU acceleration
GGUF model optimization

Troubleshooting

Model not found:

# Models auto-download from HuggingFace on first use
# Check internet connection and HF_TOKEN for gated models
export HF_TOKEN=hf_your_token_here

Out of memory:

# Use smaller models or force CPU mode
TRANSFORMERS_FORCE_CPU=1 nx start universal-runtime

# Or reduce batch size in config
transformers:
  batch_size: 1

Slow inference:

# Ensure GPU is detected
# Check device selection in logs
# Consider using GGUF models with Ollama/Lemonade instead

Extending Provider Support

To add a new provider enum:

Update config/schema.yaml (runtime.provider enum).
Regenerate datamodels via config/generate_types.py.
Map the provider to an execution path in the server runtime service.
Update CLI defaults or additional flags if needed.
Document usage in this guide.

Upcoming Roadmap

Advanced agent handler configuration – choose handlers per command and dataset.
Fine-tuning pipeline integration – track status in the roadmap.

Next Steps

Configuration Guide – runtime schema details.
Extending runtimes – step-by-step provider integration.
Prompts – control how system prompts interact with runtime capabilities.

Multi-Model Support​

Runtime Responsibilities​

Choosing a Provider​

Agent Handlers​

Lemonade Runtime​

Quick Setup​

Key Features​

Multi-Model with Lemonade​

More Information​

Universal Runtime​

Supported Model Formats​

Quick Setup​

Configuration Examples​

Hardware Acceleration​

Environment Variables​

API Usage​

Supported Model Categories​

Key Differences: Universal vs Other Runtimes​

When to Use Universal Runtime​

Troubleshooting​

Extending Provider Support​

Upcoming Roadmap​

Next Steps​

Multi-Model Support

Runtime Responsibilities

Choosing a Provider

Agent Handlers

Lemonade Runtime

Quick Setup

Key Features

Multi-Model with Lemonade

More Information

Universal Runtime

Supported Model Formats

Quick Setup

Configuration Examples

Hardware Acceleration

Environment Variables

API Usage

Supported Model Categories

Key Differences: Universal vs Other Runtimes

When to Use Universal Runtime

Troubleshooting

Extending Provider Support

Upcoming Roadmap

Next Steps