Evaluation
Judge outputs and iterate to better prompts.
Criteria
Define objective metrics and rubrics. Example CLI usage:
# Evaluate AI responses
uv run python -m prompts.cli evaluate "AI is machine learning" \
--query "What is AI?" \
--criteria "accuracy,clarity,completeness" \
--output-format detailed
A/B testing
Compare prompt variants with the same inputs.
# Create variant templates
cp templates/basic/qa_basic.json templates/basic/qa_basic_v2.json
# Run A/B
uv run python -m prompts.cli execute "Explain machine learning" --template qa_basic > response_a.txt
uv run python -m prompts.cli execute "Explain machine learning" --template qa_basic_v2 > response_b.txt
# Evaluate both
uv run python -m prompts.cli evaluate "$(cat response_a.txt)" \
--query "Explain machine learning" \
--criteria "clarity,completeness" \
--output-format score > score_a.txt
uv run python -m prompts.cli evaluate "$(cat response_b.txt)" \
--query "Explain machine learning" \
--criteria "clarity,completeness" \
--output-format score > score_b.txt
Automation
Wire into CI with your preferred runner. Examples:
- Validate templates and run evaluation on PRs
- Fail builds on regression thresholds (e.g., clarity score drop)