Pharmaceutical & Therapeutics: FDA Document Analysis

Overview

Pharmaceutical and therapeutics companies undergoing FDA approval face a critical challenge: tracking hundreds of questions and answers across multiple rounds of FDA correspondence. LlamaFarm provides an automated solution to identify unanswered questions, validate existing answers, and maintain compliance throughout the approval process.

This guide provides complete step-by-step instructions with code examples and configuration. All steps are self-contained and can be followed using only this documentation.

Video Walkthrough Available

A full video demonstration of this use case is available as a supplement to this guide: FDA Document Analysis with LlamaFarm

Business Problem

During the FDA approval process, companies must:

Track regulatory questions across multiple document types (Complete Response Letters, Information Requests, Meeting Minutes)
Ensure all FDA questions have been adequately answered
Validate answers against historical correspondence (COFUS database)
Maintain confidence levels for answer authenticity
Avoid missing critical unanswered questions that could delay approval

Manual review is time-consuming, error-prone, and doesn't scale as document volumes increase.

Solution Architecture

LlamaFarm's agent-based approach automates this workflow:

FDA Documents → Vector Database → Agent Analysis → Question Extraction →
Answer Validation → Confidence Scoring → Summary Report

Key Components

RAG-Enabled Document Store: All FDA correspondence ingested into vector database
Document Analysis Agent: Recursively processes chunks to extract questions
Answer Validation Agent: Cross-references extracted questions against COFUS database
Batch Orchestrator: Manages processing of large document sets
Confidence Scoring: Provides reliability metrics for each answer

Standard Operating Procedure (SOP)

Prerequisites

LlamaFarm installed and configured
FDA documents in supported formats (PDF, Word, etc.)
Access to COFUS or equivalent answer database

Step 1: Ingest FDA Documents into Vector Database

Create a dataset and ingest all FDA correspondence:

# Create dataset for FDA documents
lf datasets create fda_correspondence -s universal_processor -b fda_db

# Upload documents (supports glob patterns)
lf datasets upload fda_correspondence ./fda_documents/*.pdf

# Process into vector database
lf datasets process fda_correspondence

Best Practice: Organize documents by submission cycle (e.g., cycle_1/, cycle_2/) for easier tracking.

Step 2: Start Recursive Script for Document Analysis

Configure and run the FDA document analyzer agent:

# Run the FDA document analyzer
lf agents run fda_document_analyzer --input-file ./config/fda_input.json

The agent will:

Break documents into manageable chunks
Process each chunk independently
Track progress across the entire corpus

Configuration Example (fda_input.json):

{
  "database": "fda_db",
  "document_types": ["complete_response", "information_request", "meeting_minutes"],
  "chunk_size": 4000,
  "chunk_overlap": 200
}

Step 3: Extract Questions from Documents

The agent sends document chunks to the LLM with specialized prompts to identify regulatory questions:

System Prompt Example:

You are analyzing FDA correspondence. Extract all regulatory questions
from the provided text. Focus on substantive questions about:
- Clinical data requirements
- Safety/efficacy concerns
- Manufacturing/quality controls
- Labeling requirements

Exclude administrative questions (meeting scheduling, contact info, etc.)

Return questions in this format:
{
  "question": "...",
  "category": "clinical|safety|manufacturing|labeling",
  "document_section": "..."
}

Step 4: Validate Answers Against COFUS Database

For each extracted question, the agent:

Queries the COFUS database using RAG
Retrieves relevant passages
Validates if the question has been adequately answered
Assesses answer authenticity

# Query specific question against COFUS
lf rag query --database cofus_db \
  --score-threshold 0.7 \
  "Has clinical endpoint XYZ been addressed?"

Step 5: Save Results and Confidence Scores

The agent generates structured output with confidence metrics:

{
  "question_id": "Q_001",
  "question": "What additional clinical data is required for endpoint validation?",
  "status": "answered",
  "confidence_score": 0.95,
  "answer_source": "COFUS Letter 2024-03-15",
  "answer_summary": "Two additional Phase 3 studies required...",
  "validation_method": "semantic_match"
}

Confidence Threshold: Focus on scores ≥ 90% for reliable answers. Questions with lower scores may require manual review.

Step 6: Review Summary of Findings

Generate an executive summary report:

lf agents run fda_summary_generator --input-file ./results/analysis_results.json

Sample Summary Output:

FDA Document Analysis Summary
=============================
Total Questions Identified: 47
Answered Questions: 42 (89%)
Unanswered Questions: 5 (11%)
Average Confidence Score: 0.93

High Priority Unanswered Questions:
1. [Clinical] What is the required duration for long-term safety follow-up?
2. [Manufacturing] Has the API impurity profile been fully characterized?
3. [Labeling] Are pediatric use restrictions required in Section 8?

Next Actions:
- Review 3 low-confidence answers (0.70-0.85 range)
- Prepare responses for 5 unanswered questions
- Submit supplemental information package

Step 7: Utilize Batch Orchestrator for Processing

For large document sets, use the batch orchestrator:

# Process multiple documents in parallel
lf agents run batch_orchestrator \
  --agent fda_document_analyzer \
  --input-dir ./fda_documents/ \
  --concurrency 5 \
  --output-dir ./results/

Monitoring:

# Check processing status
lf agents status batch_orchestrator

# View logs
lf agents logs batch_orchestrator --tail 100

Step 8: Adjust Agents and System Prompts as Needed

Customize agents based on your specific regulatory focus:

Example: Prioritize Safety Questions

Edit llamafarm.yaml:

agents:
  fda_document_analyzer:
    system_prompt: |
      You are analyzing FDA correspondence with EXTRA FOCUS on safety concerns.
      Prioritize questions related to:
      - Adverse events
      - Safety signal monitoring
      - Risk mitigation strategies

      Mark safety questions with HIGH priority.

    parameters:
      temperature: 0.2  # Lower for consistency
      top_k: 5
      score_threshold: 0.85

Test configurations on a subset before scaling:

# Test on 10 documents first
lf agents run fda_document_analyzer \
  --input-file ./test_config.json \
  --limit 10 \
  --dry-run

Cautionary Notes

⚠️ Document Formatting: Ensure documents are properly formatted before ingestion. Scanned PDFs may require OCR preprocessing.

⚠️ Process Interruption: The recursive process saves checkpoints. If stopped, it can resume from the last checkpoint, but plan for uninterrupted runs of 2-4 hours for large document sets.

⚠️ Confidence Thresholds: Don't rely solely on high-confidence scores. Critical questions should undergo manual review regardless of score.

⚠️ Model Selection: Use more powerful models (e.g., GPT-4, Claude) for regulatory analysis. Smaller models may miss nuanced questions.

Tips for Efficiency

💡 Overnight Processing: Start the analysis at the end of the workday and let agents run independently for several hours. Review results the next morning.

💡 Staged Rollout: Test on a subset (10-20 documents) first to validate prompts and configuration before processing the full corpus.

💡 Model Switching: Use fast models for initial question extraction, then switch to powerful models for answer validation:

# Fast model for extraction
lf chat --model fast --agent question_extractor

# Powerful model for validation
lf chat --model powerful --agent answer_validator

💡 Incremental Processing: Process documents as they arrive rather than batch-processing at the end:

# Add to existing dataset
lf datasets upload fda_correspondence ./new_documents/*.pdf
lf datasets process fda_correspondence

Results & ROI

Organizations using this workflow report:

Time Savings: 80-90% reduction in manual document review time
Accuracy: 95%+ question identification rate (when using appropriate models)
Risk Mitigation: Earlier identification of unanswered questions
Audit Trail: Complete tracking of all questions and answers for regulatory inspections

Getting Started

Review the example: Check out examples/fda_rag/ in the LlamaFarm repository
Start small: Begin with one submission cycle (5-10 documents)
Iterate: Refine prompts and configuration based on results
Scale: Expand to full document corpus once validated

Optional: Watch the full video walkthrough for a visual demonstration of these steps.

Additional Resources

Questions?

If you're implementing this workflow and need assistance, please reach out through our GitHub Discussions.

Overview​

Business Problem​

Solution Architecture​

Key Components​

Standard Operating Procedure (SOP)​

Prerequisites​

Step 1: Ingest FDA Documents into Vector Database​

Step 2: Start Recursive Script for Document Analysis​

Step 3: Extract Questions from Documents​

Step 4: Validate Answers Against COFUS Database​

Step 5: Save Results and Confidence Scores​

Step 6: Review Summary of Findings​

Step 7: Utilize Batch Orchestrator for Processing​

Step 8: Adjust Agents and System Prompts as Needed​

Cautionary Notes​

Tips for Efficiency​

Results & ROI​

Getting Started​

Additional Resources​

Questions?​