Skip to main content

Local only extractors

The RAG system includes 5 local-only extractors that enhance documents with metadata without requiring external LLMs.

Available extractors

  • YAKE: Advanced keyword extraction considering position and context
  • RAKE: Fast phrase extraction using stop words as delimiters
  • TF-IDF: Term frequency analysis for finding unique terms
  • Entities: Person, organization, date, email, phone extraction (spaCy + regex fallbacks)
  • DateTime: Date, time, and relative date extraction
  • Statistics: Readability metrics, vocabulary analysis, content statistics

Extractor commands

# List all available extractors
uv run python cli.py extractors list --detailed

# Test an extractor on sample text
uv run python cli.py extractors test --extractor yake

# Test with your own text
uv run python cli.py extractors test --extractor entities --text "Contact John Doe at john@company.com"

# Test with a file
uv run python cli.py extractors test --extractor statistics --file samples/document.txt

Using extractors during ingestion

# Apply extractors during CSV ingestion
uv run python cli.py ingest samples/small_sample.csv --extractors yake entities statistics

# Apply with custom configuration
uv run python cli.py ingest document.pdf --extractors rake entities \
--extractor-config '{"rake": {"max_keywords": 20}, "entities": {"entity_types": ["PERSON", "ORG"]}}'

# Use configuration-based extractors (see config_examples/extractors_demo_config.json)
uv run python cli.py --config config_examples/extractors_demo_config.json ingest samples/document.pdf

Extractor output example

{
"yake_keywords": [
"machine learning",
"artificial intelligence",
"data analysis"
],
"entities_person": ["John Smith", "Sarah Johnson"],
"entities_email": ["contact@company.com"],
"word_count": 1250,
"reading_time_minutes": 6.25,
"sentiment_classification": "positive",
"flesch_reading_ease": 65.2
}