Extractors Reference

Extractors enrich document chunks with metadata, enabling filtered searches and better retrieval. They run after parsing and add structured data to each chunk.

Quick Start

Extractors are defined in data_processing_strategies:

rag:
  data_processing_strategies:
    - name: my_processor
      parsers:
        - type: PDFParser_LlamaIndex
          config:
            chunk_size: 1000
      extractors:
        - type: EntityExtractor
          config:
            entity_types: [PERSON, ORG, DATE]
        - type: KeywordExtractor
          config:
            max_keywords: 10

Common Extractor Properties

Property	Required	Description
`type`	Yes	Extractor type (e.g., `EntityExtractor`)
`config`	No	Extractor-specific configuration
`priority`	No	Execution order (lower numbers run first)
`file_include_patterns`	No	Glob patterns for files to apply to
`condition`	No	Condition expression for when to run

EntityExtractor

Extracts named entities using NER models with regex fallback.

Extracts: People, organizations, dates, emails, phone numbers, URLs, products

- type: EntityExtractor
  config:
    entity_types: [PERSON, ORG, DATE, EMAIL, PHONE]
    use_fallback: true
    confidence_threshold: 0.7

Options

Option	Type	Default	Description
`model`	string	`en_core_web_sm`	spaCy NER model
`entity_types`	array	See below	Entity types to extract
`use_fallback`	boolean	`true`	Use regex fallback
`min_entity_length`	integer	2	Minimum entity length
`merge_entities`	boolean	`true`	Merge adjacent entities
`confidence_threshold`	number	0.7	Minimum confidence (0-1)

Supported Entity Types: PERSON, ORG, GPE, DATE, TIME, MONEY, EMAIL, PHONE, URL, LAW, PERCENT, PRODUCT, EVENT, VERSION, FAC, LOC

KeywordExtractor

Extracts important keywords using various algorithms.

Extracts: Key terms, phrases, n-grams

- type: KeywordExtractor
  config:
    algorithm: rake
    max_keywords: 10
    language: en

Options

Option	Type	Default	Description
`algorithm`	string	`rake`	`rake`, `yake`, `tfidf`, `textrank`
`max_keywords`	integer	10	Maximum keywords (1-100)
`min_length`	integer	1	Minimum word length
`max_length`	integer	4	Maximum word length
`min_frequency`	integer	1	Minimum frequency
`stop_words`	array	-	Custom stop words
`language`	string	`en`	Language for YAKE
`max_ngram_size`	integer	3	Max n-gram size for YAKE
`deduplication_threshold`	number	0.9	Dedup threshold for YAKE

DateTimeExtractor

Extracts dates, times, and durations with fuzzy parsing.

Extracts: Dates, times, relative expressions, durations

- type: DateTimeExtractor
  config:
    fuzzy_parsing: true
    extract_relative: true
    extract_times: true
    default_timezone: UTC

Options

Option	Type	Default	Description
`fuzzy_parsing`	boolean	`true`	Enable fuzzy date parsing
`extract_relative`	boolean	`true`	Extract relative dates
`extract_times`	boolean	`true`	Extract time expressions
`extract_durations`	boolean	`true`	Extract durations
`default_timezone`	string	`UTC`	Default timezone
`date_format`	string	`ISO`	Output date format
`prefer_dates_from`	string	`current`	`past`, `future`, `current`

HeadingExtractor

Extracts document headings and builds outline structure.

Extracts: Headings, hierarchy, document outline

- type: HeadingExtractor
  config:
    max_level: 6
    include_hierarchy: true
    extract_outline: true

Options

Option	Type	Default	Description
`max_level`	integer	6	Maximum heading level (1-6)
`include_hierarchy`	boolean	`true`	Include hierarchy structure
`extract_outline`	boolean	`true`	Generate document outline
`min_heading_length`	integer	3	Minimum heading length
`enabled`	boolean	`true`	Enable extractor

LinkExtractor

Extracts URLs, emails, and domain information.

Extracts: URLs, email addresses, domains

- type: LinkExtractor
  config:
    extract_urls: true
    extract_emails: true
    extract_domains: true

Options

Option	Type	Default	Description
`extract_urls`	boolean	`true`	Extract URLs
`extract_emails`	boolean	`true`	Extract email addresses
`extract_domains`	boolean	`true`	Extract unique domains
`validate_urls`	boolean	`false`	Validate URL format
`resolve_redirects`	boolean	`false`	Resolve URL redirects

PathExtractor

Extracts file paths, URLs, and S3 paths.

Extracts: File paths, URL paths, cloud storage paths

- type: PathExtractor
  config:
    extract_file_paths: true
    extract_s3_paths: true
    normalize_paths: true

Options

Option	Type	Default	Description
`extract_file_paths`	boolean	`true`	Extract file paths
`extract_urls`	boolean	`true`	Extract URL paths
`extract_s3_paths`	boolean	`true`	Extract S3 paths
`validate_paths`	boolean	`false`	Validate path existence
`normalize_paths`	boolean	`true`	Normalize path formats

PatternExtractor

Extracts data using predefined or custom regex patterns.

Extracts: Emails, phones, IPs, SSNs, credit cards, versions, custom patterns

- type: PatternExtractor
  config:
    predefined_patterns:
      - email
      - phone
      - ip_address
      - version
    custom_patterns:
      - name: order_id
        pattern: "ORD-[0-9]{6}"
        description: "Order identifier"

Options

Option	Type	Default	Description
`predefined_patterns`	array	`[]`	Built-in patterns to use
`custom_patterns`	array	`[]`	Custom regex patterns
`case_sensitive`	boolean	`false`	Case-sensitive matching
`return_positions`	boolean	`false`	Return match positions
`include_context`	boolean	`false`	Include surrounding context
`max_matches_per_pattern`	integer	100	Max matches per pattern
`deduplicate_matches`	boolean	`true`	Remove duplicates

Predefined Patterns: email, phone, url, ip, ip_address, ssn, credit_card, zip_code, file_path, version, date

ContentStatisticsExtractor

Calculates readability scores, vocabulary stats, and text structure.

Extracts: Word count, readability scores, vocabulary metrics

- type: ContentStatisticsExtractor
  config:
    include_readability: true
    include_vocabulary: true
    include_structure: true

Options

Option	Type	Default	Description
`include_readability`	boolean	`true`	Calculate readability scores
`include_vocabulary`	boolean	`true`	Analyze vocabulary
`include_structure`	boolean	`true`	Analyze text structure
`include_sentiment_indicators`	boolean	`false`	Include sentiment indicators

SummaryExtractor

Generates extractive summaries using text ranking algorithms.

Extracts: Summary sentences, key phrases, text statistics

- type: SummaryExtractor
  config:
    summary_sentences: 3
    algorithm: textrank
    include_key_phrases: true

Options

Option	Type	Default	Description
`summary_sentences`	integer	3	Number of sentences (1-10)
`algorithm`	string	`textrank`	`textrank`, `lsa`, `luhn`, `lexrank`
`include_key_phrases`	boolean	`true`	Extract key phrases
`include_statistics`	boolean	`true`	Include text statistics
`min_sentence_length`	integer	10	Minimum sentence length
`max_sentence_length`	integer	500	Maximum sentence length

TableExtractor

Extracts tabular data from documents.

Extracts: Tables, headers, cell data

- type: TableExtractor
  config:
    output_format: dict
    extract_headers: true
    merge_cells: true

Options

Option	Type	Default	Description
`output_format`	string	`dict`	`dict`, `list`, `csv`, `markdown`
`extract_headers`	boolean	`true`	Extract table headers
`merge_cells`	boolean	`true`	Handle merged cells
`min_rows`	integer	2	Minimum rows for table

YAKEExtractor

YAKE (Yet Another Keyword Extractor) - unsupervised keyword extraction.

Extracts: Keywords using statistical features

- type: YAKEExtractor
  config:
    max_keywords: 10
    language: en
    max_ngram_size: 3

Options

Option	Type	Default	Description
`max_keywords`	integer	10	Maximum keywords (1-100)
`language`	string	`en`	Language code
`max_ngram_size`	integer	3	Max n-gram size (1-5)
`deduplication_threshold`	number	0.9	Dedup threshold (0-1)

RAKEExtractor

RAKE (Rapid Automatic Keyword Extraction) algorithm.

Extracts: Keywords based on word co-occurrence

- type: RAKEExtractor
  config:
    max_keywords: 10
    min_length: 1
    max_length: 4

Use via KeywordExtractor with algorithm: rake.

TFIDFExtractor

TF-IDF based keyword extraction.

Extracts: Keywords based on term frequency-inverse document frequency

- type: TFIDFExtractor
  config:
    max_keywords: 10
    min_frequency: 1

Use via KeywordExtractor with algorithm: tfidf.

Complete Example

Combine multiple extractors for rich metadata:

extractors:
  # High priority - entities first
  - type: EntityExtractor
    priority: 100
    config:
      entity_types: [PERSON, ORG, DATE, PRODUCT]
      use_fallback: true

  # Keywords for searchability
  - type: KeywordExtractor
    priority: 90
    config:
      algorithm: yake
      max_keywords: 15

  # Statistics for filtering
  - type: ContentStatisticsExtractor
    priority: 80
    config:
      include_readability: true

  # Patterns for specific data
  - type: PatternExtractor
    priority: 70
    file_include_patterns: ["*.pdf"]
    config:
      predefined_patterns: [email, phone, date]
      custom_patterns:
        - name: case_number
          pattern: "CASE-[A-Z]{2}-[0-9]{6}"

Using Extracted Metadata

Query with metadata filters:

# Filter by entity
lf rag query --database main_db --filter "entities.ORG:Acme Corp" "contracts"

# Filter by keyword
lf rag query --database main_db --filter "keywords:merger" "recent news"

# Filter by date
lf rag query --database main_db --filter "dates:2024" "quarterly reports"

Next Steps

Databases Reference - Configure vector stores
Retrieval Strategies - Configure retrieval
RAG Guide - Full RAG overview

Quick Start​

Common Extractor Properties​

EntityExtractor​

Options​

KeywordExtractor​

Options​

DateTimeExtractor​

Options​

HeadingExtractor​

Options​

LinkExtractor​

Options​

PathExtractor​

Options​

PatternExtractor​

Options​

ContentStatisticsExtractor​

Options​

SummaryExtractor​

Options​

TableExtractor​

Options​

YAKEExtractor​

Options​

RAKEExtractor​

TFIDFExtractor​

Complete Example​

Using Extracted Metadata​

Next Steps​

Quick Start

Common Extractor Properties

EntityExtractor

Options

KeywordExtractor

Options

DateTimeExtractor

Options

HeadingExtractor

Options

LinkExtractor

Options

PathExtractor

Options

PatternExtractor

Options

ContentStatisticsExtractor

Options

SummaryExtractor

Options

TableExtractor

Options

YAKEExtractor

Options

RAKEExtractor

TFIDFExtractor

Complete Example

Using Extracted Metadata

Next Steps