Skip to main content

Extractors Reference

Extractors enrich document chunks with metadata, enabling filtered searches and better retrieval. They run after parsing and add structured data to each chunk.

Quick Start

Extractors are defined in data_processing_strategies:

rag:
data_processing_strategies:
- name: my_processor
parsers:
- type: PDFParser_LlamaIndex
config:
chunk_size: 1000
extractors:
- type: EntityExtractor
config:
entity_types: [PERSON, ORG, DATE]
- type: KeywordExtractor
config:
max_keywords: 10

Common Extractor Properties

PropertyRequiredDescription
typeYesExtractor type (e.g., EntityExtractor)
configNoExtractor-specific configuration
priorityNoExecution order (lower numbers run first)
file_include_patternsNoGlob patterns for files to apply to
conditionNoCondition expression for when to run

EntityExtractor

Extracts named entities using NER models with regex fallback.

Extracts: People, organizations, dates, emails, phone numbers, URLs, products

- type: EntityExtractor
config:
entity_types: [PERSON, ORG, DATE, EMAIL, PHONE]
use_fallback: true
confidence_threshold: 0.7

Options

OptionTypeDefaultDescription
modelstringen_core_web_smspaCy NER model
entity_typesarraySee belowEntity types to extract
use_fallbackbooleantrueUse regex fallback
min_entity_lengthinteger2Minimum entity length
merge_entitiesbooleantrueMerge adjacent entities
confidence_thresholdnumber0.7Minimum confidence (0-1)

Supported Entity Types: PERSON, ORG, GPE, DATE, TIME, MONEY, EMAIL, PHONE, URL, LAW, PERCENT, PRODUCT, EVENT, VERSION, FAC, LOC


KeywordExtractor

Extracts important keywords using various algorithms.

Extracts: Key terms, phrases, n-grams

- type: KeywordExtractor
config:
algorithm: rake
max_keywords: 10
language: en

Options

OptionTypeDefaultDescription
algorithmstringrakerake, yake, tfidf, textrank
max_keywordsinteger10Maximum keywords (1-100)
min_lengthinteger1Minimum word length
max_lengthinteger4Maximum word length
min_frequencyinteger1Minimum frequency
stop_wordsarray-Custom stop words
languagestringenLanguage for YAKE
max_ngram_sizeinteger3Max n-gram size for YAKE
deduplication_thresholdnumber0.9Dedup threshold for YAKE

DateTimeExtractor

Extracts dates, times, and durations with fuzzy parsing.

Extracts: Dates, times, relative expressions, durations

- type: DateTimeExtractor
config:
fuzzy_parsing: true
extract_relative: true
extract_times: true
default_timezone: UTC

Options

OptionTypeDefaultDescription
fuzzy_parsingbooleantrueEnable fuzzy date parsing
extract_relativebooleantrueExtract relative dates
extract_timesbooleantrueExtract time expressions
extract_durationsbooleantrueExtract durations
default_timezonestringUTCDefault timezone
date_formatstringISOOutput date format
prefer_dates_fromstringcurrentpast, future, current

HeadingExtractor

Extracts document headings and builds outline structure.

Extracts: Headings, hierarchy, document outline

- type: HeadingExtractor
config:
max_level: 6
include_hierarchy: true
extract_outline: true

Options

OptionTypeDefaultDescription
max_levelinteger6Maximum heading level (1-6)
include_hierarchybooleantrueInclude hierarchy structure
extract_outlinebooleantrueGenerate document outline
min_heading_lengthinteger3Minimum heading length
enabledbooleantrueEnable extractor

LinkExtractor

Extracts URLs, emails, and domain information.

Extracts: URLs, email addresses, domains

- type: LinkExtractor
config:
extract_urls: true
extract_emails: true
extract_domains: true

Options

OptionTypeDefaultDescription
extract_urlsbooleantrueExtract URLs
extract_emailsbooleantrueExtract email addresses
extract_domainsbooleantrueExtract unique domains
validate_urlsbooleanfalseValidate URL format
resolve_redirectsbooleanfalseResolve URL redirects

PathExtractor

Extracts file paths, URLs, and S3 paths.

Extracts: File paths, URL paths, cloud storage paths

- type: PathExtractor
config:
extract_file_paths: true
extract_s3_paths: true
normalize_paths: true

Options

OptionTypeDefaultDescription
extract_file_pathsbooleantrueExtract file paths
extract_urlsbooleantrueExtract URL paths
extract_s3_pathsbooleantrueExtract S3 paths
validate_pathsbooleanfalseValidate path existence
normalize_pathsbooleantrueNormalize path formats

PatternExtractor

Extracts data using predefined or custom regex patterns.

Extracts: Emails, phones, IPs, SSNs, credit cards, versions, custom patterns

- type: PatternExtractor
config:
predefined_patterns:
- email
- phone
- ip_address
- version
custom_patterns:
- name: order_id
pattern: "ORD-[0-9]{6}"
description: "Order identifier"

Options

OptionTypeDefaultDescription
predefined_patternsarray[]Built-in patterns to use
custom_patternsarray[]Custom regex patterns
case_sensitivebooleanfalseCase-sensitive matching
return_positionsbooleanfalseReturn match positions
include_contextbooleanfalseInclude surrounding context
max_matches_per_patterninteger100Max matches per pattern
deduplicate_matchesbooleantrueRemove duplicates

Predefined Patterns: email, phone, url, ip, ip_address, ssn, credit_card, zip_code, file_path, version, date


ContentStatisticsExtractor

Calculates readability scores, vocabulary stats, and text structure.

Extracts: Word count, readability scores, vocabulary metrics

- type: ContentStatisticsExtractor
config:
include_readability: true
include_vocabulary: true
include_structure: true

Options

OptionTypeDefaultDescription
include_readabilitybooleantrueCalculate readability scores
include_vocabularybooleantrueAnalyze vocabulary
include_structurebooleantrueAnalyze text structure
include_sentiment_indicatorsbooleanfalseInclude sentiment indicators

SummaryExtractor

Generates extractive summaries using text ranking algorithms.

Extracts: Summary sentences, key phrases, text statistics

- type: SummaryExtractor
config:
summary_sentences: 3
algorithm: textrank
include_key_phrases: true

Options

OptionTypeDefaultDescription
summary_sentencesinteger3Number of sentences (1-10)
algorithmstringtextranktextrank, lsa, luhn, lexrank
include_key_phrasesbooleantrueExtract key phrases
include_statisticsbooleantrueInclude text statistics
min_sentence_lengthinteger10Minimum sentence length
max_sentence_lengthinteger500Maximum sentence length

TableExtractor

Extracts tabular data from documents.

Extracts: Tables, headers, cell data

- type: TableExtractor
config:
output_format: dict
extract_headers: true
merge_cells: true

Options

OptionTypeDefaultDescription
output_formatstringdictdict, list, csv, markdown
extract_headersbooleantrueExtract table headers
merge_cellsbooleantrueHandle merged cells
min_rowsinteger2Minimum rows for table

YAKEExtractor

YAKE (Yet Another Keyword Extractor) - unsupervised keyword extraction.

Extracts: Keywords using statistical features

- type: YAKEExtractor
config:
max_keywords: 10
language: en
max_ngram_size: 3

Options

OptionTypeDefaultDescription
max_keywordsinteger10Maximum keywords (1-100)
languagestringenLanguage code
max_ngram_sizeinteger3Max n-gram size (1-5)
deduplication_thresholdnumber0.9Dedup threshold (0-1)

RAKEExtractor

RAKE (Rapid Automatic Keyword Extraction) algorithm.

Extracts: Keywords based on word co-occurrence

- type: RAKEExtractor
config:
max_keywords: 10
min_length: 1
max_length: 4

Use via KeywordExtractor with algorithm: rake.


TFIDFExtractor

TF-IDF based keyword extraction.

Extracts: Keywords based on term frequency-inverse document frequency

- type: TFIDFExtractor
config:
max_keywords: 10
min_frequency: 1

Use via KeywordExtractor with algorithm: tfidf.


Complete Example

Combine multiple extractors for rich metadata:

extractors:
# High priority - entities first
- type: EntityExtractor
priority: 100
config:
entity_types: [PERSON, ORG, DATE, PRODUCT]
use_fallback: true

# Keywords for searchability
- type: KeywordExtractor
priority: 90
config:
algorithm: yake
max_keywords: 15

# Statistics for filtering
- type: ContentStatisticsExtractor
priority: 80
config:
include_readability: true

# Patterns for specific data
- type: PatternExtractor
priority: 70
file_include_patterns: ["*.pdf"]
config:
predefined_patterns: [email, phone, date]
custom_patterns:
- name: case_number
pattern: "CASE-[A-Z]{2}-[0-9]{6}"

Using Extracted Metadata

Query with metadata filters:

# Filter by entity
lf rag query --database main_db --filter "entities.ORG:Acme Corp" "contracts"

# Filter by keyword
lf rag query --database main_db --filter "keywords:merger" "recent news"

# Filter by date
lf rag query --database main_db --filter "dates:2024" "quarterly reports"

Next Steps