Skip to main content

Parsers Reference

Parsers transform raw documents into text chunks suitable for embedding and retrieval. LlamaFarm provides multiple parsers for each file format, allowing you to choose based on your performance and capability needs.

Parser Configuration

Parsers are defined within data_processing_strategies in your llamafarm.yaml:

rag:
data_processing_strategies:
- name: my_processor
description: "Process various document types"
parsers:
- type: PDFParser_LlamaIndex
file_include_patterns:
- "*.pdf"
priority: 0 # Lower = try first
config:
chunk_size: 1000
chunk_overlap: 100

Common Parser Properties

PropertyRequiredDescription
typeYesParser type identifier (e.g., PDFParser_PyPDF2)
file_include_patternsNoGlob patterns for files to process (e.g., ["*.pdf"])
file_extensionsNoFile extensions this parser handles (e.g., [".pdf"])
priorityNoParser priority (lower = try first, default: 50)
mime_typesNoMIME types this parser handles
fallback_parserNoParser to use if this one fails
configYesParser-specific configuration

Auto Parser

The auto parser automatically detects file types and applies the appropriate parser.

Configuration

parsers:
- type: auto
config:
chunk_size: 1000 # Chunk size for text splitting (100-10000, default: 1000)
chunk_overlap: 200 # Overlap between chunks (0-500, default: 200)

Options

OptionTypeDefaultRangeDescription
chunk_sizeinteger1000100-10000Chunk size for text splitting
chunk_overlapinteger2000-500Overlap between chunks

PDF Parsers

PDFParser_PyPDF2

Enhanced PDF parser using PyPDF2 with comprehensive text and metadata extraction.

Best for: Simple PDFs, form extraction, annotation extraction

parsers:
- type: PDFParser_PyPDF2
file_include_patterns:
- "*.pdf"
- "*.PDF"
config:
chunk_size: 1000
chunk_overlap: 100
chunk_strategy: paragraphs
extract_metadata: true
preserve_layout: true

Options

OptionTypeDefaultRangeDescription
chunk_sizeinteger1000100-50000Chunk size in characters
chunk_overlapinteger1000-5000Overlap between chunks
chunk_strategystringparagraphsparagraphs, sentences, charactersChunking strategy
extract_metadatabooleantrue-Extract PDF metadata
preserve_layoutbooleantrue-Use layout-preserving extraction
extract_page_infobooleantrue-Extract page numbers and rotation
extract_annotationsbooleanfalse-Extract PDF annotations
extract_linksbooleanfalse-Extract hyperlinks
extract_form_fieldsbooleanfalse-Extract form fields
extract_outlinesbooleanfalse-Extract document outlines/bookmarks
extract_imagesbooleanfalse-Extract embedded images
extract_xmp_metadatabooleanfalse-Extract XMP metadata
clean_textbooleantrue-Clean extracted text
combine_pagesbooleanfalse-Combine all pages (must be false for chunking)

PDFParser_LlamaIndex

Advanced PDF parser using LlamaIndex with multiple fallback strategies and semantic chunking.

Best for: Complex PDFs, scanned documents, semantic chunking

parsers:
- type: PDFParser_LlamaIndex
file_include_patterns:
- "*.pdf"
priority: 0 # Lower = try first (preferred parser)
config:
chunk_size: 1200
chunk_overlap: 150
chunk_strategy: semantic
extract_metadata: true
extract_tables: true
fallback_strategies:
- llama_pdf_reader
- llama_pymupdf_reader
- direct_pymupdf
- pypdf2_fallback

Options

OptionTypeDefaultRangeDescription
chunk_sizeinteger1000100-50000Chunk size in characters
chunk_overlapinteger1000-5000Overlap between chunks
chunk_strategystringsentencessentences, paragraphs, pages, semanticChunking strategy
extract_metadatabooleantrue-Extract PDF metadata
extract_imagesbooleanfalse-Extract images from PDF
extract_tablesbooleantrue-Extract tables from PDF
fallback_strategiesarrayAll strategiesSee belowFallback strategies in order

Fallback Strategies:

  • llama_pdf_reader - LlamaIndex PDFReader
  • llama_pymupdf_reader - LlamaIndex PyMuPDFReader
  • direct_pymupdf - Direct PyMuPDF
  • pypdf2_fallback - PyPDF2 fallback

CSV Parsers

CSVParser_Pandas

Advanced CSV parser using Pandas with data analysis capabilities.

Best for: Data analysis, large CSV files, complex data handling

parsers:
- type: CSVParser_Pandas
file_include_patterns:
- "*.csv"
config:
chunk_size: 1000
chunk_strategy: rows
extract_metadata: true
encoding: utf-8
delimiter: ","
na_values:
- ""
- "NA"
- "N/A"
- "null"

Options

OptionTypeDefaultDescription
chunk_sizeinteger1000Number of rows per chunk
chunk_strategystringrowsrows, columns, full
extract_metadatabooleantrueExtract data statistics
encodingstringutf-8File encoding
delimiterstring,CSV delimiter
na_valuesarray["", "NA", "N/A", "null", "None"]Values to treat as NaN

CSVParser_Python

Simple CSV parser using native Python csv module.

Best for: Simple CSV files, minimal dependencies

parsers:
- type: CSVParser_Python
file_include_patterns:
- "*.csv"
config:
chunk_size: 1000
encoding: utf-8
delimiter: ","
quotechar: '"'

Options

OptionTypeDefaultDescription
chunk_sizeinteger1000Number of rows per chunk
encodingstringutf-8File encoding
delimiterstring,CSV delimiter
quotecharstring"Quote character

CSVParser_LlamaIndex

CSV parser using LlamaIndex with Pandas backend for advanced processing.

Best for: Semantic chunking, field mapping, integration with LlamaIndex

parsers:
- type: CSVParser_LlamaIndex
file_include_patterns:
- "*.csv"
- "*.tsv"
config:
chunk_size: 1000
chunk_strategy: rows
field_mapping:
title: name
content: description
combine_fields: true
skiprows: 0

Options

OptionTypeDefaultRangeDescription
chunk_sizeinteger1000100-50000Number of rows per chunk
chunk_strategystringrowsrows, semantic, fullChunking strategy
field_mappingobject--Map CSV columns to standard fields
extract_metadatabooleantrue-Extract metadata from CSV
combine_fieldsbooleantrue-Combine fields into text content
skiprowsinteger00+Number of rows to skip
na_valuesarray["", "NA", "N/A", "null", "None"]-Values to treat as missing

Excel Parsers

ExcelParser_OpenPyXL

Excel parser using OpenPyXL for XLSX files with formula support.

Best for: XLSX files, formula extraction, workbook metadata

parsers:
- type: ExcelParser_OpenPyXL
file_include_patterns:
- "*.xlsx"
config:
chunk_size: 1000
extract_formulas: false
extract_metadata: true
data_only: true
sheets: null # Process all sheets

Options

OptionTypeDefaultDescription
chunk_sizeinteger1000Number of rows per chunk
extract_formulasbooleanfalseExtract cell formulas
extract_metadatabooleantrueExtract workbook metadata
sheetsarray/nullnullSpecific sheets to process (null = all)
data_onlybooleantrueExtract values instead of formulas

ExcelParser_Pandas

Excel parser using Pandas with data analysis capabilities.

Best for: Data analysis, statistical processing

parsers:
- type: ExcelParser_Pandas
file_include_patterns:
- "*.xlsx"
- "*.xls"
config:
chunk_size: 1000
sheets: null
extract_metadata: true
skiprows: null
na_values:
- ""
- "NA"

Options

OptionTypeDefaultDescription
chunk_sizeinteger1000Number of rows per chunk
sheetsarray/nullnullSpecific sheets to process
extract_metadatabooleantrueExtract data statistics
skiprowsinteger/nullnullRows to skip at beginning
na_valuesarray["", "NA", "N/A", "null", "None"]Values to treat as NaN

ExcelParser_LlamaIndex

Excel parser using LlamaIndex with Pandas backend for advanced processing.

Best for: Semantic chunking, combining sheets, advanced data handling

parsers:
- type: ExcelParser_LlamaIndex
file_include_patterns:
- "*.xlsx"
- "*.xls"
config:
chunk_size: 1000
chunk_strategy: rows
sheets: null
combine_sheets: false
extract_metadata: true
extract_formulas: false
header_row: 0

Options

OptionTypeDefaultRangeDescription
chunk_sizeinteger1000100-50000Number of rows per chunk
chunk_strategystringrowsrows, semantic, fullChunking strategy
sheetsarray/nullnull-Specific sheets to parse
combine_sheetsbooleanfalse-Combine all sheets into one document
extract_metadatabooleantrue-Extract metadata
extract_formulasbooleanfalse-Extract formulas instead of values
header_rowinteger00+Row index for headers
skiprowsinteger00+Number of rows to skip
na_valuesarray["", "NA", "N/A", "null", "None"]-Values to treat as missing

Word Document Parsers

DocxParser_PythonDocx

Word document parser using python-docx library.

Best for: Simple DOCX files, table extraction

parsers:
- type: DocxParser_PythonDocx
file_include_patterns:
- "*.docx"
config:
chunk_size: 1000
chunk_strategy: paragraphs
extract_metadata: true
extract_tables: true
extract_headers: true
extract_footers: false
extract_comments: false

Options

OptionTypeDefaultDescription
chunk_sizeinteger1000Chunk size in characters
chunk_strategystringparagraphsparagraphs, sentences, characters
extract_metadatabooleantrueExtract document metadata
extract_tablesbooleantrueExtract tables
extract_headersbooleantrueExtract headers
extract_footersbooleanfalseExtract footers
extract_commentsbooleanfalseExtract comments

DocxParser_LlamaIndex

Advanced DOCX parser using LlamaIndex with enhanced chunking.

Best for: Semantic chunking, image extraction, complex documents

parsers:
- type: DocxParser_LlamaIndex
file_include_patterns:
- "*.docx"
config:
chunk_size: 1000
chunk_overlap: 100
chunk_strategy: paragraphs
extract_metadata: true
extract_tables: true
extract_images: false
preserve_formatting: true
include_header_footer: false

Options

OptionTypeDefaultRangeDescription
chunk_sizeinteger1000100-50000Chunk size in characters
chunk_overlapinteger1000-5000Overlap between chunks
chunk_strategystringparagraphsparagraphs, sentences, semanticChunking strategy
extract_metadatabooleantrue-Extract document metadata
extract_tablesbooleantrue-Extract tables
extract_imagesbooleanfalse-Extract images
preserve_formattingbooleantrue-Preserve text formatting
include_header_footerbooleanfalse-Include header and footer

Markdown Parsers

MarkdownParser_Python

Markdown parser using native Python with regex parsing.

Best for: Simple Markdown files, minimal dependencies

parsers:
- type: MarkdownParser_Python
file_include_patterns:
- "*.md"
- "*.markdown"
config:
chunk_size: 1000
chunk_strategy: sections
extract_metadata: true
extract_code_blocks: true
extract_links: true

Options

OptionTypeDefaultDescription
chunk_sizeinteger1000Chunk size in characters
chunk_strategystringsectionssections, paragraphs, characters
extract_metadatabooleantrueExtract YAML frontmatter
extract_code_blocksbooleantrueExtract code blocks
extract_linksbooleantrueExtract markdown links

MarkdownParser_LlamaIndex

Advanced markdown parser using LlamaIndex with semantic chunking.

Best for: Complex Markdown, semantic chunking, table extraction

parsers:
- type: MarkdownParser_LlamaIndex
file_include_patterns:
- "*.md"
- "*.markdown"
config:
chunk_size: 800
chunk_overlap: 80
chunk_strategy: headings
extract_metadata: true
extract_code_blocks: true
extract_tables: true
extract_links: true
preserve_structure: true

Options

OptionTypeDefaultRangeDescription
chunk_sizeinteger1000100-50000Chunk size in characters
chunk_overlapinteger1000-5000Overlap between chunks
chunk_strategystringheadingsheadings, paragraphs, sentences, semanticChunking strategy
extract_metadatabooleantrue-Extract frontmatter metadata
extract_code_blocksbooleantrue-Extract code blocks separately
extract_tablesbooleantrue-Extract markdown tables
extract_linksbooleantrue-Extract links and references
preserve_structurebooleantrue-Preserve heading hierarchy

Text Parsers

TextParser_Python

Text parser using native Python with encoding detection.

Best for: Plain text files, minimal dependencies

parsers:
- type: TextParser_Python
file_include_patterns:
- "*.txt"
- "*.log"
config:
chunk_size: 1000
chunk_overlap: 100
chunk_strategy: sentences
encoding: utf-8
clean_text: true
extract_metadata: true

Options

OptionTypeDefaultDescription
chunk_sizeinteger1000Chunk size in characters
chunk_overlapinteger100Overlap between chunks
chunk_strategystringsentencessentences, paragraphs, characters
encodingstringutf-8Text encoding (or auto-detect)
clean_textbooleantrueRemove excessive whitespace
extract_metadatabooleantrueExtract file statistics

TextParser_LlamaIndex

Advanced text parser using LlamaIndex with semantic splitting and code parsing.

Best for: Semantic chunking, code files, multi-format support

parsers:
- type: TextParser_LlamaIndex
file_include_patterns:
- "*.txt"
- "*.py"
- "*.js"
config:
chunk_size: 800
chunk_overlap: 80
chunk_strategy: semantic
encoding: utf-8
clean_text: true
extract_metadata: true
preserve_code_structure: true
detect_language: true
include_prev_next_rel: true

Options

OptionTypeDefaultRangeDescription
chunk_sizeinteger1000100-50000Chunk size in characters
chunk_overlapinteger1000-5000Overlap between chunks
chunk_strategystringsemanticcharacters, sentences, paragraphs, tokens, semantic, codeChunking strategy
encodingstringutf-8-Text encoding
clean_textbooleantrue-Clean extracted text
extract_metadatabooleantrue-Extract comprehensive metadata
semantic_buffer_sizeinteger11-10Buffer size for semantic chunking
semantic_breakpoint_percentile_thresholdinteger9550-99Percentile threshold for breakpoints
token_modelstringgpt-3.5-turbo-Tokenizer model for token chunking
preserve_code_structurebooleantrue-Preserve code syntax and structure
detect_languagebooleantrue-Auto-detect programming language
include_prev_next_relbooleantrue-Include relationships between chunks

Email Parser

MsgParser_ExtractMsg

Outlook MSG file parser using extract-msg library.

Best for: Outlook emails, attachment extraction

parsers:
- type: MsgParser_ExtractMsg
file_include_patterns:
- "*.msg"
config:
chunk_size: 1000
chunk_overlap: 100
chunk_strategy: email_sections
extract_metadata: true
extract_attachments: true
extract_headers: true
include_attachment_content: true
clean_text: true
preserve_formatting: false
encoding: utf-8

Options

OptionTypeDefaultRangeDescription
chunk_sizeinteger1000100-50000Chunk size in characters
chunk_overlapinteger1000-5000Overlap between chunks
chunk_strategystringemail_sectionssentences, paragraphs, characters, email_sectionsChunking strategy
extract_metadatabooleantrue-Extract email metadata
extract_attachmentsbooleantrue-Extract attachments
extract_headersbooleantrue-Extract email headers
include_attachment_contentbooleantrue-Include attachment content
clean_textbooleantrue-Clean text
preserve_formattingbooleanfalse-Preserve HTML formatting
encodingstringutf-8-Text encoding

Complete Example

Here's a comprehensive example processing multiple file types:

rag:
data_processing_strategies:
- name: universal_processor
description: "Handles multiple document formats"
parsers:
# PDF files with LlamaIndex (try first - lower priority value)
- type: PDFParser_LlamaIndex
file_include_patterns:
- "*.pdf"
- "*.PDF"
priority: 0
config:
chunk_size: 1200
chunk_overlap: 150
chunk_strategy: semantic
extract_metadata: true
extract_tables: true

# PDF fallback with PyPDF2 (try second - higher priority value)
- type: PDFParser_PyPDF2
file_include_patterns:
- "*.pdf"
priority: 50
fallback_parser: null
config:
chunk_size: 1000
chunk_overlap: 100
chunk_strategy: paragraphs

# Markdown files
- type: MarkdownParser_LlamaIndex
file_include_patterns:
- "*.md"
- "*.markdown"
priority: 0
config:
chunk_size: 800
chunk_overlap: 80
chunk_strategy: headings
extract_code_blocks: true

# CSV files
- type: CSVParser_Pandas
file_include_patterns:
- "*.csv"
priority: 0
config:
chunk_size: 500
chunk_strategy: rows
extract_metadata: true

# Excel files
- type: ExcelParser_LlamaIndex
file_include_patterns:
- "*.xlsx"
- "*.xls"
priority: 0
config:
chunk_size: 500
chunk_strategy: rows
combine_sheets: false

# Word documents
- type: DocxParser_LlamaIndex
file_include_patterns:
- "*.docx"
priority: 0
config:
chunk_size: 1000
chunk_overlap: 100
chunk_strategy: paragraphs
extract_tables: true

# Plain text and code
- type: TextParser_LlamaIndex
file_include_patterns:
- "*.txt"
- "*.py"
- "*.js"
- "*.html"
priority: 0
config:
chunk_size: 800
chunk_overlap: 80
chunk_strategy: semantic
preserve_code_structure: true

# Outlook emails
- type: MsgParser_ExtractMsg
file_include_patterns:
- "*.msg"
priority: 0
config:
chunk_strategy: email_sections
extract_attachments: true

Chunking Strategy Guidelines

StrategyBest ForConsiderations
sentencesGeneral text, documentationGood balance of granularity
paragraphsArticles, reportsPreserves natural breaks
charactersFixed-size needsPredictable chunk sizes
semanticTechnical docs, varied contentContent-aware splitting
sections / headingsMarkdown, structured docsRespects document structure
pagesPDFs, page-based docsOne chunk per page
rowsCSV, ExcelData-oriented chunking
codeSource code filesPreserves syntax
email_sectionsEmailsHeaders/body/signature

Next Steps