Document Extraction Fixtures

PDF and TXT fixtures for layout parsing, OCR-style extraction, protected-document handling, and text normalization workflows.

Why This Workflow Matters

  • Mix clean layout PDFs, scan-style pages, protected files, and damaged documents in one extraction suite.
  • Pair PDF extraction cases with TXT encoding fixtures to validate plain-text fallback and normalization.
  • Use the extraction pack for repeatable parser, OCR, and field-mapping setup.

Recommended Packs

Document Extraction Fixture Pack

Bundle of real PDF and TXT fixtures for extraction, layout parsing, OCR-style validation, protected-document handling, and damaged-file workflows.

document_extraction_fixture_pack.zip · 16.8 KB

Image Extraction Fixture Pack

Bundle of real PNG, JPEG, TIFF, and scan-style PDF fixtures for OCR, scan ingestion, and document-photo extraction workflows.

image_extraction_fixture_pack.zip · 382.3 KB

Fixture Matrices

PDF Extraction Fixture Matrix

Use the PDF matrix to choose between text-heavy, layout-driven, form-like, and damaged fixtures for preview and extraction pipelines.

TXT Encoding Fixture Matrix

Choose TXT fixtures for smoke tests, encoding detection, newline handling, long-line stress, and text-processing validation.

Suggested Fixtures

Filename Format Size Actions
pdf_invoice_layout_sample.pdf PDF 774 B
pdf_scan_like_image_sample.pdf PDF 3.7 KB
pdf_ocr_noise_sample.pdf PDF 7.9 KB
pdf_multi_column_report_sample.pdf PDF 3.3 KB
pdf_password_protected_sample.pdf PDF 3.2 KB
txt_utf8_multilingual_sample.txt TXT 94 B
txt_utf16le_sample.txt TXT 176 B