Document Extraction Fixtures

PDF and TXT fixtures for layout parsing, OCR-style extraction, protected-document handling, and text normalization workflows.

3 Why This Workflow Matters
7 Files
Use workflow pages to move from a job to the exact fixtures, packs, and supporting references.
Why This Workflow Matters

About This Workflow

  • Mix clean layout PDFs, scan-style pages, protected files, and damaged documents in one extraction suite.
  • Pair PDF extraction cases with TXT encoding fixtures to validate plain-text fallback and normalization.
  • Use the extraction pack for repeatable parser, OCR, and field-mapping setup.
Recommended Packs

Fixture Packs

Document Extraction Fixture Pack

Bundle of real PDF and TXT fixtures for extraction, layout parsing, OCR-style validation, protected-document handling, and damaged-file workflows.

document_extraction_fixture_pack.zip · 18.9 KB

Image Extraction Fixture Pack

Bundle of real PNG, JPEG, TIFF, and scan-style PDF fixtures for OCR, scan ingestion, and document-photo extraction workflows.

image_extraction_fixture_pack.zip · 382.3 KB

Fixture Matrices

Fixture Matrices

PDF Extraction Fixture Matrix

Use the PDF matrix to choose between text-heavy, layout-driven, form-like, and damaged fixtures for preview and extraction pipelines.

TXT Encoding Fixture Matrix

Choose TXT fixtures for smoke tests, encoding detection, newline handling, long-line stress, and text-processing validation.

Suggested Fixtures

Files

Filename Format Size Actions
pdf_invoice_layout_sample.pdf
.pdf SHA256 45c10f35ba18...
PDF 774 B
pdf_scan_like_image_sample.pdf
.pdf SHA256 22a2cb26d64c...
PDF 3.7 KB
pdf_ocr_noise_sample.pdf
.pdf SHA256 19097c94fe1a...
PDF 7.9 KB
pdf_multi_column_report_sample.pdf
.pdf SHA256 6c5d36e07e3d...
PDF 3.3 KB
pdf_password_protected_sample.pdf
.pdf SHA256 37f22291ff8b...
PDF 3.2 KB
txt_utf8_multilingual_sample.txt
.txt SHA256 1e219cd0bddf...
TXT 94 B
txt_utf16le_sample.txt
.txt SHA256 9033cba7c418...
TXT 176 B
Related Strategy Pages

Related Guides