pdf2struct extracts structured JSON from PDF documents.
Supports:
- text extraction
- metadata extraction
- table extraction
- OCR fallback (optional)
- invoice-like key-value extraction
pip install pdf2structOCR support:
pip install pdf2struct[ocr]Extract PDF:
pdf2struct input.pdf --out output.jsonExtract with OCR:
pdf2struct input.pdf --ocr --out output.json- metadata
- pages (text + tables)
- detected_fields (invoice key-value pairs)
MIT