Skip to content

Kubenew/pdf2struct

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdf2struct

PyPI Version Python Versions License: MIT Tests Downloads

pdf2struct extracts structured JSON from PDF documents.

Supports:

  • text extraction
  • metadata extraction
  • table extraction
  • OCR fallback (optional)
  • invoice-like key-value extraction

Install

pip install pdf2struct

OCR support:

pip install pdf2struct[ocr]

CLI

Extract PDF:

pdf2struct input.pdf --out output.json

Extract with OCR:

pdf2struct input.pdf --ocr --out output.json

Output JSON structure

  • metadata
  • pages (text + tables)
  • detected_fields (invoice key-value pairs)

License

MIT