pdf2struct

pdf2struct extracts structured JSON from PDF documents.

Supports:

text extraction
metadata extraction
table extraction
OCR fallback (optional)
invoice-like key-value extraction

Install

pip install pdf2struct

OCR support:

pip install pdf2struct[ocr]

CLI

Extract PDF:

pdf2struct input.pdf --out output.json

Extract with OCR:

pdf2struct input.pdf --ocr --out output.json

Output JSON structure

metadata
pages (text + tables)
detected_fields (invoice key-value pairs)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
pdf2struct		pdf2struct
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf2struct

Install

CLI

Output JSON structure

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pdf2struct

Install

CLI

Output JSON structure

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages