Expand description
PDF content extractor.
Uses pdf-extract to extract text content and lopdf for embedded images.
Structsยง
- PdfExtractor
- Extractor for PDF files.
- PdfOxide
Extractor - Alternative PDF extractor using the
pdf_oxidelibrary.
Constantsยง
- MAX_
IMAGES ๐ - Configuration for image extraction limits.
- MAX_
TOTAL_ ๐BYTES - MIN_
DIMENSION ๐
Functionsยง
- build_
elements ๐ - Build
ContentElementsfrom extracted text. - cmyk_
to_ ๐rgb - Convert CMYK bytes to RGB.
- decode_
flate_ ๐image - Decode
FlateDecodecompressed image to PNG. - decode_
pdf_ ๐image - Decode a PDF image into
ExtractedImageformat. - estimate_
page_ ๐count - Estimate page count from text.
- extract_
pdf_ ๐images - Extract images from PDF document using lopdf.
- extract_
pdf_ ๐text - Extract text from PDF bytes using pdf-extract.
- extract_
with_ ๐pdf_ oxide - looks_
like_ ๐heading - Heuristic to detect if text looks like a heading.