Module pdf

Module pdf 

Source
Expand description

PDF content extractor.

Uses pdf-extract to extract text content and lopdf for embedded images.

Structsยง

PdfExtractor
Extractor for PDF files.
PdfOxideExtractor
Alternative PDF extractor using the pdf_oxide library.

Constantsยง

MAX_IMAGES ๐Ÿ”’
Configuration for image extraction limits.
MAX_TOTAL_BYTES ๐Ÿ”’
MIN_DIMENSION ๐Ÿ”’

Functionsยง

build_elements ๐Ÿ”’
Build ContentElements from extracted text.
cmyk_to_rgb ๐Ÿ”’
Convert CMYK bytes to RGB.
decode_flate_image ๐Ÿ”’
Decode FlateDecode compressed image to PNG.
decode_pdf_image ๐Ÿ”’
Decode a PDF image into ExtractedImage format.
estimate_page_count ๐Ÿ”’
Estimate page count from text.
extract_pdf_images ๐Ÿ”’
Extract images from PDF document using lopdf.
extract_pdf_text ๐Ÿ”’
Extract text from PDF bytes using pdf-extract.
extract_with_pdf_oxide ๐Ÿ”’
looks_like_heading ๐Ÿ”’
Heuristic to detect if text looks like a heading.