Module pdf

Module pdf

Expand description

PDF content extractor.

Uses pdf-extract to extract text content and lopdf for embedded images.

Structs§

PdfExtractor: Extractor for PDF files.
PdfOxideExtractor: Alternative PDF extractor using the pdf_oxide library.

Constants§

MAX_IMAGES 🔒: Configuration for image extraction limits.
MAX_TOTAL_BYTES 🔒
MIN_DIMENSION 🔒

Functions§

build_elements 🔒: Build ContentElements from extracted text.
cmyk_to_rgb 🔒: Convert CMYK bytes to RGB.
decode_flate_image 🔒: Decode FlateDecode compressed image to PNG.
decode_pdf_image 🔒: Decode a PDF image into ExtractedImage format.
estimate_page_count 🔒: Estimate page count from text.
extract_pdf_images 🔒: Extract images from PDF document using lopdf.
extract_pdf_text 🔒: Extract text from PDF bytes using pdf-extract.
extract_with_pdf_oxide 🔒
looks_like_heading 🔒: Heuristic to detect if text looks like a heading.