ragfs_extract/lib.rs
1//! # ragfs-extract
2//!
3//! Content extraction from various file formats for the RAGFS indexing pipeline.
4//!
5//! This crate provides the extraction layer that reads files and produces
6//! [`ExtractedContent`](ragfs_core::ExtractedContent) for downstream chunking and embedding.
7//!
8//! ## Supported Formats
9//!
10//! | Extractor | Formats | Features |
11//! |-----------|---------|----------|
12//! | [`TextExtractor`] | `.txt`, `.md`, `.rs`, `.py`, `.js`, `.ts`, `.go`, `.java`, `.json`, `.yaml`, `.toml`, `.xml`, `.html`, `.css`, and 30+ more | UTF-8 text extraction |
13//! | [`PdfExtractor`] | `.pdf` | Text extraction + embedded images (JPEG, PNG, JPEG2000) |
14//! | [`ImageExtractor`] | `.png`, `.jpg`, `.gif`, `.webp`, `.bmp` | Metadata extraction, optional vision captioning |
15//!
16//! ## Usage
17//!
18//! ```rust,ignore
19//! use ragfs_extract::{ExtractorRegistry, TextExtractor, PdfExtractor, ImageExtractor};
20//! use std::path::Path;
21//!
22//! // Create a registry with all extractors
23//! let mut registry = ExtractorRegistry::new();
24//! registry.register("text", TextExtractor);
25//! registry.register("pdf", PdfExtractor::new());
26//! registry.register("image", ImageExtractor::new(None));
27//!
28//! // Extract content from a file
29//! let content = registry.extract(Path::new("document.pdf"), "application/pdf").await?;
30//! println!("Extracted {} bytes", content.text.len());
31//! ```
32//!
33//! ## PDF Image Extraction
34//!
35//! The [`PdfExtractor`] can extract embedded images from PDF documents:
36//!
37//! - **Supported formats**: JPEG (`DCTDecode`), PNG (`FlateDecode`), JPEG2000 (`JPXDecode`)
38//! - **Color spaces**: RGB, Grayscale, CMYK (auto-converted to RGB)
39//! - **Limits**: 100 images max, 50MB total, 50px minimum dimension
40//!
41//! ## Vision Captioning
42//!
43//! The [`ImageExtractor`] supports optional vision-based captioning via the
44//! [`ImageCaptioner`] trait. A [`PlaceholderCaptioner`] is provided as a no-op default.
45//!
46//! ## Components
47//!
48//! | Type | Description |
49//! |------|-------------|
50//! | [`ExtractorRegistry`] | Routes files to appropriate extractors by MIME type |
51//! | [`TextExtractor`] | Handles text-based files (40+ types) |
52//! | [`PdfExtractor`] | PDF text and image extraction |
53//! | [`ImageExtractor`] | Image metadata and optional captioning |
54//! | [`ImageCaptioner`] | Trait for vision model integration |
55//! | [`PlaceholderCaptioner`] | No-op captioner implementation |
56
57pub mod image;
58pub mod pdf;
59pub mod registry;
60pub mod text;
61pub mod vision;
62
63pub use image::ImageExtractor;
64pub use pdf::PdfExtractor;
65#[cfg(feature = "pdf_oxide")]
66pub use pdf::PdfOxideExtractor;
67pub use registry::ExtractorRegistry;
68pub use text::TextExtractor;
69#[cfg(feature = "vision")]
70pub use vision::BlipCaptioner;
71pub use vision::{CaptionConfig, CaptionError, ImageCaptioner, PlaceholderCaptioner};