Crate ragfs_extract

Expand description

§ragfs-extract

Content extraction from various file formats for the RAGFS indexing pipeline.

This crate provides the extraction layer that reads files and produces ExtractedContent for downstream chunking and embedding.

§Supported Formats

Extractor	Formats	Features
`TextExtractor`	`.txt`, `.md`, `.rs`, `.py`, `.js`, `.ts`, `.go`, `.java`, `.json`, `.yaml`, `.toml`, `.xml`, `.html`, `.css`, and 30+ more	UTF-8 text extraction
`PdfExtractor`	`.pdf`	Text extraction + embedded images (JPEG, PNG, JPEG2000)
`ImageExtractor`	`.png`, `.jpg`, `.gif`, `.webp`, `.bmp`	Metadata extraction, optional vision captioning

§Usage

use ragfs_extract::{ExtractorRegistry, TextExtractor, PdfExtractor, ImageExtractor};
use std::path::Path;

// Create a registry with all extractors
let mut registry = ExtractorRegistry::new();
registry.register("text", TextExtractor);
registry.register("pdf", PdfExtractor::new());
registry.register("image", ImageExtractor::new(None));

// Extract content from a file
let content = registry.extract(Path::new("document.pdf"), "application/pdf").await?;
println!("Extracted {} bytes", content.text.len());

§PDF Image Extraction

The PdfExtractor can extract embedded images from PDF documents:

Supported formats: JPEG (DCTDecode), PNG (FlateDecode), JPEG2000 (JPXDecode)
Color spaces: RGB, Grayscale, CMYK (auto-converted to RGB)
Limits: 100 images max, 50MB total, 50px minimum dimension

§Vision Captioning

The ImageExtractor supports optional vision-based captioning via the ImageCaptioner trait. A PlaceholderCaptioner is provided as a no-op default.

§Components

Type	Description
`ExtractorRegistry`	Routes files to appropriate extractors by MIME type
`TextExtractor`	Handles text-based files (40+ types)
`PdfExtractor`	PDF text and image extraction
`ImageExtractor`	Image metadata and optional captioning
`ImageCaptioner`	Trait for vision model integration
`PlaceholderCaptioner`	No-op captioner implementation

Re-exports§

pub use image::ImageExtractor;
pub use pdf::PdfExtractor;
pub use pdf::PdfOxideExtractor;
pub use registry::ExtractorRegistry;
pub use text::TextExtractor;
pub use vision::BlipCaptioner;
pub use vision::CaptionConfig;
pub use vision::CaptionError;
pub use vision::ImageCaptioner;
pub use vision::PlaceholderCaptioner;

Modules§

image: Image content extractor.
pdf: PDF content extractor.
registry: Extractor registry for managing content extractors.
text: Text content extractor.
vision: Vision model captioning for images.

Crate ragfs_extract

Crate ragfs_extract Copy item path

§ragfs-extract

§Supported Formats

§Usage

§PDF Image Extraction

§Vision Captioning

§Components

Re-exports§

Modules§

Crate ragfs_extract