Crate ragfs_extract

Crate ragfs_extract 

Source
Expand description

§ragfs-extract

Content extraction from various file formats for the RAGFS indexing pipeline.

This crate provides the extraction layer that reads files and produces ExtractedContent for downstream chunking and embedding.

§Supported Formats

ExtractorFormatsFeatures
TextExtractor.txt, .md, .rs, .py, .js, .ts, .go, .java, .json, .yaml, .toml, .xml, .html, .css, and 30+ moreUTF-8 text extraction
PdfExtractor.pdfText extraction + embedded images (JPEG, PNG, JPEG2000)
ImageExtractor.png, .jpg, .gif, .webp, .bmpMetadata extraction, optional vision captioning

§Usage

use ragfs_extract::{ExtractorRegistry, TextExtractor, PdfExtractor, ImageExtractor};
use std::path::Path;

// Create a registry with all extractors
let mut registry = ExtractorRegistry::new();
registry.register("text", TextExtractor);
registry.register("pdf", PdfExtractor::new());
registry.register("image", ImageExtractor::new(None));

// Extract content from a file
let content = registry.extract(Path::new("document.pdf"), "application/pdf").await?;
println!("Extracted {} bytes", content.text.len());

§PDF Image Extraction

The PdfExtractor can extract embedded images from PDF documents:

  • Supported formats: JPEG (DCTDecode), PNG (FlateDecode), JPEG2000 (JPXDecode)
  • Color spaces: RGB, Grayscale, CMYK (auto-converted to RGB)
  • Limits: 100 images max, 50MB total, 50px minimum dimension

§Vision Captioning

The ImageExtractor supports optional vision-based captioning via the ImageCaptioner trait. A PlaceholderCaptioner is provided as a no-op default.

§Components

TypeDescription
ExtractorRegistryRoutes files to appropriate extractors by MIME type
TextExtractorHandles text-based files (40+ types)
PdfExtractorPDF text and image extraction
ImageExtractorImage metadata and optional captioning
ImageCaptionerTrait for vision model integration
PlaceholderCaptionerNo-op captioner implementation

Re-exports§

pub use image::ImageExtractor;
pub use pdf::PdfExtractor;
pub use pdf::PdfOxideExtractor;
pub use registry::ExtractorRegistry;
pub use text::TextExtractor;
pub use vision::BlipCaptioner;
pub use vision::CaptionConfig;
pub use vision::CaptionError;
pub use vision::ImageCaptioner;
pub use vision::PlaceholderCaptioner;

Modules§

image
Image content extractor.
pdf
PDF content extractor.
registry
Extractor registry for managing content extractors.
text
Text content extractor.
vision
Vision model captioning for images.