Crate ragfs_chunker

Crate ragfs_chunker 

Source
Expand description

§ragfs-chunker

Document chunking strategies for the RAGFS indexing pipeline.

This crate splits ExtractedContent into smaller chunks suitable for embedding. Different strategies optimize for different content types.

§Chunking Strategies

ChunkerBest ForMethod
FixedSizeChunkerGeneral textToken-based splitting with overlap
CodeChunkerSource codeAST-aware splitting via tree-sitter
SemanticChunkerDocumentsStructure-aware (headings, paragraphs)

§Usage

use ragfs_chunker::{ChunkerRegistry, FixedSizeChunker, CodeChunker, SemanticChunker};
use ragfs_core::ChunkConfig;

// Create a registry with all chunkers
let mut registry = ChunkerRegistry::new();
registry.register("fixed", FixedSizeChunker::new());
registry.register("code", CodeChunker::new());
registry.register("semantic", SemanticChunker::new());
registry.set_default("fixed");

// Configure chunking parameters
let config = ChunkConfig {
    target_size: 512,    // Target tokens per chunk
    max_size: 1024,      // Maximum tokens
    overlap: 64,         // Overlap between chunks
    hierarchical: true,  // Enable parent/child relationships
    max_depth: 2,        // Maximum hierarchy depth
};

// Chunk content
let chunks = registry.chunk(&content, &content_type, &config).await?;

§Fixed-Size Chunking

The FixedSizeChunker splits text into chunks of approximately equal size:

  • Token-based sizing (not character-based)
  • Configurable overlap for context preservation
  • Smart break detection (prefers newlines, sentence boundaries)

§Code-Aware Chunking

The CodeChunker uses tree-sitter for syntax-aware splitting:

  • Respects function/class boundaries
  • Preserves complete code constructs
  • Supports Rust, Python, JavaScript, TypeScript, Go, Java, and more

§Semantic Chunking

The SemanticChunker understands document structure:

  • Splits on headings and sections
  • Preserves paragraph integrity
  • Maintains hierarchical relationships

§Components

TypeDescription
ChunkerRegistryRoutes content to appropriate chunkers
FixedSizeChunkerToken-based chunking with overlap
CodeChunkerAST-aware code chunking
SemanticChunkerDocument structure-aware chunking

Re-exports§

pub use code::CodeChunker;
pub use fixed::FixedSizeChunker;
pub use registry::ChunkerRegistry;
pub use semantic::SemanticChunker;

Modules§

code
Code-aware chunking strategy.
fixed
Fixed-size chunking strategy with overlap.
registry
Chunker registry for managing chunking strategies.
semantic
Semantic chunking strategy.