ragfs_chunker/
lib.rs

1//! # ragfs-chunker
2//!
3//! Document chunking strategies for the RAGFS indexing pipeline.
4//!
5//! This crate splits [`ExtractedContent`](ragfs_core::ExtractedContent) into smaller
6//! chunks suitable for embedding. Different strategies optimize for different content types.
7//!
8//! ## Chunking Strategies
9//!
10//! | Chunker | Best For | Method |
11//! |---------|----------|--------|
12//! | [`FixedSizeChunker`] | General text | Token-based splitting with overlap |
13//! | [`CodeChunker`] | Source code | AST-aware splitting via tree-sitter |
14//! | [`SemanticChunker`] | Documents | Structure-aware (headings, paragraphs) |
15//!
16//! ## Usage
17//!
18//! ```rust,ignore
19//! use ragfs_chunker::{ChunkerRegistry, FixedSizeChunker, CodeChunker, SemanticChunker};
20//! use ragfs_core::ChunkConfig;
21//!
22//! // Create a registry with all chunkers
23//! let mut registry = ChunkerRegistry::new();
24//! registry.register("fixed", FixedSizeChunker::new());
25//! registry.register("code", CodeChunker::new());
26//! registry.register("semantic", SemanticChunker::new());
27//! registry.set_default("fixed");
28//!
29//! // Configure chunking parameters
30//! let config = ChunkConfig {
31//!     target_size: 512,    // Target tokens per chunk
32//!     max_size: 1024,      // Maximum tokens
33//!     overlap: 64,         // Overlap between chunks
34//!     hierarchical: true,  // Enable parent/child relationships
35//!     max_depth: 2,        // Maximum hierarchy depth
36//! };
37//!
38//! // Chunk content
39//! let chunks = registry.chunk(&content, &content_type, &config).await?;
40//! ```
41//!
42//! ## Fixed-Size Chunking
43//!
44//! The [`FixedSizeChunker`] splits text into chunks of approximately equal size:
45//!
46//! - Token-based sizing (not character-based)
47//! - Configurable overlap for context preservation
48//! - Smart break detection (prefers newlines, sentence boundaries)
49//!
50//! ## Code-Aware Chunking
51//!
52//! The [`CodeChunker`] uses tree-sitter for syntax-aware splitting:
53//!
54//! - Respects function/class boundaries
55//! - Preserves complete code constructs
56//! - Supports Rust, Python, JavaScript, TypeScript, Go, Java, and more
57//!
58//! ## Semantic Chunking
59//!
60//! The [`SemanticChunker`] understands document structure:
61//!
62//! - Splits on headings and sections
63//! - Preserves paragraph integrity
64//! - Maintains hierarchical relationships
65//!
66//! ## Components
67//!
68//! | Type | Description |
69//! |------|-------------|
70//! | [`ChunkerRegistry`] | Routes content to appropriate chunkers |
71//! | [`FixedSizeChunker`] | Token-based chunking with overlap |
72//! | [`CodeChunker`] | AST-aware code chunking |
73//! | [`SemanticChunker`] | Document structure-aware chunking |
74
75pub mod code;
76pub mod fixed;
77pub mod registry;
78pub mod semantic;
79
80pub use code::CodeChunker;
81pub use fixed::FixedSizeChunker;
82pub use registry::ChunkerRegistry;
83pub use semantic::SemanticChunker;