![]() |
VOOZH | about |
Chunking is the process of segmenting text into smaller, manageable portions based on length, structure or semantic meaning. It allows vector search to focus on precise information rather than entire documents. Understanding different chunking methods helps improve retrieval accuracy and model performance in Retrieval Augmented Generation pipelines.
1. Fixed-Size Chunking: Splits text into equal-sized segments based on characters or tokens.
2. Recursive Character Splitter: Splits text using multiple fallback rules to preserve structure.
3. Token-Based Chunking: Splits text based on model token limits.
4. Sentence or Semantic Chunking: Groups text based on meaning or sentence boundaries.
5. Document-Based Chunking: Breaks structured documents into logical sections.
Chunk overlap refers to the technique of including a small portion of text from the end of one chunk at the beginning of the next chunk. This helps maintain continuity between chunks and prevents important information from being lost when text is split. It is especially useful when sentences or ideas span across multiple chunks.
Choosing the right chunk size depends on the type of document and the use case. If chunks are too large, the model may include unnecessary data. If chunk is too small, it may lose essential meaning. Some recommended chunk sizes in LangChain are:
Installing LangChain for chunking utilities.
Reading the input text file.
You can download document from here.
1. Fixed-Size Chunking
Output:
Machine learning is a branch of artificial intelligence focused on building systems that learn from data. These systems improve their performance over time without being explicitly programmed. There are many applications of machine learning, such as image classification, speech recognition, recommendation systems, and autonomous driving.
2. Recursive Character Chunking
Output:
Machine learning is a branch of artificial intelligence focused on building systems that learn from data. These systems improve their performance over time without being explicitly programmed. There are many applications of machine learning, such as image classification, speech recognition, recommendation systems, and autonomous driving.
3. Token-Based Chunking
Output:
Token-Based Chunks: 1
Machine learning is a branch of artificial intelligence focused on building systems that learn from data. These systems improve their performance over time without being explicitly programmed. There are many applications of machine learning, such as image classification, speech recognition, recommendation systems, and autonomous driving.
Supervised learning uses labeled data to train predictive models. It is commonly used for tasks like spam detection and sentiment analysis. Unsupervised learning, on the other hand, discovers hidden patterns in unlabeled data, such as customer clustering or anomaly detection.
Reinforcement learning involves agents making decisions by interacting with an environment. They receive rewards or penalties based on their actions and learn optimal behaviors through continuous feedback. This approach is widely used in robotics, game playing, and resource optimization.
Although machine learning is powerful, it also comes with challenges such as data bias, overfitting, model interpretability, and computational complexity. It is important to choose the right algorithms, preprocess data correctly, and validate models properly to ensure reliable results.
4. Sentence / Semantic Chunking
Output:
Total Chunks Created: 4
Machine learning is a branch of artificial intelligence focused on building systems that learn from data. These systems improve their performance over time without being explicitly programmed. There are many applications of machine learning, such as image classification, speech recognition, recommendation systems, and autonomous driving.
Note: Semantic chunking depends on embedding models and may require external APIs, so it is not included as a runnable example here.
5. Document-Based Chunking
Output:
Document Chunks: 4
You can download the complete code from here.