0% found this document useful (0 votes)
5 views11 pages

TypeofChunking

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 11

Different Types of

Chunking
Strategies

Structured
Data

Chunks Vector DB Retrieved Response


Chunks Generation

Unstructured
Data
Fixed-length Chunking Dynamic Chunking

Semantic Chunking Sentence-based Chunking

Overlapping Chunking Paragraph-based Chunking

Sliding Window Chunking Token-based Chunking

Hierarchical Chunking Contextual Chunking


What is Chunking?
Chunking is the process of breaking down a large documents into
smaller, manageable pieces or chunks. One single document is
known as one chunk.

This helps in efficiently retrieving and generating relevant


information from a vast amount of data.

Chunk 1

Large Chunk 2
Document

Chunk n
Why is Chunking required?
Memory Limitation
Large documents can exceed memory capacity.
Chunking breaks down the data into manageable
chunks.

Processing Efficiency
Smaller chunks are faster to process.
Reduces computational cost.

Improved Retrieved Accuracy


Focuses on relevant sections. Enhances
context specific responses.

Simplifies Information Management Easier to


navigate and search Facilitate quick
access to specific data

Scalability
Allows to handle larger datasets
Makes system more robust and scalable.
Fixed-Length Chunking
As the name suggest, we create chunk of data of fixed size from
an existing document.

It is a method of splitting text into chunks of a specific size,


maintaining the overlap to ensure continuity between the
chunks.
The CharacterTextSplitter in langchain achives it by splitting the
text by character count.

Steps

Input Document

A long document or text that needs to be divided into


smaller chunks.

Define Chunk Size


Determine the fixed size of each chunk (e.g., 100 words, 500
characters, etc.).
Chunking Process

The text is split into non-overlapping segments based on the


defined size.

This can be done at the word level, token level, or


character level.

In fixed-length chunking, chunks are created strictly based on


the fixed size, regardless of sentence boundaries.

Output

A list of chunks is generated, where each chunk contains the


specified number of words, tokens, or characters.

Consideration

Loss of Context: Since chunking does not consider sentence or


paragraph boundaries, it may break sentences in the middle,
potentially leading to incomplete ideas or a loss of context.

Memory Efficiency: Fixed-length chunking helps to reduce


memory overhead when processing long texts in small,
consistent blocks.
Fixed Length Chunking in Python
Output

Breakdown of the Code


Tokenization: The text is split into individual words using the
.split() method.

Chunk Creation: Using a loop, we iterate over the list of words


and create chunks of the specified size (20 words in this
example).

Output: The chunks are printed as separate segments of text.


Fixed-Length Chunking in Langchain
Output

Breakdown of the Code


CharacterTextSplitter:
The CharacterTextSplitter is used here to split the text into chunks based on a fixed
number of characters.
It also allows you to specify an overlap, meaning that consecutive chunks will
share a few characters, which can help retain context.

Parameters
chunk_size: Number of characters each chunk will contain (in this example, 200
characters).
chunk_overlap: Number of overlapping characters between chunks to
preserve context across splits (20 characters in this example).
separator: Ensures that we don’t split in the middle of a word. In this case, it's set
to a space to split between words.
When to use Fixed-Length Chunking

• When the data is uniform in structure: Ideal for tasks like processing log files or
structured datasets where the content length is predictable.

• When memory and processing efficiency are a priority: Useful when handling large
datasets or models with strict token limits where predictability of chunk size helps
with system optimization.

• When maintaining simple processing pipelines: Beneficial when minimizing


complexity in chunking logic is important, especially for early-stage prototyping.

Advantages of Fixed-Length Chunking

• Simple Implementation: Easy to implement, as the chunk size is predetermined and


doesn't require complex processing.

• Efficient for Uniform Data: Works well for text where content length is consistent
and uniform, allowing predictable chunk sizes.

• Scalable: Fixed-length chunks are scalable for systems that need predictable
resource allocation and processing limits (e.g., token limits in LLMs).
Srinivas Mahakud
Cloud & AI Leader @EY

For More Such


Content

Srinivas Mahakud

https://www.linkedin.com/in/srinivasmahakud/

You might also like