TypeofChunking

Different Types of
Chunking
Strategies
Structured
Data
Chunks Vector DB Retrieved Response

Chunks Generation
Unstructured
Data
Fixed-length Chunking Dynamic Chunking
Semantic Chunking Sentence-based Chunking
Overlapping Chunking Paragraph-based Chunking
Sliding Window Chunking Token-based Chunking
Hierarchical Chunking Contextual Chunking

What is Chunking?
Chunking is the process of breaking down a large documents into
smaller, manageable pieces or chunks. One single document is
known as one chunk.
This helps in efficiently retrieving and generating relevant

information from a vast amount of data.
Chunk 1
Large Chunk 2
Document
Chunk n
Why is Chunking required?
Memory Limitation
Large documents can exceed memory capacity.
Chunking breaks down the data into manageable
chunks.
Processing Efficiency
Smaller chunks are faster to process.
Reduces computational cost.
Improved Retrieved Accuracy

Focuses on relevant sections. Enhances
context specific responses.
Simplifies Information Management Easier to

navigate and search Facilitate quick
access to specific data
Scalability
Allows to handle larger datasets
Makes system more robust and scalable.
Fixed-Length Chunking
As the name suggest, we create chunk of data of fixed size from
an existing document.
It is a method of splitting text into chunks of a specific size,

maintaining the overlap to ensure continuity between the
chunks.
The CharacterTextSplitter in langchain achives it by splitting the
text by character count.
Steps
Input Document
A long document or text that needs to be divided into

smaller chunks.
Define Chunk Size

Determine the fixed size of each chunk (e.g., 100 words, 500
characters, etc.).
Chunking Process
The text is split into non-overlapping segments based on the

defined size.
This can be done at the word level, token level, or

character level.
In fixed-length chunking, chunks are created strictly based on

the fixed size, regardless of sentence boundaries.
Output
A list of chunks is generated, where each chunk contains the

specified number of words, tokens, or characters.
Consideration
Loss of Context: Since chunking does not consider sentence or

paragraph boundaries, it may break sentences in the middle,
potentially leading to incomplete ideas or a loss of context.
Memory Efficiency: Fixed-length chunking helps to reduce

memory overhead when processing long texts in small,
consistent blocks.
Fixed Length Chunking in Python
Output
Breakdown of the Code

Tokenization: The text is split into individual words using the
.split() method.
Chunk Creation: Using a loop, we iterate over the list of words

and create chunks of the specified size (20 words in this
example).
Output: The chunks are printed as separate segments of text.

Fixed-Length Chunking in Langchain
Output
Breakdown of the Code

CharacterTextSplitter:
The CharacterTextSplitter is used here to split the text into chunks based on a fixed
number of characters.
It also allows you to specify an overlap, meaning that consecutive chunks will
share a few characters, which can help retain context.
Parameters
chunk_size: Number of characters each chunk will contain (in this example, 200
characters).
chunk_overlap: Number of overlapping characters between chunks to
preserve context across splits (20 characters in this example).
separator: Ensures that we don’t split in the middle of a word. In this case, it's set
to a space to split between words.
When to use Fixed-Length Chunking
• When the data is uniform in structure: Ideal for tasks like processing log files or
structured datasets where the content length is predictable.
• When memory and processing efficiency are a priority: Useful when handling large
datasets or models with strict token limits where predictability of chunk size helps
with system optimization.
• When maintaining simple processing pipelines: Beneficial when minimizing

complexity in chunking logic is important, especially for early-stage prototyping.
Advantages of Fixed-Length Chunking
• Simple Implementation: Easy to implement, as the chunk size is predetermined and

doesn't require complex processing.
• Efficient for Uniform Data: Works well for text where content length is consistent
and uniform, allowing predictable chunk sizes.
• Scalable: Fixed-length chunks are scalable for systems that need predictable
resource allocation and processing limits (e.g., token limits in LLMs).
Srinivas Mahakud
Cloud & AI Leader @EY
For More Such

Content
Srinivas Mahakud
https://www.linkedin.com/in/srinivasmahakud/

TypeofChunking

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

TypeofChunking

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TypeofChunking

Uploaded by

Copyright:

Available Formats

Different Types of

Chunks Vector DB Retrieved Response

Semantic Chunking Sentence-based Chunking

Overlapping Chunking Paragraph-based Chunking

Sliding Window Chunking Token-based Chunking

Hierarchical Chunking Contextual Chunking

This helps in efficiently retrieving and generating relevant

Improved Retrieved Accuracy

Simplifies Information Management Easier to

It is a method of splitting text into chunks of a specific size,

A long document or text that needs to be divided into

Define Chunk Size

The text is split into non-overlapping segments based on the

This can be done at the word level, token level, or

In fixed-length chunking, chunks are created strictly based on

A list of chunks is generated, where each chunk contains the

Loss of Context: Since chunking does not consider sentence or

Memory Efficiency: Fixed-length chunking helps to reduce

Breakdown of the Code

Chunk Creation: Using a loop, we iterate over the list of words

Output: The chunks are printed as separate segments of text.

Breakdown of the Code

• When maintaining simple processing pipelines: Beneficial when minimizing

Advantages of Fixed-Length Chunking

• Simple Implementation: Easy to implement, as the chunk size is predetermined and

For More Such

You might also like