Skip to main content

Text Chunking

Most embedding models accept a maximum of 256–512 tokens. Longer documents must be split into overlapping passages before embedding so each chunk is semantically coherent and retrievable independently.

Runnable Notebooks

Chunk Size and Overlap

ParameterTypical valueEffect
Chunk size256–512 tokensLarger = more context per chunk; smaller = more precise retrieval
Overlap10–20% of chunk sizePrevents relevant content from being cut across chunk boundaries

For most RAG use cases, 512 tokens with 50–100 token overlap is a reasonable starting point.


Chunking with LangChain

LangChain's text splitters handle common document formats and respect sentence boundaries:

pip install -U aperturedb langchain langchain-community sentence-transformers
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
from aperturedb.CommonLibrary import create_connector

client = create_connector()
model = SentenceTransformer("all-MiniLM-L6-v2")

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

document = """
ApertureDB is a multimodal database for images, video, embeddings, and metadata.
It supports KNN vector search with metadata filters applied server-side during traversal.
Embeddings are stored as Descriptors in a DescriptorSet and linked to source objects via graph edges.
"""

chunks = splitter.split_text(document)

client.query([{"AddDescriptorSet": {
"name": "doc_chunks",
"dimensions": 384,
"engine": "HNSW",
"metric": "CS"
}}])

for i, chunk in enumerate(chunks):
emb = model.encode(chunk, normalize_embeddings=True).astype("float32")
client.query(
[{"AddDescriptor": {
"set": "doc_chunks",
"properties": {"text": chunk, "chunk_index": i}
}}],
[emb.tobytes()]
)

Retrieve the most relevant chunks for a query:

query_emb = model.encode("how does vector search work", normalize_embeddings=True).astype("float32")

response, _ = client.query(
[{"FindDescriptor": {
"set": "doc_chunks",
"k_neighbors": 3,
"distances": True,
"results": {"all_properties": True}
}}],
[query_emb.tobytes()]
)

for entity in response[0]["FindDescriptor"].get("entities", []):
print(f"[{entity['_distance']:.4f}] {entity['text'][:120]}")

PDF Chunking via the Workflows UI

The Embeddings Extraction workflow handles PDF text extraction, chunking, and embedding without writing code:

  • Configurable chunk size and overlap
  • PDF text extraction with OCR fallback
  • Embedding with any supported model
  • Parallel ingestion into ApertureDB

What's Next