Text Chunking

Most embedding models accept a maximum of 256–512 tokens. Longer documents must be split into overlapping passages before embedding so each chunk is semantically coherent and retrievable independently.

Runnable Notebooks

PDF notebook — store a PDF blob, chunk, embed, and search
Website Chatbot — crawl, chunk, and answer questions with a local LLM

Chunk Size and Overlap

Parameter	Typical value	Effect
Chunk size	256–512 tokens	Larger = more context per chunk; smaller = more precise retrieval
Overlap	10–20% of chunk size	Prevents relevant content from being cut across chunk boundaries

For most RAG use cases, 512 tokens with 50–100 token overlap is a reasonable starting point.

Chunking with LangChain

LangChain's text splitters handle common document formats and respect sentence boundaries:

pip install -U aperturedb langchain langchain-community sentence-transformers

from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
from aperturedb.CommonLibrary import create_connector

client = create_connector()
model  = SentenceTransformer("all-MiniLM-L6-v2")

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

document = """
ApertureDB is a multimodal database for images, video, embeddings, and metadata.
It supports KNN vector search with metadata filters applied server-side during traversal.
Embeddings are stored as Descriptors in a DescriptorSet and linked to source objects via graph edges.
"""

chunks = splitter.split_text(document)

client.query([{"AddDescriptorSet": {
    "name": "doc_chunks",
    "dimensions": 384,
    "engine": "HNSW",
    "metric": "CS"
}}])

for i, chunk in enumerate(chunks):
    emb = model.encode(chunk, normalize_embeddings=True).astype("float32")
    client.query(
        [{"AddDescriptor": {
            "set": "doc_chunks",
            "properties": {"text": chunk, "chunk_index": i}
        }}],
        [emb.tobytes()]
    )

Retrieve the most relevant chunks for a query:

query_emb = model.encode("how does vector search work", normalize_embeddings=True).astype("float32")

response, _ = client.query(
    [{"FindDescriptor": {
        "set": "doc_chunks",
        "k_neighbors": 3,
        "distances": True,
        "results": {"all_properties": True}
    }}],
    [query_emb.tobytes()]
)

for entity in response[0]["FindDescriptor"].get("entities", []):
    print(f"[{entity['_distance']:.4f}] {entity['text'][:120]}")

PDF Chunking via the Workflows UI

The Embeddings Extraction workflow handles PDF text extraction, chunking, and embedding without writing code:

Configurable chunk size and overlap
PDF text extraction with OCR fallback
Embedding with any supported model
Parallel ingestion into ApertureDB

What's Next

Building RAG Pipelines — LangChain and LlamaIndex retrieval chains
Hybrid Search — filter chunks by metadata during KNN traversal
Text Embedding Models — sentence-transformers, Cohere, model comparison

Chunk Size and Overlap​

Chunking with LangChain​

PDF Chunking via the Workflows UI​

What's Next​

Chunk Size and Overlap

Chunking with LangChain

PDF Chunking via the Workflows UI

What's Next