Text Chunking
Most embedding models accept a maximum of 256–512 tokens. Longer documents must be split into overlapping passages before embedding so each chunk is semantically coherent and retrievable independently.
Runnable Notebooks
- PDF notebook — store a PDF blob, chunk, embed, and search
- Website Chatbot — crawl, chunk, and answer questions with a local LLM
Chunk Size and Overlap
| Parameter | Typical value | Effect |
|---|---|---|
| Chunk size | 256–512 tokens | Larger = more context per chunk; smaller = more precise retrieval |
| Overlap | 10–20% of chunk size | Prevents relevant content from being cut across chunk boundaries |
For most RAG use cases, 512 tokens with 50–100 token overlap is a reasonable starting point.
Chunking with LangChain
LangChain's text splitters handle common document formats and respect sentence boundaries:
pip install -U aperturedb langchain langchain-community sentence-transformers
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
from aperturedb.CommonLibrary import create_connector
client = create_connector()
model = SentenceTransformer("all-MiniLM-L6-v2")
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
document = """
ApertureDB is a multimodal database for images, video, embeddings, and metadata.
It supports KNN vector search with metadata filters applied server-side during traversal.
Embeddings are stored as Descriptors in a DescriptorSet and linked to source objects via graph edges.
"""
chunks = splitter.split_text(document)
client.query([{"AddDescriptorSet": {
"name": "doc_chunks",
"dimensions": 384,
"engine": "HNSW",
"metric": "CS"
}}])
for i, chunk in enumerate(chunks):
emb = model.encode(chunk, normalize_embeddings=True).astype("float32")
client.query(
[{"AddDescriptor": {
"set": "doc_chunks",
"properties": {"text": chunk, "chunk_index": i}
}}],
[emb.tobytes()]
)
Retrieve the most relevant chunks for a query:
query_emb = model.encode("how does vector search work", normalize_embeddings=True).astype("float32")
response, _ = client.query(
[{"FindDescriptor": {
"set": "doc_chunks",
"k_neighbors": 3,
"distances": True,
"results": {"all_properties": True}
}}],
[query_emb.tobytes()]
)
for entity in response[0]["FindDescriptor"].get("entities", []):
print(f"[{entity['_distance']:.4f}] {entity['text'][:120]}")
PDF Chunking via the Workflows UI
The Embeddings Extraction workflow handles PDF text extraction, chunking, and embedding without writing code:
- Configurable chunk size and overlap
- PDF text extraction with OCR fallback
- Embedding with any supported model
- Parallel ingestion into ApertureDB
What's Next
- Building RAG Pipelines — LangChain and LlamaIndex retrieval chains
- Hybrid Search — filter chunks by metadata during KNN traversal
- Text Embedding Models — sentence-transformers, Cohere, model comparison