Work with PDFs
PDFs are stored as Blobs in ApertureDB. This notebook shows how to:
- Store a PDF with
document_type: "pdf"so the ApertureDB UI can render it inline - Extract text, chunk it, and generate embeddings with sentence-transformers
- Store embeddings as Descriptors linked to the PDF blob via a graph edge
- Run semantic search that returns matching chunks and their source PDF in one query
For large-scale PDF ingestion with OCR and parallel processing, see the Embeddings Extraction workflow.
Connect to ApertureDB
Option A: ApertureDB Cloud (recommended)
Sign up for a free 30-day trial. Get your key from Connect > Generate API Key, add it to a .env file in this directory:
APERTUREDB_KEY=your_key_here
Option B: Community Edition (local Docker)
Run this in a terminal before starting the notebook:
docker run -d --name aperturedb \
-p 55555:55555 -e ADB_MASTER_KEY=admin -e ADB_FORCE_SSL=false \
aperturedata/aperturedb-community
See client configuration options for all connection methods and server setup options for deployment choices.
%pip install --upgrade --quiet aperturedb python-dotenv sentence-transformers pypdf
Note: you may need to restart the kernel to use updated packages.
# Option A: ApertureDB Cloud
from dotenv import load_dotenv
load_dotenv() # loads APERTUREDB_KEY from .env into the environment
True
# Option B: Community Edition (local Docker)
# !adb config create localdb --active \
# --host localhost --port 55555 \
# --username admin --password admin \
# --no-use-ssl --no-interactive
from aperturedb.CommonLibrary import create_connector
client = create_connector()
response, _ = client.query([{"GetStatus": {}}])
client.print_last_response()
[
{
"GetStatus": {
"info": "OK",
"status": 0,
"system": "ApertureDB",
"version": "0.19.6"
}
}
]
Download the Sample Recipe PDF
We use a Butter Chicken recipe PDF from the Cookbook repo. Replace pdf_path with any PDF from your own collection.
import os
os.makedirs("data", exist_ok=True)
pdf_path = "data/ButterChicken.pdf"
!wget -q -O "{pdf_path}" \
https://raw.githubusercontent.com/aperture-data/Cookbook/main/notebooks/simple/data/ButterChicken.pdf
print(f"Downloaded {pdf_path} ({os.path.getsize(pdf_path)} bytes)")
Downloaded data/ButterChicken.pdf (141324 bytes)
Store the PDF as a Blob
Setting document_type: "pdf" lets the ApertureDB UI render the PDF inline when you browse blobs.
We also capture a _ref so we can link embeddings to this blob in the next step.
query = [{
"AddBlob": {
"_ref": 1,
"properties": {
"name": "butter_chicken_recipe",
"document_type": "pdf",
"dish_name": "Butter Chicken",
"cuisine": "Indian",
"format": "pdf"
},
"if_not_found": {"name": ["==", "butter_chicken_recipe"]},
}
}]
with open(pdf_path, "rb") as f:
pdf_bytes = f.read()
response, _ = client.query(query, [pdf_bytes])
client.print_last_response()
[
{
"AddBlob": {
"status": 0
}
}
]
Extract Text, Chunk, and Embed
We extract the PDF text with pypdf, split it into overlapping chunks, and embed each chunk with sentence-transformers. Each chunk is stored as a Descriptor linked to the source blob.
from pypdf import PdfReader
from sentence_transformers import SentenceTransformer
import numpy as np
# Extract full text
reader = PdfReader(pdf_path)
full_text = " ".join(page.extract_text() for page in reader.pages)
print(f"Extracted {len(full_text)} characters")
# Chunk by words with overlap
def chunk_text(text, chunk_size=60, overlap=15):
words = text.split()
chunks = []
for start in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[start : start + chunk_size])
if chunk:
chunks.append(chunk)
return chunks
chunks = chunk_text(full_text)
print(f"Created {len(chunks)} chunks")
# Embed all chunks
model = SentenceTransformer("all-MiniLM-L6-v2") # 384-dimensional, CPU-friendly
embeddings = model.encode(chunks, normalize_embeddings=True)
print(f"Embedding shape: {embeddings.shape}")
Extracted 1183 characters
Created 5 chunks
``````output
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading weights: 0%| | 0/103 [00:00<?, ?it/s]
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key | Status | |
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED | |
Notes:
- UNEXPECTED: can be ignored when loading from different task/architecture; not ok if you expect identical arch.
``````output
Embedding shape: (5, 384)
Store Embeddings Linked to the PDF
First create a DescriptorSet as the vector index, then add each chunk embedding with a connection back to the source blob.
SET_NAME = "recipe_pdf_chunks"
client.query([{"AddDescriptorSet": {
"name": SET_NAME,
"dimensions": 384,
"engine": "HNSW",
"metric": "CS",
}}])
client.print_last_response()
[
{
"AddDescriptorSet": {
"status": 0
}
}
]
for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
client.query([
{
"FindBlob": {
"_ref": 1,
"constraints": {"name": ["==", "butter_chicken_recipe"]},
"results": {"all_properties": False},
}
},
{
"AddDescriptor": {
"_ref": 2,
"set": SET_NAME,
"properties": {
"chunk_index": i,
"chunk_text": chunk[:200],
"source_name": "butter_chicken_recipe",
},
}
},
{
"AddConnection": {
"src": 2,
"dst": 1,
"class": "source_pdf",
}
}
], [emb.astype("float32").tobytes()])
print(f"Stored {len(chunks)} chunk embeddings linked to the PDF blob")
Stored 5 chunk embeddings linked to the PDF blob
Semantic Search Over the PDF
Find the most relevant chunks, then traverse the graph edge to retrieve the source PDF blob in the same query.
query_text = "spices and marinade"
query_emb = model.encode([query_text], normalize_embeddings=True)[0].astype("float32")
q = [{
"FindDescriptor": {
"set": SET_NAME,
"k_neighbors": 3,
"distances": True,
"_ref": 1,
"results": {"all_properties": True},
}
}, {
"FindBlob": {
"is_connected_to": {"ref": 1},
"results": {"all_properties": True},
}
}]
response, _ = client.query(q, [query_emb.tobytes()])
print(f"Query: '{query_text}'\n")
chunks_found = response[0]["FindDescriptor"].get("entities", [])
for r in chunks_found:
score = 1 - r["_distance"]
print(f" score={score:.3f} chunk #{r['chunk_index']}")
print(f" {r['chunk_text']}")
print()
blobs_found = response[1]["FindBlob"].get("entities", [])
if blobs_found:
print(f"Source PDF: {blobs_found[0]['name']} ({blobs_found[0]['cuisine']})")
Query: 'spices and marinade'
score=0.472 chunk #2
1. "Marinate chicken in yogurt, turmeric, and chili powder for 30 minutes.", 2. "Sear chicken in butter over high heat until lightly browned. Set aside.", 3. "In the same pan, add tomato puree and all
score=0.533 chunk #3
for 5 minutes until the sauce thickens.", 5. "Return chicken to the pan and simmer for 15 minutes.", 6. "Garnish with fresh coriander and serve with naan or basmati rice." Tips: For a smokier flavor,
score=0.559 chunk #1
the most recognized Indian dishes worldwide. Ingredients: "500g boneless chicken thighs, cubed", "2 tbsp butter", "1 cup tomato puree", "1/2 cup heavy cream", "1 tsp garam masala", "1 tsp cumin", "1 t
Source PDF: butter_chicken_recipe (Indian)
Cleanup
# Delete the descriptor set (removes all chunk embeddings)
client.query([{"DeleteDescriptorSet": {"with_name": SET_NAME}}])
client.print_last_response()
# Delete the PDF blob
client.query([{"DeleteBlob": {"constraints": {"name": ["==", "butter_chicken_recipe"]}}}])
client.print_last_response()
[
{
"DeleteDescriptorSet": {
"count": 1,
"status": 0
}
}
]
[
{
"DeleteBlob": {
"count": 1,
"status": 0
}
}
]
What's Next?
- Embeddings Extraction workflow: production-scale PDF ingestion with OCR, configurable chunking, and parallel loading
- Work with Blobs: other binary formats (text, audio)
- Vector Search: full vector search guide with filtering and RAG
- LangChain Integration: connect ApertureDB to a LangChain RAG pipeline