Work with PDFs

PDFs are stored as Blobs in ApertureDB. This notebook shows how to:

Store a PDF with document_type: "pdf" so the ApertureDB UI can render it inline
Extract text, chunk it, and generate embeddings with sentence-transformers
Store embeddings as Descriptors linked to the PDF blob via a graph edge
Run semantic search that returns matching chunks and their source PDF in one query

For large-scale PDF ingestion with OCR and parallel processing, see the Embeddings Extraction workflow.

Connect to ApertureDB

Option A: ApertureDB Cloud (recommended)
Sign up for a free 30-day trial. Get your key from Connect > Generate API Key, add it to a .env file in this directory:

APERTUREDB_KEY=your_key_here

Option B: Community Edition (local Docker)
Run this in a terminal before starting the notebook:

docker run -d --name aperturedb \
  -p 55555:55555 -e ADB_MASTER_KEY=admin -e ADB_FORCE_SSL=false \
  aperturedata/aperturedb-community

See client configuration options for all connection methods and server setup options for deployment choices.

%pip install --upgrade --quiet aperturedb python-dotenv sentence-transformers pypdf

Note: you may need to restart the kernel to use updated packages.

# Option A: ApertureDB Cloud
from dotenv import load_dotenv
load_dotenv()  # loads APERTUREDB_KEY from .env into the environment

True

# Option B: Community Edition (local Docker)
# !adb config create localdb --active \
#     --host localhost --port 55555 \
#     --username admin --password admin \
#     --no-use-ssl --no-interactive

from aperturedb.CommonLibrary import create_connector

client = create_connector()
response, _ = client.query([{"GetStatus": {}}])
client.print_last_response()

[
    {
        "GetStatus": {
            "info": "OK",
            "status": 0,
            "system": "ApertureDB",
            "version": "0.19.6"
        }
    }
]

Download the Sample Recipe PDF

We use a Butter Chicken recipe PDF from the Cookbook repo. Replace pdf_path with any PDF from your own collection.

import os

os.makedirs("data", exist_ok=True)

pdf_path = "data/ButterChicken.pdf"

!wget -q -O "{pdf_path}" \
    https://raw.githubusercontent.com/aperture-data/Cookbook/main/notebooks/simple/data/ButterChicken.pdf

print(f"Downloaded {pdf_path} ({os.path.getsize(pdf_path)} bytes)")

Downloaded data/ButterChicken.pdf (141324 bytes)

Store the PDF as a Blob

Setting document_type: "pdf" lets the ApertureDB UI render the PDF inline when you browse blobs.

We also capture a _ref so we can link embeddings to this blob in the next step.

query = [{
    "AddBlob": {
        "_ref": 1,
        "properties": {
            "name":      "butter_chicken_recipe",
            "document_type": "pdf",
            "dish_name": "Butter Chicken",
            "cuisine":   "Indian",
            "format":    "pdf"
        },
        "if_not_found": {"name": ["==", "butter_chicken_recipe"]},
    }
}]

with open(pdf_path, "rb") as f:
    pdf_bytes = f.read()

response, _ = client.query(query, [pdf_bytes])
client.print_last_response()

[
    {
        "AddBlob": {
            "status": 0
        }
    }
]

Extract Text, Chunk, and Embed

We extract the PDF text with pypdf, split it into overlapping chunks, and embed each chunk with sentence-transformers. Each chunk is stored as a Descriptor linked to the source blob.

from pypdf import PdfReader
from sentence_transformers import SentenceTransformer
import numpy as np

# Extract full text
reader = PdfReader(pdf_path)
full_text = " ".join(page.extract_text() for page in reader.pages)
print(f"Extracted {len(full_text)} characters")

# Chunk by words with overlap
def chunk_text(text, chunk_size=60, overlap=15):
    words = text.split()
    chunks = []
    for start in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[start : start + chunk_size])
        if chunk:
            chunks.append(chunk)
    return chunks

chunks = chunk_text(full_text)
print(f"Created {len(chunks)} chunks")

# Embed all chunks
model = SentenceTransformer("all-MiniLM-L6-v2")   # 384-dimensional, CPU-friendly
embeddings = model.encode(chunks, normalize_embeddings=True)
print(f"Embedding shape: {embeddings.shape}")

Extracted 1183 characters
Created 5 chunks
``````output
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED:	can be ignored when loading from different task/architecture; not ok if you expect identical arch.
``````output
Embedding shape: (5, 384)

Store Embeddings Linked to the PDF

First create a DescriptorSet as the vector index, then add each chunk embedding with a connection back to the source blob.

SET_NAME = "recipe_pdf_chunks"

client.query([{"AddDescriptorSet": {
    "name":       SET_NAME,
    "dimensions": 384,
    "engine":     "HNSW",
    "metric":     "CS",
}}])
client.print_last_response()

[
    {
        "AddDescriptorSet": {
            "status": 0
        }
    }
]

for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
    client.query([
        {
            "FindBlob": {
                "_ref": 1,
                "constraints": {"name": ["==", "butter_chicken_recipe"]},
                "results": {"all_properties": False},
            }
        },
        {
            "AddDescriptor": {
                "_ref": 2,
                "set": SET_NAME,
                "properties": {
                    "chunk_index": i,
                    "chunk_text":  chunk[:200],
                    "source_name": "butter_chicken_recipe",
                },
            }
        },
        {
            "AddConnection": {
                "src": 2,
                "dst": 1,
                "class": "source_pdf",
            }
        }
    ], [emb.astype("float32").tobytes()])

print(f"Stored {len(chunks)} chunk embeddings linked to the PDF blob")

Stored 5 chunk embeddings linked to the PDF blob

Semantic Search Over the PDF

Find the most relevant chunks, then traverse the graph edge to retrieve the source PDF blob in the same query.

query_text = "spices and marinade"
query_emb = model.encode([query_text], normalize_embeddings=True)[0].astype("float32")

q = [{
    "FindDescriptor": {
        "set":         SET_NAME,
        "k_neighbors": 3,
        "distances":   True,
        "_ref":        1,
        "results":     {"all_properties": True},
    }
}, {
    "FindBlob": {
        "is_connected_to": {"ref": 1},
        "results": {"all_properties": True},
    }
}]

response, _ = client.query(q, [query_emb.tobytes()])

print(f"Query: '{query_text}'\n")
chunks_found = response[0]["FindDescriptor"].get("entities", [])
for r in chunks_found:
    score = 1 - r["_distance"]
    print(f"  score={score:.3f}  chunk #{r['chunk_index']}")
    print(f"  {r['chunk_text']}")
    print()

blobs_found = response[1]["FindBlob"].get("entities", [])
if blobs_found:
    print(f"Source PDF: {blobs_found[0]['name']} ({blobs_found[0]['cuisine']})")

Query: 'spices and marinade'

  score=0.472  chunk #2
  1. "Marinate chicken in yogurt, turmeric, and chili powder for 30 minutes.", 2. "Sear chicken in butter over high heat until lightly browned. Set aside.", 3. "In the same pan, add tomato puree and all

  score=0.533  chunk #3
  for 5 minutes until the sauce thickens.", 5. "Return chicken to the pan and simmer for 15 minutes.", 6. "Garnish with fresh coriander and serve with naan or basmati rice." Tips: For a smokier flavor, 

  score=0.559  chunk #1
  the most recognized Indian dishes worldwide. Ingredients: "500g boneless chicken thighs, cubed", "2 tbsp butter", "1 cup tomato puree", "1/2 cup heavy cream", "1 tsp garam masala", "1 tsp cumin", "1 t

Source PDF: butter_chicken_recipe (Indian)

Cleanup

# Delete the descriptor set (removes all chunk embeddings)
client.query([{"DeleteDescriptorSet": {"with_name": SET_NAME}}])
client.print_last_response()

# Delete the PDF blob
client.query([{"DeleteBlob": {"constraints": {"name": ["==", "butter_chicken_recipe"]}}}])
client.print_last_response()

[
    {
        "DeleteDescriptorSet": {
            "count": 1,
            "status": 0
        }
    }
]
[
    {
        "DeleteBlob": {
            "count": 1,
            "status": 0
        }
    }
]

What's Next?

Embeddings Extraction workflow: production-scale PDF ingestion with OCR, configurable chunking, and parallel loading
Work with Blobs: other binary formats (text, audio)
Vector Search: full vector search guide with filtering and RAG
LangChain Integration: connect ApertureDB to a LangChain RAG pipeline

Connect to ApertureDB​

Download the Sample Recipe PDF​

Store the PDF as a Blob​

Extract Text, Chunk, and Embed​

Store Embeddings Linked to the PDF​

Semantic Search Over the PDF​

Cleanup​

What's Next?​