Bulk Embedding Ingestion
Load embeddings at scale using ApertureDB's ParallelLoader. This notebook downloads the Cookbook dataset (20 dishes), generates text embeddings with sentence-transformers, and ingests them in parallel.
Connect to ApertureDB
Option A: ApertureDB Cloud (recommended)
Sign up for a free 30-day trial. Get your key from Connect > Generate API Key, add it to a .env file in this directory:
APERTUREDB_KEY=your_key_here
Option B: Community Edition (local Docker)
Run this in a terminal before starting the notebook:
docker run -d --name aperturedb \
-p 55555:55555 -e ADB_MASTER_KEY=admin -e ADB_FORCE_SSL=false \
aperturedata/aperturedb-community
See client configuration options for all connection methods and server setup options for deployment choices.
%pip install --upgrade --quiet aperturedb python-dotenv sentence-transformers pandas
# Option A: ApertureDB Cloud
from dotenv import load_dotenv
load_dotenv() # loads APERTUREDB_KEY from .env into the environment
True
# Option B: Community Edition (local Docker)
# !adb config create localdb --active \
# --host localhost --port 55555 \
# --username admin --password admin \
# --no-use-ssl --no-interactive
from aperturedb.CommonLibrary import create_connector
client = create_connector()
response, _ = client.query([{"GetStatus": {}}])
client.print_last_response()
[
{
"GetStatus": {
"info": "OK",
"status": 0,
"system": "ApertureDB",
"version": "0.19.6"
}
}
]
Load Dataset and Generate Embeddings
We combine dish_name and caption into a single description, then embed with all-MiniLM-L6-v2 (384-dimensional, CPU-friendly).
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
dishes = pd.read_csv(
"https://raw.githubusercontent.com/aperture-data/Cookbook/refs/heads/main/images.adb.csv"
)
dishes["description"] = dishes["dish_name"] + " - " + dishes["caption"]
print(f"Loaded {len(dishes)} dishes")
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(dishes["description"].tolist(), normalize_embeddings=True)
print(f"Embedding shape: {embeddings.shape}")
Loaded 20 dishes
``````output
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading weights: 0%| | 0/103 [00:00<?, ?it/s]
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key | Status | |
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED | |
Notes:
- UNEXPECTED: can be ignored when loading from different task/architecture; not ok if you expect identical arch.
``````output
Embedding shape: (20, 384)
Create the DescriptorSet
SET_NAME = "cookbook_bulk"
client.query([{
"AddDescriptorSet": {
"name": SET_NAME,
"dimensions": 384,
"engine": "HNSW",
"metric": "CS",
}
}])
client.print_last_response()
[
{
"AddDescriptorSet": {
"status": 0
}
}
]
Bulk Ingest with ParallelLoader
ParallelLoader ingests data concurrently. It takes any Subscriptable — here we build a simple list of (query, blobs) pairs.
from aperturedb.ParallelLoader import ParallelLoader
class DescriptorGenerator:
"""Subscriptable generator of (query, blobs) pairs for ParallelLoader."""
def __init__(self, dishes, embeddings, set_name):
self.dishes = dishes
self.embeddings = embeddings
self.set_name = set_name
def __len__(self):
return len(self.dishes)
def __getitem__(self, idx):
# ParallelLoader calls __getitem__ with a slice for each batch;
# return a list of (query, blobs) pairs in that case.
if isinstance(idx, slice):
return [self[i] for i in range(*idx.indices(len(self)))]
row = self.dishes.iloc[idx]
emb = self.embeddings[idx].astype("float32")
query = [{
"AddDescriptor": {
"set": self.set_name,
"properties": {
"dish_name": row["dish_name"],
"cuisine": row["food_tags"],
"caption": row["caption"],
},
"if_not_found": {"dish_name": ["==", row["dish_name"]]},
}
}]
return query, [emb.tobytes()]
generator = DescriptorGenerator(dishes, embeddings, SET_NAME)
loader = ParallelLoader(client)
loader.ingest(generator, batchsize=5, numthreads=4, stats=True)
Progress: 100%|██████████| 20.0/20.0 [00:02<00:00, 9.95items/s]
``````output
============ ApertureDB Loader Stats ============
Total time (s): 2.011025905609131
Total queries executed: 4
Avg Query time (s): 1.3486077785491943
Query time std: 0.12229802807468174
Avg Query Throughput (q/s): 2.966021747481778
Overall insertion throughput (element/s): 9.945172732094711
Total inserted elements: 20
Total successful commands: 20
=================================================
``````output
Verify the Ingestion
response, _ = client.query([{
"FindDescriptorSet": {
"with_name": SET_NAME,
"results": {"count": True},
}
}])
client.print_last_response()
[
{
"FindDescriptorSet": {
"count": 1,
"returned": 0,
"status": 0
}
}
]
Search the Bulk-Loaded Descriptors
query_text = "creamy tomato curry"
query_emb = model.encode([query_text], normalize_embeddings=True)[0].astype("float32")
response, _ = client.query([{
"FindDescriptor": {
"set": SET_NAME,
"k_neighbors": 3,
"distances": True,
"results": {"all_properties": True},
}
}], [query_emb.tobytes()])
for entity in response[0]["FindDescriptor"].get("entities", []):
score = 1 - entity["_distance"]
print(f" {entity['dish_name']:<30} [{entity['cuisine']}] score={score:.3f}")
Butter chicken [Indian] score=0.423
paneer bhurji [Indian] score=0.567
waffle, smoothie [American] score=0.573
Cleanup
client.query([{"DeleteDescriptorSet": {"with_name": SET_NAME}}])
client.print_last_response()
[
{
"DeleteDescriptorSet": {
"count": 1,
"status": 0
}
}
]
What's Next
- Hybrid Search: combine KNN with metadata filters
- Recipe Text Search: single-item embedding flow with sentence-transformers
- Work with Descriptors: Add, Find, Update, Delete for individual descriptors
- Embeddings Extraction workflow: production-scale ingestion via the Workflows UI