Skip to main content

Video Embedding Models

ApertureDB stores video frame embeddings linked to the source video via graph edges. A single query retrieves matched frame metadata and the parent clip — no join required.

Runnable Notebooks

For setup and client configuration, see Client Configuration. For server setup options, see Server Setup.


CLIP Frame Embeddings

CLIP encodes individual video frames as 512-dimensional vectors. Once frame embeddings are stored in a DescriptorSet linked to the source video, a text query finds the most visually relevant moments across your video library.

The VideoVectorSearch notebook walks through the complete flow: add a video, extract frames, embed with CLIP via sentence-transformers, and run text-to-frame search.

For large video libraries, the Embeddings Extraction workflow automates frame sampling, CLIP embedding, and parallel ingestion via the ApertureDB Workflows UI.

Generate and search video embeddings using the ApertureDB Workflows UI.

Search Video Frames by Text Query

Once frame embeddings are stored, use CLIP text encoding to find matching moments:

pip install -U aperturedb torch
pip install git+https://github.com/openai/CLIP.git
import clip
import torch
from aperturedb.CommonLibrary import create_connector

device = "cuda" if torch.cuda.is_available() else "cpu"
model, _ = clip.load("ViT-B/32", device=device)
client = create_connector()

text = clip.tokenize(["chef plating a dish"]).to(device)
with torch.no_grad():
text_embedding = model.encode_text(text).squeeze().cpu().numpy()

q = [
{"FindDescriptor": {
"set": "wf_embeddings_clip_video",
"k_neighbors": 5,
"distances": True,
"_ref": 1
}},
{"FindClip": {
"is_connected_to": {"ref": 1},
"blobs": False,
"results": {"all_properties": True}
}}
]

response, _ = client.query(q, [text_embedding.astype("float32").tobytes()])
client.print_last_response()

Filter by video metadata to scope search to a specific video or time range:

q = [
{"FindDescriptor": {
"set": "wf_embeddings_clip_video",
"k_neighbors": 10,
"constraints": {"video_id": ["==", "cooking-tutorial-001"]},
"distances": True,
"_ref": 1
}},
{"FindClip": {
"is_connected_to": {"ref": 1},
"blobs": False,
"results": {"all_properties": True}
}}
]

Twelve Labs

For richer multimodal video understanding — scene detection, action recognition, natural language descriptions — Twelve Labs provides state-of-the-art video embeddings that go beyond frame-level CLIP similarity. ApertureDB stores and serves these embeddings alongside your raw video and metadata.


What's Next