Video Embedding Models

ApertureDB stores video frame embeddings linked to the source video via graph edges. A single query retrieves matched frame metadata and the parent clip — no join required.

Runnable Notebooks

Video Frame Search — CLIP on extracted frames, text-to-frame search
Work with Videos — add, find, update, and delete videos
Clips and Frames — extract clips, sample frames, apply operations

For setup and client configuration, see Client Configuration. For server setup options, see Server Setup.

CLIP Frame Embeddings

CLIP encodes individual video frames as 512-dimensional vectors. Once frame embeddings are stored in a DescriptorSet linked to the source video, a text query finds the most visually relevant moments across your video library.

The VideoVectorSearch notebook walks through the complete flow: add a video, extract frames, embed with CLIP via sentence-transformers, and run text-to-frame search.

For large video libraries, the Embeddings Extraction workflow automates frame sampling, CLIP embedding, and parallel ingestion via the ApertureDB Workflows UI.

Generate and search video embeddings using the ApertureDB Workflows UI.

Search Video Frames by Text Query

Once frame embeddings are stored, use CLIP text encoding to find matching moments:

pip install -U aperturedb torch
pip install git+https://github.com/openai/CLIP.git

import clip
import torch
from aperturedb.CommonLibrary import create_connector

device = "cuda" if torch.cuda.is_available() else "cpu"
model, _ = clip.load("ViT-B/32", device=device)
client = create_connector()

text = clip.tokenize(["chef plating a dish"]).to(device)
with torch.no_grad():
    text_embedding = model.encode_text(text).squeeze().cpu().numpy()

q = [
    {"FindDescriptor": {
        "set": "wf_embeddings_clip_video",
        "k_neighbors": 5,
        "distances": True,
        "_ref": 1
    }},
    {"FindClip": {
        "is_connected_to": {"ref": 1},
        "blobs": False,
        "results": {"all_properties": True}
    }}
]

response, _ = client.query(q, [text_embedding.astype("float32").tobytes()])
client.print_last_response()

Filter by video metadata to scope search to a specific video or time range:

q = [
    {"FindDescriptor": {
        "set": "wf_embeddings_clip_video",
        "k_neighbors": 10,
        "constraints": {"video_id": ["==", "cooking-tutorial-001"]},
        "distances": True,
        "_ref": 1
    }},
    {"FindClip": {
        "is_connected_to": {"ref": 1},
        "blobs": False,
        "results": {"all_properties": True}
    }}
]

Twelve Labs

For richer multimodal video understanding — scene detection, action recognition, natural language descriptions — Twelve Labs provides state-of-the-art video embeddings that go beyond frame-level CLIP similarity. ApertureDB stores and serves these embeddings alongside your raw video and metadata.

Twelve Labs + ApertureDB blog post — using Twelve Labs video embeddings with ApertureDB for semantic video search

What's Next

Video Frame Search notebook — CLIP frame embeddings end-to-end
Embeddings Extraction workflow — no-code frame embedding for large video libraries
Bulk Embedding Ingestion — parallel ingestion with ParallelLoader

CLIP Frame Embeddings​

Search Video Frames by Text Query​

Twelve Labs​

What's Next​

CLIP Frame Embeddings

Search Video Frames by Text Query

Twelve Labs

What's Next