Video Embedding Models
ApertureDB stores video frame embeddings linked to the source video via graph edges. A single query retrieves matched frame metadata and the parent clip — no join required.
- Video Frame Search — CLIP on extracted frames, text-to-frame search
- Work with Videos — add, find, update, and delete videos
- Clips and Frames — extract clips, sample frames, apply operations
For setup and client configuration, see Client Configuration. For server setup options, see Server Setup.
CLIP Frame Embeddings
CLIP encodes individual video frames as 512-dimensional vectors. Once frame embeddings are stored in a DescriptorSet linked to the source video, a text query finds the most visually relevant moments across your video library.
The VideoVectorSearch notebook walks through the complete flow: add a video, extract frames, embed with CLIP via sentence-transformers, and run text-to-frame search.
For large video libraries, the Embeddings Extraction workflow automates frame sampling, CLIP embedding, and parallel ingestion via the ApertureDB Workflows UI.
Search Video Frames by Text Query
Once frame embeddings are stored, use CLIP text encoding to find matching moments:
pip install -U aperturedb torch
pip install git+https://github.com/openai/CLIP.git
import clip
import torch
from aperturedb.CommonLibrary import create_connector
device = "cuda" if torch.cuda.is_available() else "cpu"
model, _ = clip.load("ViT-B/32", device=device)
client = create_connector()
text = clip.tokenize(["chef plating a dish"]).to(device)
with torch.no_grad():
text_embedding = model.encode_text(text).squeeze().cpu().numpy()
q = [
{"FindDescriptor": {
"set": "wf_embeddings_clip_video",
"k_neighbors": 5,
"distances": True,
"_ref": 1
}},
{"FindClip": {
"is_connected_to": {"ref": 1},
"blobs": False,
"results": {"all_properties": True}
}}
]
response, _ = client.query(q, [text_embedding.astype("float32").tobytes()])
client.print_last_response()
Filter by video metadata to scope search to a specific video or time range:
q = [
{"FindDescriptor": {
"set": "wf_embeddings_clip_video",
"k_neighbors": 10,
"constraints": {"video_id": ["==", "cooking-tutorial-001"]},
"distances": True,
"_ref": 1
}},
{"FindClip": {
"is_connected_to": {"ref": 1},
"blobs": False,
"results": {"all_properties": True}
}}
]
Twelve Labs
For richer multimodal video understanding — scene detection, action recognition, natural language descriptions — Twelve Labs provides state-of-the-art video embeddings that go beyond frame-level CLIP similarity. ApertureDB stores and serves these embeddings alongside your raw video and metadata.
- Twelve Labs + ApertureDB blog post — using Twelve Labs video embeddings with ApertureDB for semantic video search
What's Next
- Video Frame Search notebook — CLIP frame embeddings end-to-end
- Embeddings Extraction workflow — no-code frame embedding for large video libraries
- Bulk Embedding Ingestion — parallel ingestion with
ParallelLoader