Ingest Data and Metadata

ApertureDB allows users to store metadata as well as data like images, videos, audios, document, or any unstructured types. There are three methods available for data ingestion via Python SDK.

Ultimately, each one of the following methods generate quite a few JSON queries that are then run against a specified ApertureDB instance.

Let's go over those scenarios with some subtle nuances for each one of them. They are ordered by a decreasing amount of familiarity with ApertureDB.

Ingestion via croissant URL.

If the dataset already exists, and is depicted via croissant, the task of storing it in ApertureDB is way simpler than any of the following methods.

Objects Added to the Database

After a successful Croissant ingestion, the following types of objects are typically added to ApertureDB:

DatasetModel: Represents the ingested dataset as a whole.
RecordsetModel: Groups records belonging to the dataset.
Entity: Each data item (e.g., row, record) is stored as an entity.
Image: If the dataset contains images, these are stored as image objects.
Blob: Any binary data (e.g., files, documents) is stored as blobs.

info

It comes with the caveat that the representation would try to mimic the way in which original author published it. Could possibly mean non-normalized data, and not all referenced artifacts automatically ingested.

Ingestion based on the data model

If the schema can be expressed in terms of a Pydantic model, one possibility is to use that existing schema and use a different base class or as a mixin.

Pros:

It may be the quickest option to get started.

Cons:

Is the least tested as of now. Some things might not work out of the box.

Example Usage:

Ingest Cookbook dataset using (Pydantic) ApertureDB data models

Ingestion using data from CSV files

This is the most thoroughly tested way of getting things into ApertureDB. There is a scpecific format corresponding to each of the types of objects ApertureDB natively understands.

These are the currently supported implementations of the Parsers:

ADB Object type	CSV format to be used
BLOB	BlobDataCSV
BOUNDING_BOX	BBoxDataCSV
CONNECTION	ConnectionDataCSV
DESCRIPTOR	DescriptorDataCSV
DESCRIPTORSET	DescriptorSetDataCSV
ENTITY	EntityDataCSV
IMAGE	ImageDataCSV
POLYGON	PolygonDataCSV
VIDEO	VideoDataCSV

Pros:

Most reliable
Only a understanding of a CSV format is required.

Cons:

This introduces a need to generate a CSV as an intermediate step.
User would need to flatten their data even though it might be hierarchical.

Example Usage:

Ingest Cookbook dataset using CSV parsers

Custom defined data generators

These are the most free-form generators that are supported. These are just subclasses of Subscriptable, and they implement a very simple interface. But the flexibility that they offer make them suitable for plugging arbitrary sources to ApertureDB, performing bespoke customizations.

Example implementations:

Pros:

Most flexible in terms of the queries that are generated.
Can use it to plug arbitrary sources containing data.

Cons:

Needs a good understanding of ApertureDB's query language.

Example Usage:

Ingest Cookbook dataset and customize it at load time with custom generators

tip

ApertureDB command line tool, adb, provides subcommands to ingest data using CSV files or with data generators.

Ingest data using JSON commands

The JSON-based native query commands provide Add and Update methods for all the supported Object types in ApertureDB. These can be used from any of our Python / C++ clients, Jupyter notebooks, or even our Web UI (custom query tab) to ingest small amounts of data. However, it would require a lot of work to achieve a high ingestion throughput, and handle all the response types, which is why we offer the methods above as well as our ParallelQuery and generator option defined above.

Pros:

Most flexible in terms of the queries that are generated.

Cons:

Needs a good understanding of ApertureDB's query language.

Matrix of choices vs implications

Option	QL familiarity	DataCSV familiarity	Maturity
Croissant URL	low	low	low
Model	low	low	low
CSV Parser	medium	high	high
Query Generator	high	low	high
JSON Commands	high	low	high

tip

Talk to us about cron jobs or Airflow ingestion pipelines for setting up periodic data loading or updates.

Ingest Data and Metadata

Ingestion via croissant URL.​

Objects Added to the Database​

Ingestion based on the data model​

Ingestion using data from CSV files​

Custom defined data generators​

Example implementations:​

Ingest data using JSON commands​

Matrix of choices vs implications​

Ingestion via croissant URL.

Objects Added to the Database

Ingestion based on the data model

Ingestion using data from CSV files

Custom defined data generators

Example implementations:

Ingest data using JSON commands

Matrix of choices vs implications