Skip to main content

Ingest Data and Metadata

ApertureDB allows users to store metadata as well as data like images, videos, audios, document, or any unstructured types. There are three methods available for data ingestion via Python SDK.

Ultimately, each one of the following methods generate quite a few JSON queries that are then run against a specified ApertureDB instance.

Let's go over those scenarios with some subtle nuances for each one of them. They are ordered by a decreasing amount of familiarity with ApertureDB.

Ingestion via croissant URL.

If the dataset already exists, and is depicted via croissant, the task of storing it in ApertureDB is way simpler than any of the following methods.

Objects Added to the Database

After a successful Croissant ingestion, the following types of objects are typically added to ApertureDB:

  • DatasetModel: Represents the ingested dataset as a whole.
  • RecordsetModel: Groups records belonging to the dataset.
  • Entity: Each data item (e.g., row, record) is stored as an entity.
  • Image: If the dataset contains images, these are stored as image objects.
  • Blob: Any binary data (e.g., files, documents) is stored as blobs.
info

It comes with the caveat that the representation would try to mimic the way in which original author published it. Could possibly mean non-normalized data, and not all referenced artifacts automatically ingested.

Ingestion based on the data model

If the schema can be expressed in terms of a Pydantic model, one possibility is to use that existing schema and use a different base class or as a mixin.

Pros:

  • It may be the quickest option to get started.

Cons:

  • Is the least tested as of now. Some things might not work out of the box.

Example Usage:

Ingestion using data from CSV files

This is the most thoroughly tested way of getting things into ApertureDB. There is a scpecific format corresponding to each of the types of objects ApertureDB natively understands.

These are the currently supported implementations of the Parsers:

ADB Object typeCSV format to be used
BLOBBlobDataCSV
BOUNDING_BOXBBoxDataCSV
CONNECTIONConnectionDataCSV
DESCRIPTORDescriptorDataCSV
DESCRIPTORSETDescriptorSetDataCSV
ENTITYEntityDataCSV
IMAGEImageDataCSV
POLYGONPolygonDataCSV
VIDEOVideoDataCSV

Pros:

  • Most reliable
  • Only a understanding of a CSV format is required.

Cons:

  • This introduces a need to generate a CSV as an intermediate step.
  • User would need to flatten their data even though it might be hierarchical.

Example Usage:

Custom defined data generators

These are the most free-form generators that are supported. These are just subclasses of Subscriptable, and they implement a very simple interface. But the flexibility that they offer make them suitable for plugging arbitrary sources to ApertureDB, performing bespoke customizations.

Example implementations:

Pros:

  • Most flexible in terms of the queries that are generated.
  • Can use it to plug arbitrary sources containing data.

Cons:

  • Needs a good understanding of ApertureDB's query language.

Example Usage:

tip

ApertureDB command line tool, adb, provides subcommands to ingest data using CSV files or with data generators.

Ingest data using JSON commands

The JSON-based native query commands provide Add and Update methods for all the supported Object types in ApertureDB. These can be used from any of our Python / C++ clients, Jupyter notebooks, or even our Web UI (custom query tab) to ingest small amounts of data. However, it would require a lot of work to achieve a high ingestion throughput, and handle all the response types, which is why we offer the methods above as well as our ParallelQuery and generator option defined above.

Pros:

  • Most flexible in terms of the queries that are generated.

Cons:

  • Needs a good understanding of ApertureDB's query language.

Matrix of choices vs implications

OptionQL familiarityDataCSV familiarityMaturity
Croissant URLlowlowlow
Modellowlowlow
CSV Parsermediumhighhigh
Query Generatorhighlowhigh
JSON Commandshighlowhigh
tip

Talk to us about cron jobs or Airflow ingestion pipelines for setting up periodic data loading or updates.