Relational
The aindo.rdml.relational
modules allows to:
- Load relational data structures consisting of one or multiple tables;
- Define column types and the relational structure, utilizing primary and foreign keys.
Schema
A Schema
object is a collection of named Table
objects,
also containing the information about relations among tables.
Each Table
object contains the columns of interest of that table.
There are two primary types of columns:
PrimaryKey
’s andForeignKey
’s define the relational structure of the data.- Feature
Column
’s, namely columns that are not keys.
Feature columns
When building a Schema
, to each feature column is associated a Column
type.
The associated type will instruct the various routines of the library on how to treat the data in the column.
For example, a Column.CATEGORICAL
will be preprocessed differently than a Column.INTEGER
before being fed to the
generative model during training (more info in the ColumnPreproc section).
It will also appear differently in the evaluation report
(more info in the Synthetic data report section).
The available Column
types are:
BOOLEAN
, CATEGORICAL
, NUMERIC
, INTEGER
, DATE
, TIME
, DATETIME
, COORDINATES
, ITAFISCALCODE
, and TEXT
.
Building a Schema
To illustrate how to build a Schema
from scratch, let us work with (a subset of) the
BasketballMen dataset,
which consists of the following tables:
players
: The root table with the primary keyplayerID
.season
: A child table of players linked via the foreign keyplayerID
.all_star
: Another child table of players connected by the foreign keyplayerID
.
Let us load the tables with pandas
and let us gather the pandas.DataFrame
’s into a dictionary:
To build a Schema
, users must import the Column
, PrimaryKey
, ForeignKey
, Table
, and Schema
objects
from the aindo.rdml.relational
module.
Tables and columns that are present in the data but that are not included in the Schema
will be ignored.
RelationalData
A RelationalData
object is defined by combining the loaded data and a Schema
object:
The RelationalData.split()
method allows to split the data into train, test
and possibly validation sets, while respecting the consistency of the relational data structure.