Relational module
This module allows to:
- Load relational data structures consisting of one or multiple tables;
- Define column types and the relational structure, utilizing primary and foreign keys.
Column types
The aindo.rdml
library uses pd.DataFrame
’s as inputs and outputs.
Each column of the dataset must be transformed before feeding it to the generative model.
These transformations depend on how we want to treat the column during the training and generation process
(categorical, numerical, datetime, coordinates, …).
However, this cannot be inferred solely by the pandas
data types of the corresponding columns.
Therefore, each column in the dataset needs to be declared as a specific column type.
The available Column
types are:
BOOLEAN
, CATEGORICAL
, COORDINATES
, DATE
, DATETIME
, INTEGER
, NUMERIC
, TEXT
and TIME
.
For relational datasets, columns can also be declared as PrimaryKey()
or ForeignKey(parent=...)
.
Notice that there is a fundamental distinction between pandas
data types and aindo.rdml
column types,
since two columns with the same pandas
data type could be interpreted by a model as different Column
types
depending on the context.
For instance, an integer pandas
type could be either interpreted as a Column.INTEGER
or as a Column.CATEGORICAL
.
The following table reports the compatibility between pandas
data types and Column
types:
Column type / pandas type | bool | int | float | datetime | str | obj |
---|---|---|---|---|---|---|
BOOLEAN/CATEGORICAL | ✓ | ✓ | ✓ | ✓ | ✓ | str |
COORDINATES | ✗ | ✗ | ✗ | ✗ | ✓ | str |
INTEGER/NUMERIC | ✓ | ✓ | ✓ | ✓ | ✓* | ✓* |
DATE/TIME/DATETIME | ✗ | ✗ | ✗ | ✓ | ✓* | ✓* |
TEXT | ✗ | ✗ | ✗ | ✗ | ✓ | str |
✓* = may raise an error depending on column content
str = internally converted to a string (does not preserve the object in the synthesized data)
Schema
A Schema
object contains two primary components:
PrimaryKey
’s andForeignKey
’s, defining the relational data structure.Column
types, assigned to each column that is not a key.
To illustrate, let us work with the BasketballMan dataset, which consists of the following tables:
players
: The root table with the primary key ‘playerID’.season
: A child table of players linked via the foreign key ‘playerID’.all_star
: Another child table of players connected by the foreign key ‘playerID’.
Let us load the tables with pandas
and let us gather the pd.DataFrame
’s into a dictionary:
To build a Schema
from scratch, users must import Column
, PrimaryKey
, ForeignKey
, Table
, and Schema
objects from the aindo.rdml
package.
Columns that are present in the data but that are not included in the Schema
will be ignored.
RelationalData
A RelationalData
object is defined by combining the loaded data and a Schema
object:
The split()
method of the RelationalData
class, allows to split the data into train, test
and possibly validation sets, while respecting the consistency of the relational data structure.