Skip to content

Relational module

This module allows to:

  1. Load relational data structures consisting of one or multiple tables;
  2. Define column types and the relational structure, utilizing primary and foreign keys.

Column types

The aindo.rdml library uses pd.DataFrame’s as inputs and outputs. Each column of the dataset must be transformed before feeding it to the generative model. These transformations depend on how we want to treat the column during the training and generation process (categorical, numerical, datetime, coordinates, …). However, this cannot be inferred solely by the pandas data types of the corresponding columns. Therefore, each column in the dataset needs to be declared as a specific column type. The available Column types are: BOOLEAN, CATEGORICAL, COORDINATES, DATE, DATETIME, INTEGER, NUMERIC, TEXT and TIME. For relational datasets, columns can also be declared as PrimaryKey() or ForeignKey(parent=...).

Notice that there is a fundamental distinction between pandas data types and aindo.rdml column types, since two columns with the same pandas data type could be interpreted by a model as different Column types depending on the context. For instance, an integer pandas type could be either interpreted as a Column.INTEGER or as a Column.CATEGORICAL. The following table reports the compatibility between pandas data types and Column types:

Column type / pandas typeboolintfloatdatetimestrobj
BOOLEAN/CATEGORICALstr
COORDINATESstr
INTEGER/NUMERIC✓*✓*
DATE/TIME/DATETIME✓*✓*
TEXTstr

✓* = may raise an error depending on column content

str = internally converted to a string (does not preserve the object in the synthesized data)

Schema

A Schema object contains two primary components:

  1. PrimaryKey’s and ForeignKey’s, defining the relational data structure.
  2. Column types, assigned to each column that is not a key.

To illustrate, let us work with the BasketballMan dataset, which consists of the following tables:

  • players: The root table with the primary key ‘playerID’.
  • season: A child table of players linked via the foreign key ‘playerID’.
  • all_star: Another child table of players connected by the foreign key ‘playerID’.

Let us load the tables with pandas and let us gather the pd.DataFrame’s into a dictionary:

import pandas as pd
df_players = pd.read_csv('path/to/basket/dir/players.csv')
df_season = pd.read_csv('path/to/basket/dir/season.csv')
df_all_star = pd.read_csv('path/to/basket/dir/all_star.csv')
dfs = {
'players': df_players,
'season': df_season,
'all_star': df_all_star,
}

To build a Schema from scratch, users must import Column, PrimaryKey, ForeignKey, Table, and Schema objects from the aindo.rdml package. Columns that are present in the data but that are not included in the Schema will be ignored.

from aindo.rdml.relational import Column, ForeignKey, PrimaryKey, Schema, Table
schema = Schema(
players=Table(
playerID=PrimaryKey(),
pos=Column.CATEGORICAL,
height=Column.NUMERIC,
weight=Column.NUMERIC,
college=Column.CATEGORICAL,
race=Column.CATEGORICAL,
birthCity=Column.CATEGORICAL,
birthState=Column.CATEGORICAL,
birthCountry=Column.CATEGORICAL,
),
season=Table(
playerID=ForeignKey(parent='players'),
year=Column.INTEGER,
stint=Column.INTEGER,
tmID=Column.CATEGORICAL,
lgID=Column.CATEGORICAL,
GP=Column.INTEGER,
points=Column.INTEGER,
GS=Column.INTEGER,
assists=Column.INTEGER,
steals=Column.INTEGER,
minutes=Column.INTEGER,
),
all_star=Table(
playerID=ForeignKey(parent='players'),
conference=Column.CATEGORICAL,
league_id=Column.CATEGORICAL,
points=Column.INTEGER,
rebounds=Column.INTEGER,
assists=Column.INTEGER,
blocks=Column.INTEGER,
),
)
print(schema)
Terminal window
Out:
Schema:
players:Table
Primary key: playerID
Feature columns:
pos:<Column.CATEGORICAL: 'Categorical'>
height:<Column.NUMERIC: 'Numeric'>
weight:<Column.NUMERIC: 'Numeric'>
college:<Column.CATEGORICAL: 'Categorical'>
race:<Column.CATEGORICAL: 'Categorical'>
birthCity:<Column.CATEGORICAL: 'Categorical'>
birthState:<Column.CATEGORICAL: 'Categorical'>
birthCountry:<Column.CATEGORICAL: 'Categorical'>
Foreign keys:
season:Table
Primary key: None
Feature columns:
year:<Column.INTEGER: 'Integer'>
stint:<Column.INTEGER: 'Integer'>
tmID:<Column.CATEGORICAL: 'Categorical'>
lgID:<Column.CATEGORICAL: 'Categorical'>
GP:<Column.INTEGER: 'Integer'>
points:<Column.INTEGER: 'Integer'>
GS:<Column.INTEGER: 'Integer'>
assists:<Column.INTEGER: 'Integer'>
steals:<Column.INTEGER: 'Integer'>
minutes:<Column.INTEGER: 'Integer'>
Foreign keys:
playerID:ForeignKey(parent=players)
all_star:Table
Primary key: None
Feature columns:
conference:<Column.CATEGORICAL: 'Categorical'>
league_id:<Column.CATEGORICAL: 'Categorical'>
points:<Column.INTEGER: 'Integer'>
rebounds:<Column.INTEGER: 'Integer'>
assists:<Column.INTEGER: 'Integer'>
blocks:<Column.INTEGER: 'Integer'>
Foreign keys:
playerID:ForeignKey(parent=players)

RelationalData

A RelationalData object is defined by combining the loaded data and a Schema object:

from aindo.rdml.relational import RelationalData, Schema
dfs = {
'players': ...,
'season': ...,
'all_star': ...,
}
schema = Schema(...)
data = RelationalData(data=dfs, schema=schema)

The split() method of the RelationalData class, allows to split the data into train, test and possibly validation sets, while respecting the consistency of the relational data structure.

from aindo.rdml.relational import RelationalData
data = RelationalData(data=..., schema=...)
data_train_valid, data_test = data.split(ratio=0.1)
data_train, data_valid = data_train_valid.split(ratio=0.1)

Documentation

aindo.rdml.relational