Skip to content

Relational

The aindo.rdml.relational modules allows to:

  1. Load relational data structures consisting of one or multiple tables;
  2. Define column types and the relational structure, utilizing primary and foreign keys.

Schema

A Schema object is a collection of named Table objects, also containing the information about relations among tables. Each Table object contains the columns of interest of that table. There are two primary types of columns:

  1. PrimaryKey’s and ForeignKey’s define the relational structure of the data.
  2. Feature Column’s, namely columns that are not keys.

Feature columns

When building a Schema, to each feature column is associated a Column type. The associated type will instruct the various routines of the library on how to treat the data in the column. For example, a Column.CATEGORICAL will be preprocessed differently than a Column.INTEGER before being fed to the generative model during training (more info in the ColumnPreproc section). It will also appear differently in the evaluation report (more info in the Synthetic data report section).

The available Column types are: BOOLEAN, CATEGORICAL, NUMERIC, INTEGER, DATE, TIME, DATETIME, COORDINATES, ITAFISCALCODE, and TEXT.

Building a Schema

To illustrate how to build a Schema from scratch, let us work with (a subset of) the BasketballMen dataset, which consists of the following tables:

  • players: The root table with the primary key playerID.
  • season: A child table of players linked via the foreign key playerID.
  • all_star: Another child table of players connected by the foreign key playerID.

Let us load the tables with pandas and let us gather the pandas.DataFrame’s into a dictionary:

import pandas as pd
df_players = pd.read_csv('path/to/basket/dir/players.csv')
df_season = pd.read_csv('path/to/basket/dir/season.csv')
df_all_star = pd.read_csv('path/to/basket/dir/all_star.csv')
dfs = {
'players': df_players,
'season': df_season,
'all_star': df_all_star,
}

To build a Schema, users must import the Column, PrimaryKey, ForeignKey, Table, and Schema objects from the aindo.rdml.relational module. Tables and columns that are present in the data but that are not included in the Schema will be ignored.

from aindo.rdml.relational import Column, ForeignKey, PrimaryKey, Schema, Table
schema = Schema(
players=Table(
playerID=PrimaryKey(),
pos=Column.CATEGORICAL,
height=Column.NUMERIC,
weight=Column.NUMERIC,
college=Column.CATEGORICAL,
race=Column.CATEGORICAL,
birthCity=Column.CATEGORICAL,
birthState=Column.CATEGORICAL,
birthCountry=Column.CATEGORICAL,
),
season=Table(
playerID=ForeignKey(parent='players'),
year=Column.INTEGER,
stint=Column.INTEGER,
tmID=Column.CATEGORICAL,
lgID=Column.CATEGORICAL,
GP=Column.INTEGER,
points=Column.INTEGER,
GS=Column.INTEGER,
assists=Column.INTEGER,
steals=Column.INTEGER,
minutes=Column.INTEGER,
),
all_star=Table(
playerID=ForeignKey(parent='players'),
conference=Column.CATEGORICAL,
league_id=Column.CATEGORICAL,
points=Column.INTEGER,
rebounds=Column.INTEGER,
assists=Column.INTEGER,
blocks=Column.INTEGER,
),
)
print(schema)
Terminal window
Out:
Schema:
players:Table
Primary key: playerID
Feature columns:
pos:<Column.CATEGORICAL: 'Categorical'>
height:<Column.NUMERIC: 'Numeric'>
weight:<Column.NUMERIC: 'Numeric'>
college:<Column.CATEGORICAL: 'Categorical'>
race:<Column.CATEGORICAL: 'Categorical'>
birthCity:<Column.CATEGORICAL: 'Categorical'>
birthState:<Column.CATEGORICAL: 'Categorical'>
birthCountry:<Column.CATEGORICAL: 'Categorical'>
Foreign keys:
season:Table
Primary key: None
Feature columns:
year:<Column.INTEGER: 'Integer'>
stint:<Column.INTEGER: 'Integer'>
tmID:<Column.CATEGORICAL: 'Categorical'>
lgID:<Column.CATEGORICAL: 'Categorical'>
GP:<Column.INTEGER: 'Integer'>
points:<Column.INTEGER: 'Integer'>
GS:<Column.INTEGER: 'Integer'>
assists:<Column.INTEGER: 'Integer'>
steals:<Column.INTEGER: 'Integer'>
minutes:<Column.INTEGER: 'Integer'>
Foreign keys:
playerID:ForeignKey(parent=players)
all_star:Table
Primary key: None
Feature columns:
conference:<Column.CATEGORICAL: 'Categorical'>
league_id:<Column.CATEGORICAL: 'Categorical'>
points:<Column.INTEGER: 'Integer'>
rebounds:<Column.INTEGER: 'Integer'>
assists:<Column.INTEGER: 'Integer'>
blocks:<Column.INTEGER: 'Integer'>
Foreign keys:
playerID:ForeignKey(parent=players)

RelationalData

A RelationalData object is defined by combining the loaded data and a Schema object:

from aindo.rdml.relational import RelationalData, Schema
dfs = {
'players': ...,
'season': ...,
'all_star': ...,
}
schema = Schema(...)
data = RelationalData(data=dfs, schema=schema)

The RelationalData.split() method allows to split the data into train, test and possibly validation sets, while respecting the consistency of the relational data structure.

from aindo.rdml.relational import RelationalData
data = RelationalData(data=..., schema=...)
data_train_valid, data_test = data.split(ratio=0.1)
data_train, data_valid = data_train_valid.split(ratio=0.1)