Relational Tabular Data

Automatic data type detection

aindo.synth library uses Pandas DataFrames as inputs and outputs. However, the Pandas column type is not sufficient to determine how each column will be used during the synthesis process. Indeed, aindo.synth performs a different preprocessing step on each input data column, and transforms them into distributions that can be fed into the neural generative models. Therefore, each column in the dataset needs to be assigned the correct column type. This issue is easily addressed by aindo.synth, which integrates and automates column type detection.

aindo.synth supports the following Column objects:

  • Text for columns containing free text;

  • Categorical for columns with categorical values (e.g. job title, gender, zip code);

  • Boolean for categorical columns with True/False/NA values;

  • Float for columns containing floating point numbers;

  • Integer for columns containing integer numbers. Behaves as a column of floats with the number of precision digits fixed to 0;

  • DateTime for datetime instances with timestamp format specified by the parameter fmt;

  • Time for columns with Unix time values. Extends DateTime class.

Notice that there is a fundamental distinction between Pandas data types and aindo.synth data types since the same data type could be fed to an aindo.synth model as multiple column types depending on the context. For instance, an integer Pandas column could be either interpreted as an integer (when it has an ordinal meaning) or as a category (when it doesn’t). The following table reports the compatibility between Pandas data types and aindo.synth column types:

Aindo type / Pandas type

bool

int

float

datetime

str

obj

integer / float

✅ *

✅ *

datetime

✅ *

✅ *

categorical / boolean

str

text

str

✅ * = may raise an error depending on column content

str = internally converted to a string (does not preserve the object in the synthesized data)

Schema

By default, columns are preprocessed independently and their transformations are fitted on the input relational data. Aindo columns are instantiated automatically when the schema is detected.

1import pandas as pd
2from aindo.synth import Categorical, Integer
3from aindo.synth import auto_schema, overwrite_columns
4
5adult = {'adult': pd.DataFrame(...)}
6schema_detected = auto_schema(adult)

It is possible to automatically detect the Schema and then change specific columns. Suppose the column education-num has been detected as an Integer but we want to treat it as Categorical:

1import pandas as pd
2from aindo.synth import Categorical, Integer
3from aindo.synth import auto_schema, overwrite_columns
4
5adult = {'adult': pd.DataFrame(...)}
6schema_detected = auto_schema(adult)
7assert isinstance(schema_detected.tables['adult'].columns['education-num'], Integer)
8schema_custom = overwrite_columns(schema_detected, {'adult': {'education-num': Categorical()}})

Columns can also be explicitly instantiated by manually building the Schema. To build a Schema from scratch, you need to import the proper objects from aindo.synth and the instantiation is pretty self-explanatory.

 1from aindo.synth import Integer, Categorical
 2from aindo.synth import Schema, Table
 3
 4schema = Schema(
 5    adult=Table(
 6        age=Integer(),
 7        workclass=Categorical(),
 8        fnlwgt=Integer(),
 9        education=...,
10    )
11)

For relational datasets, columns can also be declared as PrimaryKey() or ForeignKey(parent=...) objects.

RelationalData class

To be used in aindo.synth, each dataset must be stored in a RelationalData object, which is the basic data object. As the name implies, it is aimed at storing multiple tables but can store a single table as well. It contains the data tables and the information about their relations, e.g. the foreign keys. A relational database is composed of one or more tables, whose rows are uniquely identified by primary keys. Two tables can be linked by a column of foreign keys, i.e. a set of attributes in a table that refers to the primary keys in the other table. The table containing foreign keys is a child of the parent table containing the primary key. A table might be linked to multiple parent tables through different sets of foreign keys. A RelationalData class can be instantiated by one of the loader utils:

  • RelationalData.from_data() takes as input a dictionary with tables’ names as keys and Pandas DataFrames as values;

  • RelationalData.from_dir() loads all csv and Excel files from a directory as separate tables in the relational structure;

  • RelationalData.from_sql() loads Pandas DataFrames from a database connection.

The above utility constructors require the following parameters:

  • primary_keys: A dictionary with the name of the primary keys for each table. If a table does not have a primary key, it should be omitted.

  • foreign_keys: A dictionary with the foreign keys for each table. Foreign keys are specified as a dictionary of the foreign key column names and the corresponding parent table. Again, if a table does not have foreign keys, that table should be omitted.

Upon instantiation, the RelationalData object will perform automatic detection of the column types, invoking the function auto_schema(). The detected types are then mapped to the following aindo.synth Column objects.

Alternatively, a RelationalData object can be instantiated from a custom Schema and a dictionary of DataFrames as follows:

1from aindo.synth import RelationalData
2
3adult = {'adult': ...}
4schema = ...
5data = RelationalData(adult, schema)

Currently, aindo does not support composite primary keys and relational data structures with loops.

Documentation

class aindo.synth.relational.data.RelationalData(data: dict[str, pandas.core.frame.DataFrame], schema: Schema, fit: bool = True)
__init__(data: dict[str, pandas.core.frame.DataFrame], schema: Schema, fit: bool = True) None

Relational Data structure.

Parameters:
  • data – A dictionary with the data as pandas.DataFrame’s.

  • schema – A Schema object.

  • fit – Optional, whether to fit the Schema to the data. Default: True.

fit() None

Fit the Schema to the data.

split(test_fraction: float = 0.2) tuple[RD, RD]

Split the data in training and test sets.

Parameters:

test_fraction – Fraction of data to be used as test set for all tables.

Returns:

Two RelationalData instances on train and test sets.

classmethod from_data(data: dict[str, pandas.core.frame.DataFrame], primary_keys: Optional[dict[str, str]] = None, foreign_keys: Optional[dict[str, dict[str, str]]] = None, fit: bool = True) RD

Build the Relational Data structure by automatically detecting the Schema.

Parameters:
  • data – A dictionary containing table names as keys and Pandas dataframes as values.

  • primary_keys – A dictionary containing the primary key for each table (if present). It must be provided in the case of multiple tables.

  • foreign_keys – A dictionary containing the foreign keys for each table. Keys are child tables names and values are dictionaries containing the foreign key name as keys and the parent table as values. It must be provided in the case of multiple tables.

  • fit – Optional, whether to fit the Schema to the data. Default: True.

classmethod from_dir(data_dir: pathlib.Path | str, primary_keys: Optional[dict[str, str]] = None, foreign_keys: Optional[dict[str, dict[str, str]]] = None, use_cols: Optional[dict[str, list[str]]] = None, fit: bool = True) RD

Load all csv files in data_dir. Optionally choose columns to be loaded from each csv file.

Parameters:
  • data_dir – Directory containing csv files of all tables in the relational dataset.

  • primary_keys – A dictionary containing the primary key for each table (if present). It must be provided in the case of multiple tables.

  • foreign_keys – A dictionary containing the foreign keys for each table. Keys are child tables names and values are dictionaries containing the foreign key name as keys and the parent table as values. It must be provided in the case of multiple tables.

  • use_cols – Dictionary containing table names as keys and column names as values.

  • fit – Optional, whether to fit the Schema to the data. Default: True.

Returns:

An instance of RelationalData.

classmethod from_sql(con: sqlalchemy.engine.Connectable | str, tables_names: list[str], primary_keys: dict[str, str] | None = None, foreign_keys: dict[str, dict[str, str]] | None = None, use_cols: dict[str, list[str]] = None, fit: bool = True) RD

Load Pandas dataframes listed in tables_names from the database connection con.

Parameters:
  • con – Connection to the database.

  • tables_names – List of tables names in the relational structure.

  • primary_keys – A dictionary containing the primary key for each table (if present). It must be provided in the case of multiple tables.

  • foreign_keys – A dictionary containing the foreign keys for each table. Keys are child tables names and values are dictionaries containing the foreign key name as keys and the parent table as values. It must be provided in the case of multiple tables.

  • use_cols – Dictionary containing table names as keys and column names as values.

  • fit – Optional, whether to fit the Schema to the data. Default: True.

Returns:

An instance of RelationalData.

to_csv(out_dir: pathlib.Path | str = PosixPath('out/synth')) None

Save the RelationalData tables into a directory as csv files.

Parameters:

out_dir – Directory where data is saved. Default: “out/synth/”.

to_sql(con: sqlalchemy.engine.Connectable | str, suffix: str) None

Save the RelationalData tables into a database using the SQL connection con.

Parameters:
  • con – Connection to the database.

  • suffix – Suffix used to rename tables in the dictionary. Adding a suffix is necessary to avoid overwriting the original tables in the SQL database.

aindo.synth.relational.detect.auto_schema(data: dict[str, pandas.core.frame.DataFrame], primary_keys: Optional[dict[str, str]] = None, foreign_keys: Optional[dict[str, dict[str, str]]] = None) Schema

Automatic detect the Schema from data.

Parameters:
  • data – A dictionary containing table names as keys and Pandas dataframes as values.

  • primary_keys – A dictionary containing the primary key for each table (if present). It must be provided in the case of multiple tables.

  • foreign_keys – A dictionary containing the foreign keys for each table. Keys are child tables names and values are dictionaries containing the foreign key name as keys and the parent table as values. It must be provided in the case of multiple tables.