Aindo Synth Overview

aindo.synth is a library for creating synthetic tabular and relational data using neural generative models. It offers a high-level API, that’s amenable for users who want to use the library in the simplest possible way, and a low-level API, that is meant for greater flexibility and is aimed at developers and researchers. In this documentation, we explore the high-level API.

Data loading

Data must be loaded in the main memory. aindo.synth offers various options to load data from various sources and organize them into RelationalData objects. First of all, the RelationalData class must be imported. Then, to load a single table dataset from a file (in csv or Excel format) contained in a directory, we use the constructor from_dir. For example, to load the UCI adult dataset, we can use the following instructions:

1from aindo.synth import RelationalData
2data = RelationalData.from_dir(data_dir='path/to/adult/dir')

Optionally, we can load tabular data using Pandas library and instantiate a RelationalData object from a dictionary with table names as keys and Pandas DataFrames as values. Tabular data is organized in rows and columns, where columns represent attributes and rows represent observations of such attributes. For example, let’s take a look at Adult single table dataset:

1import pandas as pd
2df = pd.read_csv('path/to/adult/dir/adult.csv')
3df
[Out]:
       age   workclass  fnlwgt  ... hours-per-week  native-country       y
0       43   Local-gov  169203  ...             35   United-States   <=50K
1       36     Private  184112  ...             45   United-States    >50K
2       35     Private  338611  ...             40   United-States   <=50K
3       64           ?  208862  ...             50   United-States    >50K
4       25   State-gov  129200  ...             40   United-States   <=50K
    ...         ...     ...  ...            ...             ...     ...
29300   25     Private  290528  ...             40   United-States   <=50K
29301   51     Private  306108  ...             40   United-States    >50K
29302   33     Private  182792  ...             40   United-States   <=50K
29303   46     Private  175925  ...             40   United-States   <=50K
29304   34     Private  191291  ...             40   United-States   <=50K
[29305 rows x 15 columns]
1data = RelationalData.from_data({'adult': df})

Similarly, we can load relational data structures containing multiple tables. In this setting each foreign key from a child table refers to the primary key in a parent table, thus defining a connection between parent and child tables. Let’s load the BasketballMan dataset, composed of the following tables:

  • players is the root table with primary key playerID

  • season is a child of players with foreign key playerID

  • all_star is a child of players with foreign key playerID

We can still use the RelationalData.from_dir() method, but in this case, the primary keys and the foreign keys describing the relations between tables need to be specified. The primary_keys parameter is a dictionary specifying one primary key for each table, while the foreign_key parameter is a dictionary which assigns to each table a dictionary with the foreign keys as keys and the corresponding parent tables as values.

1pks = {'players': 'playerID'}
2fks = {'season': {'playerID': 'players'}, 'all_star': {'playerID': 'players'}}
3data = RelationalData.from_dir(data_dir='path/to/basketballman/dir', primary_keys=pks, foreign_keys=fks)

Optionally, it is possible to load a subset of the available columns via the argument use_cols.

1data = RelationalData.from_dir(data_dir='path/to/basketballman/dir',
2                               primary_keys=pks,
3                               foreign_keys=fks,
4                               use_cols={'players':['playerID', 'pos', 'height', 'weight', 'race'],
5                                         'all_star':['playerID', 'conference', 'points', 'rebounds', 'assists', 'blocks'],
6                                         'season':['playerID', 'year', 'GP', 'minutes', 'ppg', 'rpg', 'apg', 'bpg']})

The resulting RelationalData object contains the loaded tabular data, organized according to the specified relational structure:

1data
Schema:

players:Table
Primary key: playerID
Feature columns:
  pos:Categorical
  height:Float
  weight:Float
  race:Categorical
Foreign keys:

all_star:Table
Primary key: None
Feature columns:
  conference:Categorical
  points:Float
  rebounds:Float
  assists:Float
  blocks:Float
Foreign keys:
  playerID:ForeignKey(parent=players)

season:Table
Primary key: None
Feature columns:
  year:Integer
  GP:Integer
  minutes:Integer
  ppg:Float
  rpg:Float
  apg:Float
  bpg:Float
Foreign keys:
  playerID:ForeignKey(parent=players)

(Optional) Train/test Data splitting

aindo.synth offers a utility function to split the data into train, test and possibly validation sets, while respecting the consistency of the relational data structure. Precisely, for each root table, it samples a fraction of rows to be used in the training / testing phase, together with all their children.

1from aindo.synth import RelationalData
2
3data = RelationalData.from_dir(data_dir='path/to/data/dir', primary_keys=..., foreign_keys=...)
4train_data, test_data = data.split(test_fraction=0.2)

The (optional) parameter test_fraction=0.2 by default assigns the same split ratio to all root tables, otherwise it is possible to set different split ratios by passing a dictionary on the root tables, e.g. test_fraction={'root_1':0.2, 'root_2':0.5}.

Model training

aindo.synth offers various models to generate the synthetic data. They all share a similar interface and thus can be used interchangeably, depending on which one is the most suited to the task at hand. All the complexities are hidden from the user, instantiation and training of a GraphSynth model can be done with a few lines of code. To be instantiated, the model requires the schema of the data, which can be easily retrieved from the RelationalData object. To train the model, we must also specify the number of epochs of training.

1from aindo.synth import GraphSynth, RelationalData
2
3data = RelationalData.from_dir(data_dir='path/to/data/dir', primary_keys=..., foreign_keys=...)
4model = GraphSynth(schema=data.schema)
5model.train(train_data=data, n_epochs=50)

More customization parameters are available via the optional arguments described in the documentation.

Synthetic data generation

Once the model is trained, to generate synthetic data it is enough to call the Model.sample() method of the trained model, which returns a RelationalData object containing the synthetic data.

1from aindo.synth import GraphSynth, RelationalData
2
3data = RelationalData.from_dir(data_dir='path/to/data/dir', primary_keys=..., foreign_keys=...)
4model = GraphSynth(schema=data.schema)
5model.train(train_data=data, n_epochs=50)
6data_synth = model.sample()

By default, Model.sample() generates the same number of rows as the original dataset, but an arbitrary number of samples can be specified via the optional argument n_samples, as a dictionary containing the desired number of samples to be generated for each table.

1data = RelationalData.from_dir(data_dir='path/to/basketballman/dir', primary_keys=pks, foreign_keys=fks)
2data_synth = model.sample(n_samples={'players': 4856, 'all_star': 1588, 'season': 23585})

Evaluation

All models expose the Model.report() method, which outputs a pdf displaying the key metrics in the out_dir folder.

Part of the original data will be kept out of the training phase and used to evaluate the quality of synthetic data. In the following example, we split the data into a training set and a test set. We train the model on the training set and evaluate it on the test set.

1from aindo.synth import GraphSynth, RelationalData
2
3data = RelationalData.from_dir(data_dir='path/to/data/dir', primary_keys=..., foreign_keys=...)
4train_data, test_data = data.train_test_split()
5model = GraphSynth(schema=data.schema)
6model.train(train_data=train_data, n_epochs=50)
7model.report(original_data=test_data, out_dir='path/to/out/dir')

Exporting data

RelationalData objects allow saving the data on disk or in a database. For instance, we can use the RelationalData.to_csv() method to save on disk:

1from aindo.synth import RelationalData
2
3data_synth: RelationalData = ...
4data_synth.to_csv(out_dir='path/to/out/dir')

In this case, synthetic tables are saved in the out_dir folder as separate csv files having the same name as the original tables.