Aindo Synth Overview¶
aindo.synth
is a library for creating synthetic tabular and relational data
using neural generative models.
It offers a high-level API, that’s amenable for users who want to use the library in the simplest
possible way, and a low-level API, that is meant for greater flexibility and is aimed at developers and researchers.
In this documentation, we explore the high-level API.
Data loading¶
Data must be loaded in the main memory. aindo.synth
offers various options to load data from various sources and organize
them into RelationalData
objects. First of all, the RelationalData
class must be imported.
Then, to load a single table dataset from a file (in csv or Excel format) contained in a directory,
we use the constructor from_dir
. For example, to load the
UCI adult dataset, we can use the following instructions:
1from aindo.synth import RelationalData
2data = RelationalData.from_dir(data_dir='path/to/adult/dir')
Optionally, we can load tabular data using Pandas library and instantiate a
RelationalData
object from a dictionary with table names as keys and Pandas DataFrames as values.
Tabular data is organized in rows and columns, where columns represent attributes and rows represent
observations of such attributes. For example, let’s take a look at Adult single table dataset:
1import pandas as pd
2df = pd.read_csv('path/to/adult/dir/adult.csv')
3df
[Out]:
age workclass fnlwgt ... hours-per-week native-country y
0 43 Local-gov 169203 ... 35 United-States <=50K
1 36 Private 184112 ... 45 United-States >50K
2 35 Private 338611 ... 40 United-States <=50K
3 64 ? 208862 ... 50 United-States >50K
4 25 State-gov 129200 ... 40 United-States <=50K
... ... ... ... ... ... ...
29300 25 Private 290528 ... 40 United-States <=50K
29301 51 Private 306108 ... 40 United-States >50K
29302 33 Private 182792 ... 40 United-States <=50K
29303 46 Private 175925 ... 40 United-States <=50K
29304 34 Private 191291 ... 40 United-States <=50K
[29305 rows x 15 columns]
1data = RelationalData.from_data({'adult': df})
Similarly, we can load relational data structures containing multiple tables. In this setting each foreign key from a child table refers to the primary key in a parent table, thus defining a connection between parent and child tables. Let’s load the BasketballMan dataset, composed of the following tables:
players
is the root table with primary keyplayerID
season
is a child of players with foreign keyplayerID
all_star
is a child of players with foreign keyplayerID
We can still use the RelationalData.from_dir()
method, but in this case, the primary keys and the foreign keys
describing the relations between tables need to be specified.
The primary_keys
parameter is a dictionary specifying one primary key for each table, while the foreign_key
parameter is a dictionary which assigns to each table a dictionary with the foreign keys as keys and the
corresponding parent tables as values.
1pks = {'players': 'playerID'}
2fks = {'season': {'playerID': 'players'}, 'all_star': {'playerID': 'players'}}
3data = RelationalData.from_dir(data_dir='path/to/basketballman/dir', primary_keys=pks, foreign_keys=fks)
Optionally, it is possible to load a subset of the available columns via the argument use_cols
.
1data = RelationalData.from_dir(data_dir='path/to/basketballman/dir',
2 primary_keys=pks,
3 foreign_keys=fks,
4 use_cols={'players':['playerID', 'pos', 'height', 'weight', 'race'],
5 'all_star':['playerID', 'conference', 'points', 'rebounds', 'assists', 'blocks'],
6 'season':['playerID', 'year', 'GP', 'minutes', 'ppg', 'rpg', 'apg', 'bpg']})
The resulting RelationalData
object contains the loaded tabular data, organized according to the specified
relational structure:
1data
Schema:
players:Table
Primary key: playerID
Feature columns:
pos:Categorical
height:Float
weight:Float
race:Categorical
Foreign keys:
all_star:Table
Primary key: None
Feature columns:
conference:Categorical
points:Float
rebounds:Float
assists:Float
blocks:Float
Foreign keys:
playerID:ForeignKey(parent=players)
season:Table
Primary key: None
Feature columns:
year:Integer
GP:Integer
minutes:Integer
ppg:Float
rpg:Float
apg:Float
bpg:Float
Foreign keys:
playerID:ForeignKey(parent=players)
(Optional) Train/test Data splitting¶
aindo.synth
offers a utility function to split the data into train, test and possibly validation sets, while
respecting the consistency of the relational data structure. Precisely, for each root table, it samples a fraction of
rows to be used in the training / testing phase, together with all their children.
1from aindo.synth import RelationalData
2
3data = RelationalData.from_dir(data_dir='path/to/data/dir', primary_keys=..., foreign_keys=...)
4train_data, test_data = data.split(test_fraction=0.2)
The (optional) parameter test_fraction=0.2
by default assigns the same split ratio to all root tables, otherwise
it is possible to set different split ratios by passing a dictionary on the root tables, e.g.
test_fraction={'root_1':0.2, 'root_2':0.5}
.
Model training¶
aindo.synth
offers various models to generate the synthetic data.
They all share a similar interface and thus can be used interchangeably,
depending on which one is the most suited to the task at hand.
All the complexities are hidden from the user, instantiation and training of a GraphSynth
model can be done with
a few lines of code.
To be instantiated, the model requires the schema of the data, which can be easily retrieved from the RelationalData
object. To train the model, we must also specify the number of epochs of training.
1from aindo.synth import GraphSynth, RelationalData
2
3data = RelationalData.from_dir(data_dir='path/to/data/dir', primary_keys=..., foreign_keys=...)
4model = GraphSynth(schema=data.schema)
5model.train(train_data=data, n_epochs=50)
More customization parameters are available via the optional arguments described in the documentation.
Synthetic data generation¶
Once the model is trained, to generate synthetic data it is enough to call the Model.sample()
method of the
trained model, which returns a RelationalData
object containing the synthetic data.
1from aindo.synth import GraphSynth, RelationalData
2
3data = RelationalData.from_dir(data_dir='path/to/data/dir', primary_keys=..., foreign_keys=...)
4model = GraphSynth(schema=data.schema)
5model.train(train_data=data, n_epochs=50)
6data_synth = model.sample()
By default, Model.sample()
generates the same number of rows as the original dataset, but an arbitrary number of samples
can be specified via the optional argument n_samples
, as a dictionary containing the desired number
of samples to be generated for each table.
1data = RelationalData.from_dir(data_dir='path/to/basketballman/dir', primary_keys=pks, foreign_keys=fks)
2data_synth = model.sample(n_samples={'players': 4856, 'all_star': 1588, 'season': 23585})
Evaluation¶
All models expose the Model.report()
method, which outputs a pdf displaying the key metrics in
the out_dir
folder.
Part of the original data will be kept out of the training phase and used to evaluate the quality of synthetic data. In the following example, we split the data into a training set and a test set. We train the model on the training set and evaluate it on the test set.
1from aindo.synth import GraphSynth, RelationalData
2
3data = RelationalData.from_dir(data_dir='path/to/data/dir', primary_keys=..., foreign_keys=...)
4train_data, test_data = data.train_test_split()
5model = GraphSynth(schema=data.schema)
6model.train(train_data=train_data, n_epochs=50)
7model.report(original_data=test_data, out_dir='path/to/out/dir')
Exporting data¶
RelationalData
objects allow saving the data on disk or in a database.
For instance, we can use the RelationalData.to_csv()
method to save on disk:
1from aindo.synth import RelationalData
2
3data_synth: RelationalData = ...
4data_synth.to_csv(out_dir='path/to/out/dir')
In this case, synthetic tables are saved in the out_dir
folder as separate csv files having the same
name as the original tables.