Skip to content

Quick start

The aindo.rdml library uses pandas.DataFrame’s as inputs and outputs. This means that the data must be loaded as one or more pandas.DataFrame’s. It is then processed before feeding it to the generative model training routines. Finally, the trained models will generate the synthetic data and output pandas.DataFrame’s.

Data Loading

To get started with the library, data must first be loaded in the main memory. In this example, we demonstrate how to do this with the pandas library and a single table dataset from a CSV file. Tabular data is organized into rows and columns, where columns represent attributes and rows represent observations of those attributes. For example, let’s examine the first four columns of the UCI Adult single table dataset.

import pandas as pd
df = pd.read_csv('path/to/adult.data', usecols=['age', 'workclass', 'fnlwgt', 'education'])
print(df)
Terminal window
[Out]:
age workclass fnlwgt education
0 39 State-gov 77516 Bachelors
1 50 Self-emp-not-inc 83311 Bachelors
2 38 Private 215646 HS-grad
3 53 Private 234721 11th
4 28 Private 338409 Bachelors
... ... ... ... ...
32556 27 Private 257302 Assoc-acdm
32557 40 Private 154374 HS-grad
32558 58 Private 151910 HS-grad
32559 22 Private 201490 HS-grad
32560 52 Self-emp-inc 287927 HS-grad
[32561 rows x 4 columns]

To use the aindo.rdml library, each dataset must be stored in a RelationalData object, which serves as the basic data structure. As the name suggests, this data structure can store both a single table and relational data involving multiple tables. A RelationalData object consists of two main attributes:

  1. A Data object is a dictionary with tables’ names as keys and pandas.DataFrame’s as values;
  2. A Schema object contains the structure of the relations between tables, e.g. primary and foreign keys, and a description of the column types.

Let us define a RelationalData object for the Adult dataset, with a reduced number of columns, for simplicity. All the needed classes can be found in the aindo.rdml.relational module.

import pandas as pd
from aindo.rdml.relational import Column, Table, Schema, RelationalData
dfs = {'adult': pd.read_csv(...)}
schema = Schema(
adult=Table(
age=Column.INTEGER,
workclass=Column.CATEGORICAL,
fnlwgt=Column.INTEGER,
education=Column.TEXT,
)
)
data = RelationalData(data=dfs, schema=schema)
print(data)
Terminal window
[Out]:
Schema:
adult:Table
Primary key: None
Feature columns:
age:<Column.INTEGER: 'Integer'>
workclass:<Column.CATEGORICAL: 'Categorical'>
fnlwgt:<Column.INTEGER: 'Integer'>
education:<Column.TEXT: 'Text'>
Foreign keys:

Note that in the above example the categorical column education has been declared as a Column.TEXT just for the sake of showing an example of how a text column is treated in aindo.rdml. More correctly, we should have declared it as a Column.CATEGORICAL.

An example with a more complex, multi-table data structure can be found in the Relational module section.

Train / test data splitting (optional)

The RelationalData class offers a utility function to split the data into train, test and possibly validation sets, while respecting the consistency of the relational data structure.

from aindo.rdml.relational import RelationalData
data = RelationalData(data=..., schema=...)
data_train_valid, data_test = data.split(ratio=0.1)
data_train, data_valid = data_train_valid.split(ratio=0.1)

Data preprocessing

Data preprocessing involves transforming data columns before feeding them into the model. Preprocessing is performed through a TabularPreproc object, which can be found in the aindo.rdml.synth module. TabularPreproc objects can be built with the TabularPreproc.from_schema method, which will build a default preprocessor based on the column types found in the provided Schema. After the instantiation, a TabularPreproc object needs to be fitted on a RelationalData object.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularPreproc
data = RelationalData(data=..., schema=...)
preproc = TabularPreproc.from_schema(schema=data.schema)
preproc.fit(data=data)

The preprocessing phase may also include additional operations to reduce the risk of privacy leaks, i.e. the risk of revealing personally identifiable information or sensitive data that was present in the original data. While the generative model does not copy individual data records, it could still potentially expose information if it generates data points containing rare categories or outlier numerical values.

To reduce this risk, it is possible to define a custom preprocessing of the columns through the argument preprocessors. This argument is expected to be a dictionary where the keys are the names of the tables, and the values are dictionaries containing ColumnPreproc objects for each column within the respective table. Instances of the ColumnPreproc class allow users to define custom preprocessing operations for individual columns.

For instance, to prevent the model from generating age “35” during data synthesis one would proceed as follows:

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import ColumnPreproc, TabularPreproc
data = RelationalData(data=..., schema=...)
preproc = TabularPreproc.from_schema(
schema=data.schema,
preprocessors={'adult': {'age': ColumnPreproc(non_sample_values=[35])}},
)
preproc.fit(data=data)

The preprocessing of text columns is managed by TextPreproc objects, one for each table containing text. Since in our example we already built a TabularPreproc, we can start from it to build the TextPreproc, using the TextPreproc.from_tabular method and providing also the name of the table to consider.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularPreproc, TextPreproc
data = RelationalData(data=..., schema=...)
preproc = TabularPreproc.from_schema(schema=data.schema)
preproc.fit(data=data)
preproc_text = TextPreproc.from_tabular(preproc=preproc, table='adult')
preproc_text.fit(data=data)

Note that custom preprocessing of text columns is not supported.

Further details on preprocessing functionalities are provided in the Data preprocessing section.

Model training

The aindo.rdml library uses generative models that are trained to infer patterns and distributions of the original data.

The aindo.rdml.synth module offers two generative models for synthetic data generation, each with its own trainer:

  • A TabularModel, trained by a TabularTrainer, that generates all the relational data excluding columns that contain text.
  • A TextModel, trained by a TextTrainer, that generates only text columns. Users must define a TextModel for each table containing text columns.

To instantiate and build a TabularModel the user has to provide a TabularPreproc object along with a string indicating the desired model size (small, medium, or large). Larger models generally offer a greater performance in terms of quality of the learned patterns, but they may require more time to reach convergence.

To train the model, it is necessary to:

  • Instantiate a TabularTrainer.
  • Build a TabularDataset from the training data and the TabularPreproc.

The TabularTrainer.train() method is used to train the model, and it takes as input:

  • The TabularDataset containing the training data.
  • The maximum desired number of either training epochs (n_epochs) or training steps (n_steps).
  • Either the size of each batch of data with the batch_size argument, or alternatively the available memory (on CPU or GPU, depending on the chosen device) through the memory parameter. The latter is used to automatically estimate an optimal batch_size.
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularDataset, TabularModel, TabularPreproc, TabularTrainer
data = RelationalData(data=..., schema=...)
data_train, data_test = data.split(ratio=0.1)
preproc = TabularPreproc.from_schema(schema=data.schema).fit(data=data)
preproc.fit(data=data)
model = TabularModel.build(preproc=preproc, size='small')
dataset_train = TabularDataset.from_data(data=data_train, preproc=preproc)
trainer = TabularTrainer(model=model)
trainer.train(
dataset=dataset_train,
n_epochs=100,
batch_size=256,
)

The syntax is similar for TextModel instances, but in this case the user must also specify a block_size, corresponding to the maximum text length that the model can process in a single forward step. A reasonable value for the block_size can be recovered from the TextDataset.max_text_len attribute of the training dataset.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TextDataset, TextModel, TextPreproc, TextTrainer
data = RelationalData(data=..., schema=...)
data_train, data_test = data.split(ratio=0.1)
preproc_text = TextPreproc.from_schema_table(schema=data.schema, table='adult').fit(data=data)
dataset_train = TextDataset.from_data(data=data_train, preproc=preproc_text)
model_text = TextModel.build(
preproc=preproc_text,
size='small',
block_size=dataset_train.max_text_len,
)
trainer_text = TextTrainer(model=model_text)
trainer_text.train(
dataset=dataset_train,
n_epochs=100,
batch_size=32,
)

More customization parameters are available via the optional arguments described in the Model training section.

Synthetic data generation

Once the generative model is trained, it can generate synthetic data that closely mirrors the original without containing any personally identifiable information, ensuring both privacy and utility for various applications.

To generate synthetic data using a TabularModel it is enough to call the TabularModel.generate() method, which returns a RelationalData object containing the synthetic data. It is necessary to provide the number of samples to be generated through the n_samples parameter. Optionally, the user can specify:

  • batch_size, the batch size used during generation. Defaults to 0, which means that all the data is generated in a single batch.
  • temp, a strictly positive real number describing the amount of noise used in generation. The default value is 1. Larger values will introduce more variance, lower values will decrease it.

For instance, let’s generate the same number of rows as in the original adult table, with a batch size of 1024.

import pandas as pd
from aindo.rdml.synth import TabularModel
df = pd.read_csv(...)
model = TabularModel.build(preproc=..., size=...)
# Train the tabular model
...
data_synth = model.generate(
n_samples=df.shape[0],
batch_size=1024,
)

A TabularModel only generates non-text columns. An example of the output of the previous generation is the following:

Terminal window
{'adult':
age workclass fnlwgt
0 31 Private 108501
1 39 Local-gov 228490
2 11 Private 187810
3 47 Private 113026
4 26 Private 465070
...}

To generate the text column we need to use the TextModel and provide the tabular data that we just generated.

import pandas as pd
from aindo.rdml.synth import TabularModel, TextModel
df = pd.read_csv(...)
model = TabularModel.build(preproc=..., size=...)
model_text = TextModel.build(preproc=..., size=..., block_size=...)
# Train the tabular and text models
...
data_synth = model.generate(
n_samples=df.shape[0],
batch_size=1024,
)
data_synth = model_text.generate(
data=data_synth,
batch_size=512,
)

The output data_synth is a RelationalData object containing the synthetic version of the original data, including the previously missing text column.

Terminal window
[Out]:
{'adult':
age workclass fnlwgt education
0 31 Private 108501 EntityItem B-grad HS-Flagscollegeachel
1 39 Local-gov 228490 achelachel-school
2 11 Private 187810 HS-grad
3 47 Private 113026 itu Kara-assycollege
4 26 Private 465070 8achelors
...}

The Airbnb script shows a more realistic example of text generation, with a relational data and with text columns in two different tables.

Evaluation

The aindo.rdml library also includes some tools to evaluate the generated synthetic data. These are found in the aindo.rdml.eval module.

The report() function outputs a PDF displaying the key metrics for the evaluation of the generated synthetic data in terms of both data quality and privacy protection. This function needs training and test data splits, the generated synthetic data and an output path for the PDF file.

The compute_privacy_stats() function performs a more detailed analysis of the privacy metrics. On top of the privacy score, it provides an estimate of its standard deviation, and the estimated fraction of real data points at risk.

from aindo.rdml.eval import report, compute_privacy_stats
from aindo.rdml.relational import RelationalData
data = RelationalData(data=..., schema=...)
data_train, data_test = data.split(ratio=0.1)
# Generate synthetic data
...
data_synth = ...
report(
data_train=data_train,
data_test=data_test,
data_synth=data_synth,
path='./report.pdf',
)
privacy_stats = compute_privacy_stats(
data_train=data_train,
data_synth=data_synth,
)
for t in data.schema.tables:
print(f"Table: {t}")
print(f"Privacy score: {privacy_stats[t].privacy_score:.2%f} ({privacy_stats[t].privacy_score_std:.3%f})")
print(f"% training points at risk: {privacy_stats[t].risk * 100:.1%f}")