Skip to content

Introduction

Welcome to the documentation for the aindo.rdml library.

aindo.rdml is a library for the generation of synthetic tabular and relational data using neural generative models. With its intuitive library, users can effortlessly:

  1. Preprocess tabular and relational data;
  2. Train generative models for synthetic data generation;
  3. Generate synthetic data and assess its quality.

This page serves as a comprehensive guide to the functionalities offered by the aindo.rdml library, organized into three main modules:

  • The relational module is designed to transform and preprocess data organized in tabular and relational structures;
  • The synth module is responsible for the training of generative models and the generation of synthetic data using the trained models;
  • The eval module facilitates the evaluation of synthetic data quality and performance. It provides metrics to assess the similarity between synthetic and real data and to check performances in terms of privacy.

Quick start

Data Loading

To get started with the library, data must first be loaded in the main memory. In this example, we demonstrate how to do this with the pandas library and a single table dataset from a CSV file. Tabular data is organized into rows and columns, where columns represent attributes and rows represent observations of those attributes. For example, let’s examine the first four columns of the UCI Adult single table dataset:

import pandas as pd
df = pd.read_csv('path/to/adult/dir/adult.csv', usecols=['age', 'workclass', 'fnlwgt', 'education'])
print(df)
Terminal window
[Out]:
age workclass fnlwgt education
0 39 State-gov 77516 Bachelors
1 50 Self-emp-not-inc 83311 Bachelors
2 38 Private 215646 HS-grad
3 53 Private 234721 11th
4 28 Private 338409 Bachelors
... ... ... ... ...
32556 27 Private 257302 Assoc-acdm
32557 40 Private 154374 HS-grad
32558 58 Private 151910 HS-grad
32559 22 Private 201490 HS-grad
32560 52 Self-emp-inc 287927 HS-grad
[32561 rows x 4 columns]

To use the aindo.rdml library, each dataset must be stored in a RelationalData object, which serves as the basic data structure. As the name suggests, this data structure can store both a single table and relational data involving multiple tables. A RelationalData object consists of two main attributes:

  1. A Data object is a dictionary with tables’ names as keys and pandas.DataFrame’s as values;
  2. A Schema object contains the structure of the relations between tables, e.g. primary and foreign keys, and a description of the column types.

Let’s define a RelationalData object for the Adult dataset:

import pandas as pd
from aindo.rdml.relational import Column, Table, Schema, RelationalData
dfs = {'adult': pd.read_csv(...)}
schema = Schema(
adult=Table(
age=Column.INTEGER,
workclass=Column.CATEGORICAL,
fnlwgt=Column.INTEGER,
education=Column.TEXT,
)
)
data = RelationalData(data=dfs, schema=schema)
print(data)
Terminal window
[Out]:
Schema:
adult:Table
Primary key: None
Feature columns:
age:<Column.INTEGER: 'Integer'>
workclass:<Column.CATEGORICAL: 'Categorical'>
fnlwgt:<Column.INTEGER: 'Integer'>
education:<Column.TEXT: 'Text'>
Foreign keys:

An example with a more complex, multi-table data structure can be found in the Relational module section.

Train / test data splitting (optional)

The RelationalData class offers a utility function to split the data into train, test and possibly validation sets, while respecting the consistency of the relational data structure.

from aindo.rdml.relational import RelationalData
data = RelationalData(data=..., schema=...)
data_train_valid, data_test = data.split(ratio=0.1)
data_train, data_valid = data_train_valid.split(ratio=0.1)

Data preprocessing

Data preprocessing involves transforming data columns before feeding them into the model. Preprocessing is performed through a TabularPreproc object, which requires a Schema object as an argument. This will instantiate a default preprocessor based on column types provided in the Schema. After the instantiation, a TabularPreproc object needs to be fitted on a RelationalData object:

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularPreproc
data = RelationalData(data=..., schema=...)
preproc = TabularPreproc(schema=data.schema)
preproc.fit(data=data)

The preprocessing phase may also include additional operations to reduce the risk of privacy leaks, i.e. the risk of revealing personally identifiable information or sensitive data that was present in the original data. While the generative model does not copy individual data records, it could still potentially expose information if it generates data points containing rare categories or outlier numerical values.

To reduce the risk of privacy leaks, it is possible to define a custom preprocessing of the columns through the argument preprocessors. This argument is expected to be a dictionary where the keys are the names of tables, and the values are dictionaries containing ColumnPreproc objects for each column within the respective table. Instances of ColumnPreproc class allow users to define custom preprocessing operations for individual columns.

For instance, to prevent the model from generating age “35” during data synthesis one would proceed as follows:

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import ColumnPreproc, TabularPreproc
data = RelationalData(data=..., schema=...)
preproc = TabularPreproc(
schema=data.schema,
preprocessors={'adult': {'age': ColumnPreproc(non_sample_values=[35])}},
)
preproc.fit(data=data)

The preprocessing of tables containing text columns is managed by individual TextPreproc objects, one for each table containing text. However, custom preprocessing of text columns is not supported.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TextPreproc
data = RelationalData(data=..., schema=...)
preproc_text = TextPreproc(schema=data.schema, table='adult')
preproc_text.fit(data=data)

Further details on preprocessing functionalities are provided in the Data preprocessing section.

Model training

The aindo.rdml library uses generative models that are trained to infer patterns and distributions of the original data during the training phase.

The aindo.rdml library offers two generative models for synthetic data generation, each with its own trainer:

  • A TabularModel, trained by a TabularTrainer, that generates all the relational data excluding columns that contain text.
  • A TextModel, trained by a TextTrainer, that generates only text columns. Users must define a TextModel for each table containing text columns.

To instantiate and build a TabularModel the user has to provide a TabularPreproc object along with a string indicating the desired model size (small, medium, or large). Larger models generally offer a greater performance in terms of quality of the learned patterns, but they may require a higher convergence time. Then, the model is used to instantiate a TabularTrainer. To train the model, the train() method of the TabularTrainer needs two arguments: the training data and the desired number of training epochs. The user can specify the size of a batch of data during training with the optional batch_size argument. Otherwise, the user must specify the available memory (on CPU or GPU, depending on the chosen device) through the memory parameter, and an optimal batch_size will be automatically estimated.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularModel, TabularPreproc, TabularTrainer
data = RelationalData(data=..., schema=...)
data_train, data_test = data.split(ratio=0.1)
preproc = TabularPreproc(schema=data.schema).fit(data=data)
preproc.fit(data=data)
model_tabular = TabularModel.build(preproc=preproc, size='small')
trainer_tabular = TabularTrainer(model=model_tabular)
trainer_tabular.train(
data=data_train,
n_epochs=100,
batch_size=256,
)

The syntax is similar for TextModel instances, but in this case, the user must also specify a block_size, corresponding to the maximum text length that the model can process in a single forward step.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TextModel, TextPreproc, TextTrainer
data = RelationalData(data=..., schema=...)
data_train, data_test = data.split(ratio=0.1)
preproc_text = TextPreproc(schema=data.schema, table='adult').fit(data=data)
model_text = TextModel.build(
preproc=preproc_text,
size='small',
block_size=1024,
)
trainer_text = TextTrainer(model=model_text)
trainer_text.train(
data=data_train,
n_epochs=100,
batch_size=32,
)

More customization parameters are available via the optional arguments described in the Model training section.

Synthetic data generation

Once the generative model is trained, it can generate synthetic data that closely mirrors the original without containing any specific identifiable information, ensuring both privacy and utility for various applications.

To generate synthetic data using a TabularModel it is enough to call the generate() method of the model, which returns a RelationalData object containing the synthetic data. It is necessary to provide the number of samples to be generated. Optionally, the user can specify:

  • batch_size is the batch size used during generation. Defaults to 0, which means that all the data is generated in a single batch.
  • temp is a strictly positive real number describing the amount of noise used in generation. The default value is 1, while larger values will introduce more variance and lower values will decrease the variance.

For instance, let’s generate the same number of rows as in the original adult table, with a batch size of 1024:

import pandas as pd
from aindo.rdml.synth import TabularModel
df = pd.read_csv(...)
model_tabular = TabularModel.build(preproc=..., size=...)
# Train the tabular model
...
data_synth = model_tabular.generate(
n_samples=df.shape[0],
batch_size=1024,
)

A TabularModel only generates non-text columns. An example of the output of the previous generation is the following:

Terminal window
{'adult':
age workclass fnlwgt
0 31 Private 108501
1 39 Local-gov 228490
2 11 Private 187810
3 47 Private 113026
4 26 Private 465070
...}

To generate the text column we need to use the TextModel and provide the tabular data that we just generated:

import pandas as pd
from aindo.rdml.synth import TabularModel, TextModel
df = pd.read_csv(...)
model_tabular = TabularModel.build(preproc=..., size=...)
model_text = TextModel.build(preproc=..., size=..., block_size=...)
# Train the tabular and text models
...
data_synth = model_tabular.generate(
n_samples=df.shape[0],
batch_size=1024,
)
synth_adult = model_text.generate(
data=data_synth,
batch_size=512,
)

The output synth_adult is a RelationalData object containing the synthetic version of the original data, including the previously missing text column.

Terminal window
[Out]:
{'adult':
age workclass fnlwgt education
0 31 Private 108501 EntityItem B-grad HS-Flagscollegeachel
1 39 Local-gov 228490 achelachel-school
2 11 Private 187810 HS-grad
3 47 Private 113026 itu Kara-assycollege
4 26 Private 465070 8achelors
...}

The Synthetic data generation section shows an example of text generation on relational data.

Evaluation

The aindo.rdml library also includes some tools to evaluate the generated synthetic data. These are found in the eval package

The report() function outputs a PDF displaying the key metrics for the evaluation of the generated synthetic data in terms of both data quality and privacy protection. This function needs training and test data splits, the generated synthetic data and an output path for the PDF file.

The compute_privacy_stats() function performs a more detailed analysis of the privacy metrics. On top of the privacy score, it provides an estimate of its standard deviation, and the estimated fraction of real data points at risk.

from aindo.rdml.eval import report, compute_provacy_stats
from aindo.rdml.relational import RelationalData
data = RelationalData(data=..., schema=...)
data_train, data_test = data.split(ratio=0.1)
# Generate synthetic data
...
data_synth = ...
report(
data_train=data_train,
data_test=data_test,
data_synth=data_synth,
path='./report.pdf',
)
privacy_stats = compute_privacy_stats(
data_train=data_train,
data_synth=data_synth,
)
for t in data.schema.tables:
print(f"Table: {t}")
print(f"Privacy score: {privacy_stats[t].privacy_score:.2%f} ({privacy_stats[t].privacy_score_std:.3%f})")
print(f"% training points at risk: {privacy_stats[t].risk * 100:.1%f}")

Modules

  1. Relational
  2. Synth
  3. Eval

Library Documentation