Introduction
Welcome to the documentation for the aindo.rdml
library.
aindo.rdml
is a library for the generation of synthetic tabular and relational data using neural generative
models.
With its intuitive library, users can effortlessly:
- Preprocess tabular and relational data;
- Train generative models for synthetic data generation;
- Generate synthetic data and assess its quality.
This page serves as a comprehensive guide to the functionalities offered by the aindo.rdml
library,
organized into three main modules:
- The
relational
module is designed to transform and preprocess data organized in tabular and relational structures; - The
synth
module is responsible for the training of generative models and the generation of synthetic data using the trained models; - The
eval
module facilitates the evaluation of synthetic data quality and performance. It provides metrics to assess the similarity between synthetic and real data and to check performances in terms of privacy.
Quick start
Data Loading
To get started with the library, data must first be loaded in the main memory.
In this example, we demonstrate how to do this with the pandas
library and a single table dataset from a CSV file.
Tabular data is organized into rows and columns, where columns represent attributes
and rows represent observations of those attributes.
For example, let’s examine the first four columns of the UCI Adult
single table dataset:
To use the aindo.rdml
library, each dataset must be stored in a RelationalData
object,
which serves as the basic data structure.
As the name suggests, this data structure can store both a single table and relational data involving multiple tables.
A RelationalData object consists of two main attributes:
- A
Data
object is a dictionary with tables’ names as keys andpandas.DataFrame
’s as values; - A
Schema
object contains the structure of the relations between tables, e.g. primary and foreign keys, and a description of the column types.
Let’s define a RelationalData
object for the Adult dataset:
An example with a more complex, multi-table data structure can be found in the Relational module section.
Train / test data splitting (optional)
The RelationalData
class offers a utility function to split the data into train, test and possibly validation sets,
while respecting the consistency of the relational data structure.
Data preprocessing
Data preprocessing involves transforming data columns before feeding them into the model.
Preprocessing is performed through a TabularPreproc
object, which requires a Schema
object as an argument.
This will instantiate a default preprocessor based on column types provided in the Schema
.
After the instantiation, a TabularPreproc
object needs to be fitted on a RelationalData
object:
The preprocessing phase may also include additional operations to reduce the risk of privacy leaks, i.e. the risk of revealing personally identifiable information or sensitive data that was present in the original data. While the generative model does not copy individual data records, it could still potentially expose information if it generates data points containing rare categories or outlier numerical values.
To reduce the risk of privacy leaks, it is possible to define a custom preprocessing of the columns through the
argument preprocessors
.
This argument is expected to be a dictionary where the keys are the names of tables, and the values are dictionaries
containing ColumnPreproc
objects for each column within the respective table.
Instances of ColumnPreproc
class allow users to define custom preprocessing operations for individual columns.
For instance, to prevent the model from generating age “35” during data synthesis one would proceed as follows:
The preprocessing of tables containing text columns is managed by individual TextPreproc
objects, one for each table
containing text.
However, custom preprocessing of text columns is not supported.
Further details on preprocessing functionalities are provided in the Data preprocessing section.
Model training
The aindo.rdml
library uses generative models that are trained to infer patterns and distributions of the
original data during the training phase.
The aindo.rdml
library offers two generative models for synthetic data generation, each with its own trainer:
- A
TabularModel
, trained by aTabularTrainer
, that generates all the relational data excluding columns that contain text. - A
TextModel
, trained by aTextTrainer
, that generates only text columns. Users must define aTextModel
for each table containing text columns.
To instantiate and build a TabularModel
the user has to provide a TabularPreproc
object along with a string
indicating the desired model size (small
, medium
, or large
). Larger models generally offer a greater performance
in terms of quality of the learned patterns, but they may require a higher convergence time.
Then, the model is used to instantiate a TabularTrainer
.
To train the model, the train()
method of the TabularTrainer
needs two arguments: the training data and the
desired number of training epochs.
The user can specify the size of a batch of data during training with the optional batch_size
argument.
Otherwise, the user must specify the available memory (on CPU or GPU, depending on the chosen device) through the
memory
parameter, and an optimal batch_size
will be automatically estimated.
The syntax is similar for TextModel
instances, but in this case, the user must also specify a block_size
,
corresponding to the maximum text length that the model can process in a single forward step.
More customization parameters are available via the optional arguments described in the Model training section.
Synthetic data generation
Once the generative model is trained, it can generate synthetic data that closely mirrors the original without containing any specific identifiable information, ensuring both privacy and utility for various applications.
To generate synthetic data using a TabularModel
it is enough to call the generate()
method of the
model, which returns a RelationalData
object containing the synthetic data.
It is necessary to provide the number of samples to be generated.
Optionally, the user can specify:
batch_size
is the batch size used during generation. Defaults to 0, which means that all the data is generated in a single batch.temp
is a strictly positive real number describing the amount of noise used in generation. The default value is 1, while larger values will introduce more variance and lower values will decrease the variance.
For instance, let’s generate the same number of rows as in the original adult
table, with a batch size of 1024:
A TabularModel
only generates non-text columns.
An example of the output of the previous generation is the following:
To generate the text column we need to use the TextModel
and provide the tabular data that we just generated:
The output synth_adult
is a RelationalData
object containing the synthetic version of the original data,
including the previously missing text column.
The Synthetic data generation section shows an example of text generation on relational data.
Evaluation
The aindo.rdml
library also includes some tools to evaluate the generated synthetic data.
These are found in the eval
package
The report()
function outputs a PDF displaying the key metrics for the evaluation of the generated synthetic data
in terms of both data quality and privacy protection.
This function needs training and test data splits, the generated synthetic data and an output path for the PDF file.
The compute_privacy_stats()
function performs a more detailed analysis of the privacy metrics.
On top of the privacy score, it provides an estimate of its standard deviation, and the estimated fraction of real
data points at risk.