Quick start
The aindo.rdml
library uses pandas.DataFrame
’s as inputs and outputs.
This means that the data must be loaded as one or more pandas.DataFrame
’s.
It is then processed before feeding it to the generative model training routines.
Finally, the trained models will generate the synthetic data and output pandas.DataFrame
’s.
Data Loading
To get started with the library, data must first be loaded in the main memory.
In this example, we demonstrate how to do this with the pandas
library and a single table dataset from a CSV file.
Tabular data is organized into rows and columns, where columns represent attributes
and rows represent observations of those attributes.
For example, let’s examine the first four columns of the UCI Adult
single table dataset.
To use the aindo.rdml
library, each dataset must be stored in a RelationalData
object,
which serves as the basic data structure.
As the name suggests, this data structure can store both a single table and relational data involving multiple tables.
A RelationalData
object consists of two main attributes:
- A
Data
object is a dictionary with tables’ names as keys andpandas.DataFrame
’s as values; - A
Schema
object contains the structure of the relations between tables, e.g. primary and foreign keys, and a description of the column types.
Let us define a RelationalData
object for the Adult dataset, with a reduced number of columns, for simplicity.
All the needed classes can be found in the aindo.rdml.relational
module.
Note that in the above example the categorical column education
has been declared as a Column.TEXT
just for the sake of showing an example of how a text column is treated in aindo.rdml
.
More correctly, we should have declared it as a Column.CATEGORICAL
.
An example with a more complex, multi-table data structure can be found in the Relational module section.
Train / test data splitting (optional)
The RelationalData
class offers a utility function to split the data into train, test and possibly validation sets,
while respecting the consistency of the relational data structure.
Data preprocessing
Data preprocessing involves transforming data columns before feeding them into the model.
Preprocessing is performed through a TabularPreproc
object, which can be found in the aindo.rdml.synth
module.
TabularPreproc
objects can be built with the TabularPreproc.from_schema
method,
which will build a default preprocessor based on the column types found in the provided Schema
.
After the instantiation, a TabularPreproc
object needs to be fitted on a RelationalData
object.
The preprocessing phase may also include additional operations to reduce the risk of privacy leaks, i.e. the risk of revealing personally identifiable information or sensitive data that was present in the original data. While the generative model does not copy individual data records, it could still potentially expose information if it generates data points containing rare categories or outlier numerical values.
To reduce this risk, it is possible to define a custom preprocessing of the columns through the
argument preprocessors
.
This argument is expected to be a dictionary where the keys are the names of the tables,
and the values are dictionaries containing ColumnPreproc
objects for each column within the respective table.
Instances of the ColumnPreproc
class allow users to define custom preprocessing operations for individual columns.
For instance, to prevent the model from generating age “35” during data synthesis one would proceed as follows:
The preprocessing of text columns is managed by TextPreproc
objects,
one for each table containing text.
Since in our example we already built a TabularPreproc
, we can start from it to build the TextPreproc
,
using the TextPreproc.from_tabular
method and providing also the name of the table to consider.
Note that custom preprocessing of text columns is not supported.
Further details on preprocessing functionalities are provided in the Data preprocessing section.
Model training
The aindo.rdml
library uses generative models that are trained to infer patterns and distributions of the
original data.
The aindo.rdml.synth
module offers two generative models for synthetic data generation, each with its own trainer:
- A
TabularModel
, trained by aTabularTrainer
, that generates all the relational data excluding columns that contain text. - A
TextModel
, trained by aTextTrainer
, that generates only text columns. Users must define aTextModel
for each table containing text columns.
To instantiate and build a TabularModel
the user has to provide a TabularPreproc
object along with a string
indicating the desired model size (small
, medium
, or large
).
Larger models generally offer a greater performance in terms of quality of the learned patterns,
but they may require more time to reach convergence.
To train the model, it is necessary to:
- Instantiate a
TabularTrainer
. - Build a
TabularDataset
from the training data and theTabularPreproc
.
The TabularTrainer.train()
method is used to train the model, and it takes as input:
- The
TabularDataset
containing the training data. - The maximum desired number of either training epochs (
n_epochs
) or training steps (n_steps
). - Either the size of each batch of data with the
batch_size
argument, or alternatively the available memory (on CPU or GPU, depending on the chosen device) through thememory
parameter. The latter is used to automatically estimate an optimalbatch_size
.
The syntax is similar for TextModel
instances, but in this case the user must also specify a block_size
,
corresponding to the maximum text length that the model can process in a single forward step.
A reasonable value for the block_size
can be recovered from the TextDataset.max_text_len
attribute
of the training dataset.
More customization parameters are available via the optional arguments described in the Model training section.
Synthetic data generation
Once the generative model is trained, it can generate synthetic data that closely mirrors the original without containing any personally identifiable information, ensuring both privacy and utility for various applications.
To generate synthetic data using a TabularModel
it is enough to call the TabularModel.generate()
method,
which returns a RelationalData
object containing the synthetic data.
It is necessary to provide the number of samples to be generated through the n_samples
parameter.
Optionally, the user can specify:
batch_size
, the batch size used during generation. Defaults to 0, which means that all the data is generated in a single batch.temp
, a strictly positive real number describing the amount of noise used in generation. The default value is 1. Larger values will introduce more variance, lower values will decrease it.
For instance, let’s generate the same number of rows as in the original adult
table, with a batch size of 1024.
A TabularModel
only generates non-text columns.
An example of the output of the previous generation is the following:
To generate the text column we need to use the TextModel
and provide the tabular data that we just generated.
The output data_synth
is a RelationalData
object containing the synthetic version of the original data,
including the previously missing text column.
The Airbnb script shows a more realistic example of text generation, with a relational data and with text columns in two different tables.
Evaluation
The aindo.rdml
library also includes some tools to evaluate the generated synthetic data.
These are found in the aindo.rdml.eval
module.
The report()
function outputs a PDF displaying the key metrics for the evaluation of the generated synthetic data
in terms of both data quality and privacy protection.
This function needs training and test data splits, the generated synthetic data and an output path for the PDF file.
The compute_privacy_stats()
function performs a more detailed analysis of the privacy metrics.
On top of the privacy score, it provides an estimate of its standard deviation, and the estimated fraction of real
data points at risk.