Synth
The aindo.rdml.synth
module allows to:
- Preprocess columns within each table;
- Train generative models on relational tabular data;
- Generate synthetic data.
To illustrate the full process, from preprocessing to synthetic data generation, we will use the Airbnb Open Data dataset. The original dataset consists of a single table, but upon further inspection, it is clear that we can rearrange it in a more “natural” form, by splitting it into two tables:
- A table
host
, with primary keyhost_id
. - A table
listings
, with primary keyid
and foreign keyhost_id
, referring to the primary key ofhost
.
Both tables have a text column, host_name
in host
, and name
in listings
.
For simplicity, we will focus here on the latter, however the same operations can be performed on the host
table too.
In the Airbnb script a full end-to-end example using the Airbnb dataset is laid out,
and both text columns are taken in into account.
Let us start by defining the Schema
and loading the data, as follows:
Data preprocessing
Data preprocessing means transforming the data columns to make them suitable for model training. This process can include optional steps to reduce the risk of privacy breaches and guarantee data anonymization.
Preprocessing is performed with a TabularPreproc
object.
To instantiate the default preprocessor, users can pass a Schema
object to the TabularModel.from_schema
method.
After the instantiation, the TabularPreproc
object must be fitted on a RelationalData
object.
Users also have the option to specify a custom preprocessing for each column.
This can be achieved by passing the preprocessors
argument to the TabularPreproc.from_schema
method,
The preprocessors
parameter is a dictionary where the keys are the names of tables,
and the values consist of dictionaries with column names as keys and one of the following values:
- A
ColumnPreproc
object, enabling users to define a custom behavior for that column during the preprocessing step; - A
None
value tells the preprocessor to ignore that column; - A custom column instance. This option is designed for advanced users seeking access to lower-level functionalities.
The preprocessing of text data is managed by TextPreproc
objects, one for each table containing text.
The TextPreproc
objects need to preprocess also the tabular part of the data,
to condition the text during training and generation.
In most cases, the generation of the text columns is done in addition to the generation
of the rest of the tabular data, and therefore a TabularPreproc
object is already available.
Each TextPreproc
object can then be built using the latter with the TextPreproc.from_tabular
method,
also providing the name of the table to consider.
In case no TabularPreproc
object is available, the text preprocessor can also be built from scratch,
using the TextPreproc.from_schema_table
method, which requires the Schema
and the name of the table containing the text columns.
To ensure consistency, the first method is recommended when both tabular and text data need to be generated.
It is important to note that custom preprocessing of text columns is not supported.
ColumnPreproc (advanced user)
A ColumnPreproc
object offers four optional parameters designed to customize the preprocessing of a
column:
special_values
: Provide a set of special values that will be treated as separate from the other values of the column, for example in a column with mixed type values.impute_nan
: Force the model to avoid generating missing values in the synthetic data.non_sample_values
: Provide a set of values that will not be generated in the synthetic data.protection
: Add an extra protection from potential privacy leaks coming from rare or extremal values present in the original column data.
In the next subsections, we describe in details the effect of these parameters.
Special values
The parameter special_values
takes a list of values that are considered special or unique within the dataset,
such as special characters occurring in a numeric column or outliers within a distribution.
For instance, in the Airbnb dataset, let us assume that the numerical column price
can sometimes assume
the non-numerical value 'missing'
.
In such case, we might denote this value as special:
Imputation of missing values
The parameter impute_nan
is a boolean flag that determines whether NaN values within the column should be sampled.
When set to True
, NaN values are imputed, ensuring that the synthetic data does not include any NaN values.
For instance, to avoid sampling NaN values in the price
column:
Avoid sampling certain values
The parameter non_sample_values
allows the user to set a list of values that will not be sampled during generation,
e.g. 'Manhattan'
and 'Brooklyn'
in the neighbourhood_group
column:
In place of these values, some other plausible values of the same column will be sampled when generating synthetic data.
Protection of rare values
The aindo.rdml
library provides a range of options to ensure additional privacy protection
to extremal or rare values that might be present in the columns.
Indeed, despite the model’s inability to learn from individual data subjects,
it learns the rare categories and the ranges of numerical values,
which might in some cases disclose sensitive data in the original dataset.
Consider for example a dataset with a range of information about the employees of a company, including their salaries. Let us say the CEO will have the highest salary in the dataset.
Employee ID | Name | Age | Role | Salary |
---|---|---|---|---|
001 | Alice Johnson | 60 | CEO | $100,000 |
002 | John Smith | 32 | HR | $55,000 |
003 | Emily Davis | 35 | Finance | $65,000 |
A model trained on this dataset will learn the range of values that the Salary
column can take.
When generating synthetic data, the model may (rarely) generate employees with salaries as high as the CEO one.
This extremal values found in the synthetic dataset reveals in fact the salary of the CEO in the original dataset.
Another example can be the one of a dataset containing the patients with a particular pathology. Being able to understand that a specific individual was in the original dataset would constitute a privacy leak for that individual.
Patient ID | Age | ZIP code | Systolic blood pressure (mm Hg) |
---|---|---|---|
001 | 21 | 34016 | 116 |
002 | 45 | 38068 | 125 |
003 | 72 | 00154 | 110 |
The ZIP code 34016 is the ZIP code of Monrupino, a small but charming village near Trieste,
with less than 1000 inhabitants.
If the ZIP code
column is defined as categorical, the generative model will memorize the possible values
that the column can take, even the rare one like the Monrupino ZIP code.
During the generation of synthetic data, a rare ZIP code won’t be generated often,
however when it is generated it reveals the fact that somebody from Monrupino was in the original dataset.
Even if this information does not explicitly disclose who that person is, in the case some other publicly
accessible information can be cross-referenced with the generated synthetic data, the identity of that person
may be ultimately revealed.
In any case, it is clear that the mere presence of a rare category in the generated dataset can disclose more
private information than what is intended.
The aindo.rdml
library contains a series of tools to remove or mitigate the possibility of these kinds
of privacy leaks, and add an extra layer of protection to the specific values present in some column.
The problematic values can be detected, and masked from the original dataset, so that the model will never
be able to learn them.
When generating synthetic data, the sensitive values may be either generated masked,
or they may be replaced by other viable, non-sensitive values.
All these behaviors can be tuned with the protection
parameter of the ColumnPreproc
object.
The protection
parameter can be either the boolean flag True
, indicating the default
protection (ColumnPreproc(protection=True)
), or a Protection
object,
with which the user can customize the protection measures.
When configuring a Protection
object, three optional arguments can be provided:
detectors
, a sequence ofDetector
objects that perform a detection of values that should be protected, based on the column type and a chosen detection strategy. The full list of the available detectors is provided in the API reference.default
, a boolean flag indicating whether the default protection for that column type should be enabled.type
, a string or aProtectionType
object that describes the protection strategy. This can be either imputation ('impute'
,ProtectionType.IMPUTE
) or masking ('mask'
,ProtectionType.MASK
). Imputation means replacing sensitive values with plausible alternatives within the column. Masking is achieved by replacing sensitive values with placeholders.
For instance, we could use a RareCategoryDetector
, that determines the rare categories based on the number of
occurrences, and masking strategy on the neighbourhood
column as follows:
Custom column preprocessors (expert user)
To each Column
type presented in this section,
the library associates the internal default column preprocessor,
which in turn defines how the column data will be preprocessed before being fed to the generative model.
The user might prefer to define a different preprocessor than the default one,
by means of the preprocessors
parameter of the TabularPreproc.from_schema
method.
The available column preprocessors are: Categorical
, Coordinates
, Date
, DateTime
, Time
, Integer
,
Numeric
, ItaFiscalCode
and Text
.
The table below illustrates the default mappings from column types to column preprocessors
Column type | Default Column Preprocessor |
---|---|
BOOLEAN / CATEGORICAL | Categorical |
NUMERIC / INTEGER | Numeric |
DATE | Date |
TIME | Time |
DATETIME | Datetime |
COORDINATES | Coordinates |
ITAFISCALCODE | ItaFiscalCode |
TEXT | Text |
Not all column preprocessors are compatible with all kinds of input data.
For example, while the Categorical
preprocessor can deal with virtually any type of column data,
the Datetime
preprocessor will raise an error if the input data cannot be interpreted as datetime.
Other similar limitations apply to the other column preprocessors.
Column preprocessors may be configured using the arguments:
special_values
, impute_nan
, non_sample_values
and protection
, common to all columns,
plus the specific arguments available to each one.
All the available parameters to each column preprocessor are listed in the
API reference.
For instance, the user might want to preprocess the minimum_nights
column with a Categorical
preprocessor,
instead of the default Numeric
:
Model training
The aindo.rdml
library offers two generative models for synthetic data generation:
- A
TabularModel
that generates all the relational data excluding columns that contain text. - A
TextModel
that generates only text columns. Users must specify aTextModel
for each table containing text columns.
Tabular Model
To instantiate and build a TabularModel
the user needs to provide a preproc
, which is a TabularPreproc
object,
and a size
, denoting the desired model dimensions.
The size
argument can be defined in one of the following formats:
- A
TabularModelSize
object containing the integer attributesn_layers
,h
andd
; - A string or a
Size
object, internally mapping to a default configuration ofTabularModelSize
. The options are:'small'
/Size.SMALL
,'medium'
/Size.MEDIUM
, or'large'
/Size.LARGE
.
The user may specify the type of layer used by the model with the block
parameter.
The available blocks are 'free'
(the default) and 'lstm'
.
Optionally, the user may also provide a dropout
value for the dropout layers in the model.
The model is trained using a TabularTrainer
object, which is built from the TabularModel
.
The trainer has an optional parameter dp_budget
, which, if provided,
must be a DpBudget
object containing the (epsilon, delta)-budget for differentially private (DP) training.
If not provided, the training will have no differential privacy guarantees.
Notice that DP training is available only for single-table datasets.
To train a model, the user also needs to build a TabularDataset
object containing the preprocessed training data.
The TabularDataset
is built from the raw training data and the same TabularPreproc
object used to build the model.
There are three options to instantiate a TabularDataset
object:
- From the raw data, and storing the processed data in RAM.
In this case the
TabularDataset.from_data()
method should be invoked. - From the raw data, but storing the processed data on disk.
In this case, again the
TabularDataset.from_data()
method should be invoked, but theon_disk
parameter should be set toTrue
. Moreover, thepath
parameter can be used to provide a directory where to store the processed data. By default, the data is stored in a temporary directory and deleted at the end of the process. When stored on disk, during training the data will be loaded one batch at a time. This may slightly slow down the training, but will reduce the memory consumption. - From data already processed and stored non disk.
When using the
TabularDataset.from_data()
method, withon_disk
set toTrue
and providing apath
, the data is stored in the provided directory, and can be reaccessed for later use with theTabularDataset.from_disk()
method, providing theTabularPreproc
and thepath
to the directory.
The TabularDataset
has another optional argument, block_size
,
which is an integer fixing the maximum length of the internal representation of the input used during training.
A smaller block_size
will reduce the time of a single training epoch, but will introduce approximations that
may compromise the quality of the generated synthetic data.
The given block_size
should be larger than the maximal internal representation of each table in the dataset.
For this reason, this parameter is available only for multi-table datasets.
Once the trainer and the training dataset are ready, the TabularTrainer.train()
method is used to train the model.
The method requires:
- The training dataset (
dataset
); - The desired number of training epochs (
n_epochs
), or alternatively of training steps (n_steps
); - Either the batch size (
batch_size
) or the available memory in MB (memory
), which is in turn used to compute an optimal batch size.
Additionally, users can provide the optional arguments:
lr
: The learning rate, whose optimal value is otherwise automatically determined.valid
: AValidation
object that configures validation during training. The validation dataset must be provided as aTabularDataset
object via the argumentdataset
, and various functionalities can be activated with the dedicated arguments, including learning rate scheduling and early stopping. To protect the validation data with DP guarantees, aDpValid
object should be provided through thedp
parameter. For further information, please refer to the API reference.hooks
: A sequence of custom training hooks crafted by the user, described in the next section.accumulate_grad
: The number of gradient accumulation steps. By default, it is set to 1, meaning the model is updated at each step.dp_step
: ADpStep
object containing the data needed for the differentially private step. It should be provided if and only if the trainer was equipped with a DP-budget, and therefore only for single-table datasets. For the available settings, please refer to the API reference.
Here is an example of training of the tabular model, with a validation step at the end of each epoch:
Custom hooks (expert user)
The experienced user might opt to specify personalized training hooks using the hooks
parameter of the
TabularTrainer.train()
method.
These hooks must extend the TrainHook
class, whose __init__()
method takes at least two arguments
to define the frequency of the activation of the hook:
an integer each
, and a trigger
, that may be 'epoch'
or 'step'
.
A custom hook must implement the _hook(n)
method, which is invoked when the hook is triggered
by the each
and trigger
arguments and receives as an argument the number of current epoch or current step,
depending on the value of trigger
.
A custom hook may also override the following methods:
setup(trainer, hooks)
, invoked before the training begins, takes as input the trainer and the previously defined hooks.hook()
, called at each training step. The default behavior is to check if the trigger is activated and in such case calls the_hook()
method._cleanup()
, called at the end of the training, it should return the status of the current hook.cleanup(hook_status)
, called at the end of the training, receives in input the status of the previous hooks and should return the status of the current hook. Its default behavior is to check the statuses of the previous hooks and to call the_cleanup()
method.
Text Model
As for the TabularModel
, to instantiate and build a TextModel
instance, the user is required to provide a preproc,
which in this case is a TextpPreproc
, and a size
, which is a TextModelSize
, a Size
, or a string
representation of the latter.
For a TextModel
, the user is also required to provide a block_size
, corresponding to
the maximum text length that the model can process in a single forward step.
Finally, the user may provide the optional dropout
parameter.
Alternatively, the user may build a TextModel
from a pretrained model, with the constructor
TextModel.build_from_pretrained()
, providing a TextPreproc
and a path to the pretrained model.
The optional block_size
option is also available to fix the maximum text length that the model can process
during fine-tuning.
To build the training (and validation) dataset, the user must instantiate a TextDataset
object.
The options are similar to the ones for the TabularDataset
, however in this case the max_block_size
parameter
is not available.
To reduce the block size, it is possible to set the block_size
parameter in the TextModel.build
,
or the TextModel.max_block_size
attribute.
A reasonable value for the block size can be obtained from the TextDataset.max_text_len
attribute
of the training dataset.
The associated trainer is a TextTrainer
object, which is built from a TextModel
.
At the moment, DP training is not available for TextTrainer
models,
therefore the dp_budget
option is not available.
The TextTrainer.train()
method has the same arguments as the TabularTrainer.train()
method,
except for the dp_step
option which is not active.
Synthetic data generation
After training the TabularModel
, generating synthetic data becomes straightforward
by using its TabularModel.generate()
method.
This method takes as input one (and only one) of the following:
n_samples
: The number of samples to generate in the root table.ctx
: Apandas.DataFrame
including the first columns of the root table from where to start a conditional generation. The name of the columns must match the names in the original table. Note that, due to the nature of the generation, the user cannot provide any combinations of the columns of the root table, but only the first n consecutive ones. The model will start from here to generate the following ones. The number of generated rows of the root table will exactly be the number of rows provided, and the first columns will match the provided ones.
Optionally, the user can also specify:
batch_size
: The number of samples generated in parallel. Defaults to 0, which means that all the data is generated in a single batch.max_block_size
: This parameter limits the length of each generated sample (in terms of its internal representation). It is active only for multi-table datasets. The default value is 0, meaning no limit is enforced. A reasonable value for this parameter can be obtained from theTabularDataset.block_size
attribute of the dataset.temp
: A strictly positive number describing the amount of noise used in generation. The default value is 1. Larger values will introduce more variance, lower values will decrease it.
This model only generates non-text columns.
The missing text columns are generated by the trained TextModel
’s, by means of the TextModel.generate()
method.
The optional arguments are:
batch_size
: The batch size used during generation. Defaults to 0, which means that all the data is generated in a single batch.max_text_len
: The maximum length of the generated text for each table row. The default value is 0, meaning the maximum possible value is used, namely the value of theTextModel.max_block_size
attribute.temp
: As for the tabular model, this parameter controls the amount of noise used in generation, and the default value is 1.
At the end of the procedure, data_synth
is a RelationalData
object containing the synthetic version of the
Airbnb dataset, including the text column name
in the listings
table.
Note that in order to generate also the host_name
text column present in the host
table,
we should build and train a second TextModel
and then generate the host_name
column in a similar fashion
to what was done for the name
column in the listings
table.
A full example of that can be found in the Airbnb script.