Synth module
This module allows to:
- Preprocess columns within each table;
- Train generative models on relational tabular data;
- Generate synthetic data.
Data preprocessing
Data preprocessing means transforming data columns to make them suitable for model training. This process can include optional steps to reduce the risk of privacy breaches and guarantee data anonymization.
Preprocessing is performed with a TabularPreproc
object.
The default preprocessor only needs a Schema
object to be instantiated.
After the instantiation, a TabularPreproc
object is fitted on a RelationalData
object.
To illustrate the preprocessing steps we load the UCI Adult single table dataset, containing both text and non-text columns:
Users also have the option to specify a custom preprocessing for each column.
This can be achieved by passing to the TabularPreproc
the preprocessors
argument,
which takes the form of a dictionary.
In this dictionary, the keys are the names of tables, while the values consist of dictionaries
with column names as keys and one of the following values:
- A
ColumnPreproc
object, enabling users to define a custom behavior for that column during the preprocessing step; - A
None
value tells the preprocessor to ignore that column; - A custom column instance. This option is designed for advanced users seeking access to lower-level functionalities.
The preprocessing of text data is managed by TextPreproc
objects, one for each table containing text.
However, it’s important to notice that custom preprocessing of text columns is not supported.
ColumnPreproc (advanced user)
A ColumnPreproc
object offers four optional parameters designed to customize the preprocessing of a
column: special_values
, impute_nan
,non_sample_values
and protection
:
-
special_values
is a list of values that are considered special or unique within the dataset, such as special characters occurring in a numeric column or outliers within a distribution. For instance, in the Adult dataset, let us assume that the number ‘64’ is an outlier in the distribution of theage
column and that the numeric columnfnlwgt
contains occurrences of the string'unknown'
. In such case, we would denote those values as special: -
impute_nan
is a boolean flag that determines whether NaN values within the column should be sampled. When set to True, NaN values are imputed, ensuring that the synthetic data does not include any NaN values. For instance, to avoid sampling NaN values in theage
column: -
non_sample_values
is a list of values that will not be sampled during generation, e.g. ‘Local-gov’ and ‘State-gov’ in theworkclass
column: -
protection
refers to a range of options to ensure privacy protection of the original column. This step is crucial, because despite the model’s inability to learn from individual data subjects, it retains the capacity to generate instances featuring rare categories or outlier numerical values, which might disclose sensitive data in the original dataset.A privacy leak can occur when personally identifiable information or sensitive data present in the original dataset is revealed. For instance, let us consider a table containing information about employees in a company:
Employee ID Name Age Department Salary 001 Alice Johnson 60 Marketing $80,000 002 John Smith 32 HR $55,000 003 Emily Davis 35 Finance $65,000 Even without explicitly displaying names, it is possible to identify Alice Johnson. By recognizing she is the eldest employee in the dataset, one could deduce her salary. This is a trivial example of a privacy leak.
The
protection
parameter can be either the boolean flagTrue
, indicating the default protection (ColumnPreproc(protection=True)
), or aProtection
object, which provides several protection measures.When configuring a
Protection
object, three optional arguments can be provided:detectors
, a sequence ofDetector
objects that perform a detection of values that should be protected, based on the column type and a chosen detection strategy. The full list of the available detectors is provided in the documentation;default
, a boolean flag indicating whether the default protection should be enabled;type
, a string or aProtectionType
object that describes the protection strategy. This can be either imputation ('impute'
,ProtectionType.IMPUTE
) or masking ('mask'
,ProtectionType.MASK
). Imputation means replacing sensitive values with plausible alternatives within the column. Masking is achieved by replacing sensitive values with placeholders.
For instance, we could use a
RareCategoryDetector
, that determines the rare categories based on the number of occurrences, and masking strategy on theworkclass
column as follows:
Custom column preprocessors (expert user)
For each Column
type presented in this section, the default preprocessor defines
the internal default column preprocessor.
The user might prefer to define a different preprocessor than the default one,
by means of the preprocessors
parameter of the TabularPreproc
object.
The available column preprocessors are: Categorical
, Coordinates
, Date
, DateTime
, Time
, Integer
,
Numeric
and Text
.
The table below illustrates the default mappings from column types to column preprocessors,
along with their compatibility:
Column type / Column Preprocessor | Categorical | Coordinates | Date / DateTime / Time | Integer | Numeric | Text |
---|---|---|---|---|---|---|
BOOLEAN/CATEGORICAL | default | ✗ | ✗ | ✗ | ✗ | ✗ |
COORDINATES | ✓ | default | ✗ | ✗ | ✗ | ✗ |
INTEGER | ✓ | ✗ | ✓ | ✓ | default | ✗ |
NUMERIC | ✓ | ✗ | ✗ | ✓ | default | ✗ |
DATE/TIME/DATETIME | ✓ | ✗ | default | ✗ | ✗ | ✗ |
TEXT | ✓ | ✓ | ✓ | ✗ | ✓ | default |
Column preprocessors may be configured using the arguments:
special_values
, impute_nan
, non_sample_values
and protection
, common to all columns,
plus the specific arguments available to each one.
For instance, the user might want to preprocess the hours-per-week
column with a Categorical
preprocessor,
instead of the default Numeric
:
Model training
The aindo.rdml
user API offers two generative models for synthetic data generation:
- A
TabularModel
that generates all the relational data excluding columns that contain text. - A
TextModel
that generates only text columns. Users must specify aTextModel
for each table containing text columns.
Tabular Model
To instantiate and build a TabularModel
the user needs to provide a preproc
, which is a TabularPreproc
object,
and a size
, denoting the desired model dimensions.
The size
argument can be defined in one of the following formats:
- An
TabularModelSize
object containing the integer attributesn_layers
,h
andd
; - A string or a
Size
object, internally mapping to a default configuration ofTabularModelSize
. The options are:small
/Size.SMALL
,medium
/Size.MEDIUM
, orlarge
/Size.LARGE
.
Optionally, the user may also specify a dropout
value for the dropout layers in the model
and a block_size
parameter, which fixes the maximum length of the internal representation of the input.
The model is trained using the train()
method of a TabularTrainer
object.
This method requires the training data (data
) and the desired number of training epochs (n_epochs
).
Additionally, users can provide the optional arguments:
batch_size
, i.e. is the size of a batch of data during training. When it is not specified the user must provide the argumentmemory
, which is the available memory in MB that is used to automatically set the optimal batch size value;lr
, the learning rate, whose optimal value is otherwise automatically determined;valid
, aValidation
object that configures validation during training. The validation data must be provided via the argumentdata
, and various functionalities can be activated with the dedicated arguments, including learning rate scheduling and early stopping. For further information, please refer to the documentation;hooks
, a sequence of custom training hooks crafted by the user, described in the next section.
Here is an example of training with a validation step at the end of each epoch:
Custom hooks (expert user)
The experienced user might opt to specify personalized training hooks using the hooks
parameter of the train()
method.
These hooks must extend the TrainHook
class, whose __init__()
method takes at least two arguments:
each
(integer) and trigger
, that may be epoch
or step
, and define the frequency of the activation of the hook.
A custom hook must implement the _hook(n)
method, which is invoked when the hook is triggered
by the each
and trigger
arguments and receives as an argument the number of current epoch or the current step,
depending on the value of trigger
.
A custom hook may also override the following methods:
setup(trainer, hooks)
, invoked before the training begins, takes as input the trainer and the previously defined hooks.hook()
, called at each training step. The default behavior is to check if the trigger is activated and in such case calls the_hook()
method._cleanup()
, called at the end of training, should return the status of the current hook.cleanup(hook_status)
, called at the end of training, receives in input the status of the previous hooks and should return the status of the current hook. Its default behavior is to check the statuses of the previous hooks and call the_cleanup()
method.
Text Model
To instantiate and build a TextModel
instance, the user is also required to define a block_size
, corresponding to
the maximum text length that the model can process in a single forward step.
The associated trainer is a TextTrainer
object.
Synthetic data generation
After training the TabularModel
, generating synthetic data becomes straightforward by using its generate()
method.
This method takes as input the number of samples to generate and returns a RelationalData
object containing the
synthetic data.
Optionally, the user can specify:
batch_size
is the batch size used during generation. Defaults to 0, which means that all the data is generated in a single batch.temp
is a parameter describing the amount of noise used in generation. The default value is 1, while larger values will introduce more variance and lower values will decrease the variance.
Let’s load the Airbnb dataset, containing:
- The parent table
host
with primary keyhost_id
; - The child table
listings
with primary keyid
and foreign keyhost_id
.
We define the following schema:
Suppose that our goal is to generate a synthetic table with 100 rows in the parent table:
This model only generates non-text columns.
When the original dataset includes text columns, it is necessary to train a separate TextModel
for each table
containing text.
Each trained model is then used to generate the missing text columns.
In our example, we supply the tabular data generated previously to the generate()
method of the TextModel
.
At the end of the procedure, data_synth
is a RelationalData
object containing the full synthetic version of
Airbnb, including the previously missing text columns in the host
and listings
tables.