Skip to content

Synth

The aindo.rdml.synth module allows to:

  1. Preprocess columns within each table;
  2. Train generative models on relational tabular data;
  3. Generate synthetic data.

To illustrate the full process, from preprocessing to synthetic data generation, we will use the Airbnb Open Data dataset. The original dataset consists of a single table, but upon further inspection, it is clear that we can rearrange it in a more “natural” form, by splitting it into two tables:

  1. A table host, with primary key host_id.
  2. A table listings, with primary key id and foreign key host_id, referring to the primary key of host.

Both tables have a text column, host_name in host, and name in listings. For simplicity, we will focus here on the latter, however the same operations can be performed on the host table too. In the Airbnb script a full end-to-end example using the Airbnb dataset is laid out, and both text columns are taken in into account.

Let us start by defining the Schema and loading the data, as follows:

import pandas as pd
from aindo.rdml.relational import Schema, Table, Column, PrimaryKey, ForeignKey, RelationalData
schema = Schema(
host=Table(
host_id=PrimaryKey(),
host_name=Column.TEXT,
calculated_host_listings_count=Column.NUMERIC,
),
listings=Table(
id=PrimaryKey(),
host_id=ForeignKey(parent='host'),
name=Column.TEXT,
neighbourhood_group=Column.CATEGORICAL,
neighbourhood=Column.CATEGORICAL,
latitude=Column.NUMERIC,
longitude=Column.NUMERIC,
room_type=Column.CATEGORICAL,
price=Column.INTEGER,
minimum_nights=Column.INTEGER,
number_of_reviews=Column.INTEGER,
last_review=Column.DATETIME,
reviews_per_month=Column.NUMERIC,
availability_365=Column.INTEGER,
),
)
df = pd.read_csv('path/to/airbnb.csv')
dfs = {
'host': df.loc[:, list(schema.tables['host'].all_columns)].drop_duplicates(),
'listings': df.loc[:, list(schema.tables['listings'].all_columns)],
}
data = RelationalData(data=dfs, schema=schema)

Data preprocessing

Data preprocessing means transforming the data columns to make them suitable for model training. This process can include optional steps to reduce the risk of privacy breaches and guarantee data anonymization.

Preprocessing is performed with a TabularPreproc object. To instantiate the default preprocessor, users can pass a Schema object to the TabularModel.from_schema method. After the instantiation, the TabularPreproc object must be fitted on a RelationalData object.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularPreproc
data = RelationalData(data=..., schema=...)
preproc = TabularPreproc.from_schema(schema=data.schema)
preproc.fit(data=data)

Users also have the option to specify a custom preprocessing for each column. This can be achieved by passing the preprocessors argument to the TabularPreproc.from_schema method, The preprocessors parameter is a dictionary where the keys are the names of tables, and the values consist of dictionaries with column names as keys and one of the following values:

  1. A ColumnPreproc object, enabling users to define a custom behavior for that column during the preprocessing step;
  2. A None value tells the preprocessor to ignore that column;
  3. A custom column instance. This option is designed for advanced users seeking access to lower-level functionalities.

The preprocessing of text data is managed by TextPreproc objects, one for each table containing text. The TextPreproc objects need to preprocess also the tabular part of the data, to condition the text during training and generation. In most cases, the generation of the text columns is done in addition to the generation of the rest of the tabular data, and therefore a TabularPreproc object is already available. Each TextPreproc object can then be built using the latter with the TextPreproc.from_tabular method, also providing the name of the table to consider.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularPreproc, TextPreproc
data = RelationalData(data=..., schema=...)
preproc = TabularPreproc.from_schema(schema=data.schema)
preproc.fit(data=data)
preproc_text = TextPreproc.from_tabular(preproc=preproc, table='adult')
preproc_text.fit(data=data)

In case no TabularPreproc object is available, the text preprocessor can also be built from scratch, using the TextPreproc.from_schema_table method, which requires the Schema and the name of the table containing the text columns.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TextPreproc
data = RelationalData(data=..., schema=...)
preproc_text = TextPreproc.from_schema_table(schema=data.schema, table='listings')
preproc_text.fit(data=data)

To ensure consistency, the first method is recommended when both tabular and text data need to be generated.

It is important to note that custom preprocessing of text columns is not supported.

ColumnPreproc (advanced user)

A ColumnPreproc object offers four optional parameters designed to customize the preprocessing of a column:

  1. special_values: Provide a set of special values that will be treated as separate from the other values of the column, for example in a column with mixed type values.
  2. impute_nan: Force the model to avoid generating missing values in the synthetic data.
  3. non_sample_values: Provide a set of values that will not be generated in the synthetic data.
  4. protection: Add an extra protection from potential privacy leaks coming from rare or extremal values present in the original column data.

In the next subsections, we describe in details the effect of these parameters.

Special values

The parameter special_values takes a list of values that are considered special or unique within the dataset, such as special characters occurring in a numeric column or outliers within a distribution. For instance, in the Airbnb dataset, let us assume that the numerical column price can sometimes assume the non-numerical value 'missing'. In such case, we might denote this value as special:

from aindo.rdml.synth import ColumnPreproc, TabularPreproc
preproc = TabularPreproc.from_schema(
schema=...,
preprocessors={
'listings': {
'price': ColumnPreproc(special_values=['missing']),
},
},
)

Imputation of missing values

The parameter impute_nan is a boolean flag that determines whether NaN values within the column should be sampled. When set to True, NaN values are imputed, ensuring that the synthetic data does not include any NaN values. For instance, to avoid sampling NaN values in the price column:

from aindo.rdml.synth import ColumnPreproc, TabularPreproc
preproc = TabularPreproc.from_schema(
schema=...,
preprocessors={
'listings': {
'price': ColumnPreproc(impute_nan=True),
},
},
)

Avoid sampling certain values

The parameter non_sample_values allows the user to set a list of values that will not be sampled during generation, e.g. 'Manhattan' and 'Brooklyn' in the neighbourhood_group column:

from aindo.rdml.synth import ColumnPreproc, TabularPreproc
preproc = TabularPreproc.from_schema(
schema=...,
preprocessors={
'listings': {
'neighbourhood_group': ColumnPreproc(non_sample_values=['Manhattan', 'Brooklyn']),
},
},
)

In place of these values, some other plausible values of the same column will be sampled when generating synthetic data.

Protection of rare values

The aindo.rdml library provides a range of options to ensure additional privacy protection to extremal or rare values that might be present in the columns. Indeed, despite the model’s inability to learn from individual data subjects, it learns the rare categories and the ranges of numerical values, which might in some cases disclose sensitive data in the original dataset.

Consider for example a dataset with a range of information about the employees of a company, including their salaries. Let us say the CEO will have the highest salary in the dataset.

Employee IDNameAgeRoleSalary
001Alice Johnson60CEO$100,000
002John Smith32HR$55,000
003Emily Davis35Finance$65,000

A model trained on this dataset will learn the range of values that the Salary column can take. When generating synthetic data, the model may (rarely) generate employees with salaries as high as the CEO one. This extremal values found in the synthetic dataset reveals in fact the salary of the CEO in the original dataset.

Another example can be the one of a dataset containing the patients with a particular pathology. Being able to understand that a specific individual was in the original dataset would constitute a privacy leak for that individual.

Patient IDAgeZIP codeSystolic blood pressure (mm Hg)
0012134016116
0024538068125
0037200154110

The ZIP code 34016 is the ZIP code of Monrupino, a small but charming village near Trieste, with less than 1000 inhabitants. If the ZIP code column is defined as categorical, the generative model will memorize the possible values that the column can take, even the rare one like the Monrupino ZIP code. During the generation of synthetic data, a rare ZIP code won’t be generated often, however when it is generated it reveals the fact that somebody from Monrupino was in the original dataset. Even if this information does not explicitly disclose who that person is, in the case some other publicly accessible information can be cross-referenced with the generated synthetic data, the identity of that person may be ultimately revealed. In any case, it is clear that the mere presence of a rare category in the generated dataset can disclose more private information than what is intended.

The aindo.rdml library contains a series of tools to remove or mitigate the possibility of these kinds of privacy leaks, and add an extra layer of protection to the specific values present in some column. The problematic values can be detected, and masked from the original dataset, so that the model will never be able to learn them. When generating synthetic data, the sensitive values may be either generated masked, or they may be replaced by other viable, non-sensitive values. All these behaviors can be tuned with the protection parameter of the ColumnPreproc object.

The protection parameter can be either the boolean flag True, indicating the default protection (ColumnPreproc(protection=True)), or a Protection object, with which the user can customize the protection measures.

When configuring a Protection object, three optional arguments can be provided:

  • detectors, a sequence of Detector objects that perform a detection of values that should be protected, based on the column type and a chosen detection strategy. The full list of the available detectors is provided in the API reference.
  • default, a boolean flag indicating whether the default protection for that column type should be enabled.
  • type, a string or a ProtectionType object that describes the protection strategy. This can be either imputation ('impute', ProtectionType.IMPUTE) or masking ('mask',ProtectionType.MASK). Imputation means replacing sensitive values with plausible alternatives within the column. Masking is achieved by replacing sensitive values with placeholders.

For instance, we could use a RareCategoryDetector, that determines the rare categories based on the number of occurrences, and masking strategy on the neighbourhood column as follows:

from aindo.rdml.synth import ColumnPreproc, Protection, RareCategoryDetector, TabularPreproc
preproc = TabularPreproc.from_schema(
schema=...,
preprocessors={
'listings': {
'neighbourhood': ColumnPreproc(
protection=Protection(
detectors=(RareCategoryDetector(),),
type='mask',
),
),
},
},
)

Custom column preprocessors (expert user)

To each Column type presented in this section, the library associates the internal default column preprocessor, which in turn defines how the column data will be preprocessed before being fed to the generative model. The user might prefer to define a different preprocessor than the default one, by means of the preprocessors parameter of the TabularPreproc.from_schema method.

The available column preprocessors are: Categorical, Coordinates, Date, DateTime, Time, Integer, Numeric, ItaFiscalCode and Text. The table below illustrates the default mappings from column types to column preprocessors

Column typeDefault Column Preprocessor
BOOLEAN / CATEGORICALCategorical
NUMERIC / INTEGERNumeric
DATEDate
TIMETime
DATETIMEDatetime
COORDINATESCoordinates
ITAFISCALCODEItaFiscalCode
TEXTText

Not all column preprocessors are compatible with all kinds of input data. For example, while the Categorical preprocessor can deal with virtually any type of column data, the Datetime preprocessor will raise an error if the input data cannot be interpreted as datetime. Other similar limitations apply to the other column preprocessors.

Column preprocessors may be configured using the arguments: special_values, impute_nan, non_sample_values and protection, common to all columns, plus the specific arguments available to each one. All the available parameters to each column preprocessor are listed in the API reference.

For instance, the user might want to preprocess the minimum_nights column with a Categorical preprocessor, instead of the default Numeric:

from aindo.rdml.synth import Categorical, TabularPreproc
preproc = TabularPreproc.from_schema(
schema=...,
preprocessors={
'listings': {
'minimum_nights': Categorical(),
},
},
)

Model training

The aindo.rdml library offers two generative models for synthetic data generation:

  • A TabularModel that generates all the relational data excluding columns that contain text.
  • A TextModel that generates only text columns. Users must specify a TextModel for each table containing text columns.

Tabular Model

To instantiate and build a TabularModel the user needs to provide a preproc, which is a TabularPreproc object, and a size, denoting the desired model dimensions. The size argument can be defined in one of the following formats:

  • A TabularModelSize object containing the integer attributes n_layers, h and d;
  • A string or a Size object, internally mapping to a default configuration of TabularModelSize. The options are: 'small'/Size.SMALL, 'medium'/Size.MEDIUM, or 'large'/Size.LARGE.

The user may specify the type of layer used by the model with the block parameter. The available blocks are 'free' (the default) and 'lstm'. Optionally, the user may also provide a dropout value for the dropout layers in the model.

The model is trained using a TabularTrainer object, which is built from the TabularModel. The trainer has an optional parameter dp_budget, which, if provided, must be a DpBudget object containing the (epsilon, delta)-budget for differentially private (DP) training. If not provided, the training will have no differential privacy guarantees. Notice that DP training is available only for single-table datasets.

To train a model, the user also needs to build a TabularDataset object containing the preprocessed training data. The TabularDataset is built from the raw training data and the same TabularPreproc object used to build the model. There are three options to instantiate a TabularDataset object:

  • From the raw data, and storing the processed data in RAM. In this case the TabularDataset.from_data() method should be invoked.
  • From the raw data, but storing the processed data on disk. In this case, again the TabularDataset.from_data() method should be invoked, but the on_disk parameter should be set to True. Moreover, the path parameter can be used to provide a directory where to store the processed data. By default, the data is stored in a temporary directory and deleted at the end of the process. When stored on disk, during training the data will be loaded one batch at a time. This may slightly slow down the training, but will reduce the memory consumption.
  • From data already processed and stored non disk. When using the TabularDataset.from_data() method, with on_disk set to True and providing a path, the data is stored in the provided directory, and can be reaccessed for later use with the TabularDataset.from_disk() method, providing the TabularPreproc and the path to the directory.

The TabularDataset has another optional argument, block_size, which is an integer fixing the maximum length of the internal representation of the input used during training. A smaller block_size will reduce the time of a single training epoch, but will introduce approximations that may compromise the quality of the generated synthetic data. The given block_size should be larger than the maximal internal representation of each table in the dataset. For this reason, this parameter is available only for multi-table datasets.

Once the trainer and the training dataset are ready, the TabularTrainer.train() method is used to train the model. The method requires:

  • The training dataset (dataset);
  • The desired number of training epochs (n_epochs), or alternatively of training steps (n_steps);
  • Either the batch size (batch_size) or the available memory in MB (memory), which is in turn used to compute an optimal batch size.

Additionally, users can provide the optional arguments:

  • lr: The learning rate, whose optimal value is otherwise automatically determined.
  • valid: A Validation object that configures validation during training. The validation dataset must be provided as a TabularDataset object via the argument dataset, and various functionalities can be activated with the dedicated arguments, including learning rate scheduling and early stopping. To protect the validation data with DP guarantees, a DpValid object should be provided through the dp parameter. For further information, please refer to the API reference.
  • hooks: A sequence of custom training hooks crafted by the user, described in the next section.
  • accumulate_grad: The number of gradient accumulation steps. By default, it is set to 1, meaning the model is updated at each step.
  • dp_step: A DpStep object containing the data needed for the differentially private step. It should be provided if and only if the trainer was equipped with a DP-budget, and therefore only for single-table datasets. For the available settings, please refer to the API reference.

Here is an example of training of the tabular model, with a validation step at the end of each epoch:

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularDataset, TabularModel, TabularPreproc, TabularTrainer, Validation
data = RelationalData(data=..., schema=...)
data_train, data_valid = data.split(ratio=0.1)
preproc = TabularPreproc.from_schema(schema=data.schema).fit(data=data)
model_tabular = TabularModel.build(preproc=preproc, size='small')
dataset_train = TabularDataset.from_data(data=data_train, preproc=preproc)
dataset_valid = TabularDataset.from_data(data=data_valid, preproc=preproc)
trainer_tabular = TabularTrainer(model=model_tabular)
trainer_tabular.train(
dataset=dataset_train,
n_epochs=100,
batch_size=32,
valid=Validation(dataset=dataset_valid, each=1, trigger='epoch'),
)

Custom hooks (expert user)

The experienced user might opt to specify personalized training hooks using the hooks parameter of the TabularTrainer.train() method. These hooks must extend the TrainHook class, whose __init__() method takes at least two arguments to define the frequency of the activation of the hook: an integer each, and a trigger, that may be 'epoch' or 'step'. A custom hook must implement the _hook(n) method, which is invoked when the hook is triggered by the each and trigger arguments and receives as an argument the number of current epoch or current step, depending on the value of trigger.

A custom hook may also override the following methods:

  • setup(trainer, hooks), invoked before the training begins, takes as input the trainer and the previously defined hooks.
  • hook(), called at each training step. The default behavior is to check if the trigger is activated and in such case calls the _hook() method.
  • _cleanup(), called at the end of the training, it should return the status of the current hook.
  • cleanup(hook_status), called at the end of the training, receives in input the status of the previous hooks and should return the status of the current hook. Its default behavior is to check the statuses of the previous hooks and to call the _cleanup() method.

Text Model

As for the TabularModel, to instantiate and build a TextModel instance, the user is required to provide a preproc, which in this case is a TextpPreproc, and a size, which is a TextModelSize, a Size, or a string representation of the latter. For a TextModel, the user is also required to provide a block_size, corresponding to the maximum text length that the model can process in a single forward step. Finally, the user may provide the optional dropout parameter.

Alternatively, the user may build a TextModel from a pretrained model, with the constructor TextModel.build_from_pretrained(), providing a TextPreproc and a path to the pretrained model. The optional block_size option is also available to fix the maximum text length that the model can process during fine-tuning.

To build the training (and validation) dataset, the user must instantiate a TextDataset object. The options are similar to the ones for the TabularDataset, however in this case the max_block_size parameter is not available. To reduce the block size, it is possible to set the block_size parameter in the TextModel.build, or the TextModel.max_block_size attribute. A reasonable value for the block size can be obtained from the TextDataset.max_text_len attribute of the training dataset.

The associated trainer is a TextTrainer object, which is built from a TextModel. At the moment, DP training is not available for TextTrainer models, therefore the dp_budget option is not available. The TextTrainer.train() method has the same arguments as the TabularTrainer.train() method, except for the dp_step option which is not active.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TextDataset, TextModel, TextPreproc, TextTrainer
data = RelationalData(data=..., schema=...)
data_train, data_valid = data.split(ratio=0.1)
preproc_text = TextPreproc.from_schema_table(schema=data.schema, table='listings').fit(data=data)
model_text = TextModel.build(
preproc=preproc_text,
size='small',
block_size=1024,
)
dataset_train = TextDataset.from_data(data=data_train, preproc=preproc_text)
trainer_text = TextTrainer(model=model_text)
trainer_text.train(
dataset=dataset_train,
n_epochs=100,
batch_size=32,
)

Synthetic data generation

After training the TabularModel, generating synthetic data becomes straightforward by using its TabularModel.generate() method. This method takes as input one (and only one) of the following:

  • n_samples: The number of samples to generate in the root table.
  • ctx: A pandas.DataFrame including the first columns of the root table from where to start a conditional generation. The name of the columns must match the names in the original table. Note that, due to the nature of the generation, the user cannot provide any combinations of the columns of the root table, but only the first n consecutive ones. The model will start from here to generate the following ones. The number of generated rows of the root table will exactly be the number of rows provided, and the first columns will match the provided ones.

Optionally, the user can also specify:

  • batch_size: The number of samples generated in parallel. Defaults to 0, which means that all the data is generated in a single batch.
  • max_block_size: This parameter limits the length of each generated sample (in terms of its internal representation). It is active only for multi-table datasets. The default value is 0, meaning no limit is enforced. A reasonable value for this parameter can be obtained from the TabularDataset.block_size attribute of the dataset.
  • temp: A strictly positive number describing the amount of noise used in generation. The default value is 1. Larger values will introduce more variance, lower values will decrease it.
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularModel
data = RelationalData(data=..., schema=...)
model_tabular = TabularModel.build(preproc=..., size=...)
# Train the tabular model
# as shown above
...
data_synth = model_tabular.generate(
n_samples=data['host'].shape[0],
batch_size=32,
)

This model only generates non-text columns. The missing text columns are generated by the trained TextModel’s, by means of the TextModel.generate() method.

The optional arguments are:

  • batch_size: The batch size used during generation. Defaults to 0, which means that all the data is generated in a single batch.
  • max_text_len: The maximum length of the generated text for each table row. The default value is 0, meaning the maximum possible value is used, namely the value of the TextModel.max_block_size attribute.
  • temp: As for the tabular model, this parameter controls the amount of noise used in generation, and the default value is 1.
from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularModel, TextModel
data = RelationalData(data=..., schema=...)
model_tabular = TabularModel.build(preproc=..., size=...)
model_text = TextModel.build(preproc=..., size=..., block_size=...)
# Train the tabular and text models
# as shown above
...
data_synth = model_tabular.generate(
n_samples=data['host'].shape[0],
batch_size=32,
)
data_synth = model_text.generate(
data=data_synth,
batch_size=32,
)

At the end of the procedure, data_synth is a RelationalData object containing the synthetic version of the Airbnb dataset, including the text column name in the listings table.

Note that in order to generate also the host_name text column present in the host table, we should build and train a second TextModel and then generate the host_name column in a similar fashion to what was done for the name column in the listings table. A full example of that can be found in the Airbnb script.