Skip to content

Synth module

This module allows to:

  1. Preprocess columns within each table;
  2. Train generative models on relational tabular data;
  3. Generate synthetic data.

Data preprocessing

Data preprocessing means transforming data columns to make them suitable for model training. This process can include optional steps to reduce the risk of privacy breaches and guarantee data anonymization.

Preprocessing is performed with a TabularPreproc object. The default preprocessor only needs a Schema object to be instantiated. After the instantiation, a TabularPreproc object is fitted on a RelationalData object.

To illustrate the preprocessing steps we load the UCI Adult single table dataset, containing both text and non-text columns:

import pandas as pd
df = pd.read_csv('path/to/adult/dir/adult.csv')
print(df)
Terminal window
[Out]:
age workclass fnlwgt ... hours-per-week native-country y
0 43 Local-gov 169203 ... 35 United-States <=50K
1 36 Private 184112 ... 45 United-States >50K
2 35 Private 338611 ... 40 United-States <=50K
3 64 ? 208862 ... 50 United-States >50K
4 25 State-gov 129200 ... 40 United-States <=50K
... ... ... ... ... ... ...
29300 25 Private 290528 ... 40 United-States <=50K
29301 51 Private 306108 ... 40 United-States >50K
29302 33 Private 182792 ... 40 United-States <=50K
29303 46 Private 175925 ... 40 United-States <=50K
29304 34 Private 191291 ... 40 United-States <=50K
[29305 rows x 15 columns]
from aindo.rdml.relational import Schema, Table, Column, RelationalData
from aindo.rdml.synth import TabularPreproc
dfs = {'adult': ...}
schema = Schema(
adult=Table(
age=Column.INTEGER,
workclass=Column.CATEGORICAL,
fnlwgt=Column.INTEGER,
education=Column.TEXT,
...
)
)
data = RelationalData(data=dfs, schema=schema)
preproc = TabularPreproc(schema=schema)
preproc.fit(data=data)

Users also have the option to specify a custom preprocessing for each column. This can be achieved by passing to the TabularPreproc the preprocessors argument, which takes the form of a dictionary. In this dictionary, the keys are the names of tables, while the values consist of dictionaries with column names as keys and one of the following values:

  1. A ColumnPreproc object, enabling users to define a custom behavior for that column during the preprocessing step;
  2. A None value tells the preprocessor to ignore that column;
  3. A custom column instance. This option is designed for advanced users seeking access to lower-level functionalities.

The preprocessing of text data is managed by TextPreproc objects, one for each table containing text. However, it’s important to notice that custom preprocessing of text columns is not supported.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TextPreproc
data = RelationalData(data=..., schema=...)
preproc_text = TextPreproc(schema=data.schema, table='adult')
preproc_text.fit(data=data)

ColumnPreproc (advanced user)

A ColumnPreproc object offers four optional parameters designed to customize the preprocessing of a column: special_values, impute_nan ,non_sample_values and protection:

  1. special_values is a list of values that are considered special or unique within the dataset, such as special characters occurring in a numeric column or outliers within a distribution. For instance, in the Adult dataset, let us assume that the number ‘64’ is an outlier in the distribution of the age column and that the numeric column fnlwgt contains occurrences of the string 'unknown'. In such case, we would denote those values as special:

    from aindo.rdml.synth import ColumnPreproc, TabularPreproc
    preproc = TabularPreproc(
    schema=...,
    preprocessors={
    'adult': {
    'age': ColumnPreproc(special_values=[64]),
    'fnlwgt': ColumnPreproc(special_values=['unknown']),
    },
    },
    )
  2. impute_nan is a boolean flag that determines whether NaN values within the column should be sampled. When set to True, NaN values are imputed, ensuring that the synthetic data does not include any NaN values. For instance, to avoid sampling NaN values in the age column:

    from aindo.rdml.synth import ColumnPreproc, TabularPreproc
    preproc = TabularPreproc(
    schema=...,
    preprocessors={
    'adult': {
    'age': ColumnPreproc(impute_nan=True),
    },
    },
    )
  3. non_sample_values is a list of values that will not be sampled during generation, e.g. ‘Local-gov’ and ‘State-gov’ in the workclass column:

    from aindo.rdml.synth import ColumnPreproc, TabularPreproc
    preproc = TabularPreproc(
    schema=...,
    preprocessors={
    'adult': {
    'workclass': ColumnPreproc(non_sample_values=['Local-gov', 'State-gov']),
    },
    },
    )
  4. protection refers to a range of options to ensure privacy protection of the original column. This step is crucial, because despite the model’s inability to learn from individual data subjects, it retains the capacity to generate instances featuring rare categories or outlier numerical values, which might disclose sensitive data in the original dataset.

    A privacy leak can occur when personally identifiable information or sensitive data present in the original dataset is revealed. For instance, let us consider a table containing information about employees in a company:

    Employee IDNameAgeDepartmentSalary
    001Alice Johnson60Marketing$80,000
    002John Smith32HR$55,000
    003Emily Davis35Finance$65,000

    Even without explicitly displaying names, it is possible to identify Alice Johnson. By recognizing she is the eldest employee in the dataset, one could deduce her salary. This is a trivial example of a privacy leak.

    The protection parameter can be either the boolean flag True, indicating the default protection (ColumnPreproc(protection=True)), or a Protection object, which provides several protection measures.

    When configuring a Protection object, three optional arguments can be provided:

    • detectors, a sequence of Detector objects that perform a detection of values that should be protected, based on the column type and a chosen detection strategy. The full list of the available detectors is provided in the documentation;
    • default, a boolean flag indicating whether the default protection should be enabled;
    • type, a string or a ProtectionType object that describes the protection strategy. This can be either imputation ('impute', ProtectionType.IMPUTE) or masking ('mask',ProtectionType.MASK). Imputation means replacing sensitive values with plausible alternatives within the column. Masking is achieved by replacing sensitive values with placeholders.

    For instance, we could use a RareCategoryDetector, that determines the rare categories based on the number of occurrences, and masking strategy on the workclass column as follows:

    from aindo.rdml.synth import ColumnPreproc, Protection, RareCategoryDetector, TabularPreproc
    preproc = TabularPreproc(
    schema=...,
    preprocessors={
    'adult': {
    'workclass': ColumnPreproc(
    protection=Protection(
    detectors=(RareCategoryDetector(),),
    type='mask',
    ),
    ),
    },
    },
    )

Custom column preprocessors (expert user)

For each Column type presented in this section, the default preprocessor defines the internal default column preprocessor. The user might prefer to define a different preprocessor than the default one, by means of the preprocessors parameter of the TabularPreproc object.

The available column preprocessors are: Categorical, Coordinates, Date, DateTime, Time, Integer, Numeric and Text. The table below illustrates the default mappings from column types to column preprocessors, along with their compatibility:

Column type / Column PreprocessorCategoricalCoordinatesDate / DateTime / TimeIntegerNumericText
BOOLEAN/CATEGORICALdefault
COORDINATESdefault
INTEGERdefault
NUMERICdefault
DATE/TIME/DATETIMEdefault
TEXTdefault

Column preprocessors may be configured using the arguments: special_values, impute_nan, non_sample_values and protection, common to all columns, plus the specific arguments available to each one.

For instance, the user might want to preprocess the hours-per-week column with a Categorical preprocessor, instead of the default Numeric:

from aindo.rdml.synth import Categorical, TabularPreproc
preproc = TabularPreproc(
schema=...,
preprocessors={
'adult': {
'hours-per-week': Categorical(),
},
},
)

Model training

The aindo.rdml user API offers two generative models for synthetic data generation:

  • A TabularModel that generates all the relational data excluding columns that contain text.
  • A TextModel that generates only text columns. Users must specify a TextModel for each table containing text columns.

Tabular Model

To instantiate and build a TabularModel the user needs to provide a preproc, which is a TabularPreproc object, and a size, denoting the desired model dimensions. The size argument can be defined in one of the following formats:

  • An TabularModelSize object containing the integer attributes n_layers, h and d;
  • A string or a Size object, internally mapping to a default configuration of TabularModelSize. The options are: small/Size.SMALL, medium/Size.MEDIUM, or large/Size.LARGE.

Optionally, the user may also specify a dropout value for the dropout layers in the model and a block_size parameter, which fixes the maximum length of the internal representation of the input.

The model is trained using the train() method of a TabularTrainer object. This method requires the training data (data) and the desired number of training epochs (n_epochs). Additionally, users can provide the optional arguments:

  • batch_size, i.e. is the size of a batch of data during training. When it is not specified the user must provide the argument memory, which is the available memory in MB that is used to automatically set the optimal batch size value;
  • lr, the learning rate, whose optimal value is otherwise automatically determined;
  • valid, a Validation object that configures validation during training. The validation data must be provided via the argument data, and various functionalities can be activated with the dedicated arguments, including learning rate scheduling and early stopping. For further information, please refer to the documentation;
  • hooks, a sequence of custom training hooks crafted by the user, described in the next section.

Here is an example of training with a validation step at the end of each epoch:

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularModel, TabularPreproc, TabularTrainer, Validation
data = RelationalData(data=..., schema=...)
data_train, data_test = data.split(ratio=0.1)
preproc = TabularPreproc(schema=data.schema).fit(data=data).fit(data=data)
model_tabular = TabularModel.build(preproc=preproc, size='small')
trainer_tabular = TabularTrainer(model=model_tabular)
trainer_tabular.train(
data=data_train,
n_epochs=100,
batch_size=32,
valid=Validation(data=data_train, each=1, trigger='epoch'),
)

Custom hooks (expert user)

The experienced user might opt to specify personalized training hooks using the hooks parameter of the train() method. These hooks must extend the TrainHook class, whose __init__() method takes at least two arguments: each (integer) and trigger, that may be epoch or step, and define the frequency of the activation of the hook. A custom hook must implement the _hook(n) method, which is invoked when the hook is triggered by the each and trigger arguments and receives as an argument the number of current epoch or the current step, depending on the value of trigger.

A custom hook may also override the following methods:

  • setup(trainer, hooks), invoked before the training begins, takes as input the trainer and the previously defined hooks.
  • hook(), called at each training step. The default behavior is to check if the trigger is activated and in such case calls the _hook() method.
  • _cleanup(), called at the end of training, should return the status of the current hook.
  • cleanup(hook_status), called at the end of training, receives in input the status of the previous hooks and should return the status of the current hook. Its default behavior is to check the statuses of the previous hooks and call the _cleanup() method.

Text Model

To instantiate and build a TextModel instance, the user is also required to define a block_size, corresponding to the maximum text length that the model can process in a single forward step. The associated trainer is a TextTrainer object.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TextModel, TextPreproc, TextTrainer
data = RelationalData(data=..., schema=...)
data_train, data_test = data.split(ratio=0.1)
preproc_text = TextPreproc(schema=data.schema, table='adult').fit(data=data)
model_text = TextModel.build(
preproc=preproc_text,
size='small',
block_size=1024,
)
trainer_text = TextTrainer(model=model_text)
trainer_text.train(
data=data_train,
n_epochs=100,
batch_size=32,
)

Synthetic data generation

After training the TabularModel, generating synthetic data becomes straightforward by using its generate() method. This method takes as input the number of samples to generate and returns a RelationalData object containing the synthetic data. Optionally, the user can specify:

  • batch_size is the batch size used during generation. Defaults to 0, which means that all the data is generated in a single batch.
  • temp is a parameter describing the amount of noise used in generation. The default value is 1, while larger values will introduce more variance and lower values will decrease the variance.

Let’s load the Airbnb dataset, containing:

  1. The parent table host with primary key host_id;
  2. The child table listings with primary key id and foreign key host_id.
import pandas as pd
airbnb = pd.read_csv('path/to/airbnb/dir/airbnb.csv')
host_cols = ['host_id', 'host_name', 'calculated_host_listings_count']
host = airbnb.loc[:, host_cols].drop_duplicates()
listings = airbnb.drop(['host_name', 'calculated_host_listings_count'], axis=1)
dfs = {
'host': host,
'listings': listings,
}

We define the following schema:

from aindo.rdml.relational import Column, ForeignKey, PrimaryKey, RelationalData, Schema, Table
dfs = {'host': ..., 'listings': ...}
schema = Schema(
host=Table({
'host_id': PrimaryKey(),
'host_name': Column.TEXT,
'calculated_host_listings_count': Column.NUMERIC,
}),
listings=Table({
'id': PrimaryKey(),
'host_id': ForeignKey(parent='host'),
'name': Column.TEXT,
'neighbourhood_group': Column.CATEGORICAL,
'neighbourhood': Column.CATEGORICAL,
'latitude': Column.NUMERIC,
'longitude': Column.NUMERIC,
'room_type': Column.CATEGORICAL,
'price': Column.INTEGER,
'minimum_nights': Column.INTEGER,
'number_of_reviews': Column.INTEGER,
'last_review': Column.DATETIME,
'reviews_per_month': Column.NUMERIC,
'availability_365': Column.INTEGER,
}),
)
data = RelationalData(data=dfs, schema=schema)

Suppose that our goal is to generate a synthetic table with 100 rows in the parent table:

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TabularModel, TabularPreproc, TabularTrainer, Validation
data = RelationalData(data=..., schema=...)
data_train_valid, data_test = data.split(ratio=0.1)
data_train, data_valid = data_train_valid.split(ratio=0.1)
preproc = TabularPreproc(schema=data.schema)
preproc.fit(data=data)
model_tabular = TabularModel.build(
preproc=preproc,
size='small',
)
trainer_tabular = TabularTrainer(model=model_tabular)
trainer_tabular.train(
data=data_train,
n_epochs=100,
batch_size=32,
valid=Validation(data=data_valid, each=200, trigger='step')
)
data_synth = model_tabular.generate(
n_samples=100,
batch_size=32,
)

This model only generates non-text columns. When the original dataset includes text columns, it is necessary to train a separate TextModel for each table containing text. Each trained model is then used to generate the missing text columns. In our example, we supply the tabular data generated previously to the generate() method of the TextModel.

from aindo.rdml.relational import RelationalData
from aindo.rdml.synth import TextModel, TextPreproc, TextTrainer
data = RelationalData(data=..., schema=...)
data_train_valid, data_test = data.split(ratio=0.1)
data_train, data_valid = data_train_valid.split(ratio=0.1)
# Build a tabular model, train it
# and generate the synthetic tabular data
data_synth = ...
preproc_text_host = TextPreproc(
schema=data.schema,
table='host',
)
preproc_text_host.fit(data=data)
model_text_host = TextModel.build(
preproc=preproc_text_host,
size='small',
block_size=1024,
)
trainer_text_host = TextTrainer(model=model_text_host)
trainer_text_host.train(
data=data_train,
n_epochs=100,
batch_size=32,
)
data_synth = model_text_host.generate(
data=data_synth,
batch_size=32,
)
# Build a text model for listings, train it
# and generate the text columns as was done for the 'host' table
preproc_text_listings = ...
model_text_listings = ...
trainer_text_list = ...
trainer_text_list.train(...)
data_synth = model_text_listings.generate(
data=data_synth,
batch_size=32,
)

At the end of the procedure, data_synth is a RelationalData object containing the full synthetic version of Airbnb, including the previously missing text columns in the host and listings tables.

Documentation

aindo.rdml.synth