Generative Models

aindo.synth offers several generative models to generate synthetic data. To handle the relational structure, two flavors of these models are available:

  • MultiLevelSynth assumes a hierarchical dependency between tables: child tables are generated conditionally on all the parent tables. This implementation is most efficient when the relational structure is a tree, that is when tables have at most one parent table.

  • GraphSynth does not assume a hierarchical dependence between tables and generates the data of all the tables in the structure at the same time. It is based on a graph representation of the data contained in the RelationalData and uses Graph Neural Networks to build encoders and decoders. For further details see this article detailing the technique.

Regardless of the specific implementation, all models share the same interface of the base class Model.

The base class Model implements the following methods:

  • Model.train() trains the model on the input data using n_epochs training epochs and other optional arguments, such as the batch size (default batch_size=1024), the learning rate (default lr=1e-3) and the training device (device);

  • Model.sample() generates synthetic data samples, with optional arguments mode and n_samples to specify the generation strategy and the number of samples. By default, the generated data have the same relational structure as the original data, including the number of samples for each table;

  • Model.report() generates a pdf report containing several metrics useful for evaluating the quality of the generated data such as univariate and bivariate column statistics of real and synthetic samples, correlation matrices similarity and nearest neighbor statistics;

  • Model.save() saves the model to a checkpoint.

The user is allowed to set additional hyperparameters in the Model class constructor that define the architecture, such as the latent dimensions for tables or the type and size of the modules used. In the case of GraphSynth models, the tunable hyperparameters include the strategies to perform the message passing to each nodes’ neighbors (messenger), the aggregation of the received messages (aggregator), and the number of encoder/decoder graph layers blocks in the architecture (n_graph_layers). A detailed description of all the optional arguments is provided in the documentation.

To load a saved model, it is enough to use the load_model() function, by specifying the path to the checkpoint.

Documentation

class aindo.synth.models.model.Model(encoder: Encoder, decoder: Decoder, sampler: MultiLevelSampler, trainer=None, dz: Optional[Union[dict[str, int], int]] = None)

Base Model class for synthetic data generation.

property device: Optional[Union[str, device]]

The device of the model parameters.

reset() None

Reset the model’s parameters.

train(train_data: RelationalData, n_epochs: int, batch_size: int = 1024, lr: float = 0.001, beta: float = 1.0, optimizer: str = 'Adam', optimizer_params: Optional[dict[str, Any]] = None, reset: bool = False, valid_data: Optional[RelationalData] = None, tensorboard_dir: str | pathlib.Path | None = 'logs', device: Optional[Union[str, device]] = None) None

Train the model on the data.

Parameters:
  • train_data – The training data.

  • n_epochs – Number of training epochs.

  • batch_size – Batch size for the root tables.

  • lr – Learning rate.

  • beta – Weight of regularization strength for each table.

  • optimizer – Name of the optimizer to be used, retrieved from torch.optim.

  • optimizer_params – Dictionary of parameters to initialize the optimizer.

  • reset – Reset all trainer and model parameters. If False, and a training was already performed, the optimizer and optimizer_params arguments will be ignored.

  • valid_data – Data used for validation.

  • tensorboard_dir – Output directory for tensorboard log files. If None, no data is saved.

  • device – The device where to perform the training. If None, it will be automatically selected.

sample(original_data: Optional[RelationalData] = None, n_samples: Optional[dict[str, int]] = None, mode: aindo.synth.utils.enum.SynthMode | str = SynthMode.SAMPLE_FKS, device: Optional[Union[str, device]] = None) RelationalData

Synthesize data.

Parameters:
  • original_data – Original data. Used to get the original tables in case of fixed foreign keys, and/or if some of them should not be generated.

  • n_samples – Dictionary with the number of samples to be generated for each table to be synthesized. If some are missing, the original tables are loaded. However, if a table should not be synthesized, then all its parents cannot be synthesized. If mode = ‘original_fks’, the values of n_samples are ignored. If None, the number of samples of the original data are used.

  • mode – One of: - sample_fks: Sample the foreign keys with a hierarchical algorithm from the roots to the leafs. - original_fks: Use the same structure of the original data for the foreign keys.

  • device – Optional, device where to perform the computation. If None, the device will be automatically selected.

Returns:

A dictionary with a pandas DataFrame for each table.

report(original_data: RelationalData, out_dir: str | pathlib.Path, device: Optional[Union[str, device]] = None) None

Generate the pdf report.

Parameters:
  • original_data – The original data to compare the synthetic data with.

  • out_dir – Output directory where to save the report.

  • device – Device where to perform the computation. If None, the device will be automatically selected.

save(ckpt_path: pathlib.Path | str = PosixPath('out/model.pt')) None

Save a checkpoint on disk in .pt format.

Parameters:

ckpt_path – Path to the checkpoint.

class aindo.synth.models.graph.GraphSynth(schema: Schema, hidden_units: Sequence[int] = (32, 32), messenger: str | aindo.synth.utils.enum.GraphMsgTypes = GraphMsgTypes.GRAPH_CONV, aggregator: str | aindo.synth.utils.enum.GraphAggTypes = GraphAggTypes.MEAN, updater: str | aindo.synth.utils.enum.GraphUpdtTypes = GraphUpdtTypes.GRU, n_graph_layers: int = 3, d: int = 32, d_latent: int = 8, d_link: int = 10, temp: float = 1.0, hidden_units_fks: Sequence[int] = (128,))

Graph Synth Model

__init__(schema: Schema, hidden_units: Sequence[int] = (32, 32), messenger: str | aindo.synth.utils.enum.GraphMsgTypes = GraphMsgTypes.GRAPH_CONV, aggregator: str | aindo.synth.utils.enum.GraphAggTypes = GraphAggTypes.MEAN, updater: str | aindo.synth.utils.enum.GraphUpdtTypes = GraphUpdtTypes.GRU, n_graph_layers: int = 3, d: int = 32, d_latent: int = 8, d_link: int = 10, temp: float = 1.0, hidden_units_fks: Sequence[int] = (128,)) None

Graph Synth model.

Parameters:
  • schema – Schema of the data.

  • hidden_units – Number of hidden units in the Fully Connected blocks of tables.

  • messenger – Name of the messenger. Options: copy, graph_conv.

  • aggregator – Name of the aggregator. Options: sum, mean, add_attention, dot_attention.

  • updater – Name of the updater. Options: msg, gru.

  • n_graph_layers – Number of Graph Neural Network layers in encoder/decoder.

  • d – Graph Neural Network internal dimension.

  • d_latent – Latent dimension of a table.

  • d_link – Latent dimension for the linking function between nodes.

  • temp – Temperature for sampling the foreign keys

  • hidden_units_fks – Number of hidden units for each layer in the fully connected linker.

class aindo.synth.models.multilevel.MultiLevelSynth(schema: Schema, hidden_units: Sequence[int] = (32, 32), aggregator: str | aindo.synth.utils.enum.MultiLevelAggTypes = MultiLevelAggTypes.ADD_ATTENTION, d_latent: int = 8, d_link: int = 10, temp: float = 1.0, hidden_units_fks: Sequence[int] = (128,))

Multilevel Synth Model

__init__(schema: Schema, hidden_units: Sequence[int] = (32, 32), aggregator: str | aindo.synth.utils.enum.MultiLevelAggTypes = MultiLevelAggTypes.ADD_ATTENTION, d_latent: int = 8, d_link: int = 10, temp: float = 1.0, hidden_units_fks: Sequence[int] = (128,)) None

Multilevel Synth Model

Parameters:
  • schema – Schema of the data.

  • hidden_units – Number of hidden units in the Fully Connected blocks of nodes.

  • aggregator – Name of the aggregator used to aggregate the encoded vectors in nodes. Options: leaf, gaussian, add_attention.

  • d_latent – Latent dimension of a table.

  • d_link – Latent dimension for the linking function between nodes.

  • temp – Temperature for sampling the foreign keys

  • hidden_units_fks – Number of hidden units for each layer in the fully connected linker.

aindo.synth.load.load_model(ckpt_path: pathlib.Path | str = PosixPath('out/model.pt'), device: Optional[Union[str, device]] = None) Model

Load the learned model in .pt format.

Parameters:
  • ckpt_path – Path to the saved model.

  • device – The device where the model is loaded. If None, it will be automatically selected.