Create a Semi-Synthetic Generator

To create a semi-synthetic generator, you first need to select a source to be used as its basis. This process can be initiated from the generator creation dialog, where you can also optionally specify a name. You will also need to select the type of generator you want to create, which in this case is ‘Semi-Synthesis’.

Semi-synthetic generators follow the same initial configuration steps as synthesis generators, with the key difference being the ability to designate context columns during the data treatment step.

Configuration steps

Before creating the semi-synthetic generator, the configurator allows you to review and modify some properties of the tabular data that will be used.

1. Choose data

The ‘Choose data’ step allows you to include or exclude specific tables and/or columns from the selected source.

This step works identically to synthesis generators.

2. Set primary keys

The ‘Set primary keys’ step allows you to set and unset primary keys for every table of the selected source.

This step works identically to synthesis generators.

3. Set foreign keys

The ‘Set foreign keys’ step allows you to set and unset foreign keys for every table of the selected source.

This step works identically to synthesis generators.

4. Data treatment

The ‘Data Treatment’ step is where semi-synthetic generators differ from fully synthetic generators.

Context Column Configuration

For semi-synthetic generators, you’ll see an additional Context toggle for each column:

Context enabled: The column from the original dataset (real data) will be used to build the context
Context disabled: The column will be synthetically generated based on the context columns

By default, primary and foreign keys are considered as context columns.

Visual Indicators

Context columns: Show “Or” (original) instead of “Sy” (synthesized)
Generated columns: Show “Sy” (synthesized) as in fully synthetic generators

Marking a table as “original data”

Similar to synthesis generators, you can designate certain tables as “original data” for lookup tables and reference data that should remain unchanged.

This functionality works identically to synthesis generators.

5. Generator settings

The ‘Generator Settings’ step allows you to configure various options regarding the training process and the generative model that will produce the synthetic data.

Semi-synthetic generators use the same training parameters as synthesis generators, with the model learning the relationships between context columns and target columns rather than generating all columns from scratch.

This step works identically to synthesis generators.

6. Generation settings

The ‘Generation settings’ step allows you to enable or disable the data generation once the generator is ready.

Semi-synthetic generators share most settings with synthesis generators, with the addition of context-specific parameters:

Resample: Controls how context data is selected from the original dataset for generation:
- When enabled: The context is resampled (with replacement) from the original data. This allows you to specify a custom number of records
- When disabled: The context is taken as-is from the original data, and the number of records is fixed to the original dataset size

Configuration validations

Semi-synthetic generators have specific validation requirements for context configuration:

For the case of a single table, the context must be enabled on at least one column in your dataset.

Note that in the multi-table case, primary keys and foreign keys are considered context columns by default.

Other validations work identically to synthesis generators.

Configuration complexity assessment

Semi-synthetic generators undergo the same complexity assessment as synthesis generators, with additional considerations for context-target relationships.

This works identically to synthesis generators.

Choose a destination

After selecting the configuration, you will be prompted with a dialog where you can choose a destination for writing the semi-synthetic data.

This step works identically to synthesis generators.

Generator status and timeline

The creation process for semi-synthetic generators follows the same steps as synthesis generators:

Data loading
Preprocessing (with context column identification)
Training the Generative AI model (learning context-target relationships)

Status monitoring works identically to synthesis generators.

Troubleshooting

Semi-Synthetic Specific Issues:

No context columns selected: In the case of a single table, ensure at least one column is marked as context

Other troubleshooting follows synthesis generators.

FAQ

Q: How many columns should I mark as context?
A: This depends on your use case. More context provides better conditioning but less privacy. Start with the minimum context needed for your business requirements.

Q: Can I change context column selection after creation?
A: No, context column selection is fixed during generator creation. You’ll need to create a new generator to change the context configuration.

Q: What happens if context columns have missing values?
A: Context columns may contain missing values. As for fully synthetic data, they will be treated as special values and the generated columns will maintain the statistical correlation with the presence of these missing values found in the real data.