Create a Semi-Synthetic Generator
To create a semi-synthetic generator, you first need to select a source to be used as its basis. This process can be initiated from the generator creation dialog, where you can also optionally specify a name. You will also need to select the type of generator you want to create, which in this case is ‘Semi-Synthesis’.
Semi-synthetic generators follow the same initial configuration steps as synthesis generators, with the key difference being the ability to designate context columns during the data treatment step.
Configuration steps
Before creating the semi-synthetic generator, the configurator allows you to review and modify some properties of the tabular data that will be used.
1. Choose data
The ‘Choose data’ step allows you to include or exclude specific tables and/or columns from the selected source.
This step works identically to synthesis generators.
2. Set primary keys
The ‘Set primary keys’ step allows you to set and unset primary keys for every table of the selected source.
This step works identically to synthesis generators.
3. Set foreign keys
The ‘Set foreign keys’ step allows you to set and unset foreign keys for every table of the selected source.
This step works identically to synthesis generators.
4. Data treatment
The ‘Data Treatment’ step is where semi-synthetic generators differ from fully synthetic generators.
Context Column Configuration
For semi-synthetic generators, you’ll see an additional Context toggle for each column:
- Context enabled: The column from the original dataset (real data) will be used to build the context
- Context disabled: The column will be synthetically generated based on the context columns
By default, primary and foreign keys are considered as context columns.
Visual Indicators
- Context columns: Show “Or” (original) instead of “Sy” (synthesized)
- Generated columns: Show “Sy” (synthesized) as in fully synthetic generators
Marking a table as “original data”
Similar to synthesis generators, you can designate certain tables as “original data” for lookup tables and reference data that should remain unchanged.
This functionality works identically to synthesis generators.
5. Generator settings
The ‘Generator Settings’ step allows you to configure various options regarding the training process and the generative model that will produce the synthetic data.
Semi-synthetic generators use the same training parameters as synthesis generators, with the model learning the relationships between context columns and target columns rather than generating all columns from scratch.
This step works identically to synthesis generators.
6. Generation settings
The ‘Generation settings’ step allows you to enable or disable the data generation once the generator is ready.
Settings work identically to synthesis generators.
Configuration validations
Semi-synthetic generators have specific validation requirements for context configuration:
For the case of a single table, the context must be enabled on at least one column in your dataset.
Note that in the multi-table case, primary keys and foreign keys are considered context columns by default.
Other validations work identically to synthesis generators.
Configuration complexity assessment
Semi-synthetic generators undergo the same complexity assessment as synthesis generators, with additional considerations for context-target relationships.
This works identically to synthesis generators.
Choose a destination
After selecting the configuration, you will be prompted with a dialog where you can choose a destination for writing the semi-synthetic data.
This step works identically to synthesis generators.
Generator status and timeline
The creation process for semi-synthetic generators follows the same steps as synthesis generators:
- Data loading
- Preprocessing (with context column identification)
- Training the Generative AI model (learning context-target relationships)
Status monitoring works identically to synthesis generators.
Troubleshooting
Semi-Synthetic Specific Issues:
- No context columns selected: In the case of a single table, ensure at least one column is marked as context
Other troubleshooting follows synthesis generators.
FAQ
Q: How many columns should I mark as context?
A: This depends on your use case. More context provides better conditioning but less privacy. Start with the minimum context needed for your business requirements.
Q: Can I change context column selection after creation?
A: No, context column selection is fixed during generator creation. You’ll need to create a new generator to change the context configuration.
Q: What happens if context columns have missing values?
A: Context columns may contain missing values.
As for fully synthetic data, they will be treated as special values and the generated columns will maintain
the statistical correlation with the presence of these missing values found in the real data.