Introduction
Semi-synthetic generators create a hybrid approach that combines real data (context) with synthetic data generation. Unlike fully synthetic generators that generate everything from scratch, semi-synthetic generators use some of the original columns to create a “context” to conditionally generate the remaining columns.
This approach provides a middle ground between full data privacy and data utility, making it particularly valuable for scenarios requiring both privacy protection and business logic preservation.
This section guides you on creating, using, and managing semi-synthetic generators.
When to use semi-synthetic generators
Adding more columns to the context typically increases the fidelity of the generated data with respect to the original dataset. However, the downside is that “injecting” too much original data may lead to greater risk of leaking sensitive personal information.
The primary motivation for using semi-synthetic data generators is to control how much original data is incorporated into the synthetic output. By selecting which columns to include as context, the user can balance the trade-off between data fidelity and privacy.
The choice of context columns ultimately depends on the specific use case. As a general rule, when privacy is a concern, one should include as few context columns as necessary—only those required by the application.
Conversely, if privacy is less of a concern and higher fidelity is desired, additional context columns can be included to improve the realism of the synthetic data.
For example, consider a relational dataset consisting of:
- A patient table (the root), containing personal information.
- One or more child tables with details on diagnoses, prescriptions, interventions, and other medical events.
Suppose we’re particularly interested in a set of rare diseases that occur infrequently in the original data. In a fully synthetic version of the dataset, some of these rare conditions might not be adequately represented. To address this, we could use a semi-synthetic generator and include the columns containing diagnosed diseases as part of the context. This ensures that the disease distribution in the generated dataset mirrors the original one.
In this scenario, the privacy loss is expected to be minimal.
To further improve fidelity, we might also add to the context columns the ones about prescribed medications and dosages. This would help maintain consistency between diagnoses and prescriptions, since both are present in the context. Again, while this may not significantly compromise privacy, the more columns we add, the more information from the original data we are embedding in the synthetic output.
To avoid major privacy breaches, one should never include columns containing explicit personal information in the context. In our example, this would typically refer to fields in the patient table.
How is the context created
To use semi-synthetic generators, the user may select the columns that they want to be used to create the context. The selected columns are extracted from the original data to create a sampling pool for the context. Each row of the root table is a sample in the pool. During the generation phase, the context is resampled with replacement from the context sampling pool. The user may specify the number of samples in the final context.