Skip to content

Create a synthesis

To generate synthetic data, you need to run a synthesis on a specific source. This process can be initiated from the synthesis dialog.

1-create-synth.png 2-choose-dataset.png

Configuration steps

Before creating the synthesis, the source configurator allows you to review and modify some properties of the tabular data that will be used.

1. Choose data

The ‘Choose data’ step allows the user to include or exclude certain tables and/or columns from the data previously uploaded.

1-remove.png 2-add.png

2. Set primary keys

The ‘Set primary keys’ step allows the user set and unset primary keys for every table of the data previously uploaded.

1-set.png 2-unset.png

3. Set foreign keys

The ‘Set foreign keys’ step allows the user to set and unset foreign keys for every table of the data previously uploaded.

1-set.png 2-dialog.png

4. Data treatment

The ‘Data Treatment’ step allows the user to configure various synthesis options at both the table and column levels. Columns marked with a green shield indicate sensitive data that will be synthesized, while columns marked with a gray shield indicate sensitive data that will not be synthesized. Some options are specific to the column type.

1-set.png 2-column-treat.png

Marking table as “original data”

In this step, you can designate certain tables as “original data”. When a table is marked this way, it means the data in this table will be included in its original form without any transformation or synthesis. This is particularly useful for lookup tables, which often contain reference data like country codes, product categories, or status types that don’t need to be altered.

  • Purpose: Lookup tables often contain fixed, non-sensitive data that should remain consistent across different datasets. By including the original data, you ensure that these reference values remain accurate and usable in synthesized datasets.
  • Requirements: Tables set as “original data” must have at least one inbound foreign key relationship from another table. This ensures that the lookup table is properly referenced within the relational structure, which is crucial for maintaining referential integrity across the synthesized dataset.
  • Exclusion from Root Table Validation: Tables marked as “original data” do not count towards the validation of the number of root tables. The validation ensures that there is only one root table in the relational structure. Lookup tables are considered auxiliary and are excluded from this count, as they only serve to support relationships between other tables.
  • How It Works: When a table is marked as “original data,” it bypasses any synthetic processing. The data remains exactly as it was in the source, while other tables are synthesized. This is critical for preserving lookup or reference data that should not change, while still generating synthetic data for other parts of the schema.
  • Visual Indicators: Tables marked as “original data” are indicated with a specific icon (e.g., an asterisk icon) to distinguish them from tables that will undergo synthesis.

For example, if you have a table containing product categories like “Electronics,” “Clothing,” or “Home Goods,” you may not want this information to be synthesized. By marking the table as Original Data, its values will be preserved exactly as they were in the original dataset. Additionally, since it is a lookup table with foreign key relationships, it remains part of the overall relational structure, but does not interfere with the root table validation.

3-table-treat.png

5. Synth settings

The ‘Synth Settings’ step allows you to configure various options regarding the generative model that will produce the synthetic data.

1-ui.png 2-advanced-settings.png

Configuration validations

Each configuration step may have constraints that need to be satisfied to create a fully functional synthesis. The configuration process guides you through meeting these constraints by displaying alerts when they are not satisfied.

1-validations.png

Choose a destination

After selecting the configuration, the user will be prompted with a dialog where they can choose a destination for writing the synthesized data. The available options are:

  • Aindo Cloud. A safe storage managed by Aindo, either on the cloud or locally for on-premise deployments.
  • An existing database. A database controlled locally by the user on their own device(s).

1-choose-destination.png

By default, the Aindo Cloud option is selected. If you prefer more control over your data, you can provide connection details to an existing database. All databases that are available when configuring a data source are also available as destinations, including PostgreSQL, MySQL, MariaDB and Google Big Query.

Note: The provided database must allow write access, otherwise the synthetic dataset cannot be saved.

5-list-db.png 2-connection-db.png

Download a synthetic dataset

Whether you choose to save the synthetic dataset on Aindo Cloud or on an existing database, you still have the option to download the entire dataset and save it locally on your computer. You can download any execution of a synthetic dataset from the preview page once its processing is finished.

3-download-destination.png

The supported file extensions that you can choose between are: .csv, .tsv, .xlsx, .ods, .parquet.

4-download-formats.png

Execute the synthesis (timeline)

Once you have finished configuring the synthesis, clicking the final ‘create’ button will start the execution. Each execution consumes resources, defined in the platform as quotas. After the execution starts, you will be redirected to the view page, where the timeline dialog will automatically open to show the various steps of the synthesis process.

The steps involved in generating a synthesis are:

  • Data loading
  • Preprocessing
  • Building the Generative AI model
  • Training the Generative AI model
  • Synthesis (Output of the Generative AI model)
  • Report generation
  • Store the newly synthesized data

1-timeline.png

Multiple executions

You can execute the same synthesis multiple times, which is useful if the source data changes over time or if you need to generate additional data. This is possible through the execution history side panel, where you can create, rename, and delete executions of a specific synthesis.

1-executions-panel.png 2-create-execution.png 3-timeline.png 4-progress.png

Troubleshooting

If you encounter any issues, here are some common problems and their solutions:

  • Create button returns configuration error: If there are no validation errors during configuration but the synthesis creation still fails, there may be other unsatisfied constraints. Carefully read the error message for hints about the failed configuration, or contact support for assistance.
  • Preview of data in configurator fails: The source data used to configure the synthesis may no longer be available or may have changed connection settings.
  • dataset too little: If a dataset has too few data points, synthesis is not possible.
  • finished quota: See quota errors

FAQ

Q: What should I do if the “Create” button returns a configuration error?
A: If there are no visible validation errors but the synthesis creation still fails, check the error message for hints about the failed configuration. Ensure all constraints are satisfied. If the problem persists, contact Aindo support for assistance.

Q: Why does the preview of data in the configurator fail?
A: This issue may occur if the source data is no longer available or if its connection settings have changed. Verify that the source data is accessible and that connection settings are correct.

Q: What can I do if my dataset is too small to generate a synthesis?
A: If the dataset has too few data points, synthesis may not be possible. Consider gathering more data to increase the dataset size.

Q: What are quotas and how do they affect synthesis runs?
A: Quotas are resource limits defined on the platform. Each synthesis run consumes resources, which count against your quotas. If you run out of quota, you won’t be able to start new synthesis runs. Check your quota status and manage your usage accordingly.

Q: How do I know which columns are marked as sensitive?
A: Columns marked with a gray ‘shield’ icon in the ‘Data treatment’ step indicate that they contain sensitive data. These columns should be synthesized to protect the sensitive information.