Create a synthetic dataset
To generate a synthetic dataset, you need to select a generator. This process can be initiated from the synthetic dataset creation dialog, where you can optionally specify a name.
Configuration steps
Before creating the synthetic dataset, you can review and modify some generation properties.
Generation settings
Number of records
: the number of records of the root table to be generatedRebalance
: configure rebalancing for categorical columns. This option is available only when the rebalance data treatment is enabled for a column.
Tabular model parameters
Generation batch size
: the number of records that are synthesized in parallel
Text model parameters
The settings of the text model are customizable only when there is at least a column with textual data.
Generation batch size
: the number of records that are synthesized in parallel
Pdf report generation
In this section, you can adjust parameters related to PDF report generation, such as setting limits on the number of columns displayed in the output file.
Univariate distributions
: Limit the number of columns that will be included in the ‘Univariate distributions’ sectionBivariate distributions
: Limit the number of columns that will be included in the ‘Bivariate distributions’ sectionk-NN analysis
: Limit the number of columns that will be included in the ‘k-NN analysis’ sectionPhiK analysis
: Limit the number of columns that will be included in the ‘PhiK analysis’ section
Configuration validations
This configuration step may include constraints that must be satisfied to create a synthetic dataset. The configuration process guides you through meeting these constraints by displaying alerts when they are not satisfied.
Choose a destination
After selecting the configuration, you will be prompted with a dialog where you can choose a destination for writing the synthesized data.
See the sections below for more details about each choice.
Application storage
A storage managed by the application, either on the cloud or locally for on-premises deployments. This is the default destination.
Remote database destination
A remote relational database for which you can provide connection details.
All databases that are available when configuring a data source are also available as destinations. This includes PostgreSQL, MySQL, MariaDB, Google BigQuery, Microsoft SQL Server, and Oracle Database.
Note: The provided connection details must allow write access, otherwise the synthetic dataset cannot be saved.
Remote object storage destination
A remote object storage for which you can provide connection details.
Note: The provided connection details must allow write access, otherwise the synthetic dataset cannot be saved.
All object storages that are available when configuring a data source are also available as destinations. This includes S3 and Google Cloud Storage.
When using object storage as a destination, object keys will correspond to the names of contained tables: these do not
include file extensions such as .csv
unless they are also present in the table name.
Download a synthetic dataset
Whether you choose to save the synthetic dataset in the application storage or on an existing database, you still have the option to download the entire dataset and save it locally on your computer. You can download any synthetic dataset from its view page.
The supported file extensions you can choose from are: .csv
, .tsv
, .xlsx
, .ods
, .parquet
.
Multiple synthetic datasets
You can generate as many synthetic datasets as you want using a generator, testing different configurations. You can see all synthetic datasets belonging to that generator in the generator page.
Troubleshooting
If you encounter any issues, here are some common problems and their solutions:
- Create button returns configuration error: If there are no validation errors during configuration but the synthesis creation still fails, there may be other unsatisfied constraints. Carefully read the error message for hints about the failed configuration, or contact support for assistance.
- quota limit reached: See quota errors
FAQ
Q: What should I do if the “Create” button returns a configuration error?
A: If there are no visible validation errors but the synthesis creation still fails, check the error message for hints about the failed configuration. Ensure all constraints are satisfied. If the problem persists, contact Aindo support for assistance.
Q: What are quotas and how do they affect the generation?
A: Quotas are resource limits defined on the platform. Each synthetic data generation consumes resources, which
count against your quotas. If you run out of quota, you won’t be able to generate new synthetic datasets.
Check your quota status and manage your usage accordingly.