Data preparation
High-quality, well-formatted real data are essential for creating valuable synthetic data. The structure of the real data can impact the difficulty of learning its underlying statistical distribution. Fortunately, simple procedures can drastically improve the quality of data sources, and thereby also that of generated synthetic data. In what follows, we will present some best practices for data preparation to help you achieve the best synthetization process.
Subjects and components
The first step in setting up a good data synthetization is understanding the concept of independent identically distributed (IID) components. Aindo’s models infer distributions of such IID components. In relational data, rows may not be IID. For example, consider a database consisting of a table of client data and a table of data on purchases. One client may be linked to multiple purchases. The rows in the table of purchases are therefore not IID. In such cases, Aindo’s synthetic data generator will identify IID components: rather than looking at individual records, the generator looks at bigger components that do satisfy the IID condition.
Independent Identically Distributed Components
- Components: A component is the set of all rows in the relational tabular data that refer to the same entity or subject, linked by a series of relations (foreign keys) either directly or indirectly through other rows. A component thus includes all rows that can be reached starting from one row by iteratively traversing the foreign keys in any direction.
- Independent: Each component does not depend on any other component. Therefore, there is no relation between the attributes of two different components.
- Identically Distributed: All components come from the same underlying probability distribution.
A component represents a single entity or subject and should contain all the available information in the data referring to it.
When there is only one root table, the case currently supported by the Aindo platform, each component consists of one row of the root table and all the rows of the other tables that can reach it by iteratively navigating the foreign keys. Each row of the root table uniquely identifies a component.
We will illustrate the above concepts with a practical example. Let’s consider a relational tabular dataset with three tables “Accounts”, “Transactions”, and “Investments”.
- Accounts Table (Root): This table contains information about individual customer accounts, such as the account ID, account holder’s name, and account type. Each row in the “Accounts” table is uniquely identified by the account ID.
- Transactions Table: This table records financial transactions, including details like transaction ID, date, amount, and the associated account ID. The account ID in the “Transactions” table connects each transaction to the respective account in the “Accounts” table.
- Investments Table: Here, data related to customer investments is stored, including investment ID, type of investment, quantity, and associated account ID. The account ID links each investment to the corresponding account in the ” Accounts” table.
In this case, each component comprises a row of the “Accounts” table and all rows in the “Transactions”
and “Investments” tables connected to it by a foreign key.
Components are independent if none of the attributes of rows belonging to one component depends on the values of the
attributes of other components. For example, if a customer had more than one
account, the components would NOT be independent, as transactions and investments of one
account could influence, directly or indirectly, transactions and investments of the other accounts of the same user.
Components are identically distributed if they originate from the same statistical distribution, meaning they are
all equivalent a priori (i.e. without knowing any of their attributes) and there is no trend among the users.
For instance, if each user had a different fee plan for transactions and investments based on some external factor
(which is not recorded or inferable from the attributes), the components would not be identically distributed
as this would impact the kind and number of transactions and investments of each account.
Best practices: obtaining IID components
The synthetic data created using the Aindo platform will consist of independent and identically
distributed components. This may create discrepancies with real data if the latter contains non-IID components. For
example, interdependencies between components will not be reproduced in the synthetic data.
Since components depend on the relational structure of the tabular data, it is often possible to
change the relational structure to make the components IID.
Here are some best practices to ensure your real relational tabular data are composed of IID components:
- Focus on data integrity. Ensure that all rows containing information about a single entity or subject are reachable from one another by recursively traversing the foreign keys, making sure that they belong to the same component. If not, consider adding a single foreign key to connect two components containing information about the same entity.
- Use root tables to connect interdependent components. When there is only one root table, components consist of one row of the root and all rows hierarchically connected to it via foreign keys. If the components are not independent, it is possible to add a table with just a primary key (which will become the new root table) and connect to the same key all rows of the former root table belonging to components that are not independent.
- Pay attention to temporal and sequential information. Date and/or time (or other sequential information) columns can help identify indicate components that are not independent. Ensure that the specified date/time provides all necessary information about the component or that other components do not add to this information. For example, in a time series of transactions, each transaction likely depends on previous ones. Make sure that all rows belonging to the same time series are reachable one from another by traversing the foreign keys.
- Pay attention to the row order for time series. When dealing with time series or other ordered sequential information (i.e. there is a “past” and a “future”), it is important that the corresponding tables are ordered according to the natural temporal order. In fact, when constructing components, the order of rows belonging to the same component will be preserved. Notice that the generated synthetic tables will not necessarily preserve this order, and they may need to be sorted afterward.
- Add extra columns to account for trends When the components are not identically distributed, consider adding new attributes or tables to account for some of the overall trends. For instance, a trend can be captured by adding the time and/or date of the events. When doing so, make sure that the components are independent, i.e. in the previous example that the trend is completely captured by the date/time of the event and not by previous occurrences.
- Separate entities/subjects from events. Oftentimes, data contain information about entities or subjects and some recorded events for each of them. It is better to separate general “static” information about entities/subjects from “dynamic” information introduced by the events in different tables. In general, entity/subject information should be contained in the root table, while events information should be collected in child tables. For example, a dataset could comprise general information about a patient and a series of medical examinations they undertook. In this situation, ensure that the patient information is completely contained in the root table and create a child table for the medical examinations, each pointing to the corresponding patient in the root table.
Information redundancy
Information redundancy can impact the quality of the generated synthetic data, forcing the Aindo model to learn unnecessary information. It is thus important to ensure that the real data contain only the necessary information to replicate its statistical properties. As a rule of thumb, any information deterministically derived from one or more attributes of the data should not be part of the synthetization process but should be reconstructed from the synthetic data using only the necessary columns.
Use only the necessary information when training the Aindo model. Derived information can be reconstructed later from the synthetic data.
Best practices
Here are some best practices to minimize information redundancy:
- Remove duplicated columns.
- Define lookup tables. If there are columns whose attributes only appear in fixed combinations, store the possible combinations in a lookup table. For example, if a table contains a column with country names and another with the corresponding country codes, create a lookup table, pointed by the original table, containing these two columns.
- Remove all the columns that are a known function of one or more other columns. These can be reconstructed after synthetization. For example, if you have a “total spending” columns and an “average spending per month” column obtained by dividing “total spending” by 12, only use the “total spending” column for synthetization and create the “average spending per month” column by dividing the synthetic “total spending column” by 12.
Some best practices can be applied even if columns belong to different tables but information redundancy occurs among rows of the same component. For instance, “total spending” and “average spending per month” columns from the last example could appear in two different tables while preserving their relationship. However, it is generally better to collect all the strongly correlated columns in the same table, also to identify columns related by functional relations or whose values appear in fixed combinations.
Data cleaning and formatting
Standard data cleaning can benefit both synthetization quality and training speed. Here are some best practices for cleaning and formatting your data to get the most out of the Aindo platform:
- Fix typos. Typos can impact both synthetization quality and processing/training speed.
- Keep only the necessary precision for numerical variables. High precision for numerical variables can increase training time and force the model to learn irrelevant details. Keep the precision of numerical variables to the significant level. Ensure data manipulations do not artificially increase the number of digits of numerical data and double-check the true number of significant digits for your data.
- Ensure correct coordinates format. The coordinates column should contain a string with the format ‘latitude, longitude’. Latitude and longitude should be numbers specifying angles in degrees. Don’t insert the ° symbol. Latitude must be inside the range [-90, 90] while longitude must be in the range [-180, 180[.
- Pay attention to the date and time format. Valid formats are strings specifying the date and time (the format is inferred automatically) or Unix timestamps with string, integer, or float types. If using Unix timestamps, set the appropriate precision when configuring the source. Standard datetime formats are also accepted when connecting to a database.