Supported data formats and types
In the Aindo platform, data manipulation is central to your workflow. Understanding different data formats and types is essential for efficient data handling and analysis.
Supported data formats
This section explains the supported data formats in the Aindo platform.
Tabular data
A tabular dataset is a structured collection of data organized in rows and columns, like a table. Tabular datasets offer a convenient way to store and analyze structured data thanks to their organized and easy-to-read format.
In a tabular dataset:
-
Each row contains a single observation or data point. For example, in a dataset of customer information, each row might represent a different customer.
-
Each column represents a particular attribute or feature of the data. For instance, in a dataset of sales transactions, columns might include attributes such as date, product name, quantity sold, and total revenue.
Tabular datasets are often used for tasks such as data exploration, visualization, modeling, and predictive analytics. They are versatile and can accommodate different types of data, including numerical, categorical, and textual information. The section Supported data types details the column types that can be processed in Aindo’s platform.
Relational tabular data
Information in a relational tabular dataset is stored in multiple interconnected tables.
For example, a financial institution might have data stored in two tables: Accounts
and Transactions
.
Each transaction belongs to a single account, so that there are clear connections between the tables.
In a relational tabular dataset:
- Each table typically has a primary key, which is an unique identifier for each row in the table. This key is used to establish relationships with other tables.
- In tables that have relationships with other tables, foreign keys are used to reference the primary key of another table. These keys create links between related rows in different tables.
In a relational tabular dataset, tables are connected through primary and foreign key relationships, which establish the links between related rows in different tables. These relationships define the associations between entities represented in the dataset.
In the finance example with two tables:
- The
Accounts
table contains information about individual customer accounts, such as the account ID, account holder’s name, and account type. Each row in theAccounts
table is uniquely identified by the account ID, which serves as a primary key. The accounts table does not refer back to any other table, making it a so-called root table. - The
Transactions
table contains information about financial transactions, including details like transaction ID, date, amount, and the associated account ID. This account ID connects each transaction to its corresponding account in theAccounts
table, serving as a foreign key.
In this setup, the primary and foreign key relationships enable connections between the Accounts
and Transactions
.
For instance, one can track all transactions associated with a specific account.
Relational tabular databases can have any number of interconnected tables.
Lookup tables
Databases often contain lookup tables. Unlike dynamic data tables, lookup tables typically contain predefined data that other tables can reference using foreign keys. These reference data are often static or reference data that remain constant over time, such as lists of countries, states, or product categories. They essentially serve as dictionaries within the database.
For instance, consider a financial database containing a CurrencyCodes
lookup table.
This table stores a list of standardized currency codes (e.g., USD for US Dollar, EUR for Euro, GBP for British Pound).
Other tables, such as the Transactions
or Accounts
tables, can reference this lookup table using foreign keys to
indicate the currency associated with each transaction or account. This ensures consistency in representing currency
across the dataset, facilitating accurate financial analysis and reporting.
Given the static and non-personal nature and foundational role of lookup tables, it’s generally advisable not to synthesize them. Users can easily specify that these tables are exempt from synthesis, ensuring that the reference data remain unchanged.
Time series
Time series data consists of a sequence of data points where each point depends on the previous ones, forming interdependencies within the dataset.
In Aindo platform, each time series is treated as a single sample. To generate synthetic data from time series effectively, Aindo platform necessitates a significant number of independent samples, each representing a distinct time series. This approach ensures that the generative model captures the underlying patterns and dependencies within the time series data, facilitating the creation of realistic synthetic time series.
To use Aindo platform for generating synthetic time series data, it must be provided in a relational format. Specifically, the data should be structured such that there is a root table containing information about individuals ( e.g., customers) and other tables containing information about transactions linked to these individuals.
For example, consider credit card transaction data. Each transaction forms part of a time series where the transaction amount, time, and location may depend on previous transactions made by the same user. Aindo platform treats each user’s transaction history as a separate time series sample. By analyzing multiple independent samples of transaction histories, Aindo platform can accurately model the patterns and dependencies within the data, enabling the generation of synthetic transaction data that mirrors real-world behaviors.
Supported data types
This section explains the different data types in the Aindo platform.
Numeric
- Content: A numeric column is used to store numerical data that can be either integers or real numbers (numbers with decimal points).
- Example: Suppose you have a dataset containing information about various products, including their prices. The price column would typically be a numeric column because prices can contain decimal values.
- Accepted format: floating point.
Integer
- Content: An integer column is used to store whole numbers without any decimal places.
- Example: Continuing with the product dataset example, you might have another column representing the quantity of each product in stock. Since you can’t have fractional quantities of products, this column would be an integer column.
- Distinguishing from Numeric: Integer columns do not include decimal points. If the data you’re dealing with consist solely of whole numbers, it’s likely an integer column.
- Accepted format: integer.
Categorical
- Content: A categorical column is used to store data that represent categories or groups.
- Example: In the product dataset, you might have a column representing the product categories, such as ” Electronics”, “Clothing”, “Books”, etc. This column would be categorical because it represents distinct categories rather than numerical values.
- Distinguishing from Integer: Categorical columns contain values that represent categories or groups, not numerical values. Even if a categorical variable is encoded using numbers (e.g., ordinal encoding), it’s important to recognize whether these numbers represent quantitative values or serve as labels or identifiers for different categories. If the set of values does not have an inherent ordinal or numerical meaning (e.g. if we cannot claim that “3>1”), it should be considered categorical.
- Privacy Considerations: Synthetic data generated for categorical columns will retain the same categories as the original data. Therefore, personally identifiable information (e.g., names, phone numbers, social security numbers) should never be marked as categorical, as this may inadvertently lead to privacy risks or re-identification of individuals.
- Accepted format: strings or numbers uniquely identifying the categories.
Boolean
- Content: A boolean column stores data of an attribute that can only take one of two values, either “true” or “false”
- Example: In the product dataset, you might have a column indicating if a product is available in stock. This column would be boolean because it has only two possible states.
- Accepted format:
- a float: interpreted as True if the value is greater than 0 and False otherwise.
- a string (case-insensitive):
- “true”, “t”, “yes”, “y”, “vero”, “v”, “si”, “s”: True,
- “false”, “f”, “no”, “n”, “falso”: False,
- when connecting to a database, the boolean type is also accepted.
Date
- Content: A date column is used to store calendar dates.
- Example: In a sales dataset, you might have a column representing the date of each transaction.
- Distinguishing from Time and Datetime: Date columns only contain information about the date itself, without any reference to time. They are distinct from time and datetime columns, which include time information.
- Accepted format: strings specifying the date (the format is inferred automatically) or unix timestamps with string, integer, or float types. When connecting to a database, standard date formats are also accepted.
Time
- Content: A time column is used to store time of day data.
- Example: In a scheduling application, you might have a column representing the start time of each appointment.
- Distinguishing from Date and Datetime: Time columns only contain information about the time of day, without any reference to date. They are distinct from date and datetime columns, which include date information.
- Accepted format: strings specifying the time (the format is inferred automatically) or unix timestamps with string, integer, or float types. When connecting to a database, standard time formats are also accepted.
Datetime
- Content: A datetime column is used to store both date and time information.
- Example: In a log file, you might have a column representing the timestamp of each log entry.
- Distinguishing from Date and Time: Datetime columns contain both date and time information, unlike date and time columns which only contain one of these components.
- Accepted format: strings specifying the date and time (the format is inferred automatically) or unix timestamps with string, integer, or float types. When connecting to a database, standard datatime formats are also accepted.
Geolocation
- Content: A geolocation column is used to store geographical coordinates, i.e., latitude and longitude values.
- Example: In a dataset containing information about store locations, you might have a column representing the latitude and longitude of each store.
- Accepted format: a string with the format ‘latitude, longitude’. Latitude and longitude should be float numbers specifying angles in degrees. Latitude must be inside the range [-90, 90] while longitude must be in the range [-180, 180].
Text
- Content: A text column is used to store textual data.
- Example: In a customer feedback dataset, you might have a column representing the comments left by customers.
- Distinguishing from Categorical: Text columns contain free-form text data, whereas categorical columns contain predefined categories or groups. Upon generation, text columns may create strings not present in the original dataset, while categorical columns will only contain strings that were present in the original dataset.
- Accepted format: a string.