Similarity score

The Similarity score is an aggregate metric that offers a comprehensive assessment of the overall similarity between the real and synthetic data. This score is a reliable indicator of the quality and fidelity of synthetic data generated by the Aindo platform.

Calculation methodology

The Similarity score for a table is derived from the mean of all Bivariate Similarity scores calculated across pairs of variables within the table.

The Bivariate Similarity score measures the distance between two bivariate distributions within a dataset. More on the bivariate distributions you can find in the section about bivariance. It is computed as 1 minus the Total Variation Distance. The Total Variation Distance is defined as one half of the sum of absolute differences between the observed frequencies of each bin in histograms generated from the real and synthetic data.

Interpretation

The Bivariate Similarity Score is calculated for every pair of variables in the dataset and it is a number between 0 and 1. A higher Bivariate Similarity score (closer to 1) indicates a greater similarity between the real and synthetic data distributions for the corresponding variables. Conversely, a lower score (closer to 0) suggests greater dissimilarity, indicating potential disparities in the generated synthetic data compared to the real data.

The Similarity Score for a table is the mean of all Bivariate Similarity scores. When the Similarity Scores deviates significantly from 1, refer to the Troubleshooting section to identify potential sources of discrepancy and adjust the synthetic data generation process accordingly.