Skip to content

Eval module

After generating the synthetic data, the user may use the report() function from the eval module to evaluate the output in terms of both synthetic data quality, i.e. the similarity between real and synthetic data, and real data privacy protection. This function outputs a PDF displaying the key metrics for the evaluation:

  • A similarity matrix measures the similarity between univariate and bivariate distributions between real and synthetic data, in a scale between 0 and 1, with higher values indicating closer alignment. A score near 1 means faithful reproduction, while lower scores pinpoint areas of deviation. This matrix acts as a valuable tool for evaluating the quality of the synthetic data.
  • The 1-NN distance distribution plot shows for each data point of the original (respectively synthetic) dataset the distance to the closest data point in the original (respectively synthetic) dataset. The resulting histograms show insights into the spatial similarities and dissimilarities of the real and synthetic datasets.
  • The proximity ratios plot shows, for each real data point, the ratio between the distance of the closest synthetic data point and the closest real data point. When several synthetic data points cluster around real data points, we could be in the presence of a potential privacy leak. This may only happen for real data points used during training. It cannot happen for data points excluded by the training set, since they have never been seen by the model and cannot be memorized by it. Therefore, we show the proximity ratios histograms for train and test data points of the real dataset. A higher density of train-synth vs. test-synth points in the lower tail suggests that some synthetic data points are clustering around training data points more than around test data points.
  • The Phik correlation matrix evaluates the relationships between categorical and numerical variables in a dataset. Unlike traditional correlation measures for numerical data, Phik is tailored also for categorical variables, capturing both linear and non-linear associations. Its values range from 0 to 1, where 0 implies no association and 1 indicates a perfect one.
  • Univariate distributions for each column and bivariate distributions for each pair of columns belonging to the same table.

The report function takes as input (a part of) the data used in training, some holdhout test data, (a part of) the generated synthetic data (all in the form of RelationalData objects), and an output path for the PDF file:

from aindo.rdml.eval import report
from aindo.rdml.relational import RelationalData
data = RelationalData(data=..., schema=...)
data_train, data_test = data.split(ratio=0.1)
# Generate synthetic data
data_synth = ...