Eval
The aindo.rdml.eval
module allows to:
- Generate a PDF report to evaluate the quality of the generated synthetic data, in terms of both similarity and privacy protection.
- Compute additional, in-depth statistics for the privacy metric.
Synthetic data report
After generating the synthetic data, the user may use the report()
function from the aindo.rdml.eval
module
to evaluate the output in terms of both synthetic data quality, i.e. the similarity between real and synthetic data,
and real data privacy protection.
This function outputs a PDF displaying the key metrics for the evaluation:
-
A similarity matrix measures the similarity of univariate and bivariate distributions between real and synthetic data, in a scale between 0 and 1, with higher values indicating closer alignment. A score near 1 means faithful reproduction, while lower scores pinpoint areas of deviation. This matrix acts as a valuable tool for evaluating the quality of the synthetic data.
-
The 1-NN distance distribution plot shows for each data point of the original (respectively synthetic) dataset the distance to the closest data point in the original (respectively synthetic) dataset. The resulting histograms show insights into the spatial similarities and dissimilarities of the real and synthetic datasets.
-
The proximity ratios plot is a tool for identifying potential privacy leaks in synthetic data. The underlying concept is that the proximity of synthetic data points to real data points may indicate a privacy leak. Such clustering can only occur for real data points that were included in the training process, because data points excluded from the training set are never seen and thus cannot be memorized by the model.
The proximity ratio is calculated in the following way: for each real data point, calculate the ratio of the distance to the nearest synthetic data point to the distance to the nearest real data point. By analyzing the distribution of these ratios, we can detect whether synthetic data points are significantly accumulating around certain real data points. To establish a baseline, a “perfect” synthetic dataset is obtained by splitting the real dataset into two subsets, R1 and R2. We then compute and plot histograms for two proximity ratios: Train to Synthetic Proximity Ratio (TSPR): the proximity ratio between R1 and the synthetic dataset S and Train to Train Proximity Ratio (TTPR): the proximity ratio between R1 and R2.
A higher density of TSPR values compared to TTPR values in the left tail of the histogram suggests that some synthetic data points are clustering around training data points more than expected, which may indicate a privacy concern. For this reason, these distribution are used to compute the privacy score.
-
The Phik correlation matrix evaluates the relationships between categorical and numerical variables in a dataset. Unlike traditional correlation measures for numerical data, Phik is tailored also to categorical variables, capturing both linear and non-linear associations. Its values range from 0 to 1, where 0 implies no association and 1 indicates a perfect one.
-
Univariate distributions for each column and bivariate distributions for each pair of columns belonging to the same table.
-
Privacy Score asses numerically the presence of privacy leaks: it is computed by dividing the percentage of TSPR values below a given threshold by the percentage of TTPR values below the same threshold. The threshold is set as the
q
-quantile of TTPR values, with a defaultq
value of 0.1. The score ranges from 0 to 100, with lower values indicating a higher risk of privacy leaks. -
Similarity Score is the average of the similarity matrix values, multiplied by a scale factor. It provides a global measure of the similarity between the real and synthetic data. A score of 100 indicates perfect similarity.
The report
function takes as input (a part of) the data used in training, some holdhout test data,
(a part of) the generated synthetic data (all in the form of RelationalData
objects),
and an output path for the PDF file.
The user may optionally specify the maximum number of samples to use for each table
of the training data (n_max_train
) and of the test data (n_max_test
),
and which columns to use for each section of the report and for each table.
For further information, please refer to the API reference.
Additional statistics for the privacy metric
The compute_privacy_stats
function performs a more refined analysis on the privacy score.
It takes as input (a part of) the training data and (a part of) the generated synthetic data and returns a dictionary
mapping each table to a PrivacyStat
object, which contains the following attributes:
privacy_score
: The privacy score, as presented in the Synthetic data report.privacy_score_std
: An estimate of its standard deviation.risk
: An estimate of the fraction of training points at risk of re-identification. It uses the same percentages as the privacy score, but is evaluated as the difference between the two values, in particular the fraction of records of TSPR below the threshold minus the fraction of records of TTPR below the same threshold. If a negative value is obtained by the use of this method, which means that synthetic data are statistically more distant to training data than test data, then the fraction at risk is set to 0.
The user can provide some optional parameters to control the output scores:
q
: The quantile used to compute the privacy score and the number of records at risk of re-identification.risk_confidence
: A confidence parameter for the estimation of the number of records at risk of re-identification. If provided, then the estimated number of records below the thresholdn_risk
to compute the fraction, is corrected with a factor of-risk_confidence * sqrt(n_risk)
for this evaluation.
For more details and the full list of parameters, please refer to the API reference.