Eval
The aindo.rdml.eval
module allows to:
- Generate a PDF report to evaluate the quality of the generated synthetic data, in terms of both similarity and privacy protection.
- Compute additional, in-depth statistics for the privacy metric.
Synthetic data report
After generating the synthetic data, the user may use the report()
function from the aindo.rdml.eval
module
to evaluate the output in terms of both synthetic data quality, i.e. the similarity between real and synthetic data,
and real data privacy protection.
This function outputs a PDF displaying the key metrics for the evaluation:
- A similarity matrix measures the similarity of univariate and bivariate distributions between real and synthetic data, in a scale between 0 and 1, with higher values indicating closer alignment. A score near 1 means faithful reproduction, while lower scores pinpoint areas of deviation. This matrix acts as a valuable tool for evaluating the quality of the synthetic data.
- The 1-NN distance distribution plot shows for each data point of the original (respectively synthetic) dataset the distance to the closest data point in the original (respectively synthetic) dataset. The resulting histograms show insights into the spatial similarities and dissimilarities of the real and synthetic datasets.
- The proximity ratios plot is useful to catch a potential privacy leak in the synthetic data. The idea is that when several synthetic data points cluster around real data points, we might be in the presence of such a leak. This may only happen for real data points used during training. It cannot happen for data points excluded from the training set, since they have never been seen by the model and cannot be memorized by it. Consider the following distribution (proximity ratio): for each real data point, take the ratio between the distance of the closest synthetic data point and the closest real data point. In order to check if there is an anomalous accumulation of synthetic data points around some real data points we need to compare the proximity ratio to the one we would obtain with a “perfect” synthetic dataset. To do so, we split the real training set in two subsets, T1 and T2. We then plot the histograms of the proximity ratio computed between T1 and the synthetic data points S, called Train to Synthetic Proximity Ratio (TSPR), and between T1 and T2, called Train to Train Proximity Ratio (TTPR). A higher density of the TSPR with respect to the TTPR points in the left tail suggests that some synthetic data points are clustering around training data points more than desirable.
- The Phik correlation matrix evaluates the relationships between categorical and numerical variables in a dataset. Unlike traditional correlation measures for numerical data, Phik is tailored also to categorical variables, capturing both linear and non-linear associations. Its values range from 0 to 1, where 0 implies no association and 1 indicates a perfect one.
- Univariate distributions for each column and bivariate distributions for each pair of columns belonging to the same table.
The report
function takes as input (a part of) the data used in training, some holdhout test data,
(a part of) the generated synthetic data (all in the form of RelationalData
objects),
and an output path for the PDF file.
The user may optionally specify the maximum number of samples to use for each table
of the training data (n_max_train
) and of the test data (n_max_test
),
and which columns to use for each section of the report and for each table.
For further information, please refer to the API reference.
Additional statistics for the privacy metric
The compute_privacy_stats
function performs a more refined analysis on the privacy score.
It takes as input (a part of) the training data and (a part of) the generated synthetic data and returns a dictionary
mapping each table to a PrivacyStat
object, which contains the following attributes:
privacy_score
: The privacy score.privacy_score_std
: An estimate of its standard deviation.risk
: An estimate of the fraction of training points at risk of re-identification.
The user can provide some optional parameters to control the output scores:
q
: The quantile used to compute the privacy score and the number of records at risk of re-identification.risk_confidence
: A confidence parameter for the estimation of the number of records at risk of re-identification. The estimated number of records at risk (n_risk
) is corrected with a factor of-risk_confidence * sqrt(n_risk)
.
For more details and the full list of parameters, please refer to the API reference.