Eval

The aindo.rdml.eval module allows to:

Generate a PDF report to evaluate the quality of the generated synthetic data, in terms of both similarity and privacy protection.
Compute additional, in-depth statistics for the privacy metric.

Synthetic data report

After generating the synthetic data, the user may use the report() function from the aindo.rdml.eval module to evaluate the output in terms of both synthetic data quality, i.e. the similarity between real and synthetic data, and real data privacy protection. This function outputs a PDF displaying the key metrics for the evaluation:

A similarity matrix measures the similarity of univariate and bivariate distributions between real and synthetic data, in a scale between 0 and 1, with higher values indicating closer alignment. A score near 1 means faithful reproduction, while lower scores pinpoint areas of deviation. This matrix acts as a valuable tool for evaluating the quality of the synthetic data.
The 1-NN distance distribution plot shows for each data point of the original (respectively synthetic) dataset the distance to the closest data point in the original (respectively synthetic) dataset. The resulting histograms show insights into the spatial similarities and dissimilarities of the real and synthetic datasets.
The proximity ratios plot is a tool for identifying potential privacy leaks in synthetic data. The underlying concept is that the proximity of synthetic data points to real data points may indicate a privacy leak. Such clustering can only occur for real data points that were included in the training process, because data points excluded from the training set are never seen and thus cannot be memorized by the model.

The proximity ratio is calculated in the following way: for each real data point, calculate the ratio of the distance to the nearest synthetic data point to the distance to the nearest real data point. By analyzing the distribution of these ratios, we can detect whether synthetic data points are significantly accumulating around certain real data points. To establish a baseline, a “perfect” synthetic dataset is obtained by splitting the real dataset into two subsets, R₁ and R₂. We then compute and plot histograms for two proximity ratios: Train to Synthetic Proximity Ratio (TSPR): the proximity ratio between R₁ and the synthetic dataset S and Train to Train Proximity Ratio (TTPR): the proximity ratio between R₁ and R₂.

A higher density of TSPR values compared to TTPR values in the left tail of the histogram suggests that some synthetic data points are clustering around training data points more than expected, which may indicate a privacy concern. For this reason, these distribution are used to compute the privacy score.
The Phik correlation matrix evaluates the relationships between categorical and numerical variables in a dataset. Unlike traditional correlation measures for numerical data, Phik is tailored also to categorical variables, capturing both linear and non-linear associations. Its values range from 0 to 1, where 0 implies no association and 1 indicates a perfect one.
Univariate distributions for each column and bivariate distributions for each pair of columns belonging to the same table.
Privacy Score asses numerically the presence of privacy leaks: it is computed by dividing the percentage of TSPR values below a given threshold by the percentage of TTPR values below the same threshold. The threshold is set as the q-quantile of TTPR values, with a default q value of 0.1. The score ranges from 0 to 100, with lower values indicating a higher risk of privacy leaks.
Similarity Score is the average of the similarity matrix values, multiplied by a scale factor. It provides a global measure of the similarity between the real and synthetic data. A score of 100 indicates perfect similarity.

The report function takes as input (a part of) the data used in training, some holdhout test data, (a part of) the generated synthetic data (all in the form of RelationalData objects), and an output path for the PDF file. The user may optionally specify the maximum number of samples to use for each table of the training data (n_max_train) and of the test data (n_max_test), and which columns to use for each section of the report and for each table. For further information, please refer to the API reference.

Additional statistics for the privacy metric

The compute_privacy_stats function performs a more refined analysis on the privacy score. It takes as input (a part of) the training data and (a part of) the generated synthetic data and returns a dictionary mapping each table to a PrivacyStat object, which contains the following attributes:

privacy_score: The privacy score, as presented in the Synthetic data report.
privacy_score_std: An estimate of its standard deviation.
risk: An estimate of the fraction of training points at risk of re-identification. It uses the same percentages as the privacy score, but is evaluated as the difference between the two values, in particular the fraction of records of TSPR below the threshold minus the fraction of records of TTPR below the same threshold. If a negative value is obtained by the use of this method, which means that synthetic data are statistically more distant to training data than test data, then the fraction at risk is set to 0.

The user can provide some optional parameters to control the output scores:

q: The quantile used to compute the privacy score and the number of records at risk of re-identification.
risk_confidence: A confidence parameter for the estimation of the number of records at risk of re-identification. If provided, then the estimated number of records below the threshold n_risk to compute the fraction, is corrected with a factor of -risk_confidence * sqrt(n_risk) for this evaluation.

For more details and the full list of parameters, please refer to the API reference.

from aindo.rdml.eval import report, compute_privacy_stats
from aindo.rdml.relational import RelationalData

data = RelationalData(data=..., schema=...)
data_train, data_test = data.split(ratio=0.1)

# Generate synthetic data
...
data_synth = ...

report(
    data_train=data_train,
    data_test=data_test,
    data_synth=data_synth,
    path="./report.pdf",
)

privacy_stats = compute_privacy_stats(
    data_train=data_train,
    data_synth=data_synth,
)
for t in data.schema.tables:
    print(f"Table: {t}")
    print(f"Privacy score: {privacy_stats[t].privacy_score:.2f} ({privacy_stats[t].privacy_score_std:.3f})")
    print(f"% training points at risk: {privacy_stats[t].risk * 100:.1f}")