Skip to content

Privacy score

The Privacy score is a metric used to quantify the level of privacy protection provided by a synthetic dataset. The Privacy Score is a similarity-based privacy metric. The underlying principle in this kind of metric is that privacy risks occur when specific synthetic records are significantly close to specific real records. The synthetic records may then leak information about the real records.

Calculation methodology

The Privacy score is computed using the Distance to Closest Record Ratio metric, which compares distances between synthetic records and real records. This metric involves comparing two distances:

  • Synthetic to Train Distance Ratio (STDR): The STDR measures the ratio of the distance between a synthetic record and the closest real record in the training set used to build the generative model.
  • Synthetic to Holdout Distance Ratio (SHDR): The SHDR measures the ratio of the distance between a synthetic record and the closest real record in a holdout set

The train set contains real records used during the training of the generative model. The holdout set contains real records not used during the training of the generative model. We want the synthetic data to be statistically no more similar to the train data than to the holdout data. This is so, because if the synthetic data is much closer to the train data than to the holdout data, then it means that there is a higher risk of privacy leakage. By comparing the STDR to the SHDR, we can assess if small STDR values are due to chance (similar small SHDR values occur), or due to the generative model leaking private information (no similar SHDR values occur).

To calculate the α-privacy score, for α a user-specified significance level, we compute the α-quantile of the SHDR distribution. Then, we compute the fraction of records of the train and holdouts sets respectively that have STDR and SHDR values inside the first α-quantile of the SHDR distribution. Finally, we compute the ratio between the fraction of training records with STDR inside the first Finally, we compute the ratio between the fraction of training records with STDR inside the first-quantile and the fraction of holdout records whose SHDR values are inside the same quantile. If this value is greater than 1, the score is 100 because synthetic data is not statistically closer to train data rather than the holdout data. Otherwise, the privacy score is the ratio multiplied by 100.

The default value for α is 0.05.

Interpretation

The Privacy score ranges from 0 to 100, with 100 indicating no significant privacy risks and 0 indicating that all synthetic records pose a risk to real records because they are too similar to train data. Keep in mind that the score has some variance so the calculated value can vary somewhat between different runs. This is particularly true for small datasets.