Skip to content

Privacy score

The Privacy score is a similarity-based metric used to quantify the level of privacy protection provided by a synthetic dataset. The underlying principle in this kind of metric is that a privacy risk occurs when synthetic records are statistically too close to the real records used to train the model. The synthetic records may in this case leak information about the real ones.

Calculation methodology

The score computation involves the following steps:

First, we split the real training set R in two subsets, R1 and R2. For each data point in R1, we compute the distance to its nearest neighbor in R2 and we divide it by the distance to its second nearest neighbor in R1 (the second being needed as the first nearest neighbor in R1 would be the original data point itself). We call the distribution of such ratio Train to Train Proximity Ratio (TTPR). We then compute the same quantity substituting R2 with the synthetic dataset S, obtaining the Train to Synthetic Proximity Ratio (TSPR) distribution.

Next, we compute the α-quantile of the TTPR distribution, where α is a fixed parameter (set to 0.1 by default) representing the portion of the distribution we are interested in. We then calculate the fraction of records whose proximity ratio is below the obtained threshold, for both the TTPR (which results in α, by definition) and the TSPR, respectively.

The score is finally derived from the ratio between these two fractions: if the ratio is equal or greater than 1, then the privacy risk for the synthetic data is minimum and the resulting privacy score is 100; otherwise, the score is simply given by the ratio multiplied by 100.

Interpretation

The score can be used to detect whether the synthetic data is statistically closer to the real data with respect to the “ideal” scenario represented by the TTPR. If such a scenario occurs, it might entail a risk of privacy leakage. The Privacy score ranges from 0 to 100, with 100 indicating the absence of significant privacy risks and near-zero values meaning that most of the real records are at risk, having a synthetic record too similar to them.

Keep in mind that the score computation has some intrinsic variance, so the obtained value can vary somewhat between different runs. This is particularly true for small datasets.