Evaluation
class ReportColumns
The columns to be used in the different sections of the Report
.
For each section, if the columns to be used are specified by an integer n
,
the first n
columns of each table will be used.
Otherwise, if a dict is provided, it should map each table to either the columns to be used,
or to an integer n
specifying the number of columns. For missing tables all columns will be retained.
Arguments:
univ
- The columns to be used for the univariate distributions.biv
- The columns to be used for the bivariate distributions.knn
- The columns to be used for the k-NN (nearest neighbors) analysis.phik
- The columns to be used for the PhiK analysis.
report
Collect summary statistics for the evaluation of synthetic data in terms of data quality and privacy protection.
Arguments:
data_train
- ARelationalData
object containing the original training data.data_test
- ARelationalData
object containing the original test data.data_synth
- ARelationalData
object containing the generated synthetic data.path
- A path to save the report.n_max_train
- The maximum number of samples per table (for train data) to use in the report.n_max_test
- The maximum number of samples per table (for test data) to use in the report.columns
- The columns to use for the computation of the report sections. It can be an instance ofReportColumns
, which is a data structure containing the columns to be used in each report section, These can be provided as an int or a dict. Otherwise, it can be an int or a dict, and i this case the same settings will be applied to all sections. For each section, if the columns to be used are specified by an integern
, the firstn
columns of each table will be used. Otherwise, if a dict is provided, it should map each table to either the columns to be used, or to an integern
specifying the number of columns. For missing tables all columns will be retained. By default, 100 columns are used for the univariate distributions and the k-NN analysis, and 20 for the bivariate distributions and the PhiK analysis.
compute_privacy_stats
Compute privacy statistics for the evaluation of synthetic data.
Arguments:
data_train
- ARelationalData
object containing the original training data.data_synth
- ARelationalData
object containing the generated synthetic data.q
- The quantile used to compute the privacy score and the number of records at risk of re-identification.risk_confidence
- A confidence parameter for the estimation of the number of records at risk of re-identification. The estimated number of records at risk (n_risk
) is corrected with a factor of-risk_confidence * sqrt(n_risk)
.n_folds_std
- Number of folds to use in the computation of the standard deviation, must be larger than 1. IfNone
, the computation is not performed.n_max
- The maximum number of samples per table (for both train and synth data) to use in the computation.
Returns:
A dictionary mapping each table to a PrivacyStats
object (or None
in case of error).
class PrivacyStats
Data structure containing the privacy statistics for a single table.
Attributes:
privacy_score
- The privacy score.privacy_score_std
- An estimate of the standard deviation of the privacy score.risk
- An estimate of the fraction of training points at risk of re-identification.