Constraints

When working with tabular data, it is common to encounter constraints and patterns that each row follows strictly. This presents a challenge when generating synthetic data that must preserve the same form as the original data, since these patterns must be reproduced exactly and they may be hard to distinguish from stochastic patterns.

For example, consider a table with two columns for systolic pressure and diastolic pressure. By definition, the latter value should always be lower than the former. Constraints can also involve two or more Categorical columns, for example when only some combinations of categories are allowed. Consider a table with two columns for hospital and ward, and with the possible wards that depend on the hospital. All these constraints must be respected in the generated synthetic data, as any records where the diastolic pressure is higher than the systolic pressure would not be a valid record.

While neural generative models are good at learning and reproducing these patterns, they are still probabilistic and may therefore fail. If the training is successful the rate of failure should be small but, generally, it may not be zero. To address this problem and enforce the constraints present in the data, aindo.synth provides the Constraints class.

In many cases, we can ensure that the logical constraints in the original data are preserved in the synthetic data through pre- and post-processing. For example, instead of feeding both systolic and diastolic pressure columns into our model, we can compute a new column as the difference between them. It is then easy for the model to learn that this quantity is always positive. We can then train the model to generate the synthetic diastolic pressure and this new column. In the post-processing phase, we can compute the synthetic systolic pressure by adding the generated new column to the generated diastolic pressure.

The Constraints class handles this pre-processing by transforming the original data in a form which is suitable for generation. After the synthetic data is generated, the same Constraints class must be used to perform the post-processing, to transform the generated data to the original form with the constraints guaranteed to be satisfied. Users can specify multiple logical constraints that their data should satisfy and the Constraints class will take care of applying all the necessary transformations and inverse transformations.

Available constraints

Each type of constraint is represented by a class and must specify the table it refers to. The available constraints are:

  • Equality: It represents constraints of type F(..) = G(..), where F and G are two arithmetic functions of the columns of a single table.

    In the following example, the table numbers contains the columns A, B, C and D, which must satisfy the equation: (A - 2) + (B - 4) * C = D / 3

    1from aindo.synth import Equality
    2
    3constraint = Equality(table='numbers', lhs='(A - 2) + (B - 4) * C', rhs='D / 3')
    

    This logic can be applied with both numerical and datetime columns. In the case of DateTime columns, all operands must be DateTime columns, and they must have the same date format.

  • GreaterThan: It represents constraints of type F(..) > G(..), where F and G are two functions of the columns of a single table. The usage is identical to the Equality constraint: lhs and rhs arguments must be specified, and in the case of DateTime columns all operands must be DateTime columns with the same format.

  • FixedCombination: It should be used in the case of a set of categorical columns for which only some combinations are possible. E.g. if a table contains the two columns Company and Department, the possible choices of departments may depend on the company, and we need to exclude combinations that are not present in the original data.

    1from aindo.synth import FixedCombinations
    2
    3constraint = FixedCombinations(table='employees', columns=['Company', 'Department'])
    

Usage

Once the constraints have been identified, they must be used to build an instance of the Constraints class. The resulting object can be used to transform the original RelationalData object to obtain a new RelationalData object that can be used to build the neural generative model. Once the model is trained and the synthetic data has been generated, the resulting synthetic RelationalData can be reverted to the form of the original data.

As an example, consider the following table containing measurements of the blood pressure of patients from various hospitals:

patient_id

Hospital

Monitor

Systolic pressure

Diastolic pressure

Mean Arterial pressure

0

St. James

Mercury

115

87

96

1

Sacred Heart

Mercury

123

75

91

2

St. James

Digital

128

81

97

3

Northwell

Aneroid

117

68

84

4

Sacred Heart

Digital

132

79

97

5

Sacred Heart

Aneroid

121

83

96

In this example table, the patient’s blood pressure measurements are reported. Along with the results, the type of blood pressure monitor used for the measurement is also reported. Notice that hospitals might use different blood pressure monitors. For example, the St. James hospital does not take measurements using the Aneroid monitor. Moreover, the Mean Arterial pressure by definition is computed as (Diastolic pressure + 1/3 * (Systolic pressure - Diastolic pressure)). That consists of a hard constraint on the value of the Mean Arterial pressure. Finally, we know that Systolic pressure is always higher than Diastolic one. The following example script shows how to deal with these kinds of constraints with aindo.synth. After building the single constraints, they are passed to the Constraints class. The resulting object is used to transform the data and schema before building the model and to transform back the generated synthetic data to the original data form.

 1from aindo.synth import RelationalData, Constraints, Equality, FixedCombinations, GreaterThan, GraphSynth
 2
 3data = RelationalData.from_dir(...)
 4
 5constraints = [
 6    FixedCombinations(table='blood', columns=['Hospital', 'Monitor']),
 7    Equality(
 8      table='blood',
 9      lhs='Mean Arterial pressure',
10      rhs='(Diastolic pressure + 1/3 * (Systolic pressure - Diastolic pressure))',
11    ),
12    GreaterThan(table='blood', lhs='Systolic pressure', rhs='Diastolic pressure'),
13]
14
15constraint = Constraints(constraints=constraints)
16data_trans = constraint.fit_transform(data)
17model = GraphSynth(schema=data_trans.schema)
18
19# train the model
20...
21data_synth_trans = model.sample(n_samples=data.n_samples)
22data_synth = constraint.inverse_transform(data_synth_trans)

Notice that when the original data is transformed, the fit_transform method is used. This is because the constraints object must also be fitted. The Constraints class has a transform method available too. This can be used if the object was already fitted and transforms the data without fitting the object again.

Documentation

class aindo.synth.relational.constraints.Constraints(*args: Constraints, constraints: Sequence[Constraints] = ())

Specify a set of constraints that the original data satisfy and the synthetic data must satisfy too.

fit_transform(data: RelationalData) RelationalData

Fit the constraints and transform the input relational data to a form suitable for generation of synthetic data.

transform(data: RelationalData) RelationalData

Transform the input relational data to a form suitable for generation of synthetic data.

inverse_transform(data: RelationalData) RelationalData

Inverse transform the input relational data to bring it to the original form.

class aindo.synth.relational.constraints.Equality(table: str, lhs: str | sympy.core.expr.Expr, rhs: str | sympy.core.expr.Expr)
__init__(table: str, lhs: str | sympy.core.expr.Expr, rhs: str | sympy.core.expr.Expr) None

Constraint of type F(..) = G(..), where F and G are functions of columns in the same table. Arguments of the functions can be either Sympy expressions or strings that can be cast to Sympy expressions. The allowed characters for the expressions are: ‘+’, ‘-’, ‘*’ and ‘/’.

Parameters:
  • table – Name of the table containing the two columns involved in the constraint.

  • lhs – Left side of the constraint.

  • rhs – Right side of the constraint.

class aindo.synth.relational.constraints.FixedCombinations(table: str, columns: Sequence[str])
__init__(table: str, columns: Sequence[str]) None

Constraint on categorical columns in a single table that can only appear in a fixed set of combinations.

Parameters:
  • table – Name of the table containing the two columns involved in the constraint.

  • columns – List of names of columns involved in the constraint.

class aindo.synth.relational.constraints.GreaterThan(table: str, lhs: str, rhs: str)
__init__(table: str, lhs: str, rhs: str) None

Constraint of type F(..) > G(..), where F and G are functions of columns in the same table. Arguments of the functions can be either Sympy expressions or strings that can be cast to Sympy expressions. The allowed characters for the expressions are: ‘+’, ‘-’, ‘*’ and ‘/’.

Parameters:
  • table – Name of the table containing the two columns involved in the constraint.

  • lhs – Left side of the constraint.

  • rhs – Right side of the constraint.