Constraints¶
When working with tabular data, it is common to encounter constraints and patterns that each row follows strictly. This presents a challenge when generating synthetic data that must preserve the same form as the original data, since these patterns must be reproduced exactly and they may be hard to distinguish from stochastic patterns.
For example, consider a table with two columns for systolic pressure and diastolic pressure.
By definition, the latter value should always be lower than the former.
Constraints can also involve two or more Categorical
columns, for example when only some combinations of
categories are allowed.
Consider a table with two columns for hospital and ward, and with the possible wards that depend on the hospital.
All these constraints must be respected in the generated synthetic data, as any records where the diastolic
pressure is higher than the systolic pressure would not be a valid record.
While neural generative models are good at learning and reproducing these patterns, they are still probabilistic
and may therefore fail.
If the training is successful the rate of failure should be small but, generally, it may not be zero.
To address this problem and enforce the constraints present in the data, aindo.synth
provides the Constraints
class.
In many cases, we can ensure that the logical constraints in the original data are preserved in the synthetic data through pre- and post-processing. For example, instead of feeding both systolic and diastolic pressure columns into our model, we can compute a new column as the difference between them. It is then easy for the model to learn that this quantity is always positive. We can then train the model to generate the synthetic diastolic pressure and this new column. In the post-processing phase, we can compute the synthetic systolic pressure by adding the generated new column to the generated diastolic pressure.
The Constraints
class handles this pre-processing by transforming the original data in a form which is suitable
for generation.
After the synthetic data is generated, the same Constraints
class must be used to perform the post-processing,
to transform the generated data to the original form with the constraints guaranteed to be satisfied.
Users can specify multiple logical constraints that their data should satisfy and the Constraints
class will
take care of applying all the necessary transformations and inverse transformations.
Available constraints¶
Each type of constraint is represented by a class and must specify the table it refers to. The available constraints are:
Equality
: It represents constraints of typeF(..) = G(..)
, whereF
andG
are two arithmetic functions of the columns of a single table.In the following example, the table
numbers
contains the columnsA
,B
,C
andD
, which must satisfy the equation:(A - 2) + (B - 4) * C = D / 3
1from aindo.synth import Equality 2 3constraint = Equality(table='numbers', lhs='(A - 2) + (B - 4) * C', rhs='D / 3')
This logic can be applied with both numerical and datetime columns. In the case of
DateTime
columns, all operands must beDateTime
columns, and they must have the same date format.GreaterThan
: It represents constraints of typeF(..) > G(..)
, whereF
andG
are two functions of the columns of a single table. The usage is identical to theEquality
constraint:lhs
andrhs
arguments must be specified, and in the case ofDateTime
columns all operands must beDateTime
columns with the same format.FixedCombination
: It should be used in the case of a set of categorical columns for which only some combinations are possible. E.g. if a table contains the two columnsCompany
andDepartment
, the possible choices of departments may depend on the company, and we need to exclude combinations that are not present in the original data.1from aindo.synth import FixedCombinations 2 3constraint = FixedCombinations(table='employees', columns=['Company', 'Department'])
Usage¶
Once the constraints have been identified, they must be used to build an instance of the Constraints
class.
The resulting object can be used to transform the original RelationalData
object to obtain a new RelationalData
object that can be used to build the neural generative model.
Once the model is trained and the synthetic data has been generated, the resulting synthetic RelationalData
can be
reverted to the form of the original data.
As an example, consider the following table containing measurements of the blood pressure of patients from various hospitals:
patient_id |
Hospital |
Monitor |
Systolic pressure |
Diastolic pressure |
Mean Arterial pressure |
---|---|---|---|---|---|
0 |
St. James |
Mercury |
115 |
87 |
96 |
1 |
Sacred Heart |
Mercury |
123 |
75 |
91 |
2 |
St. James |
Digital |
128 |
81 |
97 |
3 |
Northwell |
Aneroid |
117 |
68 |
84 |
4 |
Sacred Heart |
Digital |
132 |
79 |
97 |
5 |
Sacred Heart |
Aneroid |
121 |
83 |
96 |
In this example table, the patient’s blood pressure measurements are reported.
Along with the results, the type of blood pressure monitor used for the measurement is also reported.
Notice that hospitals might use different blood pressure monitors.
For example, the St. James hospital does not take measurements using the Aneroid monitor.
Moreover, the Mean Arterial pressure by definition is computed as
(Diastolic pressure + 1/3 * (Systolic pressure - Diastolic pressure)).
That consists of a hard constraint on the value of the Mean Arterial pressure.
Finally, we know that Systolic pressure is always higher than Diastolic one.
The following example script shows how to deal with these kinds of constraints with aindo.synth
.
After building the single constraints, they are passed to the Constraints
class.
The resulting object is used to transform the data and schema before building the model and to transform back the
generated synthetic data to the original data form.
1from aindo.synth import RelationalData, Constraints, Equality, FixedCombinations, GreaterThan, GraphSynth
2
3data = RelationalData.from_dir(...)
4
5constraints = [
6 FixedCombinations(table='blood', columns=['Hospital', 'Monitor']),
7 Equality(
8 table='blood',
9 lhs='Mean Arterial pressure',
10 rhs='(Diastolic pressure + 1/3 * (Systolic pressure - Diastolic pressure))',
11 ),
12 GreaterThan(table='blood', lhs='Systolic pressure', rhs='Diastolic pressure'),
13]
14
15constraint = Constraints(constraints=constraints)
16data_trans = constraint.fit_transform(data)
17model = GraphSynth(schema=data_trans.schema)
18
19# train the model
20...
21data_synth_trans = model.sample(n_samples=data.n_samples)
22data_synth = constraint.inverse_transform(data_synth_trans)
Notice that when the original data is transformed, the fit_transform
method is used.
This is because the constraints object must also be fitted.
The Constraints
class has a transform
method available too.
This can be used if the object was already fitted and transforms the data without fitting the object again.
Documentation¶
- class aindo.synth.relational.constraints.Constraints(*args: Constraint, constraints: Sequence[Constraint] = ())¶
Specify a set of constraints that the original data satisfy and the synthetic data must satisfy too.
- fit_transform(data: RelationalData) RelationalData ¶
Fit the constraints and transform the input relational data to a form suitable for generation of synthetic data.
- transform(data: RelationalData) RelationalData ¶
Transform the input relational data to a form suitable for generation of synthetic data.
- inverse_transform(data: RelationalData) RelationalData ¶
Inverse transform the input relational data to bring it to the original form.
- class aindo.synth.relational.constraints.Equality(table: str, lhs: str | sympy.core.expr.Expr, rhs: str | sympy.core.expr.Expr)¶
- __init__(table: str, lhs: str | sympy.core.expr.Expr, rhs: str | sympy.core.expr.Expr) None ¶
Constraint of type F(..) = G(..), where F and G are functions of columns in the same table. Arguments of the functions can be either Sympy expressions or strings that can be cast to Sympy expressions. The allowed characters for the expressions are: ‘+’, ‘-’, ‘*’ and ‘/’.
- Parameters:
table – Name of the table containing the two columns involved in the constraint.
lhs – Left side of the constraint.
rhs – Right side of the constraint.
- class aindo.synth.relational.constraints.FixedCombinations(table: str, columns: Sequence[str])¶
- __init__(table: str, columns: Sequence[str]) None ¶
Constraint on categorical columns in a single table that can only appear in a fixed set of combinations.
- Parameters:
table – Name of the table containing the two columns involved in the constraint.
columns – List of names of columns involved in the constraint.
- class aindo.synth.relational.constraints.GreaterThan(table: str, lhs: str, rhs: str)¶
- __init__(table: str, lhs: str, rhs: str) None ¶
Constraint of type F(..) > G(..), where F and G are functions of columns in the same table. Arguments of the functions can be either Sympy expressions or strings that can be cast to Sympy expressions. The allowed characters for the expressions are: ‘+’, ‘-’, ‘*’ and ‘/’.
- Parameters:
table – Name of the table containing the two columns involved in the constraint.
lhs – Left side of the constraint.
rhs – Right side of the constraint.