Anonymization Techniques
This section provides an overview of the available anonymization and pseudonymization techniques supported by the platform. For each technique, you’ll find a brief description along with the key parameters that control its behavior.
Data nulling
Data nulling replaces all values, including missing ones, with a constant value.
Constant Value
: The fixed value that will replace the original data, with default toNull
Character masking
Character masking hides part of the data by replacing characters with a symbol (such as *
).
Mask Length
: The number of characters to replaceSymbol
: The character used for masking (e.g.*
)Starting Direction
: Indicates whether masking starts from the left or right
Mocking
Mocking generates realistic fake data for different data types like names, emails, and addresses. Missing values are also replaced during this process.
Generator Type
: The type of mock data to generate (e.g., name, email)Seed
: Optional value (number or string) to initialize the generator for consistent results
Key Hashing
Key hashing replaces data with a hashed value using HMAC and a cryptographic key. The result is encoded in Base64. By default, the SHA-256 algorithm is used.
Key
: Cryptographic key used for hashingSalt
: Optional value added to the data before hashing for extra randomnessAlgorithm
: The hashing algorithm to use (e.g.sha256
,sha512
,md5
)
Swapping
Swapping shuffles data values across rows to break direct associations while keeping values in the dataset. The process is controlled by a probability parameter.
Alpha
: Determines how likely a value is to be swapped; must be between 0 and 1Seed
: Optional number to initialize the generator for consistent results
Binning
Binning groups numerical values into a set number of ranges (bins). This generalizes the data by replacing exact values with their bin range.
Bin
: Number of bins (or groups) to create
Top/Bottom coding
Replaces values that fall outside specified thresholds with capped values. For numerical columns:
- Top-coding limits values above the upper
(1 - Q/2)
quantile. - Bottom-coding limits values below the lower
(Q/2)
quantile.
For categorical data, rare categories (those making up Q
or less of the data)
are replaced with a new, common label.
Q (Quantile)
: A value between 0 and 1 that controls the extent of top/bottom codingNew Category
(categorical only): The label used to replace rare categories
Perturbation
Perturbation consists of modify each value based on the specified perturbation intensity alpha
and replacement strategy.
It supports two modes of replacement: uniform sampling and distribution-preserving sampling.
Alpha
: the perturbation intensity, from 0 (no change) to 1 (maximum change)Sampling Mode
: the strategy used to sample replacement values:uniform
: Random values within a rangeweighted
: Values are sampled to match the original data distribution
Seed
: Optional number to initialize the generator for consistent resultsPerturbation Range
(numerical only): Specifies the range for uniform sampling; if not provided, the system uses the data’s min and max.