Anonymization Techniques

This section provides an overview of the available anonymization and pseudonymization techniques supported by the platform. For each technique, you’ll find a brief description along with the key parameters that control its behavior.

Data nulling

Data nulling replaces all values, including missing ones, with a constant value.

Constant Value: The fixed value that will replace the original data, with default to Null

Character masking

Character masking hides part of the data by replacing characters with a symbol (such as *).

Mask Length: The number of characters to replace
Symbol: The character used for masking (e.g. *)
Starting Direction: Indicates whether masking starts from the left or right

Mocking

Mocking generates realistic fake data for different data types like names, emails, and addresses. Missing values are also replaced during this process.

Generator Type: The type of mock data to generate (e.g., name, email)
Seed: Optional value (number or string) to initialize the generator for consistent results

Key Hashing

Key hashing replaces data with a hashed value using HMAC and a cryptographic key. The result is encoded in Base64. By default, the SHA-256 algorithm is used.

Key: Cryptographic key used for hashing
Salt: Optional value added to the data before hashing for extra randomness
Algorithm: The hashing algorithm to use (e.g. sha256, sha512, md5)

Swapping

Swapping shuffles data values across rows to break direct associations while keeping values in the dataset. The process is controlled by a probability parameter.

Alpha: Determines how likely a value is to be swapped; must be between 0 and 1
Seed: Optional number to initialize the generator for consistent results

Binning

Binning groups numerical values into a set number of ranges (bins). This generalizes the data by replacing exact values with their bin range.

Bin: Number of bins (or groups) to create

Top/Bottom coding

Replaces values that fall outside specified thresholds with capped values. For numerical columns:

Top-coding limits values above the upper (1 - Q/2) quantile.
Bottom-coding limits values below the lower (Q/2) quantile.

For categorical data, rare categories (those making up Q or less of the data) are replaced with a new, common label.

Q (Quantile): A value between 0 and 1 that controls the extent of top/bottom coding
New Category (categorical only): The label used to replace rare categories

Perturbation

Perturbation consists of modifying each value based on the specified perturbation intensity alpha and replacement strategy. It supports two modes of replacement: uniform sampling and distribution-preserving sampling.

Alpha: the perturbation intensity, from 0 (no change) to 1 (maximum change)
Sampling Mode: the strategy used to sample replacement values:
- uniform: Random values within a range
- weighted: Values are sampled to match the original data distribution
Seed: Optional number to initialize the generator for consistent results
Perturbation Range (numerical only): Specifies the range for uniform sampling; if not provided, the system uses the data’s min and max.