Generative Engine

The Generative Engine provides tools for creating synthetic data using advanced machine learning models. It includes modules tailored for time series and tabular data, leveraging Variational Autoencoders (VAE) and Diffusion Models to generate high-quality, privacy-preserving synthetic datasets.

The TimeSeriesEngine is designed for time series data generation, offering features like:

  • Model training and evaluation.

  • Latent space sampling.

  • Reconstruction error analysis.

The TabularEngine handles tabular data, supporting both VAE and Diffusion Models. Key features include:

  • Tools for training and evaluation.

  • Sampling from latent space.

  • Reconstruction of mixed data types.

  • Configurable architectures for numerical and categorical data.

TabularEngine

class clearbox_synthetic.generation.engine.tabular_engine.TabularEngine(dataset: Dataset, layers_size: Sequence[int] = [50], params: FrozenDict | None = None, train_params: Dict | None = None, diffusion_params: Dict | None = None, privacy_budget: float = 1.0, model_type: str = 'VAE', rules: Dict = {}, cat_labels_threshold: float = 0.02, missing_values_threshold: float = 0.999, n_bins: int = 0, scaling: Literal['none', 'normalize', 'standardize', 'quantile'] = 'quantile', num_fill_null: Literal['interpolate', 'forward', 'backward', 'min', 'max', 'mean', 'zero', 'one'] = 'none', unseen_labels='ignore')[source]

Bases: EngineInterface

This class integrates the TabularVAE and TabularDiffusion models to enable training, evaluation, and inference for tabular datasets.

Parameters:
  • dataset (Dataset) – The dataset used to initialize the generative engine.

  • layers_size (Sequence[int], optional, default=[50]) – The sizes of the hidden layers.

  • params (FrozenDict, optional, default=None) – Model parameters.

  • train_params (Dict, optional, default=None) – Training parameters.

  • diffusion_params (Dict, optional, default=None) – Diffusion model parameters.

  • privacy_budget (float, optional, default=1.0) – The privacy budget.

  • model_type (str, optional, default='VAE') – Type of model (‘VAE’ or ‘Diffusion’).

  • rules (Dict, optional, default={}) – Rules for embedding and transformations.

  • cat_labels_threshold (float, optional, default=0.02) –

    A float value between 0 and 1 that sets the threshold for discarding categorical features. It defines a minimum frequency threshold for keeping a label as a separate category. If a label appears in less than cat_labels_threshold * 100% of the total occurrences in a categorical column, it is grouped into a generic "other" category.

    For instance, if cat_labels_threshold=0.02 and a label appears less than 2% in the dataset, that label will be converted to “other”.

  • scaling (str, default="none") –

    The method used to scale numerical features:

    • ”none” : No scaling is applied

    • ”normalize” : Normalizes numerical features to the [0, 1] range.

    • ”standardize” : Standardizes numerical features to have a mean of 0 and a standard deviation of 1.

    • ”quantile” : Transforms numerical features using quantiles information.

    • ”kbins” : Converts continuous numerical data into discrete bins. The number of bins is defined by the parameter n_bin

  • num_fill_null (FillNullStrategy or str, default="mean") –

    Strategy or value used to fill null values in numerical features:

    • ”mean” : Fills null values with the mean of the column.

    • ”interpolate” : Fills null values using interpolation.

    • ”forward” : Fills null values using the previous non-null value.

    • ”backward” : Fills null values using the next non-null value.

    • ”min” : Fills null values with the minimum value of the column.

    • ”max” : Fills null values with the maximum value of the column.

    • ”zero” : Fills null values with zeros.

    • ”one” : Fills null values with ones.

    • value : Fills null values with the specified value.

  • n_bins (int, default=0) – Number of bins to discretize numerical features. If set to a value greater than 0 and if scaling==”kbins”, numerical features are discretized into the specified number of bins using quantile-based binning.

  • unseen_labels (str, default="ignore") –

    • “ignore” : If new data contains labels unseen during fit one hot encoding contains 0 in every column.

    • ”error” : Raise an error if new data contains labels unseen during fit.

model

The Variational Autoencoder model.

Type:

TabularVAE

diffusion_model

The Diffusion Model for additional training.

Type:

TabularDiffusion

params

The model parameters.

Type:

FrozenDict

search_params

Training parameters.

Type:

Dict

architecture

The architecture configuration of the model.

Type:

Dict

hashed_architecture

A hashed string representation of the architecture.

Type:

str

X: Dataset
apply(x: ndarray, y: ndarray | None = None) Tuple[source]

Applies the model to the input data.

Parameters:
  • x (np.ndarray) – The input data.

  • y (np.ndarray, optional) – The target data. Defaults to None.

Returns:

The model’s output.

Return type:

Tuple

architecture: Dict
decode(z: ndarray, y: ndarray | None = None) ndarray[source]

Decodes the latent representation back into the original space.

Parameters:
  • z (np.ndarray) – The latent representation.

  • y (np.ndarray, optional) – The target data. Defaults to None.

Returns:

The decoded data.

Return type:

np.ndarray

diffusion_model: TabularDiffusion
encode(x: ndarray, y: ndarray | None = None) ndarray[source]

Encodes the input data into the latent space.

Parameters:
  • x (np.ndarray) – The input data.

  • y (np.ndarray, optional) – The target data. Defaults to None.

Returns:

The encoded representation.

Return type:

np.ndarray

evaluate(test_ds: ndarray, y_test_ds: ndarray | None = None) Dict[source]

Evaluates the model on the test dataset.

Parameters:
  • test_ds (np.ndarray) – The test dataset.

  • y_test_ds (np.ndarray, optional) – The target values for the test dataset. Defaults to None.

Returns:

Evaluation metrics.

Return type:

Dict

fit(dataset: Dataset, epochs: int = 20, batch_size: int = 128, learning_rate: float = 0.01, val_ds: ndarray | None = None, y_val_ds: ndarray | None = None, patience: int = 4, diffusion_epochs: int | None = None, diffusion_batch_size: int | None = None, diffusion_learning_rate: float | None = None)[source]

Trains the model on the provided dataset.

Parameters:
  • dataset (Dataset) – The training dataset.

  • epochs (int, optional) – The number of training epochs for the VAE. Defaults to 20.

  • batch_size (int, optional) – The batch size for VAE training. Defaults to 128.

  • learning_rate (float, optional) – The learning rate for the VAE optimizer. Defaults to 1e-2.

  • val_ds (np.ndarray, optional) – The validation dataset. Defaults to None.

  • y_val_ds (np.ndarray, optional) – The target values for the validation dataset. Defaults to None.

  • patience (int, optional) – The number of epochs to wait for improvement before stopping early. Defaults to 4.

  • diffusion_epochs (int, optional) – The number of training epochs for the diffusion model. If None, uses the VAE epochs value.

  • diffusion_batch_size (int, optional) – The batch size for diffusion model training. If None, uses the VAE batch_size value.

  • diffusion_learning_rate (float, optional) – The learning rate for the diffusion model optimizer. If None, uses the VAE learning_rate value.

generate(dataset: Dataset | None = None, n_samples: int = 0, noise: float = 0.0, random_state: int = 42) ndarray[source]

Generates synthetic data from the model.

Parameters:
  • dataset (Dataset) – The input data to condition the generation on. If None, random samples will be generated.

  • n_samples (int, optional) – The number of samples to generate. Defaults to 100.

  • noise (float, optional) – The amount of noise to add to the latent space. Defaults to 0.0.

  • random_state (int, optional) – The random seed for reproducibility. Defaults to 42.

Returns:

The generated synthetic data.

Return type:

np.ndarray

hashed_architecture: str
model: TabularVAE
params: FrozenDict
reconstruction_error(x: ndarray, y: ndarray | None = None) ndarray[source]

Computes the reconstruction error for the input data.

Parameters:
  • x (np.ndarray) – The input data.

  • y (np.ndarray, optional) – The target data. Defaults to None.

Returns:

The reconstruction error for each instance.

Return type:

np.ndarray

sample_from_latent_space(x: ndarray, ds: ndarray, y: ndarray | None = None, y_ds: ndarray | None = None, n_samples: int = 100) Tuple[ndarray, ndarray][source]

Samples from the latent space around the given data point.

Parameters:
  • x (np.ndarray) – The data point to sample around.

  • ds (np.ndarray) – The dataset to sample from.

  • y (np.ndarray, optional) – The target values for x. Defaults to None.

  • y_ds (np.ndarray, optional) – The target values for ds. Defaults to None.

  • n_samples (int, optional) – The number of samples to generate. Defaults to 100.

Returns:

The sampled data and the corresponding indices.

Return type:

Tuple[np.ndarray, np.ndarray]

save(architecture_filename: str, sd_filename: str)[source]

Saves the model architecture and parameters to files.

Parameters:
  • architecture_filename (str) – The file path to save the model architecture.

  • sd_filename (str) – The file path to save the model parameters.

search_params: Dict

TimeSeriesEngine

class clearbox_synthetic.generation.engine.timeseries_engine.TimeSeriesEngine(dataset: Dataset, time_id: str, layers_size: Sequence[int] = [40], params: FrozenDict | None = None, train_params: Dict | None = None, num_heads: int = 4)[source]

Bases: EngineInterface

Manages training and evaluation of a time series model using a Variational Autoencoder (VAE). It handles model initialization, training, evaluation, and saving functionalities.

model

The Variational Autoencoder model for time series data.

Type:

TimeSeriesVAE

params

The parameters of the model.

Type:

FrozenDict

search_params

The training parameters.

Type:

Dict

architecture

The architecture details of the model.

Type:

Dict

hashed_architecture

A hashed string representation of the model architecture.

Type:

str

apply(x: ndarray, y: ndarray | None = None) Tuple[source]

Applies the model to the input data.

Parameters:
  • x (np.ndarray) – Input data.

  • y (np.ndarray, optional) – Target data. Defaults to None.

Returns:

The model’s output.

Return type:

Tuple

decode(z: ndarray, y: ndarray | None = None) ndarray[source]

Decodes the latent space representation into the original space.

Parameters:
  • z (np.ndarray) – Latent space representation.

  • y (np.ndarray, optional) – Target data. Defaults to None.

Returns:

Decoded data.

Return type:

np.ndarray

encode(x: ndarray, y: ndarray | None = None) ndarray[source]

Encodes the input data into the latent space.

Parameters:
  • x (np.ndarray) – Input data.

  • y (np.ndarray, optional) – Target data. Defaults to None.

Returns:

Encoded data.

Return type:

np.ndarray

evaluate(test_ds: ndarray, y_test_ds: ndarray | None = None) Dict[source]

Evaluates the model on the test dataset.

Parameters:
  • test_ds (np.ndarray) – Test dataset.

  • y_test_ds (np.ndarray, optional) – Target values for the test dataset. Defaults to None.

Returns:

Evaluation metrics.

Return type:

Dict

fit(dataset: Dataset, epochs: int = 20, batch_size: int = 128, learning_rate: float = 0.01, val_dataset: Dataset | None = None, patience: int = 4)[source]

Trains the model on the provided dataset.

Parameters:
  • dataset (Dataset) – The training dataset.

  • epochs (int, optional) – Number of training epochs. Defaults to 20.

  • batch_size (int, optional) – Size of each training batch. Defaults to 128.

  • learning_rate (float, optional) – Learning rate for the optimizer. Defaults to 1e-2.

  • val_ds (np.ndarray, optional) – Validation dataset. Defaults to None.

  • y_val_ds (np.ndarray, optional) – Target values for the validation dataset. Defaults to None.

  • patience (int, optional) – Number of epochs to wait before early stopping. Defaults to 4.

generate(dataset: Dataset, n_samples: int = 100)[source]

Generates synthetic time series data from the model.

Parameters:

n_samples (int, optional) – Number of samples to generate. Defaults to 100.

reconstruction_error(x: ndarray, y: ndarray | None = None) ndarray[source]

Computes the reconstruction error for the input data.

Parameters:
  • x (np.ndarray) – Input data.

  • y (np.ndarray, optional) – Target data. Defaults to None.

Returns:

Reconstruction error for each instance.

Return type:

np.ndarray

sample_from_latent_space(x: ndarray, ds: ndarray, y: ndarray | None = None, y_ds: ndarray | None = None, n_samples: int = 100) Tuple[ndarray, ndarray][source]

Samples data from the latent space close to the given input.

Parameters:
  • x (np.ndarray) – Input data.

  • ds (np.ndarray) – Dataset to sample from.

  • y (np.ndarray, optional) – Target data for the input. Defaults to None.

  • y_ds (np.ndarray, optional) – Target data for the dataset. Defaults to None.

  • n_samples (int, optional) – Number of samples to generate. Defaults to 100.

Returns:

Sampled data and their indices.

Return type:

Tuple[np.ndarray, np.ndarray]

save(architecture_filename: str, sd_filename: str)[source]

Saves the model architecture and parameters.

Parameters:
  • architecture_filename (str) – Filename to save the model architecture.

  • sd_filename (str) – Filename to save the state dictionary.