Generative Engine
The Generative Engine provides tools for creating synthetic data using advanced machine learning models. It includes modules tailored for time series and tabular data, leveraging Variational Autoencoders (VAE) and Diffusion Models to generate high-quality, privacy-preserving synthetic datasets.
The TimeSeriesEngine is designed for time series data generation, offering features like:
Model training and evaluation.
Latent space sampling.
Reconstruction error analysis.
The TabularEngine handles tabular data, supporting both VAE and Diffusion Models. Key features include:
Tools for training and evaluation.
Sampling from latent space.
Reconstruction of mixed data types.
Configurable architectures for numerical and categorical data.
TabularEngine
- class clearbox_synthetic.generation.engine.tabular_engine.TabularEngine(dataset: Dataset, layers_size: Sequence[int] = [50], params: FrozenDict | None = None, train_params: Dict | None = None, diffusion_params: Dict | None = None, privacy_budget: float = 1.0, model_type: str = 'VAE', rules: Dict = {}, cat_labels_threshold: float = 0.02, missing_values_threshold: float = 0.999, n_bins: int = 0, scaling: Literal['none', 'normalize', 'standardize', 'quantile'] = 'quantile', num_fill_null: Literal['interpolate', 'forward', 'backward', 'min', 'max', 'mean', 'zero', 'one'] = 'none', unseen_labels='ignore')[source]
Bases:
EngineInterfaceThis class integrates the
TabularVAEandTabularDiffusionmodels to enable training, evaluation, and inference for tabular datasets.- Parameters:
dataset (Dataset) – The dataset used to initialize the generative engine.
layers_size (Sequence[int], optional, default=[50]) – The sizes of the hidden layers.
params (FrozenDict, optional, default=None) – Model parameters.
train_params (Dict, optional, default=None) – Training parameters.
diffusion_params (Dict, optional, default=None) – Diffusion model parameters.
privacy_budget (float, optional, default=1.0) – The privacy budget.
model_type (str, optional, default='VAE') – Type of model (‘VAE’ or ‘Diffusion’).
rules (Dict, optional, default={}) – Rules for embedding and transformations.
cat_labels_threshold (float, optional, default=0.02) –
A float value between 0 and 1 that sets the threshold for discarding categorical features. It defines a minimum frequency threshold for keeping a label as a separate category. If a label appears in less than
cat_labels_threshold * 100%of the total occurrences in a categorical column, it is grouped into a generic"other"category.For instance, if
cat_labels_threshold=0.02and a label appears less than 2% in the dataset, that label will be converted to “other”.scaling (str, default="none") –
The method used to scale numerical features:
”none” : No scaling is applied
”normalize” : Normalizes numerical features to the [0, 1] range.
”standardize” : Standardizes numerical features to have a mean of 0 and a standard deviation of 1.
”quantile” : Transforms numerical features using quantiles information.
”kbins” : Converts continuous numerical data into discrete bins. The number of bins is defined by the parameter n_bin
num_fill_null (FillNullStrategy or str, default="mean") –
Strategy or value used to fill null values in numerical features:
”mean” : Fills null values with the mean of the column.
”interpolate” : Fills null values using interpolation.
”forward” : Fills null values using the previous non-null value.
”backward” : Fills null values using the next non-null value.
”min” : Fills null values with the minimum value of the column.
”max” : Fills null values with the maximum value of the column.
”zero” : Fills null values with zeros.
”one” : Fills null values with ones.
value : Fills null values with the specified value.
n_bins (int, default=0) – Number of bins to discretize numerical features. If set to a value greater than 0 and if scaling==”kbins”, numerical features are discretized into the specified number of bins using quantile-based binning.
unseen_labels (str, default="ignore") –
“ignore” : If new data contains labels unseen during fit one hot encoding contains 0 in every column.
”error” : Raise an error if new data contains labels unseen during fit.
- model
The Variational Autoencoder model.
- Type:
TabularVAE
- diffusion_model
The Diffusion Model for additional training.
- Type:
TabularDiffusion
- params
The model parameters.
- Type:
FrozenDict
- search_params
Training parameters.
- Type:
Dict
- architecture
The architecture configuration of the model.
- Type:
Dict
- hashed_architecture
A hashed string representation of the architecture.
- Type:
str
- apply(x: ndarray, y: ndarray | None = None) Tuple[source]
Applies the model to the input data.
- Parameters:
x (np.ndarray) – The input data.
y (np.ndarray, optional) – The target data. Defaults to None.
- Returns:
The model’s output.
- Return type:
Tuple
- architecture: Dict
- decode(z: ndarray, y: ndarray | None = None) ndarray[source]
Decodes the latent representation back into the original space.
- Parameters:
z (np.ndarray) – The latent representation.
y (np.ndarray, optional) – The target data. Defaults to None.
- Returns:
The decoded data.
- Return type:
np.ndarray
- diffusion_model: TabularDiffusion
- encode(x: ndarray, y: ndarray | None = None) ndarray[source]
Encodes the input data into the latent space.
- Parameters:
x (np.ndarray) – The input data.
y (np.ndarray, optional) – The target data. Defaults to None.
- Returns:
The encoded representation.
- Return type:
np.ndarray
- evaluate(test_ds: ndarray, y_test_ds: ndarray | None = None) Dict[source]
Evaluates the model on the test dataset.
- Parameters:
test_ds (np.ndarray) – The test dataset.
y_test_ds (np.ndarray, optional) – The target values for the test dataset. Defaults to None.
- Returns:
Evaluation metrics.
- Return type:
Dict
- fit(dataset: Dataset, epochs: int = 20, batch_size: int = 128, learning_rate: float = 0.01, val_ds: ndarray | None = None, y_val_ds: ndarray | None = None, patience: int = 4, diffusion_epochs: int | None = None, diffusion_batch_size: int | None = None, diffusion_learning_rate: float | None = None)[source]
Trains the model on the provided dataset.
- Parameters:
dataset (Dataset) – The training dataset.
epochs (int, optional) – The number of training epochs for the VAE. Defaults to 20.
batch_size (int, optional) – The batch size for VAE training. Defaults to 128.
learning_rate (float, optional) – The learning rate for the VAE optimizer. Defaults to 1e-2.
val_ds (np.ndarray, optional) – The validation dataset. Defaults to None.
y_val_ds (np.ndarray, optional) – The target values for the validation dataset. Defaults to None.
patience (int, optional) – The number of epochs to wait for improvement before stopping early. Defaults to 4.
diffusion_epochs (int, optional) – The number of training epochs for the diffusion model. If None, uses the VAE epochs value.
diffusion_batch_size (int, optional) – The batch size for diffusion model training. If None, uses the VAE batch_size value.
diffusion_learning_rate (float, optional) – The learning rate for the diffusion model optimizer. If None, uses the VAE learning_rate value.
- generate(dataset: Dataset | None = None, n_samples: int = 0, noise: float = 0.0, random_state: int = 42) ndarray[source]
Generates synthetic data from the model.
- Parameters:
dataset (Dataset) – The input data to condition the generation on. If None, random samples will be generated.
n_samples (int, optional) – The number of samples to generate. Defaults to 100.
noise (float, optional) – The amount of noise to add to the latent space. Defaults to 0.0.
random_state (int, optional) – The random seed for reproducibility. Defaults to 42.
- Returns:
The generated synthetic data.
- Return type:
np.ndarray
- hashed_architecture: str
- model: TabularVAE
- params: FrozenDict
- reconstruction_error(x: ndarray, y: ndarray | None = None) ndarray[source]
Computes the reconstruction error for the input data.
- Parameters:
x (np.ndarray) – The input data.
y (np.ndarray, optional) – The target data. Defaults to None.
- Returns:
The reconstruction error for each instance.
- Return type:
np.ndarray
- sample_from_latent_space(x: ndarray, ds: ndarray, y: ndarray | None = None, y_ds: ndarray | None = None, n_samples: int = 100) Tuple[ndarray, ndarray][source]
Samples from the latent space around the given data point.
- Parameters:
x (np.ndarray) – The data point to sample around.
ds (np.ndarray) – The dataset to sample from.
y (np.ndarray, optional) – The target values for x. Defaults to None.
y_ds (np.ndarray, optional) – The target values for ds. Defaults to None.
n_samples (int, optional) – The number of samples to generate. Defaults to 100.
- Returns:
The sampled data and the corresponding indices.
- Return type:
Tuple[np.ndarray, np.ndarray]
- save(architecture_filename: str, sd_filename: str)[source]
Saves the model architecture and parameters to files.
- Parameters:
architecture_filename (str) – The file path to save the model architecture.
sd_filename (str) – The file path to save the model parameters.
- search_params: Dict
TimeSeriesEngine
- class clearbox_synthetic.generation.engine.timeseries_engine.TimeSeriesEngine(dataset: Dataset, time_id: str, layers_size: Sequence[int] = [40], params: FrozenDict | None = None, train_params: Dict | None = None, num_heads: int = 4)[source]
Bases:
EngineInterfaceManages training and evaluation of a time series model using a Variational Autoencoder (VAE). It handles model initialization, training, evaluation, and saving functionalities.
- model
The Variational Autoencoder model for time series data.
- Type:
TimeSeriesVAE
- params
The parameters of the model.
- Type:
FrozenDict
- search_params
The training parameters.
- Type:
Dict
- architecture
The architecture details of the model.
- Type:
Dict
- hashed_architecture
A hashed string representation of the model architecture.
- Type:
str
- apply(x: ndarray, y: ndarray | None = None) Tuple[source]
Applies the model to the input data.
- Parameters:
x (np.ndarray) – Input data.
y (np.ndarray, optional) – Target data. Defaults to None.
- Returns:
The model’s output.
- Return type:
Tuple
- decode(z: ndarray, y: ndarray | None = None) ndarray[source]
Decodes the latent space representation into the original space.
- Parameters:
z (np.ndarray) – Latent space representation.
y (np.ndarray, optional) – Target data. Defaults to None.
- Returns:
Decoded data.
- Return type:
np.ndarray
- encode(x: ndarray, y: ndarray | None = None) ndarray[source]
Encodes the input data into the latent space.
- Parameters:
x (np.ndarray) – Input data.
y (np.ndarray, optional) – Target data. Defaults to None.
- Returns:
Encoded data.
- Return type:
np.ndarray
- evaluate(test_ds: ndarray, y_test_ds: ndarray | None = None) Dict[source]
Evaluates the model on the test dataset.
- Parameters:
test_ds (np.ndarray) – Test dataset.
y_test_ds (np.ndarray, optional) – Target values for the test dataset. Defaults to None.
- Returns:
Evaluation metrics.
- Return type:
Dict
- fit(dataset: Dataset, epochs: int = 20, batch_size: int = 128, learning_rate: float = 0.01, val_dataset: Dataset | None = None, patience: int = 4)[source]
Trains the model on the provided dataset.
- Parameters:
dataset (Dataset) – The training dataset.
epochs (int, optional) – Number of training epochs. Defaults to 20.
batch_size (int, optional) – Size of each training batch. Defaults to 128.
learning_rate (float, optional) – Learning rate for the optimizer. Defaults to 1e-2.
val_ds (np.ndarray, optional) – Validation dataset. Defaults to None.
y_val_ds (np.ndarray, optional) – Target values for the validation dataset. Defaults to None.
patience (int, optional) – Number of epochs to wait before early stopping. Defaults to 4.
- generate(dataset: Dataset, n_samples: int = 100)[source]
Generates synthetic time series data from the model.
- Parameters:
n_samples (int, optional) – Number of samples to generate. Defaults to 100.
- reconstruction_error(x: ndarray, y: ndarray | None = None) ndarray[source]
Computes the reconstruction error for the input data.
- Parameters:
x (np.ndarray) – Input data.
y (np.ndarray, optional) – Target data. Defaults to None.
- Returns:
Reconstruction error for each instance.
- Return type:
np.ndarray
- sample_from_latent_space(x: ndarray, ds: ndarray, y: ndarray | None = None, y_ds: ndarray | None = None, n_samples: int = 100) Tuple[ndarray, ndarray][source]
Samples data from the latent space close to the given input.
- Parameters:
x (np.ndarray) – Input data.
ds (np.ndarray) – Dataset to sample from.
y (np.ndarray, optional) – Target data for the input. Defaults to None.
y_ds (np.ndarray, optional) – Target data for the dataset. Defaults to None.
n_samples (int, optional) – Number of samples to generate. Defaults to 100.
- Returns:
Sampled data and their indices.
- Return type:
Tuple[np.ndarray, np.ndarray]