Preprocessor
It includes functionalities for encoding categorical variables, handling missing values, scaling numerical feature and feature selection. The Preprocessor module is a fast and flexible data manipulation component designed for preprocessing tabular and time-series data. It enables various transformations such as encoding categorical variables, handling missing values, feature selection, and numerical transformation, preparing the real dataset for the generation process.
Key Functionalities
The Preprocessor class provides several preprocessing capabilities, including:
Handling Different Data Types
Numerical Features: Standardizes, normalizes, or discretizes numerical columns.
Categorical Features: Encodes categorical variables using one-hot encoding or embedding-based methods.
Datetime Features: Converts datetime columns into numerical representations.
Feature Engineering & Selection
Feature Selection: Identifies and removes non-informative features based on data statistics.
Binning: Discretizes numerical features into categorical bins for better processing.
Categorical Embedding Rules: Generates embedding rules for categorical features to enhance model performance.
Missing Value Handling
Filling Missing Values: Handles missing data by replacing with a default value (e.g.,
-0.001for numerical data orNaNfor categorical data).Inference-Based Imputation: Uses similarity-based methods to infer missing values.
Data Transformation
Sklearn ColumnTransformer: Applies multiple transformations in parallel for numerical, categorical, and datetime features.
Custom Transformers:
NumericalTransformer: Processes numerical values with various scaling and binning techniques.CategoricalTransformer: Encodes categorical features.DatetimeTransformer: Converts datetime data into useful features.
Time-Series Support
Supports Sequential Data: Handles time-indexed datasets and processes time-based features separately.
Normalization & Scaling: Normalizes time-series features based on computed means and standard deviations.
Fixed Sequence Lengths: Reshapes time-series data into sequences of fixed length for deep learning models.
Data Encoding & Decoding
Transformation (
transform): Applies preprocessing steps to raw data and converts it into a machine-learning-friendly format.Reverse Transformation (
reverse_transform): Converts the processed data back to its original format, allowing interpretability.Inverse Preprocessing (
inverse_preprocessor): Reverses all applied transformations.