Dataset

This module provides tools and classes for working with tabular datasets, including data manipulation, validation, preprocessing, and analysis. It is designed for flexibility in machine learning workflows, supporting regression and classification tasks, and ensuring dataset integrity through automated checks and validations.

class clearbox_synthetic.utils.dataset.dataset.Dataset(*, data: DataFrame, timestamp: datetime | None = None, name: str | None = None, target_column: int | str | tuple | None = None, sequence_index: int | str | None = None, group_by: int | str | None = None, column_types: Dict[str, str] | None = None, bounds: Dict | None = None, ml_task: Literal['classification', 'regression'] = 'classification')[source]

Bases: BaseModel

A felxible class for tabular dataset manipulation.

data

A tabular dataset, more than 1 row.

Type:

pandas.DataFrame

timestamp

A datetime timestamp.

Type:

datetime, default=datetime.now()

name

A string name for the dataset.

Type:

str, optional

target_column

The target column (y) name.

Type:

str or int or Tuple, optional

bounds

A dictionary of allowed values for each column except the target one. For an numeric column use ‘column’: {‘max’: max_value, ‘min’: min_value}. For a categorical column use ‘column’: {allowed_value+}.

Type:

dict of dict, optional

ml_task

Indicates whether the dataset is used or not for a classification or regression problem.

Type:

str, default “classification”

bounds: Dict | None
categorical_map() Dict[source]

Return a map of the categorical feature indices and corresponding values.

Returns:

category_map – A dictionary with keys being the indices of the categorical columns and values being lists of unique values for that column.

Return type:

dict

categorical_to_ordinal() None[source]

Encode every categorical column in the dataset to ordinal type. This method transform the dataset in place.

check_duplicates(columns: List | None = None) int[source]

Return number of duplicated rows in the dataset, optionally considering only certain columns.

Parameters:

columns (list, optional) – List of column names as string or tuple in case of multi-level index to check for duplicates. By default use all the columns.

Returns:

Number of duplicated rows in the dataset.

Return type:

int

check_na_values() Series | None[source]

Check for columns with missing values in the dataset.

Returns:

A series with the number of missing values for each columns that has missing values or None if there are no missing values in the dataset.

Return type:

pandas.Series or None

column_bounds(column: str | Tuple) Dict | Set[source]

Return the bounds of a single column of the dataset.

Parameters:

column (str or tuple of str) – Name of a column.

Returns:

Column bounds.

Return type:

dict

column_correlation(column: str | Tuple) Series[source]

Compute correlation between a single numeric column and each other columns in the dataset.

Parameters:

column (str or tuple of str) – A string name of a single numeric column or tuple of string for a multi-level index.

Returns:

Correlation values sorted by descending order.

Return type:

pandas.Series

column_types: Dict[str, str] | None
columns(include: int | str | List | None = None) List[str][source]

Return the list of column names of (a subset of) the dataset.

Parameters:

include (scalar or list-like, optional) – A selection of dtypes or strings to be included. To select all numeric types, use ‘number’. To select strings you must use the ‘object’ dtype, but note that this will return all object dtype columns. To select Pandas categorical dtypes, use ‘category’.

Returns:

Names of columns of (a subset of) the dataset.

Return type:

list

columns_number() int[source]

Return the number of columns/features of the dataset.

Returns:

Number of columns of the dataset.

Return type:

int

columns_types() Dict[source]

Return a dict with the column name as key and the column dtype as value.

Returns:

Columns types.

Return type:

dict

data: DataFrame
describe(include: str = 'all') DataFrame[source]

Return descriptive statistics that summarize the central tendency, dispersion and shape of the dataset distribution, excluding NaN values. Analyzes both numeric and object series, as well as DataFrame columns sets of mixed data types.

Parameters:

include (str or list-like of dtypes or None, default 'all') – By default all columns of the input will be included in the output. Using a list-like of dtypes limits the results to the provided data types. To limit the result to numeric types submit ‘number’. To limit it to object columns submit ‘object’.

Returns:

Return descriptive statistics of the dataset

Return type:

pandas.DataFrame

discretize(column: str | Tuple, bins: int | List = 4, strategy: str = 'edges', quantiles: int | List[float] = 4, labels: List[str] | None = None, right: bool = True, precision: int = 3) None[source]

Bin columns values into discrete intervals. Supports binning into an equal number of bins, a pre-specified array of bins or quantile-based discretization that discretize variable into equal-sized buckets based on rank or on sample quantiles. It is useful to convert a continuous variable into a categorical variable. This method transform the dataset in place.

Parameters:
  • column (str or tuple of str) – Name of the column to bin as a str or tuple of str in case of multi-level index.

  • bins (int or list of scalars, default 4) –

    • int: defines the number of equal-width bins in the range of column. The range of column is extended by

    .1% on each side to include the minimum and maximum values of column. - list of scalars: defines the bin edges allowing for non-uniform width. No extension of the range of column is done.

  • strategy ({'edges', 'quantile'}, default 'edges') – Strategy to perform the discretization. ‘edges’ for a simple discretization into an equal number of bins or a pre-specified array of bins. ‘quantiles’ for quantile-based discretization.

  • quantiles (int or list of scalars, default 4) – Number of quantiles: 10 for deciles, 4 for quartile, etc. Alternately array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles

  • labels (list of string, optional) – Labels string names for the returned bins. Must be the same length as the resulting bins. If None, returns only integer indicators of the bins.

  • right (bool, default True) – Whether the bins includes the rightmost edge or not. If right == True, then the bins [1, 2, 3, 4] correspond to (1,2], (2,3], (3,4].

  • precision (int, default 3) – The precision at which to store and display the bins labels.

drop_columns(columns: str | List) None[source]

Drop one or more columns of the dataset. This method transform the dataset in place.

Parameters:

columns (list) – List of column names to drop as str or tuple of str in case of multi-level index.

drop_duplicates(columns: List | None = None) None[source]

Remove duplicate rows from the dataset, optionally considering only certain columns. This method transform the dataset in place.

Parameters:

columns (list, optional) – List of column names as string or tuples of str in case of multi-level index to check for duplicates. By default use all the columns.

drop_na_values(axis: int = 0, how: str = 'any') None[source]

Drop all the missing values in the dataset. This method transform the dataset in place. Check also fill_na_values().

Parameters:
  • axis ({0, 1}) – Axis along which to fill missing values.

  • how ({'any', 'all'}, default 'any') – Determine if row or columns is removed from the dataset, when we have at least one NA or all NA.

fill_na_values(fill_with: str | int | float | Dict, columns: List | None = None) str | int | float | Dict[source]

Fill missing values in the dataset. You can choose which column(s) to fill and what value(s) use to fill it.

Parameters:
  • fill_with ({'mean', 'median'}, scalar or dict) – Value(s) to use to fill it the columns. You can choose from {mean, median}, if you want to fill the missing values in a numeric columns with its mean or median value. You can write a specific string or scalar, if you want to fill all the missing values in the selected columns with just that particular value. You can pass a dict with key==column_name -> value==fill_with (eg. {‘country’: ‘italy’, ‘language’: ‘italian’}, if you want to specify what values to use for a subset of columns.

  • columns (list, optional) – List of column names as string or tuple in case of multi-level index. If None, if fill_with is a dictionary the method fill the columns specified in the dictionary key, elif fill_with is in {mean, median} the method fill all the columns containing Nan with the relative mean/median (error if there is at least one object columns), else the method fill all Nan values in dataframe with the single specified value.

Returns:

The value(s) used to fill the missing values, useful if you have choose median or mean because you have to fill the missing values in the test set with the same values.

Return type:

fill_with

classmethod from_csv(csv_file: str | IO, timestamp: datetime | None = None, target_column: int | str | Tuple | None = None, sequence_index: int | str | None = None, group_by: int | str | None = None, column_types: Dict[str, str] | None = None, name: str | None = None, bounds: Dict | None = None, sep: str = ',', header: str | int | List[int] = 'infer', cols_names: list | None = None, index_col: int | str | List | bool | None = None, usecols: List | None = None, dtype: str | Dict | None = None, converters: Dict | None = None, skiprows: int | None = None, nrows: int | None = None, na_values: Any = '?', skip_blank_lines: bool = True, dayfirst: bool = False, thousands: str | None = None, decimal: str = '.', ml_task: Literal['classification', 'regression'] = 'classification', drop_target_na_rows: bool = True) Dataset[source]

Create a Dataset object loading the dataset from a csv file.

Parameters:
  • csv_file (string or file-like object) – The csv file path as a string or the csv file. By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.

  • timestamp (datetime, optional) – Timestamp assigned to the dataset.

  • target_column (str or int or Tuple, optional) – The y column of the dataset (Supervised Machine Learning)

  • column_types (dict, optional) – An optional dictionary that indicates for each column the data type.

  • name (string, optional) – A string name for the dataset.

  • bounds (dict of dict, optional) – A dictionary of allowed values. For an ordinal column use ‘column’: {‘max’: max_value, ‘min’: min_value}. For a categorical column use ‘column’: {allowed_value+}.

  • sep (string, default ',') – Delimiter char/string to use.

  • header (int, list of int, default ‘infer’) – Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped).

  • cols_names (list, optional) – List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list are not allowed.

  • index_col (int, str, sequence of int / str, or False, optional) – Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.

  • usecols (list-like or callable, optional) – Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). For example, a valid list-like usecols parameter would be [0, 1, 2] or [‘foo’, ‘bar’, ‘baz’]. Element order is ignored, so usecols=[0, 1] is the same as [1, 0]. To instantiate a DataFrame from data with element order preserved use pd.read_csv(data, usecols=[‘foo’, ‘bar’]) [[‘foo’, ‘bar’]] for columns in [‘foo’, ‘bar’] order or pd.read_csv(data, usecols=[‘foo’, ‘bar’]) [[‘bar’, ‘foo’]] for [‘bar’, ‘foo’] order. If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True. An example of a valid callable argument would be lambda x: x.upper() in [‘AAA’, ‘BBB’, ‘DDD’]. Using this parameter results in much faster parsing time and lower memory usage.

  • prefix (str, optional) – Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1, …

  • dtype (Type name or dict of column -> type, optional) – Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

  • converters (dict, optional) – Dict of functions for converting values in certain columns. Keys can either be integers or column labels.

  • skiprows (int, optional) – Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

  • nrows (int, optional) – Number of rows of file to read. Useful for reading pieces of large files.

  • na_values (scalar, string, list-like, or dict, default '?') – Additional string to recognize as NA/NaN value.

  • skip_blank_lines (bool, default True) – If True, skip over blank lines rather than interpreting as NaN values.

  • parse_dates (bool or list of int or names or list of lists or dict, default False) –

    The behavior is as follows:
    • boolean. If True -> try parsing the index.

    • list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.

    • list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.

    • dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’.

    If a column or index cannot be represented as an array of datetimes, say because of an unparseable value or a mixture of timezones, the column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use pd.to_datetime after pd.read_csv. To parse an index or column with a mixture of timezones, specify date_parser to be a partially-applied pandas.to_datetime() with utc=True.

  • infer_datetime_format (bool, default False) – If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.

  • keep_date_col (bool, default False) – If True and parse_dates specifies combining multiple columns then keep the original columns.

  • date_parser (function, optional) – Function to use for converting a sequence of string columns to an array of datetime instances. The default uses dateutil.parser.parser to do the conversion. Pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments.

  • dayfirst (bool, default False) – DD/MM format dates, international and European format.

  • thousands (str, optional) – Thousands separator.

  • decimal (str, default ‘.’) – Character to recognize as decimal point (e.g. use ‘,’ for European data).

  • ml_task (str, default "classification") – Indicates whether the dataset is used or not for a classification or regression problem.

  • drop_target_na_rows (bool, default True) – If True and target_column is not None (Labeled Dataset), drop all rows containing na value in the target column

Returns:

A new Dataset instance.

Return type:

Dataset

classmethod from_dataframe(data: DataFrame, timestamp: datetime | None = None, target_column: int | str | Tuple | None = None, sequence_index: int | str | None = None, group_by: int | str | None = None, column_types: Dict[str, str] | None = None, name: str | None = None, bounds: Dict | None = None, ml_task: Literal['classification', 'regression'] = 'classification', drop_target_na_rows: bool = True) Dataset[source]

Create a Dataframe objest from a pandas.DataFrame

get_group_by() Series[source]

Return the sequence index of the dataset.

Returns:

The sequence index of the dataset.

Return type:

pd.Series

get_label_encoded_y() Series[source]

Return the target column of the dataset (y), preprocessed with a Label Encoder and the relative labels.

Returns:

The target column (y) of the dataset.

Return type:

pd.Series

get_n_classes() int[source]

Return the number of unique values in the target column (y) of the dataset.

Returns:

The number of unique values in the target column (y) of the dataset.

Return type:

int

get_normalized_y() Array[source]

Standardize the target column of the dataset (y), if regression is True

Returns:

The standardized target column (y)

Return type:

float

get_one_hot_encoded_y() Series[source]

Return the target column of the dataset (y), preprocessed with a One Hot Encoder.

Returns:

The one hot encoded target column (y) of the dataset.

Return type:

pd.Series

get_values()[source]

Return the Dataset as a NumPy array/matrix.

Returns:

Numpy version of the dataset, just the values, no more column names or indices.

Return type:

nd.array

get_x() DataFrame | Series[source]

Return all columns of the dataset except the target column (y).

Returns:

All columns of the dataset except the target column (y) as a pandas Dataframe.

Return type:

pd.Dataframe or pd.Series

get_x_y(n_samples=None)[source]

Return all column of the dataset except the target column (y) and the target column separately

get_y() Series[source]

Return the target column of the dataset (y).

Returns:

The target column (y) of the dataset.

Return type:

pd.Series

get_y_mean() float[source]

Return the mean of target column of the dataset (y), if regression.

Returns:

The mean of the target column (y) of the dataset.

Return type:

float

get_y_std() float[source]

Return the std of target column of the dataset (y), if regression.

Returns:

The std of the target column (y) of the dataset.

Return type:

float

group_by: int | str | None
head(num_rows: int = 5) DataFrame[source]

Return the first num_rows rows of the dataset. It is useful for quickly testing if your object has the right type of data in it. If num_rows is not passed, display the first 5 rows.

Parameters:

num_rows (int, optional) – Number of rows to display.

Returns:

Return the first num_rows rows of the dataset.

Return type:

pandas.DataFrame

info() None[source]

Display a concise summary of the dataset: information about the pd.DataFrame including the index dtype and columns dtypes, non-null values and memory usage.

map_column(column: str | Tuple[str], dict_map: Dict) None[source]

Map values of a column according to the ‘dict_map’ correspondence. It substitute each value in the columns with another value. This method transform the dataset in place. Might be a better idea to use map_columns.

Parameters:
  • column (str or tuple of str) – Name of the column to map as a str or tuple of str in case of multi-level index.

  • dict_map (dict) – Dictionary containing the correspondences value_to_map -> new_value.

map_columns(mapping_cols: Dict) None[source]

Map the values of some columns of the dataset to new values.

Parameters:

mapping_cols (dict) – A dictionary that contains the columns to map as keys and the values_map as values

ml_task: Literal['classification', 'regression']
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str | None
numerical_encoder(column: str | Tuple) None[source]

Encode categorical values of a column to ordinal values. This method transform the dataset in place.

Parameters:

column (str or tuple of str) – Name of the column to encode as a str or tuple of str in case of multi-level index.

pairwise_correlation() DataFrame[source]

Compute pairwise correlation of columns, excluding NA/null values.

Returns:

Correlation matrix of the dataset.

Return type:

pandas.Dataframe

pop_column(column: str | Tuple) Series | DataFrame[source]

Return a column and drop it from the dataset.

Parameters:

column (str or tuple of str) – Name of the column to be popped as a str or a tuple of str in case of multi-level index.

Returns:

The column popped out.

Return type:

pd.Series

row_by_index(idx: int) Series[source]

Return a row of the dataset given an index.

Parameters:

idx (int) – A single row index value.

Returns:

A single row of the dataset.

Return type:

Pandas.series

rows_number() int[source]

Return the number of rows of the dataset.

Returns:

Number of rows of the dataset.

Return type:

int

save(path: str) None[source]

Exports the Dataset object as serialized pickle file, given a filepath of the pickle file to create.

Parameters:

path (str) – Filepath of the pickle file to create.

scale_numeric_columns(strategy: str = 'min-max') None[source]

Scale every numeric column in the dataset. This method transform the dataset in place.

Parameters:

strategy ({'min-max', 'standardization'}, default 'min-max') – The scaler strategy. Check scaler() docs for furthers information.

scaler(column: str | Tuple[str], strategy: str = 'min-max') None[source]

Scale values of a numeric column.

Parameters:
  • column (str or tuple of str) – Name of the column to scale as a str or tuple of str in case of multi-level index.

  • strategy ({'min-max', 'standard'}, default 'min-max') – The scaler strategy.

Notes

Generally, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales. Note that scaling the target values is generally not required.

There are two ways to scale the numeric values:

  • min-max (normalization): values are shifted and rescaled so that they end up ranging from 0 to 1. We do thissubtracting the minimum and dividing by the maximum minus the minimum;

  • standard: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the variance so that the resulting distribution has unit variance. Unlike min-max scaling, standardization does not bound values to a specific range, which may be a problem for some algorithms but is much less affected by outliers.

sequence_index: int | str | None
classmethod set_timestamp_now(v)[source]
shuffle(reset_index: bool = False) None[source]

Shuffle the dataset rows in place.

Parameters:

reset_index (bool, default False) – If True reset the rows index after shuffling.

subset(columns: List) DataFrame | Series[source]

Return a subset of the dataset given a list of column names.

Parameters:

columns (list) – List of column names as str or tuple in case of multi-level index.

Returns:

Subset Column(s) from the dataset.

Return type:

pandas.DataFrame or pandas.Series

subset_by_type(include: int | str | List) DataFrame[source]

Return a subset of the dataset based on the column dtypes.

Parameters:

include (scalar or list-like) – A selection of dtypes or strings to be included. To select all numeric types, use ‘number’. To select strings you must use the ‘object’ dtype, but note that this will return all object dtype columns. To select Pandas categorical dtypes, use ‘category’.

Returns:

A subset of the dataset including the dtypes in include.

Return type:

pd.Dataframe

target_balance() DataFrame[source]

Return a dataframe containing the number of samples and the frequency for each unique values of the target column.

Returns:

Number and frequency of samples in the dataset for each unique values of the target column

Return type:

pd.Dataframe

target_column: int | str | tuple | None
timestamp: datetime | None
to_csv(path: str)[source]

Generate and save a csv file starting from the dataset.

Parameters:

path (str) – The path where to save the generated csv file.

train_test_split(frac: float = 0.8, random_state: int | None = None) Tuple[Dataset, Dataset][source]

Split the instance dataset into random train and test subsets as two new Dataset instances.

Parameters:
  • frac (float, default 0.8) – Ratio between training and test set size.

  • random_state (int, optional) – Seed for the random number generator. Use it for reproducibility.

Returns:

The training and the test set as two new Dataset instances.

Return type:

tuple

types_map() Dict[source]

Return a map of the features and corresponding type. This is necessary to create a Pydantic model based on dataset features.

Returns:

types_map – A dictionary with keys being the columns names and values being the type of that column.

Return type:

dict

unique_values(columns: List | None = None) Dict[source]

Return a dictionary of unique values of (a subset of) the dataset.

Parameters:

columns (list, optional) – List of column names as string or tuple in case of multi-level index. If None, return all the unique values for every column.

Returns:

A dictionary {‘column’ -> [unique_value+]}.

Return type:

dict

classmethod validate_bounds(v, info: ValidationInfo)[source]
classmethod validate_column_types(v, info: ValidationInfo)[source]
classmethod validate_group_by(v, values)[source]
classmethod validate_regression(v, info: ValidationInfo)[source]
classmethod validate_sequence_index(v, info: ValidationInfo)[source]
classmethod validate_target_column(v, info: ValidationInfo)[source]
value_counts(column: int | str | Tuple) DataFrame[source]

Given a target column, return a dataframe containing the number of samples and the frequency for each unique values of the column in the dataset. Useful to check if the dataset is balanced with respect to the y column.

Parameters:

column (str or tuple of str) – A string name of a single column or tuple of string for a multi-indexed column.

Returns:

Number and frequency of samples in the dataset for each unique values of the column col.

Return type:

pd.Dataframe

variance(columns: List | None = None, axis: int = 0, skipna: bool = True, numeric_only: bool | None = None) Series[source]

Return unbiased variance over requested axis of the dataset. Normalized by N by default.

Parameters:
  • columns (list, optional) – List of column names as string or tuples of str in case of multi-level index to check for variance. By default use all the columns.

  • axis ({0, 1}, default 0) – 0 for index, 1 for columns

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA

  • numeric_only (bool, default False) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

Variance over requested axis as descending sorted pandas series.

Return type:

pd.Series

x_columns(include: int | str | List | None = None) List[str][source]

Return the list of column names of the X subset of the dataset (no target column).

Parameters:

include (scalar or list-like, optional) – A selection of dtypes or strings to be included. To select all numeric types, use ‘number’. To select strings you must use the ‘object’ dtype, but note that this will return all object dtype columns. To select Pandas categorical dtypes, use ‘category’.

Returns:

Names of columns of the X subset of the dataset (no target column).

Return type:

list