Data Processing#

The DatasetProcessor class takes in a Dataset instance and performs data preparation on the constructed Pandas DataFrame. It heavily utilizes Scikit-learn for its data processing capabilities.

Classes#

dataset_processor.DatasetProcessor(dataset)

Processes validated datasets to prepare them for model training and evaluation.

split_data.SplitData(**data)

Validates and stores train-test and feature-target split data.

DatasetProcessor#

class mlcompare.DatasetProcessor(dataset)[source]#

Bases: object

Processes validated datasets to prepare them for model training and evaluation.

Attributes:#

dataset (DatasetType): DatasetType object containing a get_data() method and attributes needed for data processing.

drop_columns()[source]#

Drops the columns specified with the drop parameter.

Return type:

tuple[DataFrame, DataFrame]

Returns:#

:

(pd.DataFrame, pd.DataFrame): Train and test split with the columns specified by the drop parameter dropped.

handle_nan(raise_exception=False)[source]#

Handles missing values in the data including: np.nan, None, “”, and “.” by either forward-filling (ffill), backward-filling (bfill), or dropping (drop) them based on the nan parameter.

Return type:

tuple[DataFrame, DataFrame]

Parameters:

raise_exception (bool)

Args:#

raise_exception (bool, optional): Whether to raise an exception if missing values are found. Defaults to False.

Returns:#

:

(pd.DataFrame, pd.DataFrame): Train and test split with the missing values in the specified columns either forward-filled, backward-filled, or dropped or neither if a method is provided for the dataset.

Raises:#

ValueError: If missing values are found and raise_exception is True.

label_encode_column()[source]#

Applies sklearn.preprocessing.LabelEncoder to the target column.

Return type:

tuple[DataFrame, DataFrame]

Returns:#

:

(pd.DataFrame, pd.DataFrame): Train and test split with the target column encoded.

max_abs_scale_columns()[source]#

Applies sklearn.preprocessing.MaxAbsScaler to the columns specified by the maxAbsScale parameter.

Return type:

tuple[DataFrame, DataFrame]

Returns:#

:

(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns regularized.

min_max_scale_columns()[source]#

Applies sklearn.preprocessing.MinMaxScaler to the columns specified by the minMaxScale parameter.

Return type:

tuple[DataFrame, DataFrame]

Returns:#

:

(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns regularized.

normalize_columns()[source]#

Applies sklearn.preprocessing.Normalizer to the columns specified by the normalize parameter.

Return type:

tuple[DataFrame, DataFrame]

Returns:#

:

(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns regularized.

one_hot_encode_columns()[source]#

Applies sklearn.preprocessing.OneHotEncoder to the columns specified by the onehotEncode parameter.

Return type:

tuple[DataFrame, DataFrame]

Returns:#

:

(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns encoded.

ordinal_encode_columns()[source]#

Applies sklearn.preprocessing.OrdinalEncoder to the columns specified by the ordinalEncode parameter.

Return type:

tuple[DataFrame, DataFrame]

Returns:#

:

(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns encoded.

power_transform_columns()[source]#

Applies sklearn.preprocessing.PowerTransformer using the Yeo-Johnson method to the columns specified by the powerTransform parameter.

Return type:

tuple[DataFrame, DataFrame]

Returns:#

:

(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns regularized.

process_dataset(writer, save_original=True, save_processed=True, overwrite=True)[source]#

Performs all data processing steps based on the parameters provided to DatasetProcessor. Optionally saves the original and processed data to files.

Return type:

tuple[DataFrame, DataFrame, DataFrame | Series, DataFrame | Series]

Parameters:
  • writer (ResultsWriter)

  • save_original (bool)

  • save_processed (bool)

  • overwrite (bool)

Args:#

save_directory (str | Path): The directory to save the data to. save_original (bool): Whether to save the original data. save_processed (bool): Whether to save the processed, nonsplit data.

Returns:#

:
SplitDataTuple:

pd.DataFrame: Training split features. pd.DataFrame: Testing split features. pd.DataFrame | pd.Series: Training split target values. pd.DataFrame | pd.Series: Testing split target values.

quantile_transform_columns()[source]#

Applies sklearn.preprocessing.QuantileTransformer with output_distribution = “uniform” to the columns specified by the quantileTransform parameter.

Return type:

tuple[DataFrame, DataFrame]

Returns:#

:

(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns regularized.

quantile_transform_normal_columns()[source]#

Applies sklearn.preprocessing.QuantileTransformer with output_distribution = “normal” to the columns specified by the quantileTransformNormal parameter.

Return type:

tuple[DataFrame, DataFrame]

Returns:#

:

(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns regularized.

robust_scale_columns()[source]#

Applies sklearn.preprocessing.RobustScaler to the columns specified by the robustScale parameter.

Return type:

tuple[DataFrame, DataFrame]

Returns:#

:

(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns regularized.

save_data(writer, file_format='parquet', file_name_ending='', overwrite=True)[source]#

Recombined the train and test split and saves the data to a file using the specified format.

Return type:

Path

Parameters:
  • writer (ResultsWriter)

  • file_format (Literal['parquet', 'csv', 'json', 'pkl'])

  • file_name_ending (str)

  • overwrite (bool)

Args:#

save_directory (str | Path): Directory to save the data to. file_format (Literal[“parquet”, “csv”, “json”, “pkl”], optional): Format to use when saving the data. Defaults to “parquet”. file_name_ending (str, optional): String to append to the end of the file name in order to save the data

multiple times. Defaults to “”.

Returns:#

:

Path: Path to the saved data.

split_and_save_data(writer, overwrite=True)[source]#

Splits the data and saves it to a single pickle file as a SplitData object.

Return type:

Path

Parameters:
  • writer (ResultsWriter)

  • overwrite (bool)

Args:#

save_directory (str | Path): Directory to save the SplitData object to.

Returns:#

:

Path: Path to the saved SplitData object.

split_target()[source]#

Separates the target column from the features for both the train and test data.

Return type:

tuple[DataFrame, DataFrame, DataFrame | Series, DataFrame | Series]

Returns:#

:
SplitDataTuple:

pd.DataFrame: Training split features. pd.DataFrame: Testing split features. pd.DataFrame | pd.Series: Training split target values. pd.DataFrame | pd.Series: Testing split target values.

standard_scale_columns()[source]#

Applies sklearn.preprocessing.StandardScaler to the columns specified by the standardScale parameter.

Return type:

tuple[DataFrame, DataFrame]

Returns:#

:

(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns regularized.

target_encode_columns()[source]#

Applies sklearn.preprocessing.TargetEncoder to the columns specified by the targetEncode parameter.

Return type:

tuple[DataFrame, DataFrame]

Returns:#

:

(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns encoded.

Parameters:

dataset (DatasetType)

SplitData#

class mlcompare.SplitData(**data)[source]#

Bases: BaseModel

Validates and stores train-test and feature-target split data.

Parameters:
X_test: pd.DataFrame#
X_train: pd.DataFrame#
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'X_test': FieldInfo(annotation=DataFrame, required=True), 'X_train': FieldInfo(annotation=DataFrame, required=True), 'y_test': FieldInfo(annotation=Union[DataFrame, Series], required=True), 'y_train': FieldInfo(annotation=Union[DataFrame, Series], required=True)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

y_test: pd.DataFrame | pd.Series#
y_train: pd.DataFrame | pd.Series#

“””