Data Processing#
The DatasetProcessor class takes in a Dataset instance and performs data preparation on the constructed Pandas DataFrame. It heavily utilizes Scikit-learn for its data processing capabilities.
Classes#
|
Processes validated datasets to prepare them for model training and evaluation. |
|
Validates and stores train-test and feature-target split data. |
DatasetProcessor#
- class mlcompare.DatasetProcessor(dataset)[source]#
Bases:
objectProcesses validated datasets to prepare them for model training and evaluation.
Attributes:#
dataset (DatasetType): DatasetType object containing a get_data() method and attributes needed for data processing.
- drop_columns()[source]#
Drops the columns specified with the drop parameter.
Returns:#
- :
(pd.DataFrame, pd.DataFrame): Train and test split with the columns specified by the drop parameter dropped.
- handle_nan(raise_exception=False)[source]#
Handles missing values in the data including: np.nan, None, “”, and “.” by either forward-filling (ffill), backward-filling (bfill), or dropping (drop) them based on the nan parameter.
Args:#
raise_exception (bool, optional): Whether to raise an exception if missing values are found. Defaults to False.
Returns:#
- :
(pd.DataFrame, pd.DataFrame): Train and test split with the missing values in the specified columns either forward-filled, backward-filled, or dropped or neither if a method is provided for the dataset.
Raises:#
ValueError: If missing values are found and raise_exception is True.
- label_encode_column()[source]#
Applies sklearn.preprocessing.LabelEncoder to the target column.
Returns:#
- :
(pd.DataFrame, pd.DataFrame): Train and test split with the target column encoded.
- max_abs_scale_columns()[source]#
Applies sklearn.preprocessing.MaxAbsScaler to the columns specified by the maxAbsScale parameter.
Returns:#
- :
(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns regularized.
- min_max_scale_columns()[source]#
Applies sklearn.preprocessing.MinMaxScaler to the columns specified by the minMaxScale parameter.
Returns:#
- :
(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns regularized.
- normalize_columns()[source]#
Applies sklearn.preprocessing.Normalizer to the columns specified by the normalize parameter.
Returns:#
- :
(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns regularized.
- one_hot_encode_columns()[source]#
Applies sklearn.preprocessing.OneHotEncoder to the columns specified by the onehotEncode parameter.
Returns:#
- :
(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns encoded.
- ordinal_encode_columns()[source]#
Applies sklearn.preprocessing.OrdinalEncoder to the columns specified by the ordinalEncode parameter.
Returns:#
- :
(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns encoded.
- power_transform_columns()[source]#
Applies sklearn.preprocessing.PowerTransformer using the Yeo-Johnson method to the columns specified by the powerTransform parameter.
Returns:#
- :
(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns regularized.
- process_dataset(writer, save_original=True, save_processed=True, overwrite=True)[source]#
Performs all data processing steps based on the parameters provided to DatasetProcessor. Optionally saves the original and processed data to files.
Args:#
save_directory (str | Path): The directory to save the data to. save_original (bool): Whether to save the original data. save_processed (bool): Whether to save the processed, nonsplit data.
Returns:#
- :
- SplitDataTuple:
pd.DataFrame: Training split features. pd.DataFrame: Testing split features. pd.DataFrame | pd.Series: Training split target values. pd.DataFrame | pd.Series: Testing split target values.
- quantile_transform_columns()[source]#
Applies sklearn.preprocessing.QuantileTransformer with output_distribution = “uniform” to the columns specified by the quantileTransform parameter.
Returns:#
- :
(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns regularized.
- quantile_transform_normal_columns()[source]#
Applies sklearn.preprocessing.QuantileTransformer with output_distribution = “normal” to the columns specified by the quantileTransformNormal parameter.
Returns:#
- :
(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns regularized.
- robust_scale_columns()[source]#
Applies sklearn.preprocessing.RobustScaler to the columns specified by the robustScale parameter.
Returns:#
- :
(pd.DataFrame, pd.DataFrame): Train and test split with the specified columns regularized.
- save_data(writer, file_format='parquet', file_name_ending='', overwrite=True)[source]#
Recombined the train and test split and saves the data to a file using the specified format.
- Return type:
- Parameters:
Args:#
save_directory (str | Path): Directory to save the data to. file_format (Literal[“parquet”, “csv”, “json”, “pkl”], optional): Format to use when saving the data. Defaults to “parquet”. file_name_ending (str, optional): String to append to the end of the file name in order to save the data
multiple times. Defaults to “”.
Returns:#
- :
Path: Path to the saved data.
- split_and_save_data(writer, overwrite=True)[source]#
Splits the data and saves it to a single pickle file as a SplitData object.
Args:#
save_directory (str | Path): Directory to save the SplitData object to.
Returns:#
- :
Path: Path to the saved SplitData object.
- split_target()[source]#
Separates the target column from the features for both the train and test data.
Returns:#
- :
- SplitDataTuple:
pd.DataFrame: Training split features. pd.DataFrame: Testing split features. pd.DataFrame | pd.Series: Training split target values. pd.DataFrame | pd.Series: Testing split target values.
- Parameters:
dataset (DatasetType)
SplitData#
- class mlcompare.SplitData(**data)[source]#
Bases:
BaseModelValidates and stores train-test and feature-target split data.
- Parameters:
- X_test: pd.DataFrame#
- X_train: pd.DataFrame#
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'X_test': FieldInfo(annotation=DataFrame, required=True), 'X_train': FieldInfo(annotation=DataFrame, required=True), 'y_test': FieldInfo(annotation=Union[DataFrame, Series], required=True), 'y_train': FieldInfo(annotation=Union[DataFrame, Series], required=True)}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- y_test: pd.DataFrame | pd.Series#
- y_train: pd.DataFrame | pd.Series#
“””