Datasets#
The various Dataset classes primarily serve to turn a source of data into a Pandas DataFrame. Additionally, they pass any user provided parameters to the DatasetProcessor.
Classes#
|
Base class for datasets, containing attributes related to data cleaning and reformatting. |
|
Represents a locally saved dataset with all the fields required to load and prepare it for model evaluation. |
|
|
|
Represents a Kaggle dataset with all the fields required to download and prepare it for model evaluation. |
|
|
|
Creates Dataset objects such as LocalDataset, KaggleDataset, etc. |
BaseDataset#
- class mlcompare.data.BaseDataset(**data)[source]#
-
Base class for datasets, containing attributes related to data cleaning and reformatting.
Attributes:#
target (str): Column name for the target of the predictions. save_name (str | None): Name to use for files saved from this dataset. Should be unique across datasets. drop (list[str] | None): List of column names to be dropped from the dataset. one_hot_encode (list[str] | None): List of column names to be one-hot encoded in the dataset.
- drop: list[str] | None#
- label_encode: Literal['yes'] | None#
- max_abs_scale: list[str] | None#
- min_max_scale: list[str] | None#
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'drop': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None), 'label_encode': FieldInfo(annotation=Union[Literal['yes'], NoneType], required=False, default=None, alias='labelEncode', alias_priority=2), 'max_abs_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='maxAbsScale', alias_priority=2), 'min_max_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='minMaxScale', alias_priority=2), 'nan': FieldInfo(annotation=Union[Literal['ffill', 'bfill', 'drop'], NoneType], required=False, default=None), 'normalize': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None), 'one_hot_encode': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='oneHotEncode', alias_priority=2), 'ordinal_encode': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='ordinalEncode', alias_priority=2), 'power_transform': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='powerTransform', alias_priority=2), 'quantile_transform': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='quantileTransform', alias_priority=2), 'quantile_transform_normal': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='quantileTransformNormal', alias_priority=2), 'robust_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='robustScale', alias_priority=2), 'save_name': FieldInfo(annotation=Union[str, NoneType], required=False, default=None, alias='saveName', alias_priority=2), 'standard_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='standardScale', alias_priority=2), 'target': FieldInfo(annotation=str, required=True), 'target_encode': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='targetEncode', alias_priority=2)}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- abstract model_post_init(Any)[source]#
Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.
- Return type:
- nan: Literal['ffill', 'bfill', 'drop'] | None#
- normalize: list[str] | None#
- one_hot_encode: list[str] | None#
- ordinal_encode: list[str] | None#
- power_transform: list[str] | None#
- quantile_transform: list[str] | None#
- quantile_transform_normal: list[str] | None#
- robust_scale: list[str] | None#
- save_name: str | None#
- standard_scale: list[str] | None#
- target: str#
- target_encode: list[str] | None#
LocalDataset#
- class mlcompare.data.LocalDataset(**data)[source]#
Bases:
BaseDatasetRepresents a locally saved dataset with all the fields required to load and prepare it for model evaluation.
Attributes:#
file_path (str | Path): Path to the local dataset file. target (str): Column name for the target of the predictions. save_name (str | None): Name to use for files saved from this dataset. Should be unique across datasets. If None, the file will be saved with the same name as the original file. drop (list[str] | None): List of column names to be dropped from the dataset. one_hot_encode (list[str] | None): List of column names to be one-hot encoded in the dataset.
- file_path: str | Path#
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'drop': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None), 'file_path': FieldInfo(annotation=Union[str, Path], required=True, alias='path', alias_priority=2), 'label_encode': FieldInfo(annotation=Union[Literal['yes'], NoneType], required=False, default=None, alias='labelEncode', alias_priority=2), 'max_abs_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='maxAbsScale', alias_priority=2), 'min_max_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='minMaxScale', alias_priority=2), 'nan': FieldInfo(annotation=Union[Literal['ffill', 'bfill', 'drop'], NoneType], required=False, default=None), 'normalize': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None), 'one_hot_encode': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='oneHotEncode', alias_priority=2), 'ordinal_encode': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='ordinalEncode', alias_priority=2), 'power_transform': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='powerTransform', alias_priority=2), 'quantile_transform': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='quantileTransform', alias_priority=2), 'quantile_transform_normal': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='quantileTransformNormal', alias_priority=2), 'robust_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='robustScale', alias_priority=2), 'save_name': FieldInfo(annotation=Union[str, NoneType], required=False, default=None, alias='saveName', alias_priority=2), 'standard_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='standardScale', alias_priority=2), 'target': FieldInfo(annotation=str, required=True), 'target_encode': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='targetEncode', alias_priority=2)}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
KaggleDataset#
- class mlcompare.data.KaggleDataset(**data)[source]#
Bases:
BaseDatasetRepresents a Kaggle dataset with all the fields required to download and prepare it for model evaluation.
Attributes:#
user (str): Username of the Kaggle user who owns the dataset. dataset (str): Name of the Kaggle dataset. file (str): Name of the file to be downloaded from the dataset. target (str): Column name for the target of the predictions. save_name (str | None): Name to use for files saved from this dataset. Should be unique across datasets. If None, the file will be named user_dataset. drop (list[str] | None): List of column names to be dropped from the dataset. one_hot_encode (list[str] | None): List of column names to be one-hot encoded in the dataset.
- dataset: str#
- file: str#
- get_data()[source]#
Downloads a Kaggle dataset. Currently only implemented for CSV files.
- Return type:
Returns:#
- :
pd.DataFrame: Downloaded data as a Pandas DataFrame.
Raises:#
ConnectionError: If unable to authenticate with Kaggle. ValueError: If there’s no Kaggle dataset files for the provided user and dataset names. ValueError: If the file name provided doesn’t match any of the files in the matched dataset.
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'dataset': FieldInfo(annotation=str, required=True), 'drop': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None), 'file': FieldInfo(annotation=str, required=True), 'label_encode': FieldInfo(annotation=Union[Literal['yes'], NoneType], required=False, default=None, alias='labelEncode', alias_priority=2), 'max_abs_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='maxAbsScale', alias_priority=2), 'min_max_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='minMaxScale', alias_priority=2), 'nan': FieldInfo(annotation=Union[Literal['ffill', 'bfill', 'drop'], NoneType], required=False, default=None), 'normalize': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None), 'one_hot_encode': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='oneHotEncode', alias_priority=2), 'ordinal_encode': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='ordinalEncode', alias_priority=2), 'power_transform': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='powerTransform', alias_priority=2), 'quantile_transform': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='quantileTransform', alias_priority=2), 'quantile_transform_normal': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='quantileTransformNormal', alias_priority=2), 'robust_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='robustScale', alias_priority=2), 'save_name': FieldInfo(annotation=Union[str, NoneType], required=False, default=None, alias='saveName', alias_priority=2), 'standard_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='standardScale', alias_priority=2), 'target': FieldInfo(annotation=str, required=True), 'target_encode': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='targetEncode', alias_priority=2), 'user': FieldInfo(annotation=str, required=True)}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- model_post_init(Any)[source]#
Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.
- Return type:
- user: str#
HuggingFaceDataset#
- class mlcompare.data.HuggingFaceDataset(**data)[source]#
Bases:
BaseDataset- Parameters:
- file: str#
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'drop': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None), 'file': FieldInfo(annotation=str, required=True), 'label_encode': FieldInfo(annotation=Union[Literal['yes'], NoneType], required=False, default=None, alias='labelEncode', alias_priority=2), 'max_abs_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='maxAbsScale', alias_priority=2), 'min_max_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='minMaxScale', alias_priority=2), 'nan': FieldInfo(annotation=Union[Literal['ffill', 'bfill', 'drop'], NoneType], required=False, default=None), 'normalize': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None), 'one_hot_encode': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='oneHotEncode', alias_priority=2), 'ordinal_encode': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='ordinalEncode', alias_priority=2), 'power_transform': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='powerTransform', alias_priority=2), 'quantile_transform': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='quantileTransform', alias_priority=2), 'quantile_transform_normal': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='quantileTransformNormal', alias_priority=2), 'repo': FieldInfo(annotation=str, required=True), 'robust_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='robustScale', alias_priority=2), 'save_name': FieldInfo(annotation=Union[str, NoneType], required=False, default=None, alias='saveName', alias_priority=2), 'standard_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='standardScale', alias_priority=2), 'target': FieldInfo(annotation=str, required=True), 'target_encode': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='targetEncode', alias_priority=2)}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- model_post_init(Any)[source]#
Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.
- Return type:
- repo: str#
OpenMLDataset#
- class mlcompare.data.OpenMLDataset(**data)[source]#
Bases:
BaseDataset- Parameters:
- id: int | str#
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'drop': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None), 'id': FieldInfo(annotation=Union[int, str], required=True), 'label_encode': FieldInfo(annotation=Union[Literal['yes'], NoneType], required=False, default=None, alias='labelEncode', alias_priority=2), 'max_abs_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='maxAbsScale', alias_priority=2), 'min_max_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='minMaxScale', alias_priority=2), 'nan': FieldInfo(annotation=Union[Literal['ffill', 'bfill', 'drop'], NoneType], required=False, default=None), 'normalize': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None), 'one_hot_encode': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='oneHotEncode', alias_priority=2), 'ordinal_encode': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='ordinalEncode', alias_priority=2), 'power_transform': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='powerTransform', alias_priority=2), 'quantile_transform': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='quantileTransform', alias_priority=2), 'quantile_transform_normal': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='quantileTransformNormal', alias_priority=2), 'robust_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='robustScale', alias_priority=2), 'save_name': FieldInfo(annotation=Union[str, NoneType], required=False, default=None, alias='saveName', alias_priority=2), 'standard_scale': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='standardScale', alias_priority=2), 'target': FieldInfo(annotation=str, required=True), 'target_encode': FieldInfo(annotation=Union[list[str], NoneType], required=False, default=None, alias='targetEncode', alias_priority=2)}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
DatasetFactory#
- class mlcompare.DatasetFactory(params_list)[source]#
Bases:
objectCreates Dataset objects such as LocalDataset, KaggleDataset, etc. from a list of dictionaries.
Attributes:#
params_list (list[dict[str, Any]] | Path): List of dictionaries containing dataset parameters or a path to a .json file with one. For a list of keys required in each dictionary, see below:
- Required keys for all dataset types:
dataset_type Literal[“kaggle”, “local”]: Type of dataset. Accepts ‘kaggle’ or ‘local’. target (str): Name of the target column in the dataset.
- Additional required keys for ‘local’ datasets:
file_path (str | Path): Path to the local dataset file. It can be relative or absolute.
- Additional required keys for ‘kaggle’ datasets:
user (str): Kaggle username of the dataset owner. dataset (str): Name of the Kaggle dataset. file (str): Name of the file to download from the dataset.
- Optional Keys:
save_name (str): Name to use for files saved from this dataset. Should be unique across datasets. drop (list[str]): List of column names to drop from the downloaded data. one_hot_encode (list[str]): List of column names to encode using a specific encoding method.
Raises:#
AssertionError: If dataset_params is not a list of dictionaries or a path to a .json file containing one.
- static create(type, **kwargs)[source]#
Factory method to create a dataset instance based on the dataset type.
- Return type:
LocalDataset|KaggleDataset|HuggingFaceDataset|OpenMLDataset- Parameters:
type (Literal['local', 'kaggle', 'hugging face', 'huggingface', 'huggingFace', 'openml'])
Args:#
dataset_type (Literal[“local”, “kaggle”, “hugging face”, “huggingface”, “openml”]): The type of dataset to create. **kwargs: Keyword arguments passed to the dataset class constructors.
Returns:#
- :
BaseDataset: An instance of a dataset class (KaggleDataset or LocalDataset).
Raises:#
ValueError: If an unknown dataset type is provided.
- Parameters:
params_list (ParamsInput)