API¶

主要模块¶

分类¶

class autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=3600, per_run_time_limit=None, initial_configurations_via_metalearning=25, ensemble_size: int | None = None, ensemble_class: Type[AbstractEnsemble] | Literal['default'] | None = 'default', ensemble_kwargs: Dict[str, Any] | None = None, ensemble_nbest=50, max_models_on_disc=50, seed=1, memory_limit=3072, include: Optional[Dict[str, List[str]]] = None, exclude: Optional[Dict[str, List[str]]] = None, resampling_strategy='holdout', resampling_strategy_arguments=None, tmp_folder=None, delete_tmp_folder_after_terminate=True, n_jobs: Optional[int] = None, dask_client: Optional[dask.distributed.Client] = None, disable_evaluator_output=False, get_smac_object_callback=None, smac_scenario_args=None, logging_config=None, metadata_directory=None, metric: Scorer | Sequence[Scorer] | None = None, scoring_functions: Optional[List[Scorer]] = None, load_models: bool = True, get_trials_callback: SMACCallback | None = None, dataset_compression: Union[bool, Mapping[str, Any]] = True, allow_string_features: bool = True)[source]¶

此类实现了分类任务。

参数

time_left_for_this_task整数，可选 (默认值=3600)

搜索合适模型的时间限制（秒）。增加此值，auto-sklearn 找到更好模型的几率更高。

per_run_time_limit整数，可选 (默认值=time_left_for_this_task 的 1/10)

单个机器学习模型的运行时间限制。如果机器学习算法运行超出时间限制，模型拟合将终止。请将此值设置得足够高，以便典型的机器学习算法可以在训练数据上进行拟合。

initial_configurations_via_metalearning整数，可选 (默认值=25)

使用在先前数据集上表现良好的配置数量来初始化超参数优化算法。如果超参数优化算法应从头开始，请禁用此项。

ensemble_size整数，可选

添加到由 Ensemble selection from libraries of models 构建的集成中的模型数量。模型是带替换抽取的。如果设置为 0，则不拟合集成。

已废弃 - 将在 Auto-sklearn 0.16 中移除。如果您想更改集成选择的集成大小，请通过 ensemble_kwargs={"ensemble_size": int} 传递此参数。

ensemble_classType[AbstractEnsemble] | “default”，可选 (默认值=”default”)

实现后验集成算法的类。设置为 None 以禁用集成构建，或使用 SingleBest 只使用单一最佳模型而不是集成。

如果设置为“default”，对于单目标问题，它将使用 EnsembleSelection；对于多目标问题，将使用 MultiObjectiveDummyEnsemble。

ensemble_kwargsDict，可选

在初始化时传递给集成类的关键字参数。

ensemble_nbest整数，可选 (默认值=50)

在构建集成时，仅考虑 ensemble_nbest 个模型。这受到 Getting Most out of Ensemble Selection 中引入的库剪枝概念的启发。这与 ensemble_class 参数无关，并且此剪枝步骤在构建集成之前完成。

max_models_on_disc: int, optional (default=50),

定义保留在磁盘上的最大模型数量。额外的模型将被永久删除。由于此变量的性质，它设定了可用于集成的模型数量上限。它必须是大于或等于1的整数。如果设置为 None，所有模型都保留在磁盘上。

seed整数，可选 (默认值=1)

用于为 SMAC 设置种子。将决定输出文件名。

memory_limit整数，可选 (3072)

机器学习算法的内存限制（MB）。如果 auto-sklearn 尝试分配超过 memory_limit MB 的内存，它将停止拟合机器学习算法。

重要说明

如果提供 None，则不设置内存限制。
在多进程情况下，memory_limit 将是 每个作业 的限制，因此总使用量为 n_jobs x memory_limit。
内存限制也适用于集成创建过程。

includeOptional[Dict[str, List[str]]] = None

如果为 None，则使用所有可能的算法。

否则，指定搜索中包含的步骤和组件。请参阅 /pipeline/components/<step>/* 查看可用组件。

与参数 exclude 不兼容。

可能的步骤:

"data_preprocessor"
"balancing"
"feature_preprocessor"
"classifier" - 仅在使用 AutoSklearnClasssifier 时可用
"regressor" - 仅在使用 AutoSklearnRegressor 时可用

示例:

include = {
    'classifier': ["random_forest"],
    'feature_preprocessor': ["no_preprocessing"]
}

excludeOptional[Dict[str, List[str]]] = None

如果为 None，则使用所有可能的算法。

否则，指定搜索中排除的步骤和组件。请参阅 /pipeline/components/<step>/* 查看可用组件。

与参数 include 不兼容。

可能的步骤:

"data_preprocessor"
"balancing"
"feature_preprocessor"
"classifier" - 仅在使用 AutoSklearnClasssifier 时可用
"regressor" - 仅在使用 AutoSklearnRegressor 时可用

示例:

exclude = {
    'classifier': ["random_forest"],
    'feature_preprocessor': ["no_preprocessing"]
}

resampling_strategystr | BaseCrossValidator | _RepeatedSplits | BaseShuffleSplit = “holdout”

如何处理过拟合，如果使用基于 "cv" 的方法或 Splitter 对象，可能需要使用 resampling_strategy_arguments。

选项
- "holdout" - 使用 67:33 (训练:测试) 的分割
- "cv": 执行交叉验证，需要在 resampling_strategy_arguments 中指定 “folds”
- "holdout-iterative-fit" - 与“holdout”相同，但尽可能使用迭代拟合
- "cv-iterative-fit": 与“cv”相同，但尽可能使用迭代拟合
- "partial-cv": 与“cv”相同，但使用强化。
- BaseCrossValidator - 任何 BaseCrossValidator 的子类 (在 scikit-learn model_selection 模块中找到)
- _RepeatedSplits - 任何 _RepeatedSplits 的子类 (在 scikit-learn model_selection 模块中找到)
- BaseShuffleSplit - 任何 BaseShuffleSplit 的子类 (在 scikit-learn model_selection 模块中找到)

如果使用依赖于数据集保留其当前大小和顺序的 Splitter 对象，您需要查看 dataset_compression 参数，并确保应用压缩的 "methods" 中不包含 "subsample"，或通过 False 完全禁用它。

resampling_strategy_argumentsOptional[Dict] = None

用于 resampling_strategy 的附加参数，如果使用基于 cv 的策略，则需要此参数。如果保留为 None，默认参数为

{
    "train_size": 0.67,     # The size of the training set
    "shuffle": True,        # Whether to shuffle before splitting data
    "folds": 5              # Used in 'cv' based resampling strategies
}

如果使用接受 n_splits 的自定义 splitter 类（例如 PredefinedSplit），将使用 "folds" 的值。

tmp_folder字符串，可选 (None)

用于存储配置输出和日志文件的文件夹，如果为 None，则自动使用 /tmp/autosklearn_tmp_$pid_$random_number

delete_tmp_folder_after_terminate: bool, optional (True)

完成时删除 tmp_folder。如果 tmp_folder 为 None，则 tmp_dir 将始终被删除

n_jobs整数，可选，实验性

在 fit() 中并行运行的作业数量。-1 表示使用所有处理器。

重要说明:

默认情况下，Auto-sklearn 使用一个核心。
集成构建不受 n_jobs 的影响，但可以通过集成中的模型数量来控制。
predict() 不受 n_jobs 的影响（与大多数 scikit-learn 模型不同）
如果 dask_client 为 None，则创建新的 dask 客户端。

dask_clientdask.distributed.Client，可选

用户创建的 dask 客户端，可用于启动 dask 集群并将 auto-sklearn 连接到它。

disable_evaluator_output: bool or list, optional (False)

如果为 True，则禁用模型和预测输出。不能与集成构建一起使用。设置此选项为 True 时，无法使用 predict()。也可以用作列表，以传递关于要保存内容的更详细信息。列表中的允许元素为

'y_optimization' : 不保存优化集的预测结果，该结果稍后将用于构建集成。
model : 不保存任何模型文件

smac_scenario_args字典，可选 (None)

插入到 SMAC 场景中的附加参数。有关可用参数列表，请参阅 SMAC 文档。

get_smac_object_callback可调用对象

用于创建 smac.optimizer.smbo.SMBO 类对象的函数。该函数必须接受参数 scenario_dict、instances、num_params、runhistory、seed 和 ta。这是一项高级功能。仅在您熟悉 SMAC 的情况下使用。

logging_config字典，可选 (None)

指定日志配置的字典对象。如果为 None，则使用默认的 logging.yaml 文件，该文件位于相对于安装目录的 util/logging.yaml 目录中。

metadata_directory字符串，可选 (None)

元数据目录的路径。如果为 None，则使用默认目录 (autosklearn.metalearning.files)。

metricScorer，可选 (None)

由 autosklearn.metrics.make_scorer() 创建的 autosklearn.metrics.Scorer 实例。这些是内置评估指标。如果提供 None，则根据任务选择默认评估指标。

scoring_functionsList[Scorer]，可选 (None)

将为每个 pipeline 计算的评估指标列表，结果将通过 cv_results 获取

load_models布尔值，可选 (True)

在拟合 Auto-sklearn 后是否加载模型。

get_trials_callback: callable

一个具有以下定义的callable对象。

(smac.SMBO, smac.RunInfo, smac.RunValue, time_left: float) -> bool | None

这将在 autosklearn 的底层优化器 SMAC 完成每次运行的训练后调用。

您可以使用它来记录有关优化过程的自己的信息。您也可以使用它来实现基于某些标准的早期停止。

请参阅示例：早期停止和回调。

dataset_compression: Union[bool, Mapping[str, Any]] = True

我们压缩数据集，以便它们适合预定义的内存量。目前这不适用于 dataframes 或稀疏数组，仅适用于原始 numpy 数组。

注意 - 如果使用依赖于特定数据大小或顺序的自定义 resampling_strategy，必须禁用此项以保留这些属性。

您可以通过传入 False 来完全禁用此项，或保留默认值 True 以进行以下配置。

{
    "memory_allocation": 0.1,
    "methods": ["precision", "subsample"]
}

您还可以使用相同的键并从可用的 "methods" 中选择来传递您自己的配置。

可用选项在此处描述

memory_allocation
默认情况下，我们尝试将数据集放入 0.1 * memory_limit 的内存中。此浮点值可以通过 "memory_allocation": 0.1 设置。我们也允许以 MB 为单位指定绝对内存，例如 10MB 是 "memory_allocation": 10。

在执行每种缩减方法后，都会检查数据集使用的内存。如果数据集符合分配的内存，则不会执行 "methods" 中列出的任何后续方法。

例如，如果 methods: ["precision", "subsample"] 并且 "precision" 缩减步骤足以使数据集符合内存，则不会执行 "subsample" 缩减步骤。
methods
我们提供以下方法来缩减数据集大小。这些方法可以以列表形式提供，并按照给定的顺序执行。
- "precision" - 我们按如下方式降低浮点精度： * np.float128 -> np.float64 * np.float96 -> np.float64 * np.float64 -> np.float32
- subsample - 我们对数据进行子采样，使其 直接适应内存分配 memory_allocation * memory_limit。因此，这很可能是 "methods" 中列出的最后一个方法。子采样考虑了分类标签并相应地进行分层。我们保证采样集中包含每个标签至少一个出现。

allow_string_features: bool = True

autosklearn 是否应该处理字符串特征。默认情况下，文本预处理是启用的。

属性

cv_results_numpy (masked) ndarrays 的字典

一个字典，其键为列标题，值为列数据，可以导入到 pandas DataFrame 中。

并非所有 scikit-learn 返回的键都已支持。

performance_over_time_pandas.core.frame.DataFrame

包含模型随时间变化的性能数据的 DataFrame。可以直接用于绘图。请参阅示例训练和测试输入。

fit(X, y, X_test=None, y_test=None, feat_type=None, dataset_name=None)[source]¶

将 auto-sklearn 拟合到给定的训练集 (X, y)。

Fit 同时优化机器学习模型并从中构建集成。

参数

Xarray-like 或稀疏矩阵，形状 = [n_samples, n_features]: 训练输入样本。
yarray-like，形状 = [n_samples] 或 [n_samples, n_outputs]: 目标类别。
X_testarray-like 或稀疏矩阵，形状 = [n_samples, n_features]: 测试数据输入样本。将用于保存所有模型的测试预测。这允许评估 Auto-sklearn 随时间变化的性能。
y_testarray-like，形状 = [n_samples] 或 [n_samples, n_outputs]: 测试数据目标类别。将用于计算所有模型的测试误差。这允许评估 Auto-sklearn 随时间变化的性能。
feat_type列表，可选 (默认值=None): 一个包含 len(X.shape[1]) 个字符串的列表，描述属性类型。可能的类型包括 Categorical 和 Numerical。Categorical 属性将自动进行 One-Hot 编码。用于 categorical 属性的值必须是整数，例如通过 sklearn.preprocessing.LabelEncoder 获得。
dataset_name字符串，可选 (默认值=None): 创建更友好的输出。如果为 None，则字符串将由数据集的 md5 哈希确定。

返回

self

fit_ensemble(y, task: int = None, precision: Literal[16, 21, 64] = 32, dataset_name: Optional[str] = None, ensemble_size: int | None = None, ensemble_kwargs: Optional[Dict[str, Any]] = None, ensemble_nbest: Optional[int] = None, ensemble_class: Type[AbstractEnsemble] | Literal['default'] | None = 'default', metric: Scorer | Sequence[Scorer] | None = None)¶

对优化过程中训练的模型拟合集成。

所有参数默认都为 None。如果没有给出其他值，则使用调用 fit() 时设置的默认值。

参数

yarray-like

目标值。

task整数

来自模块 autosklearn.constants 的一个常量。确定任务类型（二分类、多类别分类、多标签分类或回归）。

precision整数

加载集成数据时使用的数值精度。可以是 16、32 或 64。

dataset_name字符串

当前数据集的名称。

ensemble_size整数，可选

添加到由 Ensemble selection from libraries of models 构建的集成中的模型数量。模型是带替换抽取的。如果设置为 0，则不拟合集成。

已废弃 - 将在 Auto-sklearn 0.16 中移除。如果您想更改集成选择的集成大小，请通过 ensemble_kwargs={"ensemble_size": int} 传递此参数。

ensemble_kwargsDict，可选

在初始化时传递给集成类的关键字参数。

ensemble_nbest整数

ensemble_classType[AbstractEnsemble] | “default”，可选 (默认值=”default”)

实现后验集成算法的类。设置为 None 以禁用集成构建，或使用类 SingleBest 只使用单一最佳模型而不是集成。

如果设置为“default”，对于单目标问题，它将使用 EnsembleSelection；对于多目标问题，将使用 MultiObjectiveDummyEnsemble。

metric: Scorer | Sequence[Scorer] | None = None

用于评估集成的评估指标或评估指标列表

返回

self

fit_pipeline(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], config: Union[ConfigSpace.configuration_space.Configuration, Dict[str, Union[str, float, int]]], dataset_name: Optional[str] = None, X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, feat_type: Optional[List[str]] = None, *args, **kwargs: Dict) → Tuple[Optional[autosklearn.pipeline.base.BasePipeline], smac.runhistory.runhistory.RunInfo, smac.runhistory.runhistory.RunValue]¶

拟合单个 pipeline 配置，并将结果返回给用户。

遵守 Estimator 的约束，例如重采样策略或内存约束，除非直接提供给方法。默认情况下，此方法支持与 fit() 相同的签名，并且任何额外的参数都将重定向到 TAE 评估函数，这允许在构建 pipeline 时进行进一步的自定义。

提供的任何附加参数都将直接传递给执行运行的worker。

参数

X: array-like，形状 = (n_samples, n_features): 用于训练的特征
y: array-like: 用于训练的标签
X_test: Optionalarray-like，形状 = (n_samples, n_features): 如果提供，将在此特征上跟踪测试性能。
y_test: array-like: 如果提供，将在此标签上跟踪测试性能
config: Union[Configuration, Dict[str, Union[str, float, int]]]: 用于定义 pipeline 步骤的配置对象。如果传递的是字典，则根据此字典创建配置。
dataset_name: Optional[str]: 用于标记 Auto-Sklearn 运行并识别 Auto-Sklearn 运行的名称
feat_type列表，可选 (默认值=None): 一个包含 len(X.shape[1]) 个字符串的列表，描述属性类型。可能的类型包括 Categorical 和 Numerical。Categorical 属性将自动进行 One-Hot 编码。用于 categorical 属性的值必须是整数，例如通过 sklearn.preprocessing.LabelEncoder 获得。

返回

pipeline: Optional[BasePipeline]: 已拟合的 pipeline。如果在拟合 pipeline 时失败，则返回 None。
run_info: RunInFo: 一个包含启动配置的命名元组
run_value: RunValue: 一个包含运行结果的命名元组

get_configuration_space(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, dataset_name: Optional[str] = None, feat_type: Optional[List[str]] = None) → ConfigSpace.configuration_space.ConfigurationSpace¶

返回 Configuration Space 对象，Auto-Sklearn 将从此对象中采样配置并构建 pipelines。

参数

Xarray-like 或稀疏矩阵，形状 = [n_samples, n_features]: 包含训练特征的数组，用于获取数据稀疏度等特性
yarray-like，形状 = [n_samples] 或 [n_samples, n_outputs]: 包含问题标签的数组
X_testarray-like 或稀疏矩阵，形状 = [n_samples, n_features]: 包含用于性能评估的特征的数组
y_testarray-like，形状 = [n_samples] 或 [n_samples, n_outputs]: 包含测试分割问题标签的数组
dataset_name: Optional[str]: 用于标记 Auto-Sklearn 运行的字符串

get_models_with_weights()¶

返回 auto-sklearn 找到的最终集成的列表。

返回

[(weight_1, model_1), …, (weight_n, model_n)]

get_params(deep=True)¶

获取此 estimator 的参数。

参数

deep布尔值，默认值=True: 如果为 True，将返回此 estimator 及其包含的作为 estimator 的子对象的参数。

返回

params字典: 参数名称映射到其值。

leaderboard(detailed: bool = False, ensemble_only: bool = True, top_k: Union[int, Literal['all']] = 'all', sort_by: str = 'cost', sort_order: Literal['auto', 'ascending', 'descending'] = 'auto', include: Optional[Union[str, Iterable[str]]] = None) → pandas.core.frame.DataFrame¶

返回所有评估模型的 pandas 表格结果。

概述搜索过程中训练的所有模型，以及有关其训练的各种统计信息。

可用统计信息如下：

简单:

"model_id" - autosklearn 赋予模型的 ID。
"rank" - 基于模型的 "cost" 的排名。
"ensemble_weight" - 模型在集成中的权重。
"type" - 使用的分类器/回归器类型。
"cost" - 模型在验证集上的损失。
"duration" - 模型优化的时长。

详细: 详细视图包含所有简单统计信息以及以下内容。

"config_id" - SMAC 用于优化的 ID。
"budget" - 分配给此模型的预算。
"status" - 使用 SMAC 训练模型的返回状态。
"train_loss" - 模型在训练集上的损失。
"balancing_strategy" - 数据预处理使用的平衡策略。
"start_time" - 模型开始优化时间
"end_time" - 模型结束优化时间
"data_preprocessors" - 数据上使用的预处理器
"feature_preprocessors" - 特征类型的预处理器

参数

detailed: bool = False

是否提供详细信息或仅提供简单概述。

ensemble_only: bool = True

是否仅查看集成中包含的模型或所有训练过的模型。

top_k: int or “all” = “all”

要显示的模型数量。

sort_by: str = ‘cost’

按哪一列排序。如果该列不存在，则默认按 "model_id" 索引列排序。

默认按优化的评估指标排序。在多目标优化问题中，按第一个目标排序

sort_order: “auto” or “ascending” or “descending” = “auto”

应用于 sort_by 列的排序顺序。如果保留为 "auto"，则将按一个合理的默认顺序排序，其中“更好”的项排在前面；否则，如果没有明显的“更好”，则默认为 pandas DataFrame.sort_values 的默认行为。

include: Optional[str or Iterable[str]]

要包含的项，未指定的其他项将被排除。例外情况是 "model_id" 索引列，它始终包含在内。

如果保留为 None，它将恢复使用 detailed 参数来决定要包含的列。

返回

pd.DataFrame: 一个包含模型统计信息的 dataframe，按 sort_by 排序。

predict(X, batch_size=None, n_jobs=1)[source]¶

预测 X 的类别。

参数

Xarray-like 或稀疏矩阵，形状 = [n_samples, n_features]

返回

y数组，形状 = [n_samples] 或 [n_samples, n_labels]: 预测的类别。

predict_proba(X, batch_size=None, n_jobs=1)[source]¶

预测所有样本 X 的类别概率。

参数

Xarray-like 或稀疏矩阵，形状 = [n_samples, n_features]
batch_size整数（可选）: 要预测的数据点数量（如果为 None，则一次预测所有点）。
n_jobs整数

返回

y数组，形状 = [n_samples, n_classes] 或 [n_samples, n_labels]: 预测的类别概率。

refit(X, y)¶

将 fit 找到的所有模型重新拟合到新数据。

使用交叉验证时必要。在训练过程中，auto-sklearn 在数据集上拟合每个模型 k 次，但不保留任何训练好的模型，因此不能用于预测新的数据点。此方法将在给定数据上拟合在调用 fit 期间找到的所有模型。此方法也可与 holdout 一起使用，以避免仅使用 66% 的训练数据来拟合最终模型。

参数

Xarray-like 或稀疏矩阵，形状 = [n_samples, n_features]: 训练输入样本。
yarray-like，形状 = [n_samples] 或 [n_samples, n_outputs]: 目标。

返回

self

score(X, y)¶

返回给定测试数据和标签的平均准确率。

在多标签分类中，这是子集准确率，这是一个严格的指标，因为它要求每个样本的每个标签集都正确预测。

参数

Xarray-like，形状 (n_samples, n_features): 测试样本。
yarray-like，形状 (n_samples,) 或 (n_samples, n_outputs): X 的真实标签。
sample_weightarray-like，形状 (n_samples,)，默认值=None: 样本权重。

返回

score浮点数: self.predict(X) 关于 y 的平均准确率。

set_params(**params)¶

设置此 estimator 的参数。

该方法适用于简单的 estimator 以及嵌套对象（例如 Pipeline）。后者具有 <component>__<parameter> 形式的参数，以便可以更新嵌套对象的每个组件。

参数

**params字典: Estimator 参数。

返回

selfestimator 实例: Estimator 实例。

show_models()¶

返回一个字典，其中包含集成模型的字典。

集成中的每个模型都可以通过其 model_id 作为键来访问。

模型字典包含以下内容

"model_id" - autosklearn 赋予模型的 ID。
"rank" - 基于模型的 "cost" 的排名。
"cost" - 模型在验证集上的损失。
"ensemble_weight" - 模型在集成中的权重。
"voting_model" - cv_voting_ensemble 模型（用于 'cv' 重采样）。
"estimators" - cv_voting_ensemble 中的模型列表（字典）
（'cv' 重采样）。
"data_preprocessor" - 数据上使用的预处理器。
"balancing" - 数据上使用的平衡（用于分类）。
"feature_preprocessor" - 特征类型的预处理器。
"classifier" / "regressor" - autosklearn 封装的分类器或回归器。
"sklearn_classifier" 或 "sklearn_regressor" - sklearn 分类器或回归器。

示例

import sklearn.datasets
import sklearn.metrics
import autosklearn.regression

X, y = sklearn.datasets.load_diabetes(return_X_y=True)

automl = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=120
    )
automl.fit(X_train, y_train, dataset_name='diabetes')

ensemble_dict = automl.show_models()
print(ensemble_dict)

输出

{
    25: {'model_id': 25.0,
         'rank': 1,
         'cost': 0.43667876507897496,
         'ensemble_weight': 0.38,
         'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>,
         'feature_preprocessor': <autosklearn.pipeline.components....>,
         'regressor': <autosklearn.pipeline.components.regression....>,
         'sklearn_regressor': SGDRegressor(alpha=0.0006517033225329654,...)
        },
    6: {'model_id': 6.0,
        'rank': 2,
        'cost': 0.4550418898836528,
        'ensemble_weight': 0.3,
        'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>,
        'feature_preprocessor': <autosklearn.pipeline.components....>,
        'regressor': <autosklearn.pipeline.components.regression....>,
        'sklearn_regressor': ARDRegression(alpha_1=0.0003701926442639788,...)
        }...
}

返回

Dict(int, Any)字典，长度 = 集成中的模型数量: 集成中模型的字典，其中 model_id 是键。

sprint_statistics()¶

返回训练结果的以下统计信息

数据集名称
使用的评估指标
最佳验证分数
目标算法运行次数
目标算法成功运行次数
目标算法崩溃运行次数
超出内存限制的目标算法运行次数
超出时间限制的目标算法运行次数

返回

str

class autosklearn.experimental.askl2.AutoSklearn2Classifier(time_left_for_this_task: int = 3600, per_run_time_limit=None, ensemble_size: int | None = None, ensemble_class: AbstractEnsemble | None = <class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>, ensemble_kwargs: Dict[str, Any] | None = None, ensemble_nbest: Union[float, int] = 50, max_models_on_disc: int = 50, seed: int = 1, memory_limit: int = 3072, tmp_folder: Optional[str] = None, delete_tmp_folder_after_terminate: bool = True, n_jobs: Optional[int] = None, dask_client: Optional[dask.distributed.Client] = None, disable_evaluator_output: bool = False, smac_scenario_args: Optional[Dict[str, Any]] = None, logging_config: Optional[Dict[str, Any]] = None, metric: Optional[Scorer] = None, scoring_functions: Optional[List[Scorer]] = None, load_models: bool = True, dataset_compression: Union[bool, Mapping[str, Any]] = True, allow_string_features: bool = True)[source]¶

参数

time_left_for_this_task整数，可选 (默认值=3600)

搜索合适模型的时间限制（秒）。增加此值，auto-sklearn 找到更好模型的几率更高。

per_run_time_limit整数，可选 (默认值=time_left_for_this_task 的 1/10)

ensemble_size整数，可选

添加到由 Ensemble selection from libraries of models 构建的集成中的模型数量。模型是带替换抽取的。如果设置为 0，则不拟合集成。

已废弃 - 将在 Auto-sklearn 0.16 中移除。如果您想更改集成选择的集成大小，请通过 ensemble_kwargs={"ensemble_size": int} 传递此参数。

ensemble_classType[AbstractEnsemble], optional (default=EnsembleSelection)

实现后处理集成算法的类。设为 None 可禁用集成构建，或者使用 SingleBest 仅使用单一最佳模型而不是集成。

ensemble_kwargsDict，可选

在初始化时传递给集成类的关键字参数。

max_models_on_disc: int, optional (default=50),

seed整数，可选 (默认值=1)

用于为 SMAC 设置种子。将决定输出文件名。

memory_limit整数，可选 (3072)

机器学习算法的内存限制（MB）。如果 auto-sklearn 尝试分配超过 memory_limit MB 的内存，它将停止拟合机器学习算法。

重要说明

如果提供 None，则不设置内存限制。
在多进程情况下，memory_limit 将是 每个作业 的限制，因此总使用量为 n_jobs x memory_limit。
内存限制也适用于集成创建过程。

tmp_folder字符串，可选 (None)

用于存储配置输出和日志文件的文件夹，如果为 None，则自动使用 /tmp/autosklearn_tmp_$pid_$random_number

delete_tmp_folder_after_terminate: string, optional (True)

完成时删除 tmp_folder。如果 tmp_folder 为 None，则 tmp_dir 将始终被删除

n_jobs整数，可选，实验性

在 fit() 中并行运行的作业数量。-1 表示使用所有处理器。

重要说明:

默认情况下，Auto-sklearn 使用一个核心。
集成构建不受 n_jobs 的影响，但可以通过集成中的模型数量来控制。
predict() 不受 n_jobs 的影响（与大多数 scikit-learn 模型不同）
如果 dask_client 为 None，则创建新的 dask 客户端。

dask_clientdask.distributed.Client，可选

用户创建的 dask 客户端，可用于启动 dask 集群并将 auto-sklearn 连接到它。

disable_evaluator_output: bool or list, optional (False)

'y_optimization' : 不保存优化/验证集的预测结果，这些预测结果稍后将用于构建集成。
model : 不保存任何模型文件

smac_scenario_args字典，可选 (None)

插入到 SMAC 场景中的附加参数。有关可用参数列表，请参阅 SMAC 文档。

logging_config字典，可选 (None)

指定日志配置的字典对象。如果为 None，则使用默认的 logging.yaml 文件，该文件位于相对于安装目录的 util/logging.yaml 目录中。

metricScorer，可选 (None)

由 autosklearn.metrics.make_scorer() 创建的 autosklearn.metrics.Scorer 实例。这些是内置评估指标。如果提供 None，则根据任务选择默认评估指标。

scoring_functionsList[Scorer]，可选 (None)

将为每个 pipeline 计算的评估指标列表，结果将通过 cv_results 获取

load_models布尔值，可选 (True)

在拟合 Auto-sklearn 后是否加载模型。

属性

cv_results_numpy (masked) ndarrays 的字典

一个字典，其键为列标题，值为列数据，可以导入到 pandas DataFrame 中。

并非所有 scikit-learn 返回的键都已支持。

fit(X, y, X_test=None, y_test=None, metric=None, feat_type=None, dataset_name=None)[source]¶

将 auto-sklearn 拟合到给定的训练集 (X, y)。

Fit 同时优化机器学习模型并从中构建集成。

参数

Xarray-like 或稀疏矩阵，形状 = [n_samples, n_features]: 训练输入样本。
yarray-like，形状 = [n_samples] 或 [n_samples, n_outputs]: 目标类别。
X_testarray-like 或稀疏矩阵，形状 = [n_samples, n_features]: 测试数据输入样本。将用于保存所有模型的测试预测。这允许评估 Auto-sklearn 随时间变化的性能。
y_testarray-like，形状 = [n_samples] 或 [n_samples, n_outputs]: 测试数据目标类别。将用于计算所有模型的测试误差。这允许评估 Auto-sklearn 随时间变化的性能。
feat_type列表，可选 (默认值=None): 一个包含 len(X.shape[1]) 个字符串的列表，描述属性类型。可能的类型包括 Categorical 和 Numerical。Categorical 属性将自动进行 One-Hot 编码。用于 categorical 属性的值必须是整数，例如通过 sklearn.preprocessing.LabelEncoder 获得。
dataset_name字符串，可选 (默认值=None): 创建更友好的输出。如果为 None，则字符串将由数据集的 md5 哈希确定。

返回

self

对优化过程中训练的模型拟合集成。

所有参数默认都为 None。如果没有给出其他值，则使用调用 fit() 时设置的默认值。

参数

yarray-like

目标值。

task整数

来自模块 autosklearn.constants 的一个常量。确定任务类型（二分类、多类别分类、多标签分类或回归）。

precision整数

加载集成数据时使用的数值精度。可以是 16、32 或 64。

dataset_name字符串

当前数据集的名称。

ensemble_size整数，可选

添加到由 Ensemble selection from libraries of models 构建的集成中的模型数量。模型是带替换抽取的。如果设置为 0，则不拟合集成。

已废弃 - 将在 Auto-sklearn 0.16 中移除。如果您想更改集成选择的集成大小，请通过 ensemble_kwargs={"ensemble_size": int} 传递此参数。

ensemble_kwargsDict，可选

在初始化时传递给集成类的关键字参数。

ensemble_nbest整数

ensemble_classType[AbstractEnsemble] | “default”，可选 (默认值=”default”)

实现后验集成算法的类。设置为 None 以禁用集成构建，或使用类 SingleBest 只使用单一最佳模型而不是集成。

如果设置为“default”，对于单目标问题，它将使用 EnsembleSelection；对于多目标问题，将使用 MultiObjectiveDummyEnsemble。

metric: Scorer | Sequence[Scorer] | None = None

用于评估集成的评估指标或评估指标列表

返回

self

拟合单个 pipeline 配置，并将结果返回给用户。

提供的任何附加参数都将直接传递给执行运行的worker。

参数

X: array-like，形状 = (n_samples, n_features): 用于训练的特征
y: array-like: 用于训练的标签
X_test: Optionalarray-like，形状 = (n_samples, n_features): 如果提供，将在此特征上跟踪测试性能。
y_test: array-like: 如果提供，将在此标签上跟踪测试性能
config: Union[Configuration, Dict[str, Union[str, float, int]]]: 用于定义 pipeline 步骤的配置对象。如果传递的是字典，则根据此字典创建配置。
dataset_name: Optional[str]: 用于标记 Auto-Sklearn 运行并识别 Auto-Sklearn 运行的名称
feat_type列表，可选 (默认值=None): 一个包含 len(X.shape[1]) 个字符串的列表，描述属性类型。可能的类型包括 Categorical 和 Numerical。Categorical 属性将自动进行 One-Hot 编码。用于 categorical 属性的值必须是整数，例如通过 sklearn.preprocessing.LabelEncoder 获得。

返回

pipeline: Optional[BasePipeline]: 已拟合的 pipeline。如果在拟合 pipeline 时失败，则返回 None。
run_info: RunInFo: 一个包含启动配置的命名元组
run_value: RunValue: 一个包含运行结果的命名元组

返回 Configuration Space 对象，Auto-Sklearn 将从此对象中采样配置并构建 pipelines。

参数

Xarray-like 或稀疏矩阵，形状 = [n_samples, n_features]: 包含训练特征的数组，用于获取数据稀疏度等特性
yarray-like，形状 = [n_samples] 或 [n_samples, n_outputs]: 包含问题标签的数组
X_testarray-like 或稀疏矩阵，形状 = [n_samples, n_features]: 包含用于性能评估的特征的数组
y_testarray-like，形状 = [n_samples] 或 [n_samples, n_outputs]: 包含测试分割问题标签的数组
dataset_name: Optional[str]: 用于标记 Auto-Sklearn 运行的字符串

get_models_with_weights()¶

返回 auto-sklearn 找到的最终集成的列表。

返回

[(weight_1, model_1), …, (weight_n, model_n)]

get_params(deep=True)¶

获取此 estimator 的参数。

参数

deep布尔值，默认值=True: 如果为 True，将返回此 estimator 及其包含的作为 estimator 的子对象的参数。

返回

params字典: 参数名称映射到其值。

返回所有评估模型的 pandas 表格结果。

概述搜索过程中训练的所有模型，以及有关其训练的各种统计信息。

可用统计信息如下：

简单:

"model_id" - autosklearn 赋予模型的 ID。
"rank" - 基于模型的 "cost" 的排名。
"ensemble_weight" - 模型在集成中的权重。
"type" - 使用的分类器/回归器类型。
"cost" - 模型在验证集上的损失。
"duration" - 模型优化的时长。

详细: 详细视图包含所有简单统计信息以及以下内容。

"config_id" - SMAC 用于优化的 ID。
"budget" - 分配给此模型的预算。
"status" - 使用 SMAC 训练模型的返回状态。
"train_loss" - 模型在训练集上的损失。
"balancing_strategy" - 数据预处理使用的平衡策略。
"start_time" - 模型开始优化时间
"end_time" - 模型结束优化时间
"data_preprocessors" - 数据上使用的预处理器
"feature_preprocessors" - 特征类型的预处理器

参数

detailed: bool = False

是否提供详细信息或仅提供简单概述。

ensemble_only: bool = True

是否仅查看集成中包含的模型或所有训练过的模型。

top_k: int or “all” = “all”

要显示的模型数量。

sort_by: str = ‘cost’

按哪一列排序。如果该列不存在，则默认按 "model_id" 索引列排序。

默认按优化的评估指标排序。在多目标优化问题中，按第一个目标排序

sort_order: “auto” or “ascending” or “descending” = “auto”

include: Optional[str or Iterable[str]]

要包含的项，未指定的其他项将被排除。例外情况是 "model_id" 索引列，它始终包含在内。

如果保留为 None，它将恢复使用 detailed 参数来决定要包含的列。

返回

pd.DataFrame: 一个包含模型统计信息的 dataframe，按 sort_by 排序。

predict(X, batch_size=None, n_jobs=1)¶

预测 X 的类别。

参数

Xarray-like 或稀疏矩阵，形状 = [n_samples, n_features]

返回

y数组，形状 = [n_samples] 或 [n_samples, n_labels]: 预测的类别。

predict_proba(X, batch_size=None, n_jobs=1)¶

预测所有样本 X 的类别概率。

参数

Xarray-like 或稀疏矩阵，形状 = [n_samples, n_features]
batch_size整数（可选）: 要预测的数据点数量（如果为 None，则一次预测所有点）。
n_jobs整数

返回

y数组，形状 = [n_samples, n_classes] 或 [n_samples, n_labels]: 预测的类别概率。

refit(X, y)¶

将 fit 找到的所有模型重新拟合到新数据。

参数

Xarray-like 或稀疏矩阵，形状 = [n_samples, n_features]: 训练输入样本。
yarray-like，形状 = [n_samples] 或 [n_samples, n_outputs]: 目标。

返回

self

score(X, y)¶

返回给定测试数据和标签的平均准确率。

在多标签分类中，这是子集准确率，这是一个严格的指标，因为它要求每个样本的每个标签集都正确预测。

参数

Xarray-like，形状 (n_samples, n_features): 测试样本。
yarray-like，形状 (n_samples,) 或 (n_samples, n_outputs): X 的真实标签。
sample_weightarray-like，形状 (n_samples,)，默认值=None: 样本权重。

返回

score浮点数: self.predict(X) 关于 y 的平均准确率。

set_params(**params)¶

设置此 estimator 的参数。

该方法适用于简单的 estimator 以及嵌套对象（例如 Pipeline）。后者具有 <component>__<parameter> 形式的参数，以便可以更新嵌套对象的每个组件。

参数

**params字典: Estimator 参数。

返回

selfestimator 实例: Estimator 实例。

show_models()¶

返回一个字典，其中包含集成模型的字典。

集成中的每个模型都可以通过其 model_id 作为键来访问。

模型字典包含以下内容

"model_id" - autosklearn 赋予模型的 ID。
"rank" - 基于模型的 "cost" 的排名。
"cost" - 模型在验证集上的损失。
"ensemble_weight" - 模型在集成中的权重。
"voting_model" - cv_voting_ensemble 模型（用于 'cv' 重采样）。
"estimators" - cv_voting_ensemble 中的模型列表（字典）
（'cv' 重采样）。
"data_preprocessor" - 数据上使用的预处理器。
"balancing" - 数据上使用的平衡（用于分类）。
"feature_preprocessor" - 特征类型的预处理器。
"classifier" / "regressor" - autosklearn 封装的分类器或回归器。
"sklearn_classifier" 或 "sklearn_regressor" - sklearn 分类器或回归器。

示例

import sklearn.datasets
import sklearn.metrics
import autosklearn.regression

X, y = sklearn.datasets.load_diabetes(return_X_y=True)

automl = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=120
    )
automl.fit(X_train, y_train, dataset_name='diabetes')

ensemble_dict = automl.show_models()
print(ensemble_dict)

输出

{
    25: {'model_id': 25.0,
         'rank': 1,
         'cost': 0.43667876507897496,
         'ensemble_weight': 0.38,
         'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>,
         'feature_preprocessor': <autosklearn.pipeline.components....>,
         'regressor': <autosklearn.pipeline.components.regression....>,
         'sklearn_regressor': SGDRegressor(alpha=0.0006517033225329654,...)
        },
    6: {'model_id': 6.0,
        'rank': 2,
        'cost': 0.4550418898836528,
        'ensemble_weight': 0.3,
        'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>,
        'feature_preprocessor': <autosklearn.pipeline.components....>,
        'regressor': <autosklearn.pipeline.components.regression....>,
        'sklearn_regressor': ARDRegression(alpha_1=0.0003701926442639788,...)
        }...
}

返回

Dict(int, Any)字典，长度 = 集成中的模型数量: 集成中模型的字典，其中 model_id 是键。

sprint_statistics()¶

返回训练结果的以下统计信息

数据集名称
使用的评估指标
最佳验证分数
目标算法运行次数
目标算法成功运行次数
目标算法崩溃运行次数
超出内存限制的目标算法运行次数
超出时间限制的目标算法运行次数

返回

str

回归¶

class autosklearn.regression.AutoSklearnRegressor(time_left_for_this_task=3600, per_run_time_limit=None, initial_configurations_via_metalearning=25, ensemble_size: int | None = None, ensemble_class: Type[AbstractEnsemble] | Literal['default'] | None = 'default', ensemble_kwargs: Dict[str, Any] | None = None, ensemble_nbest=50, max_models_on_disc=50, seed=1, memory_limit=3072, include: Optional[Dict[str, List[str]]] = None, exclude: Optional[Dict[str, List[str]]] = None, resampling_strategy='holdout', resampling_strategy_arguments=None, tmp_folder=None, delete_tmp_folder_after_terminate=True, n_jobs: Optional[int] = None, dask_client: Optional[dask.distributed.Client] = None, disable_evaluator_output=False, get_smac_object_callback=None, smac_scenario_args=None, logging_config=None, metadata_directory=None, metric: Scorer | Sequence[Scorer] | None = None, scoring_functions: Optional[List[Scorer]] = None, load_models: bool = True, get_trials_callback: SMACCallback | None = None, dataset_compression: Union[bool, Mapping[str, Any]] = True, allow_string_features: bool = True)[source]¶

此类实现了回归任务。

参数

time_left_for_this_task整数，可选 (默认值=3600)

搜索合适模型的时间限制（秒）。增加此值，auto-sklearn 找到更好模型的几率更高。

per_run_time_limit整数，可选 (默认值=time_left_for_this_task 的 1/10)

initial_configurations_via_metalearning整数，可选 (默认值=25)

使用在先前数据集上表现良好的配置数量来初始化超参数优化算法。如果超参数优化算法应从头开始，请禁用此项。

ensemble_size整数，可选

添加到由 Ensemble selection from libraries of models 构建的集成中的模型数量。模型是带替换抽取的。如果设置为 0，则不拟合集成。

已废弃 - 将在 Auto-sklearn 0.16 中移除。如果您想更改集成选择的集成大小，请通过 ensemble_kwargs={"ensemble_size": int} 传递此参数。

ensemble_classType[AbstractEnsemble] | “default”，可选 (默认值=”default”)

实现后验集成算法的类。设置为 None 以禁用集成构建，或使用 SingleBest 只使用单一最佳模型而不是集成。

如果设置为“default”，对于单目标问题，它将使用 EnsembleSelection；对于多目标问题，将使用 MultiObjectiveDummyEnsemble。

ensemble_kwargsDict，可选

在初始化时传递给集成类的关键字参数。

ensemble_nbest整数，可选 (默认值=50)

max_models_on_disc: int, optional (default=50),

seed整数，可选 (默认值=1)

用于为 SMAC 设置种子。将决定输出文件名。

memory_limit整数，可选 (3072)

机器学习算法的内存限制（MB）。如果 auto-sklearn 尝试分配超过 memory_limit MB 的内存，它将停止拟合机器学习算法。

重要说明

如果提供 None，则不设置内存限制。
在多进程情况下，memory_limit 将是 每个作业 的限制，因此总使用量为 n_jobs x memory_limit。
内存限制也适用于集成创建过程。

includeOptional[Dict[str, List[str]]] = None

如果为 None，则使用所有可能的算法。

否则，指定搜索中包含的步骤和组件。请参阅 /pipeline/components/<step>/* 查看可用组件。

与参数 exclude 不兼容。

可能的步骤:

"data_preprocessor"
"balancing"
"feature_preprocessor"
"classifier" - 仅在使用 AutoSklearnClasssifier 时可用
"regressor" - 仅在使用 AutoSklearnRegressor 时可用

示例:

include = {
    'classifier': ["random_forest"],
    'feature_preprocessor': ["no_preprocessing"]
}

excludeOptional[Dict[str, List[str]]] = None

如果为 None，则使用所有可能的算法。

否则，指定搜索中排除的步骤和组件。请参阅 /pipeline/components/<step>/* 查看可用组件。

与参数 include 不兼容。

可能的步骤:

"data_preprocessor"
"balancing"
"feature_preprocessor"
"classifier" - 仅在使用 AutoSklearnClasssifier 时可用
"regressor" - 仅在使用 AutoSklearnRegressor 时可用

示例:

exclude = {
    'classifier': ["random_forest"],
    'feature_preprocessor': ["no_preprocessing"]
}

resampling_strategystr | BaseCrossValidator | _RepeatedSplits | BaseShuffleSplit = “holdout”

如何处理过拟合，如果使用基于 "cv" 的方法或 Splitter 对象，可能需要使用 resampling_strategy_arguments。

选项
- "holdout" - 使用 67:33 (训练:测试) 的分割
- "cv": 执行交叉验证，需要在 resampling_strategy_arguments 中指定 “folds”
- "holdout-iterative-fit" - 与“holdout”相同，但尽可能使用迭代拟合
- "cv-iterative-fit": 与“cv”相同，但尽可能使用迭代拟合
- "partial-cv": 与“cv”相同，但使用强化。
- BaseCrossValidator - 任何 BaseCrossValidator 的子类 (在 scikit-learn model_selection 模块中找到)
- _RepeatedSplits - 任何 _RepeatedSplits 的子类 (在 scikit-learn model_selection 模块中找到)
- BaseShuffleSplit - 任何 BaseShuffleSplit 的子类 (在 scikit-learn model_selection 模块中找到)

resampling_strategy_argumentsOptional[Dict] = None

用于 resampling_strategy 的附加参数，如果使用基于 cv 的策略，则需要此参数。如果保留为 None，默认参数为

{
    "train_size": 0.67,     # The size of the training set
    "shuffle": True,        # Whether to shuffle before splitting data
    "folds": 5              # Used in 'cv' based resampling strategies
}

如果使用接受 n_splits 的自定义 splitter 类（例如 PredefinedSplit），将使用 "folds" 的值。

tmp_folder字符串，可选 (None)

用于存储配置输出和日志文件的文件夹，如果为 None，则自动使用 /tmp/autosklearn_tmp_$pid_$random_number

delete_tmp_folder_after_terminate: bool, optional (True)

完成时删除 tmp_folder。如果 tmp_folder 为 None，则 tmp_dir 将始终被删除

n_jobs整数，可选，实验性

在 fit() 中并行运行的作业数量。-1 表示使用所有处理器。

重要说明:

默认情况下，Auto-sklearn 使用一个核心。
集成构建不受 n_jobs 的影响，但可以通过集成中的模型数量来控制。
predict() 不受 n_jobs 的影响（与大多数 scikit-learn 模型不同）
如果 dask_client 为 None，则创建新的 dask 客户端。

dask_clientdask.distributed.Client，可选

用户创建的 dask 客户端，可用于启动 dask 集群并将 auto-sklearn 连接到它。

disable_evaluator_output: bool or list, optional (False)

'y_optimization' : 不保存优化集的预测结果，该结果稍后将用于构建集成。
model : 不保存任何模型文件

smac_scenario_args字典，可选 (None)

插入到 SMAC 场景中的附加参数。有关可用参数列表，请参阅 SMAC 文档。

get_smac_object_callback可调用对象

logging_config字典，可选 (None)

指定日志配置的字典对象。如果为 None，则使用默认的 logging.yaml 文件，该文件位于相对于安装目录的 util/logging.yaml 目录中。

metadata_directory字符串，可选 (None)

元数据目录的路径。如果为 None，则使用默认目录 (autosklearn.metalearning.files)。

metricScorer，可选 (None)

由 autosklearn.metrics.make_scorer() 创建的 autosklearn.metrics.Scorer 实例。这些是内置评估指标。如果提供 None，则根据任务选择默认评估指标。

scoring_functionsList[Scorer]，可选 (None)

将为每个 pipeline 计算的评估指标列表，结果将通过 cv_results 获取

load_models布尔值，可选 (True)

在拟合 Auto-sklearn 后是否加载模型。

get_trials_callback: callable

一个具有以下定义的callable对象。

(smac.SMBO, smac.RunInfo, smac.RunValue, time_left: float) -> bool | None

这将在 autosklearn 的底层优化器 SMAC 完成每次运行的训练后调用。

您可以使用它来记录有关优化过程的自己的信息。您也可以使用它来实现基于某些标准的早期停止。

请参阅示例：早期停止和回调。

dataset_compression: Union[bool, Mapping[str, Any]] = True

我们压缩数据集，以便它们适合预定义的内存量。目前这不适用于 dataframes 或稀疏数组，仅适用于原始 numpy 数组。

注意 - 如果使用依赖于特定数据大小或顺序的自定义 resampling_strategy，必须禁用此项以保留这些属性。

您可以通过传入 False 来完全禁用此项，或保留默认值 True 以进行以下配置。

{
    "memory_allocation": 0.1,
    "methods": ["precision", "subsample"]
}

您还可以使用相同的键并从可用的 "methods" 中选择来传递您自己的配置。

可用选项在此处描述

memory_allocation
默认情况下，我们尝试将数据集放入 0.1 * memory_limit 的内存中。此浮点值可以通过 "memory_allocation": 0.1 设置。我们也允许以 MB 为单位指定绝对内存，例如 10MB 是 "memory_allocation": 10。

在执行每种缩减方法后，都会检查数据集使用的内存。如果数据集符合分配的内存，则不会执行 "methods" 中列出的任何后续方法。

例如，如果 methods: ["precision", "subsample"] 并且 "precision" 缩减步骤足以使数据集符合内存，则不会执行 "subsample" 缩减步骤。
methods
我们提供以下方法来缩减数据集大小。这些方法可以以列表形式提供，并按照给定的顺序执行。
- "precision" - 我们按如下方式降低浮点精度： * np.float128 -> np.float64 * np.float96 -> np.float64 * np.float64 -> np.float32
- subsample - 我们对数据进行子采样，使其 直接适应内存分配 memory_allocation * memory_limit。因此，这很可能是 "methods" 中列出的最后一个方法。子采样考虑了分类标签并相应地进行分层。我们保证采样集中包含每个标签至少一个出现。

allow_string_features: bool = True

autosklearn 是否应该处理字符串特征。默认情况下，文本预处理是启用的。

属性

cv_results_numpy (masked) ndarrays 的字典

一个字典，其键为列标题，值为列数据，可以导入到 pandas DataFrame 中。

并非所有 scikit-learn 返回的键都已支持。

performance_over_time_pandas.core.frame.DataFrame

包含模型随时间变化的性能数据的 DataFrame。可以直接用于绘图。请参阅示例训练和测试输入。

fit(X, y, X_test=None, y_test=None, feat_type=None, dataset_name=None)[source]¶

将 Auto-sklearn 拟合到给定的训练集 (X, y)。

Fit 同时优化机器学习模型并从中构建集成。

参数

Xarray-like 或稀疏矩阵，形状 = [n_samples, n_features]: 训练输入样本。
y类似数组，形状 = [n_samples] 或 [n_samples, n_targets]: 回归目标。
X_testarray-like 或稀疏矩阵，形状 = [n_samples, n_features]: 测试数据输入样本。将用于保存所有模型的测试预测。这允许评估 Auto-sklearn 随时间变化的性能。
y_test类似数组，形状 = [n_samples] 或 [n_samples, n_targets]: 回归目标。将用于计算所有模型的测试误差。这允许随时间评估 Auto-sklearn 的性能。
feat_type列表，可选 (默认值=None): str 列表，长度为 len(X.shape[1])，描述属性类型。可能的类型为 Categorical 和 Numerical。Categorical 属性将被自动进行 One-Hot 编码。
dataset_name字符串，可选 (默认值=None): 创建更友好的输出。如果为 None，则字符串将由数据集的 md5 哈希确定。

返回

self

对优化过程中训练的模型拟合集成。

所有参数默认都为 None。如果没有给出其他值，则使用调用 fit() 时设置的默认值。

参数

yarray-like

目标值。

task整数

来自模块 autosklearn.constants 的一个常量。确定任务类型（二分类、多类别分类、多标签分类或回归）。

precision整数

加载集成数据时使用的数值精度。可以是 16、32 或 64。

dataset_name字符串

当前数据集的名称。

ensemble_size整数，可选

添加到由 Ensemble selection from libraries of models 构建的集成中的模型数量。模型是带替换抽取的。如果设置为 0，则不拟合集成。

已废弃 - 将在 Auto-sklearn 0.16 中移除。如果您想更改集成选择的集成大小，请通过 ensemble_kwargs={"ensemble_size": int} 传递此参数。

ensemble_kwargsDict，可选

在初始化时传递给集成类的关键字参数。

ensemble_nbest整数

ensemble_classType[AbstractEnsemble] | “default”，可选 (默认值=”default”)

实现后验集成算法的类。设置为 None 以禁用集成构建，或使用类 SingleBest 只使用单一最佳模型而不是集成。

如果设置为“default”，对于单目标问题，它将使用 EnsembleSelection；对于多目标问题，将使用 MultiObjectiveDummyEnsemble。

metric: Scorer | Sequence[Scorer] | None = None

用于评估集成的评估指标或评估指标列表

返回

self

拟合单个 pipeline 配置，并将结果返回给用户。

提供的任何附加参数都将直接传递给执行运行的worker。

参数

X: array-like，形状 = (n_samples, n_features): 用于训练的特征
y: array-like: 用于训练的标签
X_test: Optionalarray-like，形状 = (n_samples, n_features): 如果提供，将在此特征上跟踪测试性能。
y_test: array-like: 如果提供，将在此标签上跟踪测试性能
config: Union[Configuration, Dict[str, Union[str, float, int]]]: 用于定义 pipeline 步骤的配置对象。如果传递的是字典，则根据此字典创建配置。
dataset_name: Optional[str]: 用于标记 Auto-Sklearn 运行并识别 Auto-Sklearn 运行的名称
feat_type列表，可选 (默认值=None): 一个包含 len(X.shape[1]) 个字符串的列表，描述属性类型。可能的类型包括 Categorical 和 Numerical。Categorical 属性将自动进行 One-Hot 编码。用于 categorical 属性的值必须是整数，例如通过 sklearn.preprocessing.LabelEncoder 获得。

返回

pipeline: Optional[BasePipeline]: 已拟合的 pipeline。如果在拟合 pipeline 时失败，则返回 None。
run_info: RunInFo: 一个包含启动配置的命名元组
run_value: RunValue: 一个包含运行结果的命名元组

返回 Configuration Space 对象，Auto-Sklearn 将从此对象中采样配置并构建 pipelines。

参数

Xarray-like 或稀疏矩阵，形状 = [n_samples, n_features]: 包含训练特征的数组，用于获取数据稀疏度等特性
yarray-like，形状 = [n_samples] 或 [n_samples, n_outputs]: 包含问题标签的数组
X_testarray-like 或稀疏矩阵，形状 = [n_samples, n_features]: 包含用于性能评估的特征的数组
y_testarray-like，形状 = [n_samples] 或 [n_samples, n_outputs]: 包含测试分割问题标签的数组
dataset_name: Optional[str]: 用于标记 Auto-Sklearn 运行的字符串

get_models_with_weights()¶

返回 auto-sklearn 找到的最终集成的列表。

返回

[(weight_1, model_1), …, (weight_n, model_n)]

get_params(deep=True)¶

获取此 estimator 的参数。

参数

deep布尔值，默认值=True: 如果为 True，将返回此 estimator 及其包含的作为 estimator 的子对象的参数。

返回

params字典: 参数名称映射到其值。

返回所有评估模型的 pandas 表格结果。

概述搜索过程中训练的所有模型，以及有关其训练的各种统计信息。

可用统计信息如下：

简单:

"model_id" - autosklearn 赋予模型的 ID。
"rank" - 基于模型的 "cost" 的排名。
"ensemble_weight" - 模型在集成中的权重。
"type" - 使用的分类器/回归器类型。
"cost" - 模型在验证集上的损失。
"duration" - 模型优化的时长。

详细: 详细视图包含所有简单统计信息以及以下内容。

"config_id" - SMAC 用于优化的 ID。
"budget" - 分配给此模型的预算。
"status" - 使用 SMAC 训练模型的返回状态。
"train_loss" - 模型在训练集上的损失。
"balancing_strategy" - 数据预处理使用的平衡策略。
"start_time" - 模型开始优化时间
"end_time" - 模型结束优化时间
"data_preprocessors" - 数据上使用的预处理器
"feature_preprocessors" - 特征类型的预处理器

参数

detailed: bool = False

是否提供详细信息或仅提供简单概述。

ensemble_only: bool = True

是否仅查看集成中包含的模型或所有训练过的模型。

top_k: int or “all” = “all”

要显示的模型数量。

sort_by: str = ‘cost’

按哪一列排序。如果该列不存在，则默认按 "model_id" 索引列排序。

默认按优化的评估指标排序。在多目标优化问题中，按第一个目标排序

sort_order: “auto” or “ascending” or “descending” = “auto”

include: Optional[str or Iterable[str]]

要包含的项，未指定的其他项将被排除。例外情况是 "model_id" 索引列，它始终包含在内。

如果保留为 None，它将恢复使用 detailed 参数来决定要包含的列。

返回

pd.DataFrame: 一个包含模型统计信息的 dataframe，按 sort_by 排序。

predict(X, batch_size=None, n_jobs=1)[source]¶

预测 X 的回归目标。

参数

Xarray-like 或稀疏矩阵，形状 = [n_samples, n_features]

返回

y形状为 [n_samples] 或 [n_samples, n_outputs] 的数组: 预测值。

refit(X, y)¶

将 fit 找到的所有模型重新拟合到新数据。

参数

Xarray-like 或稀疏矩阵，形状 = [n_samples, n_features]: 训练输入样本。
yarray-like，形状 = [n_samples] 或 [n_samples, n_outputs]: 目标。

返回

self

score(X, y)¶

返回预测的决定系数 $R^2$。

决定系数 $R^2$ 定义为 $(1 - \frac{u}{v})$，其中 $u$ 是残差平方和 ((y_true - y_pred) ** 2).sum()，$v$ 是总平方和 ((y_true - y_true.mean()) ** 2).sum()。最好的可能得分是 1.0，它可以是负数（因为模型可能任意差）。一个总是预测 y 的期望值而不考虑输入特征的常数模型将获得 0.0 的 $R^2$ 分数。

参数

Xarray-like，形状 (n_samples, n_features): 测试样本。对于某些估计器，这可能是一个预先计算的核矩阵或一个通用对象列表，其形状为 (n_samples, n_samples_fitted)，其中 n_samples_fitted 是用于估计器拟合的样本数量。
yarray-like，形状 (n_samples,) 或 (n_samples, n_outputs): 对于 X 的真实值。
sample_weightarray-like，形状 (n_samples,)，默认值=None: 样本权重。

返回

score浮点数: $R^2$ of self.predict(X) wrt. y.

注意

当在回归器上调用 score 时使用的 $R^2$ 分数从 0.23 版本开始使用 multioutput='uniform_average'，以与 r2_score() 的默认值保持一致。这会影响所有多输出回归器（除了 MultiOutputRegressor）的 score 方法。

set_params(**params)¶

设置此 estimator 的参数。

该方法适用于简单的 estimator 以及嵌套对象（例如 Pipeline）。后者具有 <component>__<parameter> 形式的参数，以便可以更新嵌套对象的每个组件。

参数

**params字典: Estimator 参数。

返回

selfestimator 实例: Estimator 实例。

show_models()¶

返回一个字典，其中包含集成模型的字典。

集成中的每个模型都可以通过其 model_id 作为键来访问。

模型字典包含以下内容

"model_id" - autosklearn 赋予模型的 ID。
"rank" - 基于模型的 "cost" 的排名。
"cost" - 模型在验证集上的损失。
"ensemble_weight" - 模型在集成中的权重。
"voting_model" - cv_voting_ensemble 模型（用于 'cv' 重采样）。
"estimators" - cv_voting_ensemble 中的模型列表（字典）
（'cv' 重采样）。
"data_preprocessor" - 数据上使用的预处理器。
"balancing" - 数据上使用的平衡（用于分类）。
"feature_preprocessor" - 特征类型的预处理器。
"classifier" / "regressor" - autosklearn 封装的分类器或回归器。
"sklearn_classifier" 或 "sklearn_regressor" - sklearn 分类器或回归器。

示例

import sklearn.datasets
import sklearn.metrics
import autosklearn.regression

X, y = sklearn.datasets.load_diabetes(return_X_y=True)

automl = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=120
    )
automl.fit(X_train, y_train, dataset_name='diabetes')

ensemble_dict = automl.show_models()
print(ensemble_dict)

输出

{
    25: {'model_id': 25.0,
         'rank': 1,
         'cost': 0.43667876507897496,
         'ensemble_weight': 0.38,
         'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>,
         'feature_preprocessor': <autosklearn.pipeline.components....>,
         'regressor': <autosklearn.pipeline.components.regression....>,
         'sklearn_regressor': SGDRegressor(alpha=0.0006517033225329654,...)
        },
    6: {'model_id': 6.0,
        'rank': 2,
        'cost': 0.4550418898836528,
        'ensemble_weight': 0.3,
        'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing....>,
        'feature_preprocessor': <autosklearn.pipeline.components....>,
        'regressor': <autosklearn.pipeline.components.regression....>,
        'sklearn_regressor': ARDRegression(alpha_1=0.0003701926442639788,...)
        }...
}

返回

Dict(int, Any)字典，长度 = 集成中的模型数量: 集成中模型的字典，其中 model_id 是键。

sprint_statistics()¶

返回训练结果的以下统计信息

数据集名称
使用的评估指标
最佳验证分数
目标算法运行次数
目标算法成功运行次数
目标算法崩溃运行次数
超出内存限制的目标算法运行次数
超出时间限制的目标算法运行次数

返回

str

指标¶

autosklearn.metrics.make_scorer(name: str, score_func: Callable, *, optimum: float = 1.0, worst_possible_result: float = 0.0, greater_is_better: bool = True, needs_proba: bool = False, needs_threshold: bool = False, needs_X: bool = False, **kwargs: Any) → autosklearn.metrics.Scorer[source]¶

根据性能指标或损失函数创建评分器。

受 scikit-learn 启发的工厂函数，用于封装 scikit-learn 的评分函数以在 auto-sklearn 中使用。

参数

name: str: 指标的描述性名称
score_func可调用对象: 评分函数（或损失函数），其签名为 score_func(y, y_pred, **kwargs)。
optimumint 或 float, default=1: 评分函数可以达到的最佳分数，即评分函数情况下的最大值，损失函数情况下的最小值。
worst_possible_resultint 或 float, default=0: 评分函数可以达到的最差分数，即评分函数情况下的最小值，损失函数情况下的最大值。
greater_is_betterboolean, default=True: 指示 score_func 是一个评分函数（默认），意味着越高越好，还是一个损失函数，意味着越低越好。在后一种情况下，评分器对象会对 score_func 的结果进行符号翻转。
needs_probaboolean, default=False: 指示 score_func 是否需要调用 predict_proba 从分类器获取概率估计。
needs_thresholdboolean, default=False: 指示 score_func 是否接受连续的决策确定性。这仅适用于二元分类。
needs_Xboolean, default=False: 指示 score_func 是否在 __call__ 中需要 X 来计算指标。
**kwargs额外参数: 要传递给 score_func 的额外参数。

返回

scorer可调用对象: 返回标量分数的可调用对象；越大越好，或者设置 greater_is_better 为 False。

内置指标¶

分类指标¶

注意：默认的 autosklearn.metrics.f1, autosklearn.metrics.precision 和 autosklearn.metrics.recall 内置指标仅适用于二元分类。要在多标签和多类分类上应用它们，请使用具有适当平均机制的对应指标，例如 autosklearn.metrics.f1_macro。有关这些指标如何使用的更多信息，请阅读此 scikit-learn 文档。

autosklearn.metrics.accuracy¶: accuracy 的别名

autosklearn.metrics.balanced_accuracy¶: balanced_accuracy 的别名

autosklearn.metrics.f1¶: f1 的别名

autosklearn.metrics.f1_macro¶: f1_macro 的别名

autosklearn.metrics.f1_micro¶: f1_micro 的别名

autosklearn.metrics.f1_samples¶: f1_samples 的别名

autosklearn.metrics.f1_weighted¶: f1_weighted 的别名

autosklearn.metrics.roc_auc¶: roc_auc 的别名

autosklearn.metrics.precision¶: precision 的别名

autosklearn.metrics.precision_macro¶: precision_macro 的别名

autosklearn.metrics.precision_micro¶: precision_micro 的别名

autosklearn.metrics.precision_samples¶: precision_samples 的别名

autosklearn.metrics.precision_weighted¶: precision_weighted 的别名

autosklearn.metrics.average_precision¶: average_precision 的别名

autosklearn.metrics.recall¶: recall 的别名

autosklearn.metrics.recall_macro¶: recall_macro 的别名

autosklearn.metrics.recall_micro¶: recall_micro 的别名

autosklearn.metrics.recall_samples¶: recall_samples 的别名

autosklearn.metrics.recall_weighted¶: recall_weighted 的别名

autosklearn.metrics.log_loss¶: log_loss 的别名

回归指标¶

autosklearn.metrics.r2¶: r2 的别名

autosklearn.metrics.mean_squared_error¶: mean_squared_error 的别名

autosklearn.metrics.mean_absolute_error¶: mean_absolute_error 的别名

autosklearn.metrics.median_absolute_error¶: median_absolute_error 的别名

扩展接口¶

class autosklearn.pipeline.components.base.AutoSklearnClassificationAlgorithm[source]¶

为 auto-sklearn 中的分类算法提供抽象接口。

有关更多信息，请参阅扩展 auto-sklearn。

get_estimator()[source]¶

返回底层估计器对象。

返回

estimator底层估计器对象

predict(X)[source]¶

predict 函数调用底层 scikit-learn 模型的 predict 函数，并返回一个包含预测结果的数组。

参数

X类似数组，形状 = (n_samples, n_features)

返回

数组，形状 = (n_samples,) 或形状 = (n_samples, n_labels): 返回预测值

注意

更多信息请参见scikit-learn API 文档。

predict_proba(X)[来源]¶

预测概率。

参数

X类似数组，形状 = (n_samples, n_features)

返回

数组，形状为 (n_samples,)（如果 n_classes == 2）或 (n_samples, n_classes)

class autosklearn.pipeline.components.base.AutoSklearnRegressionAlgorithm[来源]¶

为 auto-sklearn 中的回归算法提供抽象接口。

创建此类的子类并将其放入目录 autosklearn/pipeline/components/regression 中即可使其可用。

get_estimator()[来源]¶

返回底层估计器对象。

返回

estimator底层估计器对象

predict(X)[来源]¶

predict 函数调用底层 scikit-learn 模型的 predict 函数，并返回一个包含预测结果的数组。

参数

X类似数组，形状 = (n_samples, n_features)

返回

数组，形状 = (n_samples,) 或形状 = (n_samples, n_targets): 返回预测值

注意

更多信息请参见scikit-learn API 文档。

class autosklearn.pipeline.components.base.AutoSklearnPreprocessingAlgorithm[来源]¶

为 auto-sklearn 中的预处理算法提供抽象接口。

有关更多信息，请参阅扩展 auto-sklearn。

get_preprocessor()[来源]¶

返回底层的预处理器对象。

返回

preprocessor底层的预处理器对象

transform(X)[来源]¶

transform 函数调用底层 scikit-learn 模型的 transform 函数并返回转换后的数组。

参数

X类似数组，形状 = (n_samples, n_features)

返回

X数组: 返回转换后的训练数据

注意

更多信息请参见scikit-learn API 文档。

集成¶

单目标¶

class autosklearn.ensembles.EnsembleSelection(task_type: int, metrics: Sequence[Scorer] | Scorer, backend: Backend, ensemble_size: int = 50, bagging: bool = False, mode: str = 'fast', random_state: int | np.random.RandomState | None = None)[来源]¶

一个由选定算法组成的集成

拟合 EnsembleSelection 会根据搜索过程中生成的模型生成一个集成。可进一步用于预测。

参数

task_type: int

指示正在执行的任务的标识符。

metrics: Sequence[Scorer] | Scorer

用于评估模型的指标。如果传入多个指标，集成选择仅针对第一个进行优化。

backendBackend

提供对 Auto-sklearn 后端的访问。Ensemble Selection 未使用此项。

bagging: bool = False

在集成选择中是否使用 bagging。

mode: str in ['fast', 'slow'] = 'fast'

使用哪种集成生成方法 * 'slow' - Rich Caruana 集成选择中使用的原始方法。 * 'fast' - Rich Caruana 集成选择的更快版本。

random_state: int | RandomState | None = None

用于集成选择的 random_state。

None - 使用 numpy 的默认 RandomState 对象
int - 连续调用 fit 将产生相同的结果
RandomState - 真正的随机，每次调用 fit 将产生不同的结果，即使使用同一个对象。

参考文献

从模型库中进行集成选择
Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew 和 Alex Ksikes
ICML 2004
https://dl.acm.org/doi/10.1145/1015330.1015432
https://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml04.icdm06long.pdf

fit(base_models_predictions: List[np.ndarray], true_targets: np.ndarray, model_identifiers: List[Tuple[int, int, float]], runs: Sequence[Run], X_data: SUPPORTED_FEAT_TYPES | None = None) → EnsembleSelection[来源]¶

根据基础模型的预测和目标拟合集成。

集成构建最大化性能（与超参数优化不同）！

参数

base_models_predictions: np.ndarray

形状 = (n_base_models, n_data_points, n_targets)。n_targets 在分类情况下是类别的数量，在回归情况下是 0 或 1。

也可以是 2D numpy 数组的列表，以防止将所有预测复制到单个大型 numpy 数组中。

X_datalist-like 或稀疏数据

true_targets形状为 [n_targets] 的数组

model_identifiers每个基础模型的标识符。

可用于集成的实际文本输出。

runs: Sequence[Run]

SMAC 执行的、被集成构建器考虑的每个运行的附加信息。

返回

self

get_identifiers_with_weights() → List[Tuple[Tuple[int, int, float], float]][来源]¶

返回传递给集成构建器的所有模型的 (identifier, weight) 对。

参数

modelsdict {标识符模型对象}: 标识符与传递给 fit() 方法的标识符相同。模型可用于美观的打印输出。

返回

List[Tuple[Tuple[int, int, float], float]

get_models_with_weights(models: Dict[Tuple[int, int, float], autosklearn.pipeline.base.BasePipeline]) → List[Tuple[float, autosklearn.pipeline.base.BasePipeline]][来源]¶

集成中包含的所有模型的 (weight, model) 对列表。

参数

modelsdict {标识符模型对象}: 标识符与传递给 fit() 方法的标识符相同。模型可用于美观的打印输出。

返回

List[Tuple[float, BasePipeline]]

get_selected_model_identifiers() → List[Tuple[int, int, float]][来源]¶

返回集成中模型的标识符。

这包括权重为零的模型！

返回

list

get_validation_performance() → float[来源]¶

返回集成的验证性能。

返回

float

predict(base_models_predictions: Union[numpy.ndarray, List[numpy.ndarray]]) → numpy.ndarray[来源]¶

根据基础模型预测创建集成预测。

参数

base_models_predictionsnp.ndarray: 形状 = (n_base_models, n_data_points, n_targets)。与 fit 方法中的相同。

返回

np.ndarray

单模型类¶

这些类封装单个模型，以便在 Auto-sklearn 中提供统一接口。

class autosklearn.ensembles.SingleBest(task_type: int, metrics: Sequence[Scorer] | Scorer, backend: Backend, random_state: int | np.random.RandomState | None = None)[来源]¶

由单个最佳模型组成的集成。

参数

task_type: int: 指示正在执行的任务的标识符。
metrics: Sequence[Scorer] | Scorer: 用于评估模型的指标。
random_state: int | RandomState | None = None: 未使用。
backendBackend: 提供对 Auto-sklearn 后端的访问。未使用。

fit(base_models_predictions: np.ndarray | list[np.ndarray], true_targets: np.ndarray, model_identifiers: list[tuple[int, int, float]], runs: Sequence[Run], X_data: SUPPORTED_FEAT_TYPES | None = None) → SingleBest[来源]¶

选择单个最佳模型。

参数

base_models_predictions: np.ndarray

形状 = (n_base_models, n_data_points, n_targets)。n_targets 在分类情况下是类别的数量，在回归情况下是 0 或 1。

也可以是 2D numpy 数组的列表，以防止将所有预测复制到单个大型 numpy 数组中。

true_targets形状为 [n_targets] 的数组

model_identifiers每个基础模型的标识符。

可用于集成的实际文本输出。

runs: Sequence[Run]

SMAC 执行的、被集成构建器考虑的每个运行的附加信息。未使用。

X_dataarray-like | 稀疏矩阵 | None = None

返回

self

class autosklearn.ensembles.SingleModelEnsemble(task_type: int, metrics: Sequence[Scorer] | Scorer, backend: Backend, model_index: int, random_state: int | np.random.RandomState | None = None)[来源]¶

由单个模型组成的集成。

此类由 MultiObjectiveDummyEnsemble 使用，用于表示由单个模型组成的集成，此类别不应单独使用。

请勿自行使用！

参数

task_type: int: 指示正在执行的任务的标识符。
metrics: Sequence[Scorer] | Scorer: 用于评估模型的指标。
backendBackend: 提供对 Auto-sklearn 后端的访问。未使用。
model_indexint: 构成集成的模型的索引。此索引将用于选择将在 fit 和 predict 期间传递的正确预测。
random_state: int | RandomState | None = None: 未使用。

fit 方法的虚拟实现。

传递模型索引的实际工作在构造函数中完成。此方法仅存储所选模型的标识符并计算其验证损失。

参数

base_models_predictions: np.ndarray

形状 = (n_base_models, n_data_points, n_targets)。n_targets 在分类情况下是类别的数量，在回归情况下是 0 或 1。

也可以是 2D numpy 数组的列表，以防止将所有预测复制到单个大型 numpy 数组中。

true_targets形状为 [n_targets] 的数组

model_identifiers每个基础模型的标识符。

可用于集成的实际文本输出。

runs: Sequence[Run]

SMAC 执行的、被集成构建器考虑的每个运行的附加信息。未使用。

X_datalist-like | spmatrix | None = None

如果指标需要，则提供给指标的 X 数据

返回

self

class autosklearn.ensembles.SingleBestFromRunhistory(task_type: int, metrics: Sequence[Scorer] | Scorer, backend: Backend, run_history: RunHistory, seed: int, random_state: int | np.random.RandomState | None = None)[来源]¶

如果发生崩溃，此类别会搜索最佳个体模型。

此类模型作为单个对象的集成返回，以符合 AbstractEnsemble 的预期接口。

请勿自行使用！

get_identifiers_from_run_history() → list[tuple[int, int, float]][来源]¶

解析运行历史，以识别性能最佳的模型

填充标识符属性，后端使用此属性来访问实际模型。

多目标¶

class autosklearn.ensembles.MultiObjectiveDummyEnsemble(task_type: int, metrics: Sequence[Scorer] | Scorer, backend: Backend, random_state: int | np.random.RandomState | None = None)[来源]¶

多目标集成的虚拟实现。

构建的集成是帕累托前沿上的每个独立模型。

参数

task_type: int: 指示正在执行的任务的标识符。
metrics: Sequence[Scorer] | Scorer: 用于评估模型的指标。
backendBackend: 提供对 Auto-sklearn 后端的访问。未使用。
random_state: int | RandomState | None = None: 未使用。

fit(base_models_predictions: list[np.ndarray], true_targets: np.ndarray, model_identifiers: list[tuple[int, int, float]], runs: Sequence[Run], X_data: SUPPORTED_FEAT_TYPES | None = None) → MultiObjectiveDummyEnsemble[来源]¶

根据基础模型预测和目标选择虚拟集成。

参数

base_models_predictions: np.ndarray

形状 = (n_base_models, n_data_points, n_targets)。n_targets 在分类情况下是类别的数量，在回归情况下是 0 或 1。

也可以是 2D numpy 数组的列表，以防止将所有预测复制到单个大型 numpy 数组中。

true_targets形状为 [n_targets] 的数组

model_identifiers每个基础模型的标识符。

可用于集成的实际文本输出。

runs: Sequence[Run]

SMAC 执行的、被集成构建器考虑的每个运行的附加信息。未使用。

X_datalist-like | 稀疏矩阵 | None = None

如果需要，则提供给指标的 X 数据

返回

self

get_identifiers_with_weights() → list[tuple[tuple[int, int, float], float]][来源]¶

基于对第一个指标最优的集成，返回传递给集成构建器的所有模型的 (identifier, weight) 对。

参数

modelsdict {标识符模型对象}: 标识符与传递给 fit() 方法的标识符相同。模型可用于美观的打印输出。

返回

list[tuple[tuple[int, int, float], float]

get_models_with_weights(models: dict[tuple[int, int, float], BasePipeline]) → list[tuple[float, BasePipeline]][来源]¶

返回对第一个指标最优的集成的 (weight, model) 对列表。

参数

modelsdict {标识符模型对象}: 标识符与传递给 fit() 方法的标识符相同。模型可用于美观的打印输出。

返回

list[tuple[float, BasePipeline]]

get_selected_model_identifiers() → list[tuple[int, int, float]][来源]¶

返回对第一个指标最优的集成中模型的标识符。

这包括权重为零的模型！

返回

list

get_validation_performance() → float[来源]¶

对第一个指标最优的集成的验证性能。

返回

float

property pareto_set: Sequence[autosklearn.ensembles.abstract_ensemble.AbstractEnsemble]¶

获取帕累托前沿上集成序列

返回

Sequence[AbstractEnsemble]

抛出

SklearnNotFittedError: 如果尚未调用 fit 且帕累托集尚不存在

predict(base_models_predictions: np.ndarray | list[np.ndarray]) → np.ndarray[来源]¶

使用对第一个指标最优的集成进行预测。

参数

base_models_predictionsnp.ndarray: 形状 = (n_base_models, n_data_points, n_targets)。与 fit 方法中的相同。

返回

np.ndarray