重采样策略

auto-sklearn 中,可以通过指定参数 resampling_strategyresampling_strategy_arguments 来使用不同的重采样策略。以下示例展示了 AutoSklearnClassifier 的常见设置。

import numpy as np
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.classification

数据加载

X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1
)

留出法

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder="/tmp/autosklearn_resampling_example_tmp",
    disable_evaluator_output=False,
    # 'holdout' with 'train_size'=0.67 is the default argument setting
    # for AutoSklearnClassifier. It is explicitly specified in this example
    # for demonstrational purpose.
    resampling_strategy="holdout",
    resampling_strategy_arguments={"train_size": 0.67},
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")
AutoSklearnClassifier(ensemble_class=<class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>,
                      per_run_time_limit=30,
                      resampling_strategy_arguments={'train_size': 0.67},
                      time_left_for_this_task=120,
                      tmp_folder='/tmp/autosklearn_resampling_example_tmp')

获取最终集成模型的得分

predictions = automl.predict(X_test)
print("Accuracy score holdout: ", sklearn.metrics.accuracy_score(y_test, predictions))
Accuracy score holdout:  0.958041958041958

交叉验证

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder="/tmp/autosklearn_resampling_example_tmp",
    disable_evaluator_output=False,
    resampling_strategy="cv",
    resampling_strategy_arguments={"folds": 5},
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")

# One can use models trained during cross-validation directly to predict
# for unseen data. For this, all k models trained during k-fold
# cross-validation are considered as a single soft-voting ensemble inside
# the ensemble constructed with ensemble selection.
print("Before re-fit")
predictions = automl.predict(X_test)
print("Accuracy score CV", sklearn.metrics.accuracy_score(y_test, predictions))
Before re-fit
Accuracy score CV 0.965034965034965

执行再训练

在 fit() 期间,模型在单独的交叉验证折叠上进行拟合。为了使用所有可用数据,我们调用 refit(),它会在整个数据集上训练最终集成中的所有模型。

print("After re-fit")
automl.refit(X_train.copy(), y_train.copy())
predictions = automl.predict(X_test)
print("Accuracy score CV", sklearn.metrics.accuracy_score(y_test, predictions))
After re-fit
Accuracy score CV 0.958041958041958

scikit-learn 切分器对象

也可以使用scikit-learn 的切分器类来进一步自定义输出。如果需要对切分有 100% 的控制权,可以使用scikit-learn 的 PredefinedSplit

下面是使用预定义切分的示例。我们按第一个特征切分训练数据。在实践中,应根据具体的用例进行切分。

selected_indices = (X_train[:, 0] < np.mean(X_train[:, 0])).astype(int)
resampling_strategy = sklearn.model_selection.PredefinedSplit(
    test_fold=selected_indices
)

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder="/tmp/autosklearn_resampling_example_tmp",
    disable_evaluator_output=False,
    resampling_strategy=resampling_strategy,
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")

print(automl.sprint_statistics())
auto-sklearn results:
  Dataset name: breast_cancer
  Metric: accuracy
  Best validation score: 0.964789
  Number of target algorithm runs: 25
  Number of successful target algorithm runs: 25
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0

对于自定义重采样策略(即 Auto-sklearn 未定义为字符串的重采样策略),需要执行再训练。

automl.refit(X_train, y_train)
AutoSklearnClassifier(ensemble_class=<class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>,
                      per_run_time_limit=30,
                      resampling_strategy=PredefinedSplit(test_fold=array([0, 0, ..., 1, 1])),
                      time_left_for_this_task=120,
                      tmp_folder='/tmp/autosklearn_resampling_example_tmp')

获取最终集成模型的得分 (再次)

显然,这个得分相当差,因为我们通过根据第一个特征进行切分来“破坏”了数据集。

predictions = automl.predict(X_test)
print(
    "Accuracy score custom split", sklearn.metrics.accuracy_score(y_test, predictions)
)
Accuracy score custom split 0.958041958041958

脚本总运行时间: ( 6 分 35.274 秒)

画廊由 Sphinx-Gallery 生成