注意
点击 这里下载完整示例代码或通过 Binder 在浏览器中运行此示例
多标签分类¶
此示例展示了如何格式化多标签分类问题的目标值。有关多标签分类的详细信息可在此处找到 here。
import numpy as np
from pprint import pprint
import sklearn.datasets
import sklearn.metrics
from sklearn.utils.multiclass import type_of_target
import autosklearn.classification
数据加载¶
# Using reuters multilabel dataset -- https://www.openml.org/d/40594
X, y = sklearn.datasets.fetch_openml(data_id=40594, return_X_y=True, as_frame=False)
# fetch openml downloads a numpy array with TRUE/FALSE strings. Re-map it to
# integer dtype with ones and zeros
# This is to comply with Scikit-learn requirement:
# "Positive classes are indicated with 1 and negative classes with 0 or -1."
# More information on: https://scikit-learn.cn/stable/modules/multiclass.html
y[y == "TRUE"] = 1
y[y == "FALSE"] = 0
y = y.astype(int)
# Using type of target is a good way to make sure your data
# is properly formatted
print(f"type_of_target={type_of_target(y)}")
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, random_state=1
)
type_of_target=multilabel-indicator
构建分类器¶
automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=60,
per_run_time_limit=30,
# Bellow two flags are provided to speed up calculations
# Not recommended for a real implementation
initial_configurations_via_metalearning=0,
smac_scenario_args={"runcount_limit": 1},
)
automl.fit(X_train, y_train, dataset_name="reuters")
AutoSklearnClassifier(ensemble_class=<class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>,
initial_configurations_via_metalearning=0,
per_run_time_limit=30,
smac_scenario_args={'runcount_limit': 1},
time_left_for_this_task=60)
查看 auto-sklearn 找到的模型¶
print(automl.leaderboard())
rank ensemble_weight type cost duration
model_id
2 1 1.0 random_forest 0.447294 4.031171
打印 auto-sklearn 构建的最终集成模型¶
pprint(automl.show_models(), indent=4)
{ 2: { 'balancing': Balancing(random_state=1),
'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f05d248c2e0>,
'cost': 0.4472941828699525,
'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f05d0f2e520>,
'ensemble_weight': 1.0,
'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f05d248c8b0>,
'model_id': 2,
'rank': 1,
'sklearn_classifier': RandomForestClassifier(max_features=15, n_estimators=512, n_jobs=1,
random_state=1, warm_start=True)}}
打印 auto-sklearn 运行的统计信息¶
# Print statistics about the auto-sklearn run such as number of
# iterations, number of models failed with a time out.
print(automl.sprint_statistics())
auto-sklearn results:
Dataset name: reuters
Metric: f1_macro
Best validation score: 0.552706
Number of target algorithm runs: 1
Number of successful target algorithm runs: 1
Number of crashed target algorithm runs: 0
Number of target algorithms that exceeded the time limit: 0
Number of target algorithms that exceeded the memory limit: 0
获取最终集成模型的得分¶
predictions = automl.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))
Accuracy score 0.604
脚本总运行时间: ( 0 分钟 24.585 秒)