特征类型

auto-sklearn 中,可以在调用方法 fit() 时通过指定参数 feat_type 来指定数据集的特征类型。下面的示例演示了如何做到这一点。

此外,您可以提供一个格式正确的 pandas DataFrame,特征类型将自动推断,如 随时间变化的性能图 中所示。

import numpy as np

import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

import autosklearn.classification

数据加载

https://www.openml.org/d/40981 加载 Australian 数据集

bunch = data = sklearn.datasets.fetch_openml(data_id=40981, as_frame=True)
y = bunch["target"].to_numpy()
X = bunch["data"].to_numpy(np.float)

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1
)

# Auto-sklearn can automatically recognize categorical/numerical data from a pandas
# DataFrame. This example highlights how the user can provide the feature types,
# when using numpy arrays, as there is no per-column dtype in this case.
# feat_type is a list that tags each column from a DataFrame/ numpy array / list
# with the case-insensitive string categorical or numerical, accordingly.
feat_type = [
    "Categorical" if x.name == "category" else "Numerical" for x in bunch["data"].dtypes
]
/home/runner/work/auto-sklearn/auto-sklearn/examples/40_advanced/example_feature_types.py:31: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.com.cn/devdocs/release/1.20.0-notes.html#deprecations
  X = bunch["data"].to_numpy(np.float)

构建并拟合分类器

cls = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=30,
    # Bellow two flags are provided to speed up calculations
    # Not recommended for a real implementation
    initial_configurations_via_metalearning=0,
    smac_scenario_args={"runcount_limit": 1},
)
cls.fit(X_train, y_train, X_test, y_test, feat_type=feat_type)
AutoSklearnClassifier(ensemble_class=<class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>,
                      initial_configurations_via_metalearning=0,
                      per_run_time_limit=3,
                      smac_scenario_args={'runcount_limit': 1},
                      time_left_for_this_task=30)

获取最终集成模型的得分

predictions = cls.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))
Accuracy score 0.8786127167630058

脚本总运行时间: ( 0 分钟 25.381 秒)

由 Sphinx-Gallery 生成的画廊