注意
点击 此处 下载完整示例代码,或通过 Binder 在浏览器中运行此示例
特征类型¶
在 auto-sklearn 中,可以在调用方法 fit()
时通过指定参数 feat_type
来指定数据集的特征类型。下面的示例演示了如何做到这一点。
此外,您可以提供一个格式正确的 pandas DataFrame,特征类型将自动推断,如 随时间变化的性能图 中所示。
import numpy as np
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
import autosklearn.classification
数据加载¶
从 https://www.openml.org/d/40981 加载 Australian 数据集
bunch = data = sklearn.datasets.fetch_openml(data_id=40981, as_frame=True)
y = bunch["target"].to_numpy()
X = bunch["data"].to_numpy(np.float)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, random_state=1
)
# Auto-sklearn can automatically recognize categorical/numerical data from a pandas
# DataFrame. This example highlights how the user can provide the feature types,
# when using numpy arrays, as there is no per-column dtype in this case.
# feat_type is a list that tags each column from a DataFrame/ numpy array / list
# with the case-insensitive string categorical or numerical, accordingly.
feat_type = [
"Categorical" if x.name == "category" else "Numerical" for x in bunch["data"].dtypes
]
/home/runner/work/auto-sklearn/auto-sklearn/examples/40_advanced/example_feature_types.py:31: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.com.cn/devdocs/release/1.20.0-notes.html#deprecations
X = bunch["data"].to_numpy(np.float)
构建并拟合分类器¶
cls = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=30,
# Bellow two flags are provided to speed up calculations
# Not recommended for a real implementation
initial_configurations_via_metalearning=0,
smac_scenario_args={"runcount_limit": 1},
)
cls.fit(X_train, y_train, X_test, y_test, feat_type=feat_type)
AutoSklearnClassifier(ensemble_class=<class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>,
initial_configurations_via_metalearning=0,
per_run_time_limit=3,
smac_scenario_args={'runcount_limit': 1},
time_left_for_this_task=30)
获取最终集成模型的得分¶
predictions = cls.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))
Accuracy score 0.8786127167630058
脚本总运行时间: ( 0 分钟 25.381 秒)