注意
点击此处下载完整的示例代码,或通过 Binder 在浏览器中运行此示例
文本预处理¶
以下示例展示了如何使用 auto-sklearn 拟合一个简单的 NLP 问题。
- 有关文本预处理的介绍,您可以参考以下链接
from pprint import pprint
import pandas as pd
import sklearn.metrics
from sklearn.datasets import fetch_20newsgroups
import autosklearn.classification
数据加载¶
cats = ["comp.sys.ibm.pc.hardware", "rec.sport.baseball"]
X_train, y_train = fetch_20newsgroups(
subset="train", # select train set
shuffle=True, # shuffle the data set for unbiased validation results
random_state=42, # set a random seed for reproducibility
categories=cats, # select only 2 out of 20 labels
return_X_y=True, # 20NG dataset consists of 2 columns X: the text data, y: the label
) # load this two columns separately as numpy array
X_test, y_test = fetch_20newsgroups(
subset="test", # select test set for unbiased evaluation
categories=cats, # select only 2 out of 20 labels
return_X_y=True, # 20NG dataset consists of 2 columns X: the text data, y: the label
) # load this two columns separately as numpy array
创建 pandas dataframe¶
分类特征和文本特征通常都是字符串。Python Pandas 将 Python 字符串存储在通用的 object 类型中。请确保将正确的 dtype 应用于正确的列。
# create a pandas dataframe for training labeling the "Text" column as sting
X_train = pd.DataFrame({"Text": pd.Series(X_train, dtype="string")})
# create a pandas dataframe for testing labeling the "Text" column as sting
X_test = pd.DataFrame({"Text": pd.Series(X_test, dtype="string")})
构建并拟合分类器¶
# create an autosklearn Classifier or Regressor depending on your task at hand.
automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=60,
per_run_time_limit=30,
)
automl.fit(X_train, y_train, dataset_name="20_Newsgroups") # fit the automl model
AutoSklearnClassifier(ensemble_class=<class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>,
per_run_time_limit=30, time_left_for_this_task=60)
查看 auto-sklearn 找到的模型¶
print(automl.leaderboard())
rank ensemble_weight type cost duration
model_id
3 1 0.34 mlp 0.022959 12.225609
2 2 0.56 random_forest 0.040816 12.765663
4 3 0.10 extra_trees 0.079082 11.489445
打印 auto-sklearn 构建的最终集成模型¶
pprint(automl.show_models(), indent=4)
{ 2: { 'balancing': Balancing(random_state=1),
'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f05d2001d60>,
'cost': 0.04081632653061229,
'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f05d4452cd0>,
'ensemble_weight': 0.56,
'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f05d2001eb0>,
'model_id': 2,
'rank': 1,
'sklearn_classifier': RandomForestClassifier(max_features=10, n_estimators=512, n_jobs=1,
random_state=1, warm_start=True)},
3: { 'balancing': Balancing(random_state=1, strategy='weighting'),
'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f05d684e550>,
'cost': 0.022959183673469385,
'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f05d0fd7e50>,
'ensemble_weight': 0.34,
'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f05d3f33d30>,
'model_id': 3,
'rank': 2,
'sklearn_classifier': MLPClassifier(activation='tanh', alpha=1.103855734598575e-05, beta_1=0.999,
beta_2=0.9, early_stopping=True,
hidden_layer_sizes=(229, 229, 229),
learning_rate_init=0.00014375616988222174, max_iter=32,
n_iter_no_change=32, random_state=1, verbose=0, warm_start=True)},
4: { 'balancing': Balancing(random_state=1),
'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f05d3dfde50>,
'cost': 0.07908163265306123,
'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f05e9b267f0>,
'ensemble_weight': 0.1,
'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f05d198f2b0>,
'model_id': 4,
'rank': 3,
'sklearn_classifier': ExtraTreesClassifier(max_features=9, min_samples_split=4, n_estimators=512,
n_jobs=1, random_state=1, warm_start=True)}}
获取最终集成模型的评分¶
predictions = automl.predict(X_test)
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions))
Accuracy score: 0.982256020278834
脚本总运行时间: ( 1 分钟 5.383 秒)