文本预处理¶

以下示例展示了如何使用 auto-sklearn 拟合一个简单的 NLP 问题。

有关文本预处理的介绍，您可以参考以下链接

from pprint import pprint

import pandas as pd
import sklearn.metrics
from sklearn.datasets import fetch_20newsgroups

import autosklearn.classification

数据加载¶

cats = ["comp.sys.ibm.pc.hardware", "rec.sport.baseball"]
X_train, y_train = fetch_20newsgroups(
    subset="train",  # select train set
    shuffle=True,  # shuffle the data set for unbiased validation results
    random_state=42,  # set a random seed for reproducibility
    categories=cats,  # select only 2 out of 20 labels
    return_X_y=True,  # 20NG dataset consists of 2 columns X: the text data, y: the label
)  # load this two columns separately as numpy array

X_test, y_test = fetch_20newsgroups(
    subset="test",  # select test set for unbiased evaluation
    categories=cats,  # select only 2 out of 20 labels
    return_X_y=True,  # 20NG dataset consists of 2 columns X: the text data, y: the label
)  # load this two columns separately as numpy array

创建 pandas dataframe¶

分类特征和文本特征通常都是字符串。Python Pandas 将 Python 字符串存储在通用的 object 类型中。请确保将正确的 dtype 应用于正确的列。

# create a pandas dataframe for training labeling the "Text" column as sting
X_train = pd.DataFrame({"Text": pd.Series(X_train, dtype="string")})

# create a pandas dataframe for testing labeling the "Text" column as sting
X_test = pd.DataFrame({"Text": pd.Series(X_test, dtype="string")})

构建并拟合分类器¶

# create an autosklearn Classifier or Regressor depending on your task at hand.
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=60,
    per_run_time_limit=30,
)

automl.fit(X_train, y_train, dataset_name="20_Newsgroups")  # fit the automl model

AutoSklearnClassifier(ensemble_class=<class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>,
                      per_run_time_limit=30, time_left_for_this_task=60)

查看 auto-sklearn 找到的模型¶

print(automl.leaderboard())

          rank  ensemble_weight           type      cost   duration
model_id
3            1             0.34            mlp  0.022959  12.225609
2            2             0.56  random_forest  0.040816  12.765663
4            3             0.10    extra_trees  0.079082  11.489445

打印 auto-sklearn 构建的最终集成模型¶

pprint(automl.show_models(), indent=4)

{   2: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f05d2001d60>,
           'cost': 0.04081632653061229,
           'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f05d4452cd0>,
           'ensemble_weight': 0.56,
           'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f05d2001eb0>,
           'model_id': 2,
           'rank': 1,
           'sklearn_classifier': RandomForestClassifier(max_features=10, n_estimators=512, n_jobs=1,
                       random_state=1, warm_start=True)},
    3: {   'balancing': Balancing(random_state=1, strategy='weighting'),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f05d684e550>,
           'cost': 0.022959183673469385,
           'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f05d0fd7e50>,
           'ensemble_weight': 0.34,
           'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f05d3f33d30>,
           'model_id': 3,
           'rank': 2,
           'sklearn_classifier': MLPClassifier(activation='tanh', alpha=1.103855734598575e-05, beta_1=0.999,
              beta_2=0.9, early_stopping=True,
              hidden_layer_sizes=(229, 229, 229),
              learning_rate_init=0.00014375616988222174, max_iter=32,
              n_iter_no_change=32, random_state=1, verbose=0, warm_start=True)},
    4: {   'balancing': Balancing(random_state=1),
           'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f05d3dfde50>,
           'cost': 0.07908163265306123,
           'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7f05e9b267f0>,
           'ensemble_weight': 0.1,
           'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f05d198f2b0>,
           'model_id': 4,
           'rank': 3,
           'sklearn_classifier': ExtraTreesClassifier(max_features=9, min_samples_split=4, n_estimators=512,
                     n_jobs=1, random_state=1, warm_start=True)}}

获取最终集成模型的评分¶

predictions = automl.predict(X_test)
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions))

Accuracy score: 0.982256020278834

脚本总运行时间： ( 1 分钟 5.383 秒)

下载 Python 源代码： example_text_preprocessing.py

下载 Jupyter Notebook： example_text_preprocessing.ipynb

由 Sphinx-Gallery 生成的图库