文本预处理

以下示例展示了如何使用 auto-sklearn 拟合一个简单的 NLP 问题。

有关文本预处理的介绍,您可以参考以下链接
  1. https://scikit-learn.cn/stable/tutorial/text_analytics/working_with_text_data.html

  2. https://machinelearningmastery.com/clean-text-machine-learning-python/

from pprint import pprint

import pandas as pd
import sklearn.metrics
from sklearn.datasets import fetch_20newsgroups

import autosklearn.classification

数据加载

cats = ["comp.sys.ibm.pc.hardware", "rec.sport.baseball"]
X_train, y_train = fetch_20newsgroups(
    subset="train",  # select train set
    shuffle=True,  # shuffle the data set for unbiased validation results
    random_state=42,  # set a random seed for reproducibility
    categories=cats,  # select only 2 out of 20 labels
    return_X_y=True,  # 20NG dataset consists of 2 columns X: the text data, y: the label
)  # load this two columns separately as numpy array

X_test, y_test = fetch_20newsgroups(
    subset="test",  # select test set for unbiased evaluation
    categories=cats,  # select only 2 out of 20 labels
    return_X_y=True,  # 20NG dataset consists of 2 columns X: the text data, y: the label
)  # load this two columns separately as numpy array

创建 pandas dataframe

分类特征和文本特征通常都是字符串。Python Pandas 将 Python 字符串存储在通用的 object 类型中。请确保将正确的 dtype 应用于正确的列。

# create a pandas dataframe for training labeling the "Text" column as sting
X_train = pd.DataFrame({"Text": pd.Series(X_train, dtype="string")})

# create a pandas dataframe for testing labeling the "Text" column as sting
X_test = pd.DataFrame({"Text": pd.Series(X_test, dtype="string")})

构建并拟合分类器

# create an autosklearn Classifier or Regressor depending on your task at hand.
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=60,
    per_run_time_limit=30,
)

automl.fit(X_train, y_train, dataset_name="20_Newsgroups")  # fit the automl model
AutoSklearnClassifier(ensemble_class=<class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>,
                      per_run_time_limit=30, time_left_for_this_task=60)

查看 auto-sklearn 找到的模型

print(automl.leaderboard())
          rank  ensemble_weight           type      cost   duration
model_id
3            1             0.34            mlp  0.022959  12.225609
2            2             0.56  random_forest  0.040816  12.765663
4            3             0.10    extra_trees  0.079082  11.489445

获取最终集成模型的评分

predictions = automl.predict(X_test)
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions))
Accuracy score: 0.982256020278834

脚本总运行时间: ( 1 分钟 5.383 秒)

由 Sphinx-Gallery 生成的图库