In [None]:
%matplotlib inline


# Obtain run information

The following example shows how to obtain information from a finished
Auto-sklearn run. In particular, it shows:
* how to query which models were evaluated by Auto-sklearn
* how to query the models in the final ensemble
* how to get general statistics on the what Auto-sklearn evaluated

Auto-sklearn is a wrapper on top of
the sklearn models. This example illustrates how to interact
with the sklearn components directly, in this case a PCA preprocessor.


In [None]:
from pprint import pprint

import sklearn.datasets
import sklearn.metrics

import autosklearn.classification

## Data Loading



In [None]:
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1
)

## Build and fit the classifier



In [None]:
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=30,
    per_run_time_limit=10,
    disable_evaluator_output=False,
    # To simplify querying the models in the final ensemble, we
    # restrict auto-sklearn to use only pca as a preprocessor
    include={"feature_preprocessor": ["pca"]},
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")

## Predict using the model



In [None]:
predictions = automl.predict(X_test)
print("Accuracy score:{}".format(sklearn.metrics.accuracy_score(y_test, predictions)))

## Report the models found by Auto-Sklearn

Auto-sklearn uses
[Ensemble Selection](https://www.cs.cornell.edu/~alexn/papers/shotgun.icml04.revised.rev2.pdf)
to construct ensembles in a post-hoc fashion. The ensemble is a linear
weighting of all models constructed during the hyperparameter optimization.
This prints the final ensemble. It is a dictionary where ``model_id`` of
each model is a key, and value is a dictionary containing information
of that model. A model's dict contains its ``'model_id'``, ``'rank'``,
``'cost'``, ``'ensemble_weight'``, and the model itself. The model is
given by the ``'data_preprocessor'``, ``'feature_preprocessor'``,
``'regressor'/'classifier'`` and ``'sklearn_regressor'/'sklearn_classifier'``
entries. But for the ``'cv'`` resampling strategy, the same for each cv
model is stored in the ``'estimators'`` list in the dict, along with the
``'voting_model'``.



In [None]:
pprint(automl.show_models(), indent=4)

## Report statistics about the search

Print statistics about the auto-sklearn run such as number of
iterations, number of models failed with a time out etc.



In [None]:
print(automl.sprint_statistics())

## Detailed statistics about the search - part 1

Auto-sklearn also keeps detailed statistics of the hyperparameter
optimization procedurce, which are stored in a so-called
[run history](https://automl.github.io/SMAC3/main/api/smac.runhistory.runhistory.html#smac.runhistory.runhistory.RunHistory).



In [None]:
print(automl.automl_.runhistory_)

Runs are stored inside an ``OrderedDict`` called ``data``:



In [None]:
print(len(automl.automl_.runhistory_.data))

Let's iterative over all entries



In [None]:
for run_key in automl.automl_.runhistory_.data:
    print("#########")
    print(run_key)
    print(automl.automl_.runhistory_.data[run_key])

and have a detailed look at one entry:



In [None]:
run_key = list(automl.automl_.runhistory_.data.keys())[0]
run_value = automl.automl_.runhistory_.data[run_key]

The ``run_key`` contains all information describing a run:



In [None]:
print("Configuration ID:", run_key.config_id)
print("Instance:", run_key.instance_id)
print("Seed:", run_key.seed)
print("Budget:", run_key.budget)

and the configuration can be looked up in the run history as well:



In [None]:
print(automl.automl_.runhistory_.ids_config[run_key.config_id])

The only other important entry is the budget in case you are using
auto-sklearn with
`sphx_glr_examples_60_search_example_successive_halving.py`.
The remaining parts of the key can be ignored for auto-sklearn and are
only there because the underlying optimizer, SMAC, can handle more general
problems, too.



The ``run_value`` contains all output from running the configuration:



In [None]:
print("Cost:", run_value.cost)
print("Time:", run_value.time)
print("Status:", run_value.status)
print("Additional information:", run_value.additional_info)
print("Start time:", run_value.starttime)
print("End time", run_value.endtime)

Cost is basically the same as a loss. In case the metric to optimize for
should be maximized, it is internally transformed into a minimization
metric. Additionally, the status type gives information on whether the run
was successful, while the additional information's most interesting entry
is the internal training loss. Furthermore, there is detailed information
on the runtime available.



As an example, let's find the best configuration evaluated. As
Auto-sklearn solves a minimization problem internally, we need to look
for the entry with the lowest loss:



In [None]:
losses_and_configurations = [
    (run_value.cost, run_key.config_id)
    for run_key, run_value in automl.automl_.runhistory_.data.items()
]
losses_and_configurations.sort()
print("Lowest loss:", losses_and_configurations[0][0])
print(
    "Best configuration:",
    automl.automl_.runhistory_.ids_config[losses_and_configurations[0][1]],
)

## Detailed statistics about the search - part 2

To maintain compatibility with scikit-learn, Auto-sklearn gives the
same data as
[cv_results_](https://scikit-learn.org/stable/modules/generated/sklearn.
model_selection.GridSearchCV.html).



In [None]:
print(automl.cv_results_)

## Inspect the components of the best model

Iterate over the components of the model and print
The explained variance ratio per stage



In [None]:
for i, (weight, pipeline) in enumerate(automl.get_models_with_weights()):
    for stage_name, component in pipeline.named_steps.items():
        if "feature_preprocessor" in stage_name:
            print(
                "The {}th pipeline has a explained variance of {}".format(
                    i,
                    # The component is an instance of AutoSklearnChoice.
                    # Access the sklearn object via the choice attribute
                    # We want the explained variance attributed of
                    # each principal component
                    component.choice.preprocessor.explained_variance_ratio_,
                )
            )