示例 4 - 在集群上

本示例展示了如何在集群环境中运行 HpBandster。除了使用一个共享目录来向每个 worker 传达 nameserver 的位置,以及通过网络而非仅仅是环回接口进行通信之外,实际的 Python 代码与示例 3 有很大不同。

要将其作为批处理作业实际运行,通常需要一个 shell 脚本。这些脚本因调度器而异。这里我们提供一个 Sun Grid Engine (SGE) 的示例脚本,但将其改编到任何其他调度器应该很容易。该脚本仅指定了输出 (-o) 和错误 (-e) 的日志文件,加载一个虚拟环境,然后为第一个数组任务执行 master,否则执行 worker。数组作业会多次执行相同的源文件,并将它们捆绑成一个作业,其中每个任务获得一个唯一的任务 ID。对于 SGE,这些 ID 是正整数,我们简单地将第一个任务称为 master。

# submit via qsub -t 1-4 -q test_core.q example_4_cluster_submit_me.sh

#$ -cwd
#$ -o $JOB_ID-$TASK_ID.o
#$ -e $JOB_ID-$TASK_ID.e

# enter the virtual environment
source ~sfalkner/virtualenvs/HpBandSter_tests/bin/activate


if [ $SGE_TASK_ID -eq 1]
   then python3 example_4_cluster.py --run_id $JOB_ID --nic_name eth0 --working_dir .
else
   python3 example_4_cluster.py --run_id $JOB_ID --nic_name eth0  --working_dir . --worker
fi

您只需将以上代码复制到一个文件中,例如 submit_me.sh,然后通过以下方式告知 SGE 运行它:

qsub -t 1-4 -q your_queue_name submit_me.sh

现在来看实际的 Python 源代码

import logging
logging.basicConfig(level=logging.INFO)

import argparse
import pickle
import time

import hpbandster.core.nameserver as hpns
import hpbandster.core.result as hpres

from hpbandster.optimizers import BOHB as BOHB
from hpbandster.examples.commons import MyWorker



parser = argparse.ArgumentParser(description='Example 1 - sequential and local execution.')
parser.add_argument('--min_budget',   type=float, help='Minimum budget used during the optimization.',    default=9)
parser.add_argument('--max_budget',   type=float, help='Maximum budget used during the optimization.',    default=243)
parser.add_argument('--n_iterations', type=int,   help='Number of iterations performed by the optimizer', default=4)
parser.add_argument('--n_workers', type=int,   help='Number of workers to run in parallel.', default=2)
parser.add_argument('--worker', help='Flag to turn this into a worker process', action='store_true')
parser.add_argument('--run_id', type=str, help='A unique run id for this optimization run. An easy option is to use the job id of the clusters scheduler.')
parser.add_argument('--nic_name',type=str, help='Which network interface to use for communication.')
parser.add_argument('--shared_directory',type=str, help='A directory that is accessible for all processes, e.g. a NFS share.')


args=parser.parse_args()

# Every process has to lookup the hostname
host = hpns.nic_name_to_host(args.nic_name)


if args.worker:
    time.sleep(5)   # short artificial delay to make sure the nameserver is already running
    w = MyWorker(sleep_interval = 0.5,run_id=args.run_id, host=host)
    w.load_nameserver_credentials(working_directory=args.shared_directory)
    w.run(background=False)
    exit(0)

# Start a nameserver:
# We now start the nameserver with the host name from above and a random open port (by setting the port to 0)
NS = hpns.NameServer(run_id=args.run_id, host=host, port=0, working_directory=args.shared_directory)
ns_host, ns_port = NS.start()

# Most optimizers are so computationally inexpensive that we can affort to run a
# worker in parallel to it. Note that this one has to run in the background to
# not plock!
w = MyWorker(sleep_interval = 0.5,run_id=args.run_id, host=host, nameserver=ns_host, nameserver_port=ns_port)
w.run(background=True)

# Run an optimizer
# We now have to specify the host, and the nameserver information
bohb = BOHB(  configspace = MyWorker.get_configspace(),
                      run_id = args.run_id,
                      host=host,
                      nameserver=ns_host,
                      nameserver_port=ns_port,
                      min_budget=args.min_budget, max_budget=args.max_budget
               )
res = bohb.run(n_iterations=args.n_iterations, min_n_workers=args.n_workers)


# In a cluster environment, you usually want to store the results for later analysis.
# One option is to simply pickle the Result object
with open(os.path.join(args.shared_directory, 'results.pkl'), 'wb') as fh:
    pickle.dump(res, fh)


# Step 4: Shutdown
# After the optimizer run, we must shutdown the master and the nameserver.
bohb.shutdown(shutdown_workers=True)
NS.shutdown()

脚本总运行时间: ( 0 分钟 0.000 秒)

图库由 Sphinx-Gallery 生成