注意
点击此处下载完整的示例代码
示例 4 - 在集群上¶
本示例展示了如何在集群环境中运行 HpBandster。除了使用一个共享目录来向每个 worker 传达 nameserver 的位置,以及通过网络而非仅仅是环回接口进行通信之外,实际的 Python 代码与示例 3 有很大不同。
要将其作为批处理作业实际运行,通常需要一个 shell 脚本。这些脚本因调度器而异。这里我们提供一个 Sun Grid Engine (SGE) 的示例脚本,但将其改编到任何其他调度器应该很容易。该脚本仅指定了输出 (-o) 和错误 (-e) 的日志文件,加载一个虚拟环境,然后为第一个数组任务执行 master,否则执行 worker。数组作业会多次执行相同的源文件,并将它们捆绑成一个作业,其中每个任务获得一个唯一的任务 ID。对于 SGE,这些 ID 是正整数,我们简单地将第一个任务称为 master。
# submit via qsub -t 1-4 -q test_core.q example_4_cluster_submit_me.sh
#$ -cwd
#$ -o $JOB_ID-$TASK_ID.o
#$ -e $JOB_ID-$TASK_ID.e
# enter the virtual environment
source ~sfalkner/virtualenvs/HpBandSter_tests/bin/activate
if [ $SGE_TASK_ID -eq 1]
then python3 example_4_cluster.py --run_id $JOB_ID --nic_name eth0 --working_dir .
else
python3 example_4_cluster.py --run_id $JOB_ID --nic_name eth0 --working_dir . --worker
fi
您只需将以上代码复制到一个文件中,例如 submit_me.sh,然后通过以下方式告知 SGE 运行它:
qsub -t 1-4 -q your_queue_name submit_me.sh
现在来看实际的 Python 源代码
import logging
logging.basicConfig(level=logging.INFO)
import argparse
import pickle
import time
import hpbandster.core.nameserver as hpns
import hpbandster.core.result as hpres
from hpbandster.optimizers import BOHB as BOHB
from hpbandster.examples.commons import MyWorker
parser = argparse.ArgumentParser(description='Example 1 - sequential and local execution.')
parser.add_argument('--min_budget', type=float, help='Minimum budget used during the optimization.', default=9)
parser.add_argument('--max_budget', type=float, help='Maximum budget used during the optimization.', default=243)
parser.add_argument('--n_iterations', type=int, help='Number of iterations performed by the optimizer', default=4)
parser.add_argument('--n_workers', type=int, help='Number of workers to run in parallel.', default=2)
parser.add_argument('--worker', help='Flag to turn this into a worker process', action='store_true')
parser.add_argument('--run_id', type=str, help='A unique run id for this optimization run. An easy option is to use the job id of the clusters scheduler.')
parser.add_argument('--nic_name',type=str, help='Which network interface to use for communication.')
parser.add_argument('--shared_directory',type=str, help='A directory that is accessible for all processes, e.g. a NFS share.')
args=parser.parse_args()
# Every process has to lookup the hostname
host = hpns.nic_name_to_host(args.nic_name)
if args.worker:
time.sleep(5) # short artificial delay to make sure the nameserver is already running
w = MyWorker(sleep_interval = 0.5,run_id=args.run_id, host=host)
w.load_nameserver_credentials(working_directory=args.shared_directory)
w.run(background=False)
exit(0)
# Start a nameserver:
# We now start the nameserver with the host name from above and a random open port (by setting the port to 0)
NS = hpns.NameServer(run_id=args.run_id, host=host, port=0, working_directory=args.shared_directory)
ns_host, ns_port = NS.start()
# Most optimizers are so computationally inexpensive that we can affort to run a
# worker in parallel to it. Note that this one has to run in the background to
# not plock!
w = MyWorker(sleep_interval = 0.5,run_id=args.run_id, host=host, nameserver=ns_host, nameserver_port=ns_port)
w.run(background=True)
# Run an optimizer
# We now have to specify the host, and the nameserver information
bohb = BOHB( configspace = MyWorker.get_configspace(),
run_id = args.run_id,
host=host,
nameserver=ns_host,
nameserver_port=ns_port,
min_budget=args.min_budget, max_budget=args.max_budget
)
res = bohb.run(n_iterations=args.n_iterations, min_n_workers=args.n_workers)
# In a cluster environment, you usually want to store the results for later analysis.
# One option is to simply pickle the Result object
with open(os.path.join(args.shared_directory, 'results.pkl'), 'wb') as fh:
pickle.dump(res, fh)
# Step 4: Shutdown
# After the optimizer run, we must shutdown the master and the nameserver.
bohb.shutdown(shutdown_workers=True)
NS.shutdown()
脚本总运行时间: ( 0 分钟 0.000 秒)