AutoML搜索HPO简单实战
本文简单的介绍如何使用ray中的tune的功能来搜索网络超参数。一个模型的成功其实离不开精心挑选设定的超参数,而超参数本身与模型的训练过程,硬件条件其实是密切相关的。很多算法在Paper只会给你展示最终的结果,而过程中经历的无数失败的参数尝试却根本不会体现出来。 很多人调侃AutoML抢了ai工程师的饭碗,其实ai工程师作的事情并不只是调参(质疑)。能够写出这样的自动化的工具,其把自己的模型最佳搭配参数找出来,那也是只有ai工程师能做的事情。本文就来教大家来如何入门。
这里使用的工具只有一个,那就是ray,关于ray的介绍大家可以看官网,你可以把他看作是ai界的scalar,在某些时候这种工具还是挺管用的,我们不需要用到他的所有功能,这里只需要用到他的一个模块,tune。
pip install ray
pip install alfred-py
ray做超参搜索
其实用ray来做HPO非常简单,我们以一个分类任务为例子,这里的模型是我们自己定义的,一个非常简单的ConvNet,然后我们采用sgd进行优化,我们希望知道在我们的target 硬件,也就是一个GPU下,最佳的lr以及momentum是多少,请注意,这里不是无脑的0.1和0.9,这些都是经验值,经验值一定最优吗?还真不一定,这也是为什么你需要做这么一步的原因。
废话不多说,直接上代码:
'''
this projects demonstrate how to tune
hyper parameters on a model
'''
from ray.tune.schedulers import ASHAScheduler
import numpy as np
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from ray import tune
from ray.tune.examples.mnist_pytorch import get_data_loaders, train, test
import ray
import sys
from alfred.dl.torch.common import device
if len(sys.argv) > 1:
ray.init(redis_address=sys.argv[1])
class ConvNet(nn.Module):
def __init__(self):
super(ConvNet, self).__init__()
self.conv1 = nn.Conv2d(1, 3, kernel_size=3)
self.fc = nn.Linear(192, 10)
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 3))
x = x.view(-1, 192)
x = self.fc(x)
return F.log_softmax(x, dim=1)
def train_mnist(config):
model = ConvNet()
model.to(device)
train_loader, test_loader = get_data_loaders()
optimizer = optim.SGD(
model.parameters(), lr=config["lr"], momentum=config["momentum"])
for i in range(10):
train(model, optimizer, train_loader, device)
acc = test(model, test_loader, device)
# tune.track.log(mean_accuracy=acc)
tune.report(mean_accuracy=acc)
if i % 5 == 0:
# This saves the model to the trial directory
torch.save(model.state_dict(), "./model.pth")
search_space = {
"lr": tune.choice([0.001, 0.01, 0.1]),
"momentum": tune.uniform(0.1, 0.9)
}
analysis = tune.run(
train_mnist,
num_samples=30,
resources_per_trial={'gpu': 1},
scheduler=ASHAScheduler(metric="mean_accuracy",
mode="max", grace_period=1),
config=search_space)
真的是超级简单,就这么即行代码,你可以对你的超参数进行优化了。运行完成之后,你可以看到类似于这样的输出:
Number of trials: 30/30 (30 TERMINATED)
+-------------------------+------------+-------------------+-------+------------+----------+--------+------------------+
| Trial name | status | loc | lr | momentum | acc | iter | total time (s) |
|-------------------------+------------+-------------------+-------+------------+----------+--------+------------------|
| train_mnist_f71cb_00000 | TERMINATED | 10.80.43.74:55628 | 0.001 | 0.338473 | 0.1 | 10 | 3.4553 |
| train_mnist_f71cb_00001 | TERMINATED | 10.80.43.74:55629 | 0.001 | 0.333578 | 0.2375 | 10 | 3.35463 |
| train_mnist_f71cb_00002 | TERMINATED | 10.80.43.74:55632 | 0.1 | 0.765898 | 0.865625 | 10 | 3.3302 |
| train_mnist_f71cb_00003 | TERMINATED | 10.80.43.74:55627 | 0.01 | 0.109677 | 0.290625 | 1 | 2.19001 |
| train_mnist_f71cb_00004 | TERMINATED | 10.80.43.74:55633 | 0.01 | 0.25874 | 0.24375 | 1 | 2.23142 |
| train_mnist_f71cb_00005 | TERMINATED | 10.80.43.74:55630 | 0.01 | 0.586889 | 0.2125 | 1 | 2.24251 |
| train_mnist_f71cb_00006 | TERMINATED | 10.80.43.74:55631 | 0.001 | 0.159626 | 0.128125 | 1 | 2.22076 |
| train_mnist_f71cb_00007 | TERMINATED | 10.80.43.74:55622 | 0.01 | 0.880302 | 0.890625 | 10 | 3.35615 |
| train_mnist_f71cb_00008 | TERMINATED | 10.80.43.74:55620 | 0.001 | 0.824616 | 0.0625 | 1 | 2.24873 |
| train_mnist_f71cb_00009 | TERMINATED | 10.80.43.74:55618 | 0.001 | 0.520118 | 0.025 | 1 | 2.21357 |
| train_mnist_f71cb_00010 | TERMINATED | 10.80.43.74:55621 | 0.001 | 0.374901 | 0.1625 | 1 | 2.21661 |
| train_mnist_f71cb_00011 | TERMINATED | 10.80.43.74:55626 | 0.1 | 0.380379 | 0.840625 | 10 | 3.29623 |
| train_mnist_f71cb_00012 | TERMINATED | 10.80.43.74:55619 | 0.001 | 0.16643 | 0.103125 | 1 | 2.18452 |
| train_mnist_f71cb_00013 | TERMINATED | 10.80.43.74:55624 | 0.01 | 0.83129 | 0.753125 | 4 | 2.62007 |
| train_mnist_f71cb_00014 | TERMINATED | 10.80.43.74:55625 | 0.1 | 0.302737 | 0.76875 | 4 | 2.55607 |
| train_mnist_f71cb_00015 | TERMINATED | 10.80.43.74:55623 | 0.001 | 0.10079 | 0.14375 | 1 | 2.19084 |
| train_mnist_f71cb_00016 | TERMINATED | 10.80.43.74:56210 | 0.01 | 0.310645 | 0.221875 | 1 | 2.15354 |
| train_mnist_f71cb_00017 | TERMINATED | 10.80.43.74:56244 | 0.1 | 0.302933 | 0.8625 | 10 | 3.40914 |
| train_mnist_f71cb_00018 | TERMINATED | 10.80.43.74:56288 | 0.1 | 0.472268 | 0.88125 | 10 | 3.36956 |
| train_mnist_f71cb_00019 | TERMINATED | 10.80.43.74:56344 | 0.01 | 0.368897 | 0.0875 | 1 | 2.17568 |
| train_mnist_f71cb_00020 | TERMINATED | 10.80.43.74:56376 | 0.1 | 0.87095 | 0.959375 | 10 | 3.41518 |
| train_mnist_f71cb_00021 | TERMINATED | 10.80.43.74:56410 | 0.1 | 0.225298 | 0.834375 | 4 | 2.65473 |
| train_mnist_f71cb_00022 | TERMINATED | 10.80.43.74:56447 | 0.001 | 0.385364 | 0.15625 | 1 | 2.2179 |
| train_mnist_f71cb_00023 | TERMINATED | 10.80.43.74:56480 | 0.1 | 0.455948 | 0.846875 | 4 | 2.60914 |
| train_mnist_f71cb_00024 | TERMINATED | 10.80.43.74:56531 | 0.001 | 0.643994 | 0.103125 | 1 | 2.20683 |
| train_mnist_f71cb_00025 | TERMINATED | 10.80.43.74:56565 | 0.01 | 0.525646 | 0.246875 | 1 | 2.19718 |
| train_mnist_f71cb_00026 | TERMINATED | 10.80.43.74:56602 | 0.1 | 0.178208 | 0.7 | 4 | 2.60433 |
| train_mnist_f71cb_00027 | TERMINATED | 10.80.43.74:56635 | 0.001 | 0.557703 | 0.11875 | 1 | 2.27553 |
| train_mnist_f71cb_00028 | TERMINATED | 10.80.43.74:56667 | 0.001 | 0.360534 | 0.046875 | 1 | 2.30704 |
| train_mnist_f71cb_00029 | TERMINATED | 10.80.43.74:56713 | 0.001 | 0.466288 | 0.13125 | 1 | 2.31522 |
+-------------------------+------------+-------------------+-------+------------+----------+--------+------------------+
2022-02-08 16:08:57,847 INFO tune.py:636 -- Total run time: 172.06 seconds (171.83 seconds for the tuning loop).
如果你安装了tensorboard,你还可以看到类似于这样的曲线:
一眼就能看出,高亮的地方就是你的最优化的参数:

这个效果还算不错。直接可以找到一个最合适的超参数配置,搭配你的模型,一定可以实现无疑伦比的精度。
Ray是否可以扩展到大型检测网络超参数搜索
答案当然是可以的,只是比较麻烦。在我之前开源的YOLOv7系列中,就有很多模型的组合,每个模型的组合其实对超参数依赖都不同。这里贴一段给大家参考的可以对大型训练框架,进行超参数搜索的案例:
import detectron2
from detectron2.utils.logger import setup_logger
setup_logger()
import detectron2.data.transforms as T
# import some common detectron2 utilities
from detectron2 import model_zoo
from detectron2.data import detection_utils as utils
from detectron2.data import MetadataCatalog, DatasetCatalog, build_detection_train_loader, DatasetMapper, build_detection_test_loader
from detectron2.engine import DefaultTrainer, DefaultPredictor
from detectron2.config import get_cfg
from detectron2.evaluation import COCOEvaluator, inference_on_dataset
from detectron2.engine.hooks import HookBase
from detectron2.data.datasets import register_coco_instances, load_coco_json
from detectron2.utils.visualizer import Visualizer, ColorMode
from detectron2.structures.boxes import BoxMode
import numpy as np
import os, json, cv2, random
# import PointRend project
from detectron2.projects import point_rend
import copy
import torch
import matplotlib.pyplot as plt
from PIL import Image, ImageDraw
from torchvision.transforms import functional as F
import random
import albumentations as A
# Hyperparametertuning
import ray
from functools import partial
from ray import tune
from ray.tune import JupyterNotebookReporter
from ray.tune.schedulers import ASHAScheduler
# %% [markdown]
# # Register the train and test dataset
# %% [code]
def get_balloon_dicts(img_dir):
json_file = os.path.join(img_dir, "via_region_data.json")
with open(json_file) as f:
imgs_anns = json.load(f)
dataset_dicts = []
for idx, v in enumerate(imgs_anns.values()):
record = {}
filename = os.path.join(img_dir, v["filename"])
height, width = cv2.imread(filename).shape[:2]
record["file_name"] = filename
record["image_id"] = idx
record["height"] = height
record["width"] = width
annos = v["regions"]
objs = []
for _, anno in annos.items():
assert not anno["region_attributes"]
anno = anno["shape_attributes"]
px = anno["all_points_x"]
py = anno["all_points_y"]
poly = [(x + 0.5, y + 0.5) for x, y in zip(px, py)]
poly = [p for x in poly for p in x]
obj = {
"bbox": [np.min(px), np.min(py), np.max(px), np.max(py)],
"bbox_mode": BoxMode.XYXY_ABS,
"segmentation": [poly],
"category_id": 0,
'iscrowd': 0,
}
objs.append(obj)
record["annotations"] = objs
dataset_dicts.append(record)
return dataset_dicts
for d in ["train", "val"]:
DatasetCatalog.register("balloon_" + d, lambda d=d: get_balloon_dicts("balloon/" + d))
MetadataCatalog.get("balloon_" + d).set(thing_classes=["balloon"])
balloon_metadata = MetadataCatalog.get("balloon_train")
# %% [markdown]
# # Generate the custom Trainer class using augmentations and hook
# %% [code]
class RayTuneReporterHook(HookBase):
""" SEMI IMPLEMENTED: Hook for Ray Tune Reporter"""
def __init__(self, eval_period):
self._eval_period = eval_period
def after_step(self):
next_iter = self.trainer.iter + 1
is_final = next_iter == self.trainer.max_iter
if is_final or (self._eval_period > 0 and next_iter % self._eval_period == 0):
loss_mask_latest = self.trainer.storage.history('loss_mask').latest()
tune.report(loss_mask=loss_mask_latest)
print(f'{self.trainer.iter}: losses and mAPs {loss_mask_latest}')
class CustomTrainer(DefaultTrainer):
""" Custom trainer class"""
def __init__(self, cfg, hp):
self._hp = hp
super().__init__(cfg)
def build_hooks(self):
# build hooks from super class
# and add ray tune reporter hook
hooks = super().build_hooks()
hooks.insert(-1,RayTuneReporterHook(self.cfg.TEST.EVAL_PERIOD))
return hooks
# %% [markdown]
# # Train the PointRend model
# %% [code]
def set_cfg_values(hp):
""" Set the Config values for the PointRend model"""
cfg = get_cfg()
# Add PointRend-specific config
point_rend.add_pointrend_config(cfg)
# Load a config from file
cfg.merge_from_file("/kaggle/working/detectron2_repo/projects/PointRend/configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco.yaml")
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5 # set threshold for this model
cfg.DATASETS.TRAIN = ("balloon_train",)
cfg.DATASETS.TEST = ("balloon_val",)
cfg.DATALOADER.NUM_WORKERS = 4
# Use a model from PointRend model zoo: https://github.com/facebookresearch/detectron2/tree/master/projects/PointRend#pretrained-models
cfg.MODEL.WEIGHTS = "detectron2://PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_edd263.pkl"
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 512 #128 # faster, and good enough for this toy dataset (default: 512)
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1 # only has one class (ballon). (see https://detectron2.readthedocs.io/tutorials/datasets.html#update-the-config-for-new-datasets)
cfg.MODEL.POINT_HEAD.NUM_CLASSES = 1 # only has one class (ballon). (see https://detectron2.readthedocs.io/tutorials/datasets.html#update-the-config-for-new-datasets)
# NOTE: this config means the number of classes, but a few popular unofficial tutorials incorrect uses num_classes+1 here.
cfg.SEED = 1
cfg.INPUT.FORMAT = 'RGB'
cfg.INPUT.RANDOM_FLIP = 'none'
cfg.TEST.EVAL_PERIOD = 1
# Hyperparams that need to be solved
cfg.OUTPUT_DIR = '/kaggle/working/output'
cfg.SOLVER.IMS_PER_BATCH = hp['batch_size']
cfg.SOLVER.BASE_LR = hp['lr'] #0.00025 # pick a good LR
cfg.SOLVER.MAX_ITER = hp['epochs'] #300 # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset
cfg.SOLVER.STEPS = (3,) #[] # do not decay learning rate
if hp['anchorboxes']:
cfg.MODEL.ANCHOR_GENERATOR.SIZES = [[8, 16, 32, 64, 128]]
cfg.MODEL.ANCHOR_GENERATOR.ASPECT_RATIOS = [[0.25, 0.5, 1.0, 2.0]]
return cfg
# %% [code]
def train_model(hp, checkpoint_dir=None):
cfg = set_cfg_values(hp)
# Train the model
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = CustomTrainer(cfg, hp)
trainer.resume_or_load(resume=False)
trainer.train()
# %% [code]
# This plain cell works
hp = {"epochs": 8,
"lr": 1e-3,
"batch_size": 2,
"anchorboxes": False,
"hflip": 0.5,
"vflip": 0.5,
"blur": 0.5,
"contrast": 0.2,
"brightness": 0.2}
train_model(hp)
# %% [code]
# And this hyperparameter function cell does not
def main(num_samples=10, max_num_epochs=10, gpus_per_trial=1):
# Set the hyperparameter configuration dict with viable ranges
hp = {"epochs": tune.randint(8,15),
"lr": tune.loguniform(1e-3, 1e-2),
"batch_size": tune.choice([2, 4, 8]),
"anchorboxes": False,
"hflip": tune.choice([0, 0.5]),
"vflip": tune.choice([0, 0.5]),
"blur": tune.choice([0, 0.5]),
"contrast": tune.choice([0, 0.2]),
"brightness": tune.choice([0, 0.2])}
# initialize tha Asynchronous Successive Halving Algorithm Scheduler
scheduler = ASHAScheduler(
metric="loss_mask",
mode="min",
max_t=max_num_epochs,
grace_period=1,
reduction_factor=4)
# Report results to the CLI
reporter = JupyterNotebookReporter(overwrite=False,
metric_columns=["loss_mask", "segm_map_small",
"segm_map_medium", "training_iteration"])
# Run the ASHA algorithm
result = tune.run(
train_model,
resources_per_trial={"cpu": 2, "gpu": gpus_per_trial},
config=hp,
num_samples=num_samples,
scheduler=scheduler,
progress_reporter=reporter)
# Get the best model
best_trial = result.get_best_trial("loss_mask", "min", "last")
print("Best trial config: {}".format(best_trial.config))
print("Best trial final validation loss: {}".format(
best_trial.last_result["loss_mask"]))
print("Best trial final validation segm_map_small: {}".format(
best_trial.last_result["segm_map_small"]))
print("Best trial final validation segm_map_medium: {}".format(
best_trial.last_result["segm_map_medium"]))
if __name__ == "__main__":
# You can change the number of GPUs per trial here:
main(num_samples=2, max_num_epochs=3, gpus_per_trial=1)
总结
ray是一个不错的工具,做分布式计算和搜索大有用途,感兴趣的童鞋可以好好研究研究。
打个小广告: github.com/jinfagang/y…