Mask RCNN 训练总结

2,401 阅读5分钟

一开始训练就训练报错

Epoch 1/2000
Traceback (most recent call last):
  File "multi_type_question_number.py", line 450, in <module>
    train(model)
  File "multi_type_question_number.py", line 281, in train
    layers='heads')
  File "/root/Mask_RCNN/mrcnn/model.py", line 2381, in train
    use_multiprocessing=False,
  File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training_generator.py", line 217, in fit_generator
    class_weight=class_weight)
  File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training.py", line 1211, in train_on_batch
    class_weight=class_weight)
  File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training.py", line 751, in _standardize_user_data
    exception_prefix='input')
  File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training_utils.py", line 138, in standardize_input_data
    str(data_shape))
ValueError: Error when checking input: expected input_image_meta to have shape (16,) but got array with shape (22,)

原因:class不对应。 解决:背景1 + 需目标检测的数量

无法保存模型报错

100/100 [==============================] - 261s 3s/step - loss: 1.6207 - rpn_class_loss: 0.1146 - rpn_bbox_loss: 0.4441 - mrcnn_class_loss: 0.2516 - mrcnn_bbox_loss: 0.4165 - mrcnn_mask_loss: 0.3940 - val_loss: 0.8447 - val_rpn_class_loss: 0.0514 - val_rpn_bbox_loss: 0.4179 - val_mrcnn_class_loss: 0.0822 - val_mrcnn_bbox_loss: 0.1607 - val_mrcnn_mask_loss: 0.1326
Traceback (most recent call last):
  File "multi_type_question_number.py", line 450, in <module>
    train(model)
  File "multi_type_question_number.py", line 281, in train
    layers='heads')
  File "/root/Mask_RCNN/mrcnn/model.py", line 2381, in train
    use_multiprocessing=False,
  File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training_generator.py", line 251, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/callbacks.py", line 79, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/callbacks.py", line 446, in on_epoch_end
    self.model.save(filepath, overwrite=True)
  File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/network.py", line 1090, in save
    save_model(self, filepath, overwrite, include_optimizer)
  File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/saving.py", line 382, in save_model
    _serialize_model(model, f, include_optimizer)
  File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/saving.py", line 83, in _serialize_model
    model_config['config'] = model.get_config()
  File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/network.py", line 931, in get_config
    return copy.deepcopy(config)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 218, in _deepcopy_list
    y.append(deepcopy(a, memo))
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 223, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 223, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 223, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 223, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 182, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 297, in _reconstruct
    state = deepcopy(state, memo)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 182, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 297, in _reconstruct
    state = deepcopy(state, memo)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 174, in deepcopy
    rv = reductor(4)
TypeError: can't pickle SwigPyObject objects

解决:

报错模型参数不对。 model.py设置只保存权重即可,mask rcnn不支持全量模型报错,原因未知

save_best_only=True,save_weights_only=True
#Callbacks
keras.callbacks.ModelCheckpoint(self.checkpoint_path,monitor='val_loss',
verbose=0,save_best_only=True,save_weights_only=True),
]

Loss nan

Epoch 33/2000
100/100 [==============================] - 231s 2s/step - loss: 0.9838 - rpn_class_loss: 0.0180 - rpn_bbox_loss: 0.2431 - mrcnn_class_loss: 0.1373 - mrcnn_bbox_loss: 0.2723 - mrcnn_mask_loss: 0.3130 - val_loss: 0.9683 - val_rpn_class_loss: 0.0417 - val_rpn_bbox_loss: 0.3798 - val_mrcnn_class_loss: 0.0751 - val_mrcnn_bbox_loss: 0.2362 - val_mrcnn_mask_loss: 0.2355
Epoch 34/2000
100/100 [==============================] - 215s 2s/step - loss: 0.8713 - rpn_class_loss: 0.0227 - rpn_bbox_loss: 0.1844 - mrcnn_class_loss: 0.1292 - mrcnn_bbox_loss: 0.2486 - mrcnn_mask_loss: 0.2863 - val_loss: 0.7546 - val_rpn_class_loss: 0.0275 - val_rpn_bbox_loss: 0.2965 - val_mrcnn_class_loss: 0.0583 - val_mrcnn_bbox_loss: 0.1699 - val_mrcnn_mask_loss: 0.2023
Epoch 35/2000
100/100 [==============================] - 221s 2s/step - loss: nan - rpn_class_loss: 0.6370 - rpn_bbox_loss: 0.4525 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0335 - mrcnn_mask_loss: 0.0907 - val_loss: nan - val_rpn_class_loss: 0.7072 - val_rpn_bbox_loss: 0.4715 - val_mrcnn_class_loss: nan - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00
Epoch 36/2000
100/100 [==============================] - 219s 2s/step - loss: nan - rpn_class_loss: 0.7061 - rpn_bbox_loss: 0.4460 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: 0.0000e+00 - val_loss: nan - val_rpn_class_loss: 0.7072 - val_rpn_bbox_loss: 0.4431 - val_mrcnn_class_loss: nan - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00
Epoch 37/2000
100/100 [==============================] - 209s 2s/step - loss: nan - rpn_class_loss: 0.7066 - rpn_bbox_loss: 0.4681 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: 0.0000e+00 - val_loss: nan - val_rpn_class_loss: 0.7073 - val_rpn_bbox_loss: 0.4084 - val_mrcnn_class_loss: nan - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00

.....
Epoch 167/2000
100/100 [==============================] - 210s 2s/step - loss: nan - rpn_class_loss: 0.7062 - rpn_bbox_loss: 0.4489 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: 0.0000e+00 - val_loss: nan - val_rpn_class_loss: 0.7074 - val_rpn_bbox_loss: 0.5061 - val_mrcnn_class_loss: nan - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00
Epoch 168/2000
 24/100 [======>.......................] - ETA: 1:34 - loss: nan - rpn_class_loss: 0.7069 - rpn_bbox_loss: 0.4690 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: 0.0000e+00

mrcnn_class_loss: nan 导致 整体loss nan

解决: 训练数据中存在 class之外的数值导致

class id 注意事项

github.com/matterport/…

It takes me two days to running this code on my own data set. I thought there should be more details in the guidance. 1. When using add_image() in the utils.Dataset class, the image_id must be consecutive integer from 1 to some number, because image_id is the index of a list. Class number also should be consecutive integer from 1 to some number, or you will get an nanloss.

  1. class id 必须是连续的整型数字
  2. 而且必须是从 1 开始,否则会报错
  3. 0 是作为背景的class id 已经被占用了

RPN 的理解

rpn class loss :并不是所有物体类别的分类损失,而是前景和背景分类的损失

cuda 与nvidia驱动兼容

nvtop 监控gpu 变化

blog.csdn.net/fword/artic…

blog.csdn.net/jiang_xinxi…

github.com/Syllo/nvtop

训练启动流程

  • 去到自己创建的训练目录

    1. 激活训练环境:source activate dl
    2. 开始训练:nohup python xxxx.py train >train.log &
  • 训练过程查看: ecs远程服务器jupyter notebook 以及tensorboard服务后台运行:

nohup jupyter notebook --ip 0.0.0.0 --no-browser --allow-root > jupyter.log &

  • tensorboard loss 检测: nohup tensorboard --logdir=./logs > tensorboard.log &

    • tensorboard在centos7中无法开启: Tensorboard could not bind to unsupported address family

ip6的问题,可能是Ip4没有成为默认,指定host 0.0.0.0即可。  nohup tensorboard --logdir=./logs --host=0.0.0.0 > tensorboard.log &

  • 端口防火墙 同时还需要在防火墙中打开6006端口。
iptables -A INPUT -p tcp --dport 6006 -j ACCEPT
iptables -A OUTPUT -p tcp --sport 6006 -j ACCEPT


iptables -A INPUT -p tcp --dport 8888 -j ACCEPT
iptables -A OUTPUT -p tcp --sport 8888 -j ACCEPT

service iptables restart

www.jianshu.com/p/586da7c8f…

训练卡死

model.pytrain方法中 use_multiprocessing关闭,worker设为1 就能搞定.

训练 step, batch-size, epoch 理解

  1. iteration:表示1次迭代(也叫training step),每次迭代更新1次网络结构的参数;
  2. batch-size:1次迭代所使用的样本量;
  3. epoch:1个epoch表示过了1遍训练集中的所有样本。

值得注意的是,在深度学习领域中,常用带mini-batch的随机梯度下降算法(Stochastic Gradient Descent, SGD)训练深层结构,它有一个好处就是并不需要遍历全部的样本,当数据量非常大时十分有效。此时,可根据实际问题来定义epoch,例如定义10000次迭代为1个epoch,若每次迭代的batch-size设为256,那么1个epoch相当于过了2560000次训练样本。