一开始训练就训练报错
Epoch 1/2000
Traceback (most recent call last):
File "multi_type_question_number.py", line 450, in <module>
train(model)
File "multi_type_question_number.py", line 281, in train
layers='heads')
File "/root/Mask_RCNN/mrcnn/model.py", line 2381, in train
use_multiprocessing=False,
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training_generator.py", line 217, in fit_generator
class_weight=class_weight)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training.py", line 1211, in train_on_batch
class_weight=class_weight)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training.py", line 751, in _standardize_user_data
exception_prefix='input')
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training_utils.py", line 138, in standardize_input_data
str(data_shape))
ValueError: Error when checking input: expected input_image_meta to have shape (16,) but got array with shape (22,)
原因:class不对应。 解决:背景1 + 需目标检测的数量
无法保存模型报错
100/100 [==============================] - 261s 3s/step - loss: 1.6207 - rpn_class_loss: 0.1146 - rpn_bbox_loss: 0.4441 - mrcnn_class_loss: 0.2516 - mrcnn_bbox_loss: 0.4165 - mrcnn_mask_loss: 0.3940 - val_loss: 0.8447 - val_rpn_class_loss: 0.0514 - val_rpn_bbox_loss: 0.4179 - val_mrcnn_class_loss: 0.0822 - val_mrcnn_bbox_loss: 0.1607 - val_mrcnn_mask_loss: 0.1326
Traceback (most recent call last):
File "multi_type_question_number.py", line 450, in <module>
train(model)
File "multi_type_question_number.py", line 281, in train
layers='heads')
File "/root/Mask_RCNN/mrcnn/model.py", line 2381, in train
use_multiprocessing=False,
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/training_generator.py", line 251, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/callbacks.py", line 79, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/callbacks.py", line 446, in on_epoch_end
self.model.save(filepath, overwrite=True)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/network.py", line 1090, in save
save_model(self, filepath, overwrite, include_optimizer)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/saving.py", line 382, in save_model
_serialize_model(model, f, include_optimizer)
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/saving.py", line 83, in _serialize_model
model_config['config'] = model.get_config()
File "/root/anaconda3/envs/dl/lib/python3.5/site-packages/keras/engine/network.py", line 931, in get_config
return copy.deepcopy(config)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 243, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 218, in _deepcopy_list
y.append(deepcopy(a, memo))
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 243, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 243, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 223, in _deepcopy_tuple
y = [deepcopy(a, memo) for a in x]
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 223, in <listcomp>
y = [deepcopy(a, memo) for a in x]
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 223, in _deepcopy_tuple
y = [deepcopy(a, memo) for a in x]
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 223, in <listcomp>
y = [deepcopy(a, memo) for a in x]
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 182, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 297, in _reconstruct
state = deepcopy(state, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 243, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 182, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 297, in _reconstruct
state = deepcopy(state, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 243, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/root/anaconda3/envs/dl/lib/python3.5/copy.py", line 174, in deepcopy
rv = reductor(4)
TypeError: can't pickle SwigPyObject objects
解决:
报错模型参数不对。
model.py
设置只保存权重即可,mask rcnn
不支持全量模型报错,原因未知
save_best_only=True,save_weights_only=True
#Callbacks
keras.callbacks.ModelCheckpoint(self.checkpoint_path,monitor='val_loss',
verbose=0,save_best_only=True,save_weights_only=True),
]
Loss nan
Epoch 33/2000
100/100 [==============================] - 231s 2s/step - loss: 0.9838 - rpn_class_loss: 0.0180 - rpn_bbox_loss: 0.2431 - mrcnn_class_loss: 0.1373 - mrcnn_bbox_loss: 0.2723 - mrcnn_mask_loss: 0.3130 - val_loss: 0.9683 - val_rpn_class_loss: 0.0417 - val_rpn_bbox_loss: 0.3798 - val_mrcnn_class_loss: 0.0751 - val_mrcnn_bbox_loss: 0.2362 - val_mrcnn_mask_loss: 0.2355
Epoch 34/2000
100/100 [==============================] - 215s 2s/step - loss: 0.8713 - rpn_class_loss: 0.0227 - rpn_bbox_loss: 0.1844 - mrcnn_class_loss: 0.1292 - mrcnn_bbox_loss: 0.2486 - mrcnn_mask_loss: 0.2863 - val_loss: 0.7546 - val_rpn_class_loss: 0.0275 - val_rpn_bbox_loss: 0.2965 - val_mrcnn_class_loss: 0.0583 - val_mrcnn_bbox_loss: 0.1699 - val_mrcnn_mask_loss: 0.2023
Epoch 35/2000
100/100 [==============================] - 221s 2s/step - loss: nan - rpn_class_loss: 0.6370 - rpn_bbox_loss: 0.4525 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0335 - mrcnn_mask_loss: 0.0907 - val_loss: nan - val_rpn_class_loss: 0.7072 - val_rpn_bbox_loss: 0.4715 - val_mrcnn_class_loss: nan - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00
Epoch 36/2000
100/100 [==============================] - 219s 2s/step - loss: nan - rpn_class_loss: 0.7061 - rpn_bbox_loss: 0.4460 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: 0.0000e+00 - val_loss: nan - val_rpn_class_loss: 0.7072 - val_rpn_bbox_loss: 0.4431 - val_mrcnn_class_loss: nan - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00
Epoch 37/2000
100/100 [==============================] - 209s 2s/step - loss: nan - rpn_class_loss: 0.7066 - rpn_bbox_loss: 0.4681 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: 0.0000e+00 - val_loss: nan - val_rpn_class_loss: 0.7073 - val_rpn_bbox_loss: 0.4084 - val_mrcnn_class_loss: nan - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00
.....
Epoch 167/2000
100/100 [==============================] - 210s 2s/step - loss: nan - rpn_class_loss: 0.7062 - rpn_bbox_loss: 0.4489 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: 0.0000e+00 - val_loss: nan - val_rpn_class_loss: 0.7074 - val_rpn_bbox_loss: 0.5061 - val_mrcnn_class_loss: nan - val_mrcnn_bbox_loss: 0.0000e+00 - val_mrcnn_mask_loss: 0.0000e+00
Epoch 168/2000
24/100 [======>.......................] - ETA: 1:34 - loss: nan - rpn_class_loss: 0.7069 - rpn_bbox_loss: 0.4690 - mrcnn_class_loss: nan - mrcnn_bbox_loss: 0.0000e+00 - mrcnn_mask_loss: 0.0000e+00
mrcnn_class_loss: nan
导致 整体loss
nan
解决:
训练数据中存在 class
之外的数值导致
class id 注意事项
It takes me two days to running this code on my own data set. I thought there should be more details in the guidance. 1. When using add_image() in the utils.Dataset class, the image_id must be consecutive integer from 1 to some number, because image_id is the index of a list. Class number also should be consecutive integer from 1 to some number, or you will get an nanloss.
- class id 必须是连续的整型数字
- 而且必须是从 1 开始,否则会报错
- 0 是作为背景的class id 已经被占用了
RPN 的理解
rpn class loss :并不是所有物体类别的分类损失,而是前景和背景分类的损失
cuda 与nvidia驱动兼容
nvtop 监控gpu 变化
训练启动流程
-
去到自己创建的训练目录
- 激活训练环境:source activate dl
- 开始训练:nohup python xxxx.py train >train.log &
-
训练过程查看: ecs远程服务器jupyter notebook 以及tensorboard服务后台运行:
nohup jupyter notebook --ip 0.0.0.0 --no-browser --allow-root > jupyter.log &
-
tensorboard loss 检测:
nohup tensorboard --logdir=./logs > tensorboard.log &
- tensorboard在centos7中无法开启:
Tensorboard could not bind to unsupported address family
- tensorboard在centos7中无法开启:
ip6的问题,可能是Ip4没有成为默认,指定host 0.0.0.0即可。 nohup tensorboard --logdir=./logs --host=0.0.0.0 > tensorboard.log &
- 端口防火墙 同时还需要在防火墙中打开6006端口。
iptables -A INPUT -p tcp --dport 6006 -j ACCEPT
iptables -A OUTPUT -p tcp --sport 6006 -j ACCEPT
iptables -A INPUT -p tcp --dport 8888 -j ACCEPT
iptables -A OUTPUT -p tcp --sport 8888 -j ACCEPT
service iptables restart
训练卡死
model.py
的train
方法中
use_multiprocessing
关闭,worker
设为1 就能搞定.
训练 step, batch-size, epoch 理解
- iteration:表示1次迭代(也叫training step),每次迭代更新1次网络结构的参数;
- batch-size:1次迭代所使用的样本量;
- epoch:1个epoch表示过了1遍训练集中的所有样本。
值得注意的是,在深度学习领域中,常用带mini-batch的随机梯度下降算法(Stochastic Gradient Descent, SGD)训练深层结构,它有一个好处就是并不需要遍历全部的样本,当数据量非常大时十分有效。此时,可根据实际问题来定义epoch,例如定义10000次迭代为1个epoch,若每次迭代的batch-size设为256,那么1个epoch相当于过了2560000次训练样本。