YOLO相关问题记录can't convert cuda:0 device type tensor to numpy.Y

报错

OpenCV can't augment image: 608 x 608

The size of tensor a (19) must match the size of tensor b (76) at non-singleton dimension 3

NotImplementedError: Create your own 'get_image_id' function"

view size is not compatible with input tensor’s size and stride

CUDA error: an illegal memory access was encountered

can't convert cuda:0 device type tensor to numpy.

报错

Python版使用中

OpenCV can't augment image: 608 x 608

opencv版本问题，装的太高，降级：

pip install opencv_python==3.4.4.19

The size of tensor a (19) must match the size of tensor b (76) at non-singleton dimension 3

train.py文件中：

self.strides = [8, 16, 32] t改为 self.strides = [32, 16]

for i in range(3): 改为 for i in range(len(self.strides)):

NotImplementedError: Create your own 'get_image_id' function"

dataset.py文件中get_image_id函数：

先注释掉前面的：

raise NotImplementedError("Create your own 'get_image_id' function")

再根据自己图片的命名规则，提取名称中的id，如对于图片“level1_123.jpg”，可以这样写：

lv, no = os.path.splitext(os.path.basename(filename))[0].split("_")
lv = lv.replace("level", "")
no = f"{int(no):04d}"

view size is not compatible with input tensor’s size and stride

yolo_layer.py文件中，在view()前面加上contiguous()，如：

det_confs = det_confs.contiguous().view(output.size(0), num_anchors * output.size(2) * output.size(3), 1)

或者就用reshape来代替view（推荐）：

det_confs = det_confs.reshape(output.size(0), num_anchors * output.size(2) * output.size(3), 1)

CUDA error: an illegal memory access was encountered

升级pytorch，我是从1.8.0直接升到最新的1.10.0，就好了。

pip3 install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio===0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

can't convert cuda:0 device type tensor to numpy.

utils/plots.py文件中，注释“if isinstance(output, torch.Tensor):”。需要这句：

output = output.cpu().numpy()

C++版使用中

‘dnn’ in namespace ‘cv’ does not name a type

安装最新版本的OpenCV，再导入一下相关库

#include <opencv2/dnn.hpp>
#include <opencv2/dnn/all_layers.hpp>

知识点

YOLOv5超参介绍

# Hyperparameter evolution metadata (mutation scale 0-1, lower_limit, upper_limit)
        meta = {'lr0': (1, 1e-5, 1e-1),  # 初始学习率(SGD=1E-2, Adam=1E-3)
                'lrf': (1, 0.01, 1.0),  # 余弦退火超参数学习率(lr0 * lrf)
                'momentum': (0.3, 0.6, 0.98),  # SGD学习率动量/Adam beta1
                'weight_decay': (1, 0.0, 0.001),  # 优化器权重衰减系数
                'warmup_epochs': (1, 0.0, 5.0),  # 预热学习epoch(fractions ok)
                'warmup_momentum': (1, 0.0, 0.95),  # 预热学习率动量
                'warmup_bias_lr': (1, 0.0, 0.2),  # 初始预热学习率
                'box': (1, 0.02, 0.2),  # giou损失的系数
                'cls': (1, 0.2, 4.0),  # 分类损失的系数
                'cls_pw': (1, 0.5, 2.0),  # 分类BCELoss中正样本的权重
                'obj': (1, 0.2, 4.0),  # obj损失的系数(像素级缩放)
                'obj_pw': (1, 0.5, 2.0),  # 物体BCELoss中正样本的权重
                'iou_t': (0, 0.1, 0.7),  # 标签与anchors的iou阈值
                'anchor_t': (1, 2.0, 8.0),  # 标签的长h宽w/anchor的长h_a宽w_a阈值, 即h/h_a, w/w_a都要在(1/2.0, 8.0)之间
                'anchors': (2, 2.0, 10.0),  # 每个输出网格的锚点(0为忽略)
                # 下面是一些数据增强的系数, 包括颜色空间和图片空间
                'fl_gamma': (0, 0.0, 2.0),  # 焦点损失gamma(efficientDet默认gamma=1.5)
                'hsv_h': (1, 0.0, 0.1),  # 图像hsv -色调增强(小数)
                'hsv_s': (1, 0.0, 0.9),  # 图像hsv -饱和度增强(小数)
                'hsv_v': (1, 0.0, 0.9),  # 图像hsv -明度增强(小数)
                'degrees': (1, 0.0, 45.0),  # 图像旋转(+/- 角度 )
                'translate': (1, 0.0, 0.9),  # 图像水平和垂直平移 (+/- 小数)
                'scale': (1, 0.0, 0.9),  # 图像缩放(+/- 比例)
                'shear': (1, 0.0, 10.0),  # 图像剪切(+/- 程度)
                'perspective': (0, 0.0, 0.001),  # 图像透视变换(+/- 小数)，范围0-0.001
                'flipud': (1, 0.0, 1.0),  # 图像上下翻转 (probability)
                'fliplr': (0, 0.0, 1.0),  # 图像左右翻转 (probability)
                'mosaic': (1, 0.0, 1.0),  # 图像马赛克 (probability)
                'mixup': (1, 0.0, 1.0),  # 图像混合 (probability)
                'copy_paste': (1, 0.0, 1.0)}  # 段复制粘贴 (probability)
}

YOLOv5n6.yaml文件介绍

# YOLOv5 🚀 by Ultralytics, GPL-3.0 license

# Parameters
nc: 80  # 目标的类别数量
depth_multiple: 0.33  # 模型深度。控制模块的数量，当模块的数量number不为1时，模块的数量 = number * depth。
width_multiple: 0.25  # 模型的宽度。控制卷积核的数量，卷积核的数量 = number  * width。
anchors:
  - [19,27,  44,40,  38,94]  # P3/8 检测小目标  19,27是一组尺寸，一共三组 
  - [96,68,  86,152,  180,137]  # P4/16
  - [140,301,  303,264,  238,542]  # P5/32
  - [436,615,  739,380,  925,792]  # P6/64 检测大目标

# YOLOv5 v6.0 backbone
backbone:
  # from   第一列 输入来自哪一层  -1代表上一层， 4代表第4层     
  # number 第二列 卷积核的数量    最终数量需要乘上width  
  # module 第三列 模块名称 包括：Conv Focus BottleneckCSP  SPP  
          # Focus, [64, 3]：对特征图的切片操作，模块参数中的 [64, 3] 解析得到[3, 32, 3] ，输入为3（RGB），输出为64*width_multiple = 32，3是卷积核 3*3
          # Conv, [512, 3, 2]：Conv由conv+Bn+Leaky_relu激活函数三者组成。512是卷积核数量，最终数量需要乘上width_multiple。3是卷积核 3*3。2是步长。
          # BottleneckCSP, [1024, False]：借鉴CSPNet网络结构，由三个卷积层和X个Res unint模块Concate组成，如果带有False参数就是没有使用Res unint模块，而是采用conv+Bn+Leaky_relu
          # SPP, [1024, [5, 9, 13]]：采用1×1，5×5，9×9，13×13的最大池化的方式，进行多尺度融合。
  # args   第四列 模块的参数   
  [[-1, 1, Conv, [64, 6, 2, 2]], # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],   # 1-P2/4 卷积核的数量 = 128 * wdith = 128*width_multiple=32        
   [-1, 3, C3, [128]],           # 模块数量 = 3 * depth_multiple =3*0.33=1  
   [-1, 1, Conv, [256, 3, 2]],   # 3-P3/8
   [-1, 6, C3, [256]],           # 模块数量 = 6 * depth_multiple =6*0.33=2
   [-1, 1, Conv, [512, 3, 2]],   # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [768, 3, 2]],   # 7-P5/32
   [-1, 3, C3, [768]],
   [-1, 1, Conv, [1024, 3, 2]],  # 9-P6/64
   [-1, 3, C3, [1024]],
   [-1, 1, SPPF, [1024, 5]],     # 11
  ]

# YOLOv5 v6.0 head
# 包括 Neck 和 Detector head 两部分
head:
  [[-1, 1, Conv, [768, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']], # 上采样 
   [[-1, 8], 1, Concat, [1]],  # cat backbone P5 代表cat上一层和第8层 
   [-1, 3, C3, [768, False]],  # 15 第15层        

   [-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4 代表cat上一层和第6层  
   [-1, 3, C3, [512, False]],  # 19 第19层  

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3 代表cat上一层和第4层  
   [-1, 3, C3, [256, False]],  # 23 (P3/8-small) 第23层 

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 20], 1, Concat, [1]],  # cat head P4 代表cat上一层和第20层
   [-1, 3, C3, [512, False]],  # 26 (P4/16-medium) 第26层 

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 16], 1, Concat, [1]],  # cat head P5 代表cat上一层和第16层
   [-1, 3, C3, [768, False]],  # 29 (P5/32-large) 第29层 

   [-1, 1, Conv, [768, 3, 2]],
   [[-1, 12], 1, Concat, [1]],  # cat head P6 代表cat上一层和第12层
   [-1, 3, C3, [1024, False]],  # 32 (P6/64-xlarge) 第32层 

   [[23, 26, 29, 32], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5, P6) 代表输入的层数23/26/29/32 
  ]

代码段

visdrone2yolo.py

根据官方yaml改的，放在train.py同目录下运行：

from utils.general import download, os, Path
from glob import glob

def visdrone2yolo(dir):
    from PIL import Image
    from tqdm import tqdm

    def convert_box(size, box):
        # Convert VisDrone box to YOLO xywh box
        dw = 1. / size[0]
        dh = 1. / size[1]
        return (box[0] + box[2] / 2) * dw, (box[1] + box[3] / 2) * dh, box[2] * dw, box[3] * dh

    os.mkdir(dir + '/labels')  # make labels directory
    pbar = tqdm(glob(dir + '/annotations/'+'*.txt'), desc=f'Converting {dir}')
    for f in pbar:
        img_size = Image.open((dir + '/images/' + f.split('/')[-1].split('.')[0]) + '.jpg').size
        lines = []
        with open(f, 'r') as file:  # read annotation.txt
            for row in [x.split(',') for x in file.read().strip().splitlines()]:
                if row[4] == '0':  # VisDrone 'ignored regions' class 0
                    continue
                cls = int(row[5]) - 1
                box = convert_box(img_size, tuple(map(int, row[:4])))
                lines.append(f"{cls} {' '.join(f'{x:.6f}' for x in box)}\n")
                with open(str(f).replace(os.sep + 'annotations' + os.sep, os.sep + 'labels' + os.sep), 'w') as fl:
                    fl.writelines(lines)  # write label.txt

# Download
# dir = Path(yaml['path'])  # dataset root dir
# urls = ['https://github.com/ultralytics/yolov5/releases/download/v1.0/VisDrone2019-DET-train.zip',
#         'https://github.com/ultralytics/yolov5/releases/download/v1.0/VisDrone2019-DET-val.zip',
#         'https://github.com/ultralytics/yolov5/releases/download/v1.0/VisDrone2019-DET-test-dev.zip',
#         'https://github.com/ultralytics/yolov5/releases/download/v1.0/VisDrone2019-DET-test-challenge.zip']
# download(urls, dir=dir)
# print(dir)
# Convert
for d in 'datasets/VisDrone/VisDrone2019-DET-train', 'datasets/VisDrone/VisDrone2019-DET-val', 'datasets/VisDrone/VisDrone2019-DET-test-dev':
   visdrone2yolo(d)  # convert VisDrone annotations to YOLO labels

网上的，不太好用：

import os
from os import getcwd
from PIL import Image
import xml.etree.ElementTree as ET
import random

#root_dir = "train/"
root_dir = "/home/shizheng/sxf/yolov5/datasets/VisDrone/VisDrone2019-DET-train/"
annotations_dir = root_dir+"annotations/"
image_dir = root_dir + "images/"
label_dir = root_dir + "labels/"
# label_dir = root_dir + "images/"    # yolo里面要和图片放到一起
xml_dir = root_dir+"annotations_voc/"  #注意新建文件夹。后续改一下名字，运行完成之后annotations这个文件夹就不需要了。把annotations_命名为annotations
data_split_dir = root_dir + "train_namelist/"

sets = ['train', 'test','val']
class_name = ['ignored regions', 'pedestrian','people','bicycle','car', 'van', 'truck', 'tricycle','awning-tricycle', 'bus','motor','others']


def visdrone2voc(annotations_dir, image_dir, xml_dir):
    for filename in os.listdir(annotations_dir):
        fin = open(annotations_dir + filename, 'r')
        image_name = filename.split('.')[0]
        img = Image.open(image_dir + image_name + ".jpg")
        xml_name = xml_dir + image_name + '.xml'
        with open(xml_name, 'w+') as fout:
            fout.write('<annotation>' + '\n')

            fout.write('\t' + '<folder>VOC2007</folder>' + '\n')
            fout.write('\t' + '<filename>' + image_name + '.jpg' + '</filename>' + '\n')

            fout.write('\t' + '<source>' + '\n')
            fout.write('\t\t' + '<database>' + 'VisDrone2018 Database' + '</database>' + '\n')
            fout.write('\t\t' + '<annotation>' + 'VisDrone2018' + '</annotation>' + '\n')
            fout.write('\t\t' + '<image>' + 'flickr' + '</image>' + '\n')
            fout.write('\t\t' + '<flickrid>' + 'Unspecified' + '</flickrid>' + '\n')
            fout.write('\t' + '</source>' + '\n')

            fout.write('\t' + '<owner>' + '\n')
            fout.write('\t\t' + '<flickrid>' + 'Haipeng Zhang' + '</flickrid>' + '\n')
            fout.write('\t\t' + '<name>' + 'Haipeng Zhang' + '</name>' + '\n')
            fout.write('\t' + '</owner>' + '\n')

            fout.write('\t' + '<size>' + '\n')
            fout.write('\t\t' + '<width>' + str(img.size[0]) + '</width>' + '\n')
            fout.write('\t\t' + '<height>' + str(img.size[1]) + '</height>' + '\n')
            fout.write('\t\t' + '<depth>' + '3' + '</depth>' + '\n')
            fout.write('\t' + '</size>' + '\n')

            fout.write('\t' + '<segmented>' + '0' + '</segmented>' + '\n')

            for line in fin.readlines():
                line = line.split(',')
                fout.write('\t' + '<object>' + '\n')
                fout.write('\t\t' + '<name>' + class_name[int(line[5])] + '</name>' + '\n')
                fout.write('\t\t' + '<pose>' + 'Unspecified' + '</pose>' + '\n')
                fout.write('\t\t' + '<truncated>' + line[6] + '</truncated>' + '\n')
                fout.write('\t\t' + '<difficult>' + str(int(line[7])) + '</difficult>' + '\n')
                fout.write('\t\t' + '<bndbox>' + '\n')
                fout.write('\t\t\t' + '<xmin>' + line[0] + '</xmin>' + '\n')
                fout.write('\t\t\t' + '<ymin>' + line[1] + '</ymin>' + '\n')
                # pay attention to this point!(0-based)
                fout.write('\t\t\t' + '<xmax>' + str(int(line[0]) + int(line[2]) - 1) + '</xmax>' + '\n')
                fout.write('\t\t\t' + '<ymax>' + str(int(line[1]) + int(line[3]) - 1) + '</ymax>' + '\n')
                fout.write('\t\t' + '</bndbox>' + '\n')
                fout.write('\t' + '</object>' + '\n')

            fin.close()
            fout.write('</annotation>')

def data_split(xml_dir, data_split_dir):
    trainval_percent = 0.2
    train_percent = 0.9
    total_xml = os.listdir(xml_dir)
    if not os.path.exists(data_split_dir):
        os.makedirs(data_split_dir)
    num = len(total_xml)
    list = range(num)
    tv = int(num * trainval_percent)
    tr = int(tv * train_percent)
    trainval = random.sample(list, tv)
    train = random.sample(trainval, tr)

    ftrainval = open(data_split_dir+'/trainval.txt', 'w+')
    ftest = open(data_split_dir+'/test.txt', 'w+')
    ftrain = open(data_split_dir+'/train.txt', 'w+')
    fval = open(data_split_dir+'/val.txt', 'w+')

    for i in list:
        name = total_xml[i][:-4] + '\n'
        if i in trainval:
            ftrainval.write(name)
            if i in train:
                ftest.write(name)
            else:
                fval.write(name)
        else:
            ftrain.write(name)

    ftrainval.close()
    ftrain.close()
    fval.close()
    ftest.close()


def convert(size, box):
    dw = 1. / size[0]
    dh = 1. / size[1]
    x = (box[0] + box[1]) / 2.0
    y = (box[2] + box[3]) / 2.0
    w = box[1] - box[0]
    h = box[3] - box[2]
    x = x * dw
    w = w * dw
    y = y * dh
    h = h * dh
    return (x, y, w, h)

def convert_annotation_voc(xml_dir, label_dir, image_name):
    in_file = open(xml_dir + '%s.xml' % (image_name))
    out_file = open(label_dir + '%s.txt' % (image_name), 'w+')
    tree = ET.parse(in_file)
    root = tree.getroot()
    size = root.find('size')
    w = int(size.find('width').text)
    h = int(size.find('height').text)

    for obj in root.iter('object'):
        difficult = obj.find('difficult').text
        cls = obj.find('name').text
        if cls not in class_name or int(difficult) == 1:
            continue
        cls_id = class_name.index(cls)
        xmlbox = obj.find('bndbox')
        b = (float(xmlbox.find('xmin').text), float(xmlbox.find('xmax').text), float(xmlbox.find('ymin').text),
             float(xmlbox.find('ymax').text))
        bb = convert((w, h), b)
        if cls_id != 0:  # 忽略掉0类
            if cls_id != 11:  # 忽略掉11类
                out_file.write(str(cls_id - 1) + " " + " ".join([str(a) for a in bb]) + '\n')  # 其他类id-1。可以根据自己需要修改代码

def voc2yolo(xml_dir, image_dir, label_dir):
    wd = getcwd()
    print(wd)
    for image_set in sets:
        if not os.path.exists(label_dir):
            os.makedirs(label_dir)
        image_names = open(data_split_dir+'%s.txt' % (image_set)).read().strip().split()
        list_file = open(root_dir + '%s.txt' % (image_set), 'w+')
        for image_name in image_names:
            list_file.write(image_dir+'%s.jpg\n' % (image_name))
            convert_annotation_voc(xml_dir, label_dir, image_name)
        list_file.close()




if __name__ == '__main__':
    visdrone2voc(annotations_dir, image_dir, xml_dir) #将visdrone转化为voc的xml格式
    data_split(xml_dir, data_split_dir)		# 将数据集分开成train、val、test
    voc2yolo(xml_dir, image_dir, label_dir)	# 将voc转化为yolo格式的txt