D-FINE TensorRT推理上周看到了最新出来的检测模型D-FINE，原打算上周就开始写这篇blog，但是被一些事

上周看到了最新出来的检测模型D-FINE，原打算上周就开始写这篇blog，但是被一些事耽误了一直拖到今天。关于模型整体设计以及各个模块解析我就不多做讲解，但是利用分布代替直接预测位置来微调bbox确实是一种好的思路，同时还能将这个“知识”传递到前面的层，也就是论文中说的FDR和GO-LSD。

1. 导出ONNX

作者的仓库中很多模块采用的是注册式设计，所以有些地方可能得借助搜索才能找到对应模块。就导出而言，我们可以先看看给出的tools/inference/trt_inf.py，其中输入图片以及尺寸就能得到bbox、score以及label。这部分其实都在后处理模块中，作者先将cxcywh转为xyxy格式，然后乘以图片尺寸就得到了最终的bbox

不过这我之前写的基于yolo的推理代码输入输出不符合，为了避免过多更改我本地的代码，所以直接更改了这里的后处理部分，主要逻辑就是保留cxcywh的bbox框格式，同时直接将bbox乘以640.得到在640*640尺寸下的位置坐标，后续就可以直接套用之前yolo的后处理部分。

    def forward(self, outputs):
        logits, boxes = outputs['pred_logits'], outputs['pred_boxes']
        bbox_pred = torchvision.ops.box_convert(boxes, in_fmt='cxcywh', out_fmt='xyxy')

        if self.use_focal_loss:
            scores = F.sigmoid(logits)
            scores, index = torch.topk(scores.flatten(1), self.num_top_queries, dim=-1)
            # TODO for older tensorrt
            # labels = index % self.num_classes
            labels = mod(index, self.num_classes)
            index = index // self.num_classes
            boxes = bbox_pred.gather(dim=1, index=index.unsqueeze(-1).repeat(1, 1, bbox_pred.shape[-1]))

        else:
            scores = F.softmax(logits)[:, :, :-1]
            scores, labels = scores.max(dim=-1)
            if scores.shape[1] > self.num_top_queries:
                scores, index = torch.topk(scores, self.num_top_queries, dim=-1)
                labels = torch.gather(labels, dim=1, index=index)
                boxes = torch.gather(boxes, dim=1, index=index.unsqueeze(-1).tile(1, 1, boxes.shape[-1]))

        # TODO for onnx export
        if self.deploy_mode:
            scores = scores.unsqueeze(-1)
            labels = labels.unsqueeze(-1)
            boxes = outputs['pred_boxes'] * 640.
            return torch.cat([boxes, scores,labels],dim=-1)

2. TensorRT推理

首先需要注意的是必须使用TensorRT10.x的版本，否则推理结果会出现异常（在我这测试8.6版本是会出现异常的）。此外，对于安培架构的显卡转换engine的时候如果添加--fp16会导致精度大幅下降，这个可以见issue，因此转换时根据显卡去选择是否添加--fp16。

对于TensorRT的封装可以参考一些开源库，例如github.com/shouxieai/t…、github.com/l-sf/Linfer、github.com/Melody-Zhou…等，不过这些大多基于TensorRT8.x，使用10.x时需要更新一些调用的函数，这些内容可以去看看TensorRT的文档部分。

后处理

对于D-FINE的后处理部分由于直接返回了类似于yolo的bbox、label、score，所以这里可以直接借用之前的kernel，同时也不再需要nms，大致代码如下

 static __global__ void decode_kernel_dfine(float *predict, int num_bboxes, float confidence_threshold, float* invert_affine_matrix, float* parray, int MAX_IMAGE_BOXES){
        int position = blockDim.x * blockIdx.x + threadIdx.x;
        if(position >= num_bboxes)  return;

        float* pitem     = predict + 6 * position;
        float confidence = *(pitem + 4);
        float label      = *(pitem + 5);
        if(confidence < confidence_threshold)
            return;

        int index = atomicAdd(parray, 1);
        if(index >= MAX_IMAGE_BOXES)
            return;

        float cx = *pitem++;
        float cy = *pitem++;
        float width = *pitem++;
        float height = *pitem++;
        float left = cx - width * 0.5f;
        float top = cy - height * 0.5f;
        float right = cx + width * 0.5f;
        float bottom = cy + height * 0.5f;
        affine_project(invert_affine_matrix, left,  top,    &left,  &top);
        affine_project(invert_affine_matrix, right, bottom, &right, &bottom);
        float* pout_item = parray + 1 + index * 6;
        *pout_item++ = left;
        *pout_item++ = top;
        *pout_item++ = right;
        *pout_item++ = bottom;
        *pout_item++ = confidence;
        *pout_item++ = label;
    }

推理测试

封装好之后测试一下

void test_dfine() {
    std::string engine_file = "D-FINE/dfine_s_v10.engine";
    auto engine = DFine::create_infer(
            engine_file,                // engine file
            0,                                      // gpu id
            0.5f,                                  // confidence threshold
            0.65f,                                  // nms threshold
            DFine::NMSMethod::FastGPU,               // NMS method, fast GPU / CPU
            false                                   // preprocess use multi stream
    );
    if (engine == nullptr) {
        INFOE("Engine is nullptr");
        return;
    }
    int batch = 1;
    vector <cv::Mat> images{cv::imread("../car.jpg")};
    for (int i = images.size(); i < batch; ++i) images.push_back(images[i % 1]);


    // warmup
    vector <shared_future<DFine::BoxArray>> boxes_array;
    for (int i = 0; i < 10; ++i)
        boxes_array = engine->commits(images);
    boxes_array.back().get();
    boxes_array.clear();

    const int ntest = 100;
    auto begin_timer = iLogger::timestamp_now_float();

    for (int i = 0; i < ntest; ++i)
        boxes_array = engine->commits(images);

    // wait all result
    boxes_array.back().get();
    float inference_average_time = (iLogger::timestamp_now_float() - begin_timer) / ntest / images.size();
    INFO("%s[%s] average: %.2f ms / image, FPS: %.2f", engine_file.c_str(), "DFine", inference_average_time,
         1000 / inference_average_time);


    cv::Mat image = cv::imread("../car.jpg");
    if (image.empty()) {
        INFOE("Image is empty");
        return;
    }

    auto boxes = engine->commit(image).get();
    uint8_t b, g, r;
    for (auto &obj: boxes) {
        tie(b, g, r) = iLogger::random_color(obj.class_label);
        cv::rectangle(image, cv::Point(obj.left, obj.top), cv::Point(obj.right, obj.bottom), cv::Scalar(b, g, r), 5);

        auto name = cocolabels[obj.class_label];
        auto caption = iLogger::format("%s %.2f", name, obj.confidence);
        int width = cv::getTextSize(caption, 0, 1, 2, nullptr).width + 10;
        cv::rectangle(image, cv::Point(obj.left - 3, obj.top - 33), cv::Point(obj.left + width, obj.top),
                      cv::Scalar(b, g, r), -1);
        cv::putText(image, caption, cv::Point(obj.left, obj.top - 5), 0, 1, cv::Scalar::all(0), 2, 16);
    }
    INFO("Save to Result.jpg, %d objects", boxes.size());
    cv::imwrite("Result.jpg", image);
    delete engine;
    engine = nullptr;
}

由于这里是fp32推理，而且没有直接使用端到端的方式，所以速度相对会慢一些。后续如果有时间，打算更新一版端到端的实现。