【实战】大批量数据导出的方案

背景

业务需要导出将近6w多的全单位的数据，有些数据在当前表中不存在，需要现计算。这个功能是汇总的功能，而且涉及的表很多（将近40张），表是真TM的多，需要显示的字段也是超多，我是真服了，当时是怎么设计的。

方案

优化架构和表结构是不行的，只能在这40张表的基础上继续搞。想的办法是分开批量查询。主体思想就是先把主要的数据查询出来，再写多个子查询（涉及到计算的），子查询都用 in 来写，一次查询上千条，然后这样再分批次，最后再汇总。

具体实现操作

这是主表，主要的数据，里面会查询出project_id,再去根据project_id 去作为其他子查询的条件。这里 * 省略具体的字段。

 select * from
 serp_em_project_info a
 left join serp_em_construction_plan b
 on a.project_id = b.project_id
 left join serp_em_project_account_info c
 on a.project_id = c.project_id
 left join js_sys_office o
 on a.office_code = o.office_code
 left join serp_base_project_developer d
 on a.project_developer =d.project_developer_id
 left  join  js_sys_area e
 on a.construction_area = e.area_code
 left join js_sys_office uo
 on uo.office_code = a.undertaking_office
 left join js_sys_user creator
 on creator.user_code =a.project_manager

然后再分批次，sumAmountHandlerList 方法中传的参数就是我们查询出的主要数据

 public List<Map<String, Object>> sumAmountHandlerList(List<Map<String, Object>> list) {
        
        // 一批次的数量，将2500 个 project_id 放在一个集合里
        int batchNumber = 2500;
 
         // 循环
         int iteratorCount = (int) Math.ceil((double) list.size() / batchNumber);
         for (int i = 0; i < iteratorCount; i++) {
             int startIndex = i * batchNumber;
             int endIndex = (i < iteratorCount - 2) ? (startIndex + batchNumber - 1) : (list.size() - 1);
 
             // 取所有的 'projectID'
             List<String> projectIds = new ArrayList<>();
             for (int j = startIndex; j <= endIndex; j++) {
                 projectIds.add((String) list.get(j).get("project_id"));
             }
 
           //核定借款合计，核定借款金额，子查询，需要现计算
             List<Map<String, Object>> individualBorrowList = projectReportDao.individualBorrowCollection(projectIds) ;
 
 
             // 数据处理
             for (int j = startIndex; j <= endIndex; j++) {
                 Map<String, Object> item = list.get(j);
 
                 // 核定借款金额(authorized)
                 item.put("authorized", 0D);
                 for (Map<String, Object> authorizedItem : individualBorrowList) {
                     if (((String) authorizedItem.get("project_id")).equals((String) item.get("project_id"))) {
                         item.put("authorized", Double.parseDouble(authorizedItem.get("authorized").toString()));
                         break;
                     }
                 }
              }

这块，主要讲一下，如何分批次。先获取到主体数据的总数，再根据一个批次里的数量，获取到需要几个批次，就可以全部加载完数据。也就是外循环几次。

    // 循环
         int iteratorCount = (int) Math.ceil((double) list.size() / batchNumber);
         for (int i = 0; i < iteratorCount; i++) {
              int startIndex = i * batchNumber;
             int endIndex = (i < iteratorCount - 2) ? (startIndex + batchNumber - 1) : (list.size() - 1);

第一次内循环是从0开始，第二次内循环就是 1 * 批次数量，依次类推。 int startIndex = i * batchNumber;

第一次内循环是从（批次数量-1）结束，因为是从0开始的，第二次内循环就是 2500开始，到2500+batchNumber - 1 结束，

最后一次内循环就是总数-1;

这里比较绕，梳理通了，整体就简单了，最好是带上具体的数字就简单多了。

然后就是在内循环中，根据project_id ，进行比对，相同的，将数据组合在一起。

其他方案

当时，也有考虑采取定时任务，在凌晨将数据全部算好，然后放到一张表里，这样直接查询一张表，肯定性能会更好一些。但数据会存在延迟性，当天审批过的数据，业务人员无法查看到，只有第二天定时任务完了，才可以查询。采取分批次子查询的方案查询在1s内就可以查询出来，导出时，由于数据较多，需要五六分钟（将近20M），业务人员也可以接受。

看看，大家还有什么比较好的方案？欢迎讨论。