excel里面有一亿条数据如何导入数据库才是最优解

289 阅读3分钟

在SpringBoot框架中处理一亿条Excel数据导入是一个典型的大数据批处理场景。这需要考虑内存使用、处理效率和系统稳定性。下面我提供一个实际场景和详细解决方案。

场景描述

假设我们有一个电商平台,需要导入历史订单数据(1亿条)到MySQL数据库。每条订单包含ID、用户ID、商品信息、金额、时间等字段。

最优解决方案

1. 技术选择

  • Excel解析: Apache POI的SXSSF模式(流式处理)或EasyExcel(阿里开源的Excel处理库)
  • 批处理框架: Spring Batch
  • 数据库: MySQL(使用批量插入)
  • 并行处理: 多线程分片处理
  • 数据库优化: 关闭自动提交、暂时禁用索引等

2. 详细实现步骤

步骤1: 创建必要的实体类

@Data
public class Order {
    private Long id;
    private Long userId;
    private String productInfo;
    private BigDecimal amount;
    private Date orderTime;
    // 其他字段...
}

步骤2: 配置Spring Batch

@Configuration
@EnableBatchProcessing
public class BatchConfig {

    @Autowired
    private JobBuilderFactory jobBuilderFactory;
    
    @Autowired
    private StepBuilderFactory stepBuilderFactory;
    
    @Autowired
    private DataSource dataSource;
    
    @Bean
    public Job importOrderJob(Step step1) {
        return jobBuilderFactory.get("importOrderJob")
                .incrementer(new RunIdIncrementer())
                .flow(step1)
                .end()
                .build();
    }
    
    @Bean
    public Step step1(ItemReader<Order> reader, ItemProcessor<Order, Order> processor, ItemWriter<Order> writer) {
        return stepBuilderFactory.get("step1")
                .<Order, Order>chunk(10000) // 每10000条数据提交一次
                .reader(reader)
                .processor(processor)
                .writer(writer)
                .taskExecutor(taskExecutor()) // 使用多线程
                .throttleLimit(8) // 线程数量
                .build();
    }
    
    @Bean
    public TaskExecutor taskExecutor() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(8);
        executor.setMaxPoolSize(16);
        executor.setQueueCapacity(500);
        executor.setThreadNamePrefix("order-import-");
        executor.initialize();
        return executor;
    }
}

步骤3: 实现Excel读取器

@Component
@StepScope
public class ExcelOrderReader implements ItemReader<Order> {

    private EasyExcel easyExcel;
    private AtomicInteger readCount = new AtomicInteger(0);
    private List<Order> currentBatch = new ArrayList<>();
    private int currentIndex = 0;
    private final int BATCH_SIZE = 1000; // 每次从Excel读取的数量
    
    @Value("#{jobParameters['filePath']}")
    private String filePath;
    
    @PostConstruct
    public void init() {
        // 使用EasyExcel读取数据,分批次加载到内存
        EasyExcel.read(filePath, Order.class, new PageReadListener<Order>(dataList -> {
            currentBatch.addAll(dataList);
        })).sheet().doRead();
    }
    
    @Override
    public Order read() {
        if (currentIndex >= currentBatch.size()) {
            // 当前批次读取完毕,加载下一批
            currentBatch.clear();
            currentIndex = 0;
            
            int startRow = readCount.get();
            if (startRow >= 100000000) { // 所有数据读取完毕
                return null;
            }
            
            // 读取下一批数据
            EasyExcel.read(filePath)
                .sheet()
                .headRowNumber(0) // 忽略表头
                .beginRow(startRow)
                .pageSize(BATCH_SIZE)
                .doReadSync();
            
            readCount.addAndGet(BATCH_SIZE);
            
            if (currentBatch.isEmpty()) {
                return null; // 没有更多数据
            }
        }
        
        return currentBatch.get(currentIndex++);
    }
}

步骤4: 实现数据处理器

@Component
public class OrderProcessor implements ItemProcessor<Order, Order> {
    
    @Override
    public Order process(Order order) {
        // 可以在这里进行数据清洗、转换、验证等操作
        if (order.getAmount() == null || order.getAmount().compareTo(BigDecimal.ZERO) < 0) {
            return null; // 过滤无效数据
        }
        
        // 添加默认值、转换日期格式等
        if (order.getOrderTime() == null) {
            order.setOrderTime(new Date());
        }
        
        return order;
    }
}

步骤5: 实现数据写入器

@Component
public class OrderWriter implements ItemWriter<Order> {

    @Autowired
    private JdbcTemplate jdbcTemplate;
    
    @Override
    public void write(List<? extends Order> orders) throws Exception {
        if (orders.isEmpty()) {
            return;
        }
        
        // 使用批量插入提高性能
        String sql = "INSERT INTO orders (id, user_id, product_info, amount, order_time) VALUES (?, ?, ?, ?, ?)";
        
        jdbcTemplate.batchUpdate(sql, new BatchPreparedStatementSetter() {
            @Override
            public void setValues(PreparedStatement ps, int i) throws SQLException {
                Order order = orders.get(i);
                ps.setLong(1, order.getId());
                ps.setLong(2, order.getUserId());
                ps.setString(3, order.getProductInfo());
                ps.setBigDecimal(4, order.getAmount());
                ps.setTimestamp(5, new Timestamp(order.getOrderTime().getTime()));
            }
            
            @Override
            public int getBatchSize() {
                return orders.size();
            }
        });
    }
}

步骤6: 数据库优化

@Component
public class DatabaseOptimizer {
    
    @Autowired
    private JdbcTemplate jdbcTemplate;
    
    @BeforeStep
    public void beforeStep() {
        // 导入前禁用索引和外键约束
        jdbcTemplate.execute("SET FOREIGN_KEY_CHECKS = 0");
        jdbcTemplate.execute("ALTER TABLE orders DISABLE KEYS");
        // 调整MySQL配置
        jdbcTemplate.execute("SET autocommit=0");
        jdbcTemplate.execute("SET unique_checks=0");
    }
    
    @AfterStep
    public void afterStep() {
        // 导入后重新启用索引和约束
        jdbcTemplate.execute("SET FOREIGN_KEY_CHECKS = 1");
        jdbcTemplate.execute("ALTER TABLE orders ENABLE KEYS");
        jdbcTemplate.execute("SET autocommit=1");
        jdbcTemplate.execute("SET unique_checks=1");
    }
}

步骤7: 启动导入作业的Controller

@RestController
@RequestMapping("/api/import")
public class ImportController {
    
    @Autowired
    private JobLauncher jobLauncher;
    
    @Autowired
    private Job importOrderJob;
    
    @PostMapping("/orders")
    public ResponseEntity<String> importOrders(@RequestParam("filePath") String filePath) {
        try {
            JobParameters jobParameters = new JobParametersBuilder()
                    .addString("filePath", filePath)
                    .addDate("time", new Date())
                    .toJobParameters();
            
            JobExecution execution = jobLauncher.run(importOrderJob, jobParameters);
            
            return ResponseEntity.ok("Import job started with ID: " + execution.getId());
        } catch (Exception e) {
            return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR)
                    .body("Failed to start import job: " + e.getMessage());
        }
    }
    
    @GetMapping("/status/{jobId}")
    public ResponseEntity<String> getJobStatus(@PathVariable Long jobId) {
        // 实现查询作业状态的逻辑
        // ...
        return ResponseEntity.ok("Job status information");
    }
}

性能优化要点

  1. 分片处理: 将一亿条数据分成多个小块处理
  2. 多线程并行: 使用多线程提高处理速度
  3. 批量提交: 每次提交多条数据(如10000条)而不是逐条提交
  4. 流式读取: 避免一次性将所有Excel数据加载到内存
  5. 数据库优化: 暂时禁用索引、外键和自动提交
  6. 任务监控: 实现进度跟踪和失败重试机制

这种方案能够有效处理一亿条Excel数据的导入,同时保持系统稳定性和性能。