在SpringBoot框架中处理一亿条Excel数据导入是一个典型的大数据批处理场景。这需要考虑内存使用、处理效率和系统稳定性。下面我提供一个实际场景和详细解决方案。
场景描述
假设我们有一个电商平台,需要导入历史订单数据(1亿条)到MySQL数据库。每条订单包含ID、用户ID、商品信息、金额、时间等字段。
最优解决方案
1. 技术选择
- Excel解析: Apache POI的SXSSF模式(流式处理)或EasyExcel(阿里开源的Excel处理库)
- 批处理框架: Spring Batch
- 数据库: MySQL(使用批量插入)
- 并行处理: 多线程分片处理
- 数据库优化: 关闭自动提交、暂时禁用索引等
2. 详细实现步骤
步骤1: 创建必要的实体类
@Data
public class Order {
private Long id;
private Long userId;
private String productInfo;
private BigDecimal amount;
private Date orderTime;
// 其他字段...
}
步骤2: 配置Spring Batch
@Configuration
@EnableBatchProcessing
public class BatchConfig {
@Autowired
private JobBuilderFactory jobBuilderFactory;
@Autowired
private StepBuilderFactory stepBuilderFactory;
@Autowired
private DataSource dataSource;
@Bean
public Job importOrderJob(Step step1) {
return jobBuilderFactory.get("importOrderJob")
.incrementer(new RunIdIncrementer())
.flow(step1)
.end()
.build();
}
@Bean
public Step step1(ItemReader<Order> reader, ItemProcessor<Order, Order> processor, ItemWriter<Order> writer) {
return stepBuilderFactory.get("step1")
.<Order, Order>chunk(10000) // 每10000条数据提交一次
.reader(reader)
.processor(processor)
.writer(writer)
.taskExecutor(taskExecutor()) // 使用多线程
.throttleLimit(8) // 线程数量
.build();
}
@Bean
public TaskExecutor taskExecutor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(8);
executor.setMaxPoolSize(16);
executor.setQueueCapacity(500);
executor.setThreadNamePrefix("order-import-");
executor.initialize();
return executor;
}
}
步骤3: 实现Excel读取器
@Component
@StepScope
public class ExcelOrderReader implements ItemReader<Order> {
private EasyExcel easyExcel;
private AtomicInteger readCount = new AtomicInteger(0);
private List<Order> currentBatch = new ArrayList<>();
private int currentIndex = 0;
private final int BATCH_SIZE = 1000; // 每次从Excel读取的数量
@Value("#{jobParameters['filePath']}")
private String filePath;
@PostConstruct
public void init() {
// 使用EasyExcel读取数据,分批次加载到内存
EasyExcel.read(filePath, Order.class, new PageReadListener<Order>(dataList -> {
currentBatch.addAll(dataList);
})).sheet().doRead();
}
@Override
public Order read() {
if (currentIndex >= currentBatch.size()) {
// 当前批次读取完毕,加载下一批
currentBatch.clear();
currentIndex = 0;
int startRow = readCount.get();
if (startRow >= 100000000) { // 所有数据读取完毕
return null;
}
// 读取下一批数据
EasyExcel.read(filePath)
.sheet()
.headRowNumber(0) // 忽略表头
.beginRow(startRow)
.pageSize(BATCH_SIZE)
.doReadSync();
readCount.addAndGet(BATCH_SIZE);
if (currentBatch.isEmpty()) {
return null; // 没有更多数据
}
}
return currentBatch.get(currentIndex++);
}
}
步骤4: 实现数据处理器
@Component
public class OrderProcessor implements ItemProcessor<Order, Order> {
@Override
public Order process(Order order) {
// 可以在这里进行数据清洗、转换、验证等操作
if (order.getAmount() == null || order.getAmount().compareTo(BigDecimal.ZERO) < 0) {
return null; // 过滤无效数据
}
// 添加默认值、转换日期格式等
if (order.getOrderTime() == null) {
order.setOrderTime(new Date());
}
return order;
}
}
步骤5: 实现数据写入器
@Component
public class OrderWriter implements ItemWriter<Order> {
@Autowired
private JdbcTemplate jdbcTemplate;
@Override
public void write(List<? extends Order> orders) throws Exception {
if (orders.isEmpty()) {
return;
}
// 使用批量插入提高性能
String sql = "INSERT INTO orders (id, user_id, product_info, amount, order_time) VALUES (?, ?, ?, ?, ?)";
jdbcTemplate.batchUpdate(sql, new BatchPreparedStatementSetter() {
@Override
public void setValues(PreparedStatement ps, int i) throws SQLException {
Order order = orders.get(i);
ps.setLong(1, order.getId());
ps.setLong(2, order.getUserId());
ps.setString(3, order.getProductInfo());
ps.setBigDecimal(4, order.getAmount());
ps.setTimestamp(5, new Timestamp(order.getOrderTime().getTime()));
}
@Override
public int getBatchSize() {
return orders.size();
}
});
}
}
步骤6: 数据库优化
@Component
public class DatabaseOptimizer {
@Autowired
private JdbcTemplate jdbcTemplate;
@BeforeStep
public void beforeStep() {
// 导入前禁用索引和外键约束
jdbcTemplate.execute("SET FOREIGN_KEY_CHECKS = 0");
jdbcTemplate.execute("ALTER TABLE orders DISABLE KEYS");
// 调整MySQL配置
jdbcTemplate.execute("SET autocommit=0");
jdbcTemplate.execute("SET unique_checks=0");
}
@AfterStep
public void afterStep() {
// 导入后重新启用索引和约束
jdbcTemplate.execute("SET FOREIGN_KEY_CHECKS = 1");
jdbcTemplate.execute("ALTER TABLE orders ENABLE KEYS");
jdbcTemplate.execute("SET autocommit=1");
jdbcTemplate.execute("SET unique_checks=1");
}
}
步骤7: 启动导入作业的Controller
@RestController
@RequestMapping("/api/import")
public class ImportController {
@Autowired
private JobLauncher jobLauncher;
@Autowired
private Job importOrderJob;
@PostMapping("/orders")
public ResponseEntity<String> importOrders(@RequestParam("filePath") String filePath) {
try {
JobParameters jobParameters = new JobParametersBuilder()
.addString("filePath", filePath)
.addDate("time", new Date())
.toJobParameters();
JobExecution execution = jobLauncher.run(importOrderJob, jobParameters);
return ResponseEntity.ok("Import job started with ID: " + execution.getId());
} catch (Exception e) {
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR)
.body("Failed to start import job: " + e.getMessage());
}
}
@GetMapping("/status/{jobId}")
public ResponseEntity<String> getJobStatus(@PathVariable Long jobId) {
// 实现查询作业状态的逻辑
// ...
return ResponseEntity.ok("Job status information");
}
}
性能优化要点
- 分片处理: 将一亿条数据分成多个小块处理
- 多线程并行: 使用多线程提高处理速度
- 批量提交: 每次提交多条数据(如10000条)而不是逐条提交
- 流式读取: 避免一次性将所有Excel数据加载到内存
- 数据库优化: 暂时禁用索引、外键和自动提交
- 任务监控: 实现进度跟踪和失败重试机制
这种方案能够有效处理一亿条Excel数据的导入,同时保持系统稳定性和性能。