开心一刻
昨晚,跟好朋友吃夜宵,吐槽我的相亲情况
我:近五年来,我自认为是一位作风正直的人,不抽烟不喝酒,也不和女孩打情骂俏;作息规律,21点睡,06点起,并且我性格也好,安静、老实、非常听话
最后我叹气道:这么多优点,怎么相亲女孩都看不上我呢?
朋友:刚出来半年,就忘记我们是怎么进去的呢?当初没钱还带我去PC,我都不好意思说你
缓冲区阻塞
在 异构数据源同步之数据同步 → datax 改造,有点意思 中提到 Runtime.getRuntime().exec 会发生阻塞,究其原因是 缓冲区填满导致的死锁
当 Runtime 对象调用 exec(cmd) 后,JVM 会启动一个子进程,该进程会与 JVM 进程建立三个管道连接:标准输入(stdin)、标准输出(stdout)、标准错误(stderr),这些管道在操作系统中都有固定大小的缓冲区。如果子进程持续向 stdout 或 stderr 写入数据,而父进程(JVM)没有及时通过 Process.getInputStream() 和 Process.getErrorStream() 来读取,缓冲区就会被填满,一旦缓冲区满,子进程的写入操作就会被阻塞,进而挂起。此时如果父进程调用了 process.waitFor() 等待子进程结束,就会形成经典的
死锁:父进程等子进程结束,子进程等父进程读取缓冲区,程序便永远卡住
我们采用了两个线程分别来读取 stdout 和 stderr
Process process = Runtime.getRuntime().exec(DATAX_COMMAND);
// 另启线程读取标准输出
new Thread(() -> {
try (BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream(), SYSTEM_ENCODING))) {
String line;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
} catch (IOException e) {
throw new RuntimeException(e);
}
}).start();
// 另启线程读取错误输出
new Thread(() -> {
try (BufferedReader errorReader = new BufferedReader(new InputStreamReader(process.getErrorStream(), SYSTEM_ENCODING))) {
String line;
while ((line = errorReader.readLine()) != null) {
System.out.println(line);
}
} catch (IOException e) {
throw new RuntimeException(e);
}
}).start();
// 等待命令执行完成
int i = process.waitFor();
if (i == 0) {
System.out.println("job执行完成");
} else {
System.out.println("job执行失败");
}
如果我们不用区分 标准输出 和 错误输出,我们可以将错误输出合并到标准输出
List<String> DATAX_COMMNDS = Arrays.asList("java",
"-server",
"-Xms4g",
"-Xmx4g",
"-Dfile.encoding=GBK",
"-Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener",
"-Ddatax.home=E:\\git-project\\datax-home",
"-Dlogback.configurationFile=E:\\git-project\\datax-home\\conf\\logback.xml",
"-classpath", "E:\\git-project\\datax-home\\lib\\*",
"com.alibaba.datax.core.Engine",
"-mode", "standalone",
"-job", "E:\\git-project\\datax-home\\job\\mysql2mysql.json")
ProcessBuilder pb = new ProcessBuilder(DATAX_COMMNDS);
// 合并错误流到标准输出
pb.redirectErrorStream(true);
Process process = pb.start();
// 另启线程读取标准输出
new Thread(() -> {
try (BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream(), SYSTEM_ENCODING))) {
String line;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
} catch (IOException e) {
throw new RuntimeException(e);
}
}).start();
// 等待命令执行完成
int i = process.waitFor();
if (i == 0) {
System.out.println("job执行完成");
} else {
System.out.println("job执行失败");
}
每执行一次任务,都创建一个新的线程来读取输出流是不合理的;线程的创建与销毁都是存在资源消耗的,更合理的做法是采用 线程池
线程池的合理创建与业务有关(IO密集还是CPU密集),就不展开了
已同步记录数
因为要实时感知 DataX 的同步记录数,我们改造了 DataX 的日志输出,将 DataX 每次写入目标库的记录数输出到日志中,然后读取日志中的记录数,并进行累加实时更新到数据库中。具体实现可参考:异源数据同步 → 如何获取 DataX 已同步数据量?,其中强调了持久化到数据库是一定要采用
update table_name set sync_rows = sync_rows + syncRows;
具体实现类似如下
private static final List<String> DATAX_COMMNDS = Arrays.asList("java",
"-server",
"-Xms4g",
"-Xmx4g",
"-Dfile.encoding=GBK",
"-Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener",
"-Ddatax.home=E:\\git-project\\datax-home",
"-Dlogback.configurationFile=E:\\git-project\\datax-home\\conf\\logback.xml",
"-classpath", "E:\\git-project\\datax-home\\lib\\*",
"com.alibaba.datax.core.Engine",
"-mode", "standalone",
"-job", "E:\\git-project\\datax-home\\job\\mysql2mysql.json");
@Test
public void test() {
// 先生成任务日志
Long jobId = 1L;
LocalDateTime now = LocalDateTime.now();
int execStatus = -1;
String msg = "";
QslJobLog qslJobLog = new QslJobLog(jobId, execStatus, 0L, now, now);
qslJobLogDao.insert(qslJobLog);
try {
ProcessBuilder pb = new ProcessBuilder(DATAX_COMMNDS);
// 合并错误流到标准输出
pb.redirectErrorStream(true);
Process process = pb.start();
// 线程池线程异步读取标准流
Future<String> streamFuture = readStream(process, qslJobLog.getId());
msg = streamFuture.get();
// 等待命令执行完成
int i = process.waitFor();
if (i == 0) {
execStatus = 1;
LOGGER.info("job[{}]执行完成", jobId);
} else {
LOGGER.error("job[{}]执行失败", jobId);
execStatus = 0;
}
} catch (Exception e) {
execStatus = 0;
msg = "任务执行异常:" + e.getMessage();
LOGGER.error("任务执行异常:", e);
}
now = LocalDateTime.now();
qslJobLogDao.update(new LambdaUpdateWrapper<QslJobLog>()
.eq(QslJobLog::getId, qslJobLog.getId())
.set(QslJobLog::getExecStatus, execStatus)
.set(QslJobLog::getUpdateTime, now)
.set(QslJobLog::getRemark, msg));
}
private Future<String> readStream(Process process, Long jobLogId) {
return executorService.submit(() -> {
String threadName = Thread.currentThread().getName();
LOGGER.info("线程[{}]读取任务日志开始", threadName);
StringBuilder sb = new StringBuilder();
try (BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream(), SYSTEM_ENCODING))) {
String line;
while ((line = reader.readLine()) != null) {
if (line.contains("sync rows=")) {
long syncRows = Long.parseLong(line.split("=")[1]);
qslJobLogDao.update(new LambdaUpdateWrapper<QslJobLog>()
.eq(QslJobLog::getId, jobLogId)
.setSql("sync_rows = sync_rows + " + syncRows));
} else {
sb.append(line).append(StrPool.CRLF);
}
LOGGER.info(line);
}
} catch (IOException e) {
LOGGER.error("日志读取异常:", e);
}
LOGGER.info("线程[{}]读取任务日志结束", threadName);
// 保留后面20000字符
return sb.length() > 20000 ? sb.substring(sb.length() - 20000, sb.length()) : sb.toString();
});
}
其中表 tbl_qsl_job_log 结构如下
CREATE TABLE `tbl_qsl_job_log` (
`id` bigint NOT NULL COMMENT '主键id',
`job_id` bigint NOT NULL COMMENT '任务id',
`sync_rows` bigint NOT NULL DEFAULT '0' COMMENT '同步数量',
`exec_status` tinyint DEFAULT NULL COMMENT '执行-状态,-2:等待中,-1:执行中,0:失败,1:成功',
`remark` text COMMENT 'datax执行日志',
`create_time` datetime DEFAULT NULL COMMENT '创建时间',
`update_time` datetime DEFAULT NULL COMMENT '最终修改时间',
PRIMARY KEY (`id`)
)
对 readStream 方法进行一下补充说明,其中 StringBuilder sb 记录的是 DataX 的日志输出(不包括包含 sync rows= 的行),并且截取最后 20000 个字符进行落库,目的是方便从平台查看 DataX 的执行日志
针对如上代码,你们觉得有哪些优化空间?下面是我做的一些优化调整
-
删除 DataX 日志落库逻辑,直接对接 DataX 任务日志文件
DataX 日志不落库的话,对
readStream进行调整private Future<Long> readStream(Process process, Long jobLogId) { return executorService.submit(() -> { String threadName = Thread.currentThread().getName(); long totalRows = 0; LOGGER.info("线程[{}]读取任务日志开始,jobLogId={}", threadName, jobLogId); try (BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream(), SYSTEM_ENCODING))) { String line; while ((line = reader.readLine()) != null) { if (line.contains("sync rows=")) { long syncRows = Long.parseLong(line.split("=")[1]); totalRows += totalRows; qslJobLogDao.update(new LambdaUpdateWrapper<QslJobLog>() .eq(QslJobLog::getId, jobLogId) .setSql("sync_rows = sync_rows + " + syncRows)); } LOGGER.info(line); } } catch (IOException e) { LOGGER.error("日志读取异常:", e); } LOGGER.info("线程[{}]读取任务日志结束,jobLogId={}", threadName, jobLogId); return totalRows; }); }既然对接 DataX 日志文件,那么 DataX 日志文件的重要性就上来了,自然对其结构管理要更规范一些了;对 DataX 的
logbook.xml进行调整<?xml version="1.0" encoding="UTF-8"?> <configuration> <property name="log.dir" value="${datax.home}/log/" /> <property name="job.id" value="${job.id}" /> <property name="job.log.id" value="${job.log.id}" /> <property name="ymd" value="${current.day}"/> <property name="byMillionSecond" value="${current.time.millis}"/> <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender"> <Encoding>UTF-8</Encoding> <encoder class="ch.qos.logback.classic.encoder.PatternLayoutEncoder"> <pattern>%msg%n</pattern> </encoder> </appender> <appender name="FILE" class="ch.qos.logback.core.FileAppender"> <charset>UTF-8</charset> <file>${log.dir}/${ymd}/${job.id}/${job.log.id}-${byMillionSecond}.log</file> <encoder class="ch.qos.logback.classic.encoder.PatternLayoutEncoder"> <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{0} - %msg%n</pattern> </encoder> </appender> <root level="${loglevel:-INFO}"> <appender-ref ref="STDOUT" /> <appender-ref ref="FILE" /> </root> </configuration>文件中涉及 5 个变量,可以通过设置系统属性的方式传递给 DataX 的 logback
private static final String DATAX_HOME_PATH = "E:\\git-project\\datax-home"; @Test public void test() { // 先生成任务日志 Long jobId = 2L; LocalDateTime now = LocalDateTime.now(); int execStatus = -1; long totalRows = 0; QslJobLog qslJobLog = new QslJobLog(jobId, execStatus, 0L, now, now); qslJobLogDao.insert(qslJobLog); String currentDay = now.format(DateTimeFormatter.ofPattern("yyyy-MM-dd")); String currentTimeMillis = now.format(DateTimeFormatter.ofPattern("HH_mm_ss.SSS")); List<String> DATAX_COMMNDS = Arrays.asList("java", "-server", "-Xms4g", "-Xmx4g", "-Dfile.encoding=GBK", "-Dlogback.statusListenerClass=ch.qos.logback.core.status.NopStatusListener", "-Ddatax.home=" + DATAX_HOME_PATH, "-Dlogback.configurationFile=" + DATAX_HOME_PATH + "\\conf\\logback.xml", "-Djob.id=" + jobId, "-Djob.log.id=" + qslJobLog.getId(), "-Dcurrent.day=" + currentDay, "-Dcurrent.time.millis=" + currentTimeMillis, "-classpath", DATAX_HOME_PATH + "\\lib\\*", "com.alibaba.datax.core.Engine", "-mode", "standalone", "-job", DATAX_HOME_PATH + "\\job\\mysql2mysql.json"); String jobLogPath = DATAX_HOME_PATH + "\\log\\" + currentDay + "\\" + jobId + "\\" + qslJobLog.getId() + "-" + currentTimeMillis + ".log"; try { ProcessBuilder pb = new ProcessBuilder(DATAX_COMMNDS); // 合并错误流到标准输出 pb.redirectErrorStream(true); Process process = pb.start(); // 线程池线程异步读取标准流 Future<Long> streamFuture = readStream(process, qslJobLog.getId()); totalRows = streamFuture.get(); // 等待命令执行完成 int i = process.waitFor(); if (i == 0) { execStatus = 1; LOGGER.info("job[{}]执行完成,totalRows={}", jobId, totalRows); } else { LOGGER.error("job[{}]执行失败", jobId); execStatus = 0; } } catch (Exception e) { execStatus = 0; LOGGER.error("任务执行异常:", e); } now = LocalDateTime.now(); qslJobLogDao.update(new LambdaUpdateWrapper<QslJobLog>() .eq(QslJobLog::getId, qslJobLog.getId()) .set(QslJobLog::getExecStatus, execStatus) .set(QslJobLog::getUpdateTime, now) .set(QslJobLog::getRemark, jobLogPath)); }- datax.home,DataX 的 home 路径,示例中路径:E:\git-project\datax-home
- job.id,平台任务id,示例中是 2
- job.log.id,平台任务执行的日志id,每次任务执行时,通过 Mybatis Plus 的雪花算法生成
- current.day,平台任务执行时的日期,格式:yyyy-MM-dd
- current.time.millis,平台任务执行时的时分秒以及毫秒,格式:HH_mm_ss.SSS
示例代码中,
jobLogPath落库到了表 tbl_qsl_job_log 的remark字段,这是不推荐的,应该新增字段(如:datax_log_path)来存储;任务执行完成之后,日志路径与文件名格式如下这个路径也在表 tbl_qsl_job_log 中进行了存储
-
删除异步等待,减少平台任务与 DataX 任务的结束时差
我们细细斟酌下如下几行代码
// 线程池线程异步读取标准流 Future<Long> streamFuture = readStream(process, qslJobLog.getId()); totalRows = streamFuture.get(); // 等待命令执行完成 int i = process.waitFor();有没有可能 DataX 任务已经执行完了,
readStream还未执行完,也就是streamFuture.get();还在阻塞中?完全有可能的,readStream 从标准流读数据,并进行频繁的数据库更新(更新 sync_rows),标准流缓存的大小和数据库压力的大小都直接决定streamFuture.get()的阻塞时长既然异步等待会阻塞,那可不可以
异步不等待?(需要结合业务情况来判断),如果可以的话,对readStream进行调整private void readStream(Process process, Long jobLogId) { executorService.execute(() -> { String threadName = Thread.currentThread().getName(); long totalRows = 0; LOGGER.info("线程[{}]读取任务日志开始,jobLogId={}", threadName, jobLogId); try (BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream(), SYSTEM_ENCODING))) { String line; while ((line = reader.readLine()) != null) { if (line.contains("sync rows=")) { long syncRows = Long.parseLong(line.split("=")[1]); totalRows += totalRows; qslJobLogDao.update(new LambdaUpdateWrapper<QslJobLog>() .eq(QslJobLog::getId, jobLogId) .setSql("sync_rows = sync_rows + " + syncRows)); } LOGGER.info(line); } } catch (IOException e) { LOGGER.error("日志读取异常:", e); } LOGGER.info("线程[{}]读取任务日志结束,jobLogId={},totalRows={}", threadName, jobLogId, totalRows); }); }前面斟酌的 3 行代码就变成 2 行了
// 线程池线程异步读取标准流 readStream(process, qslJobLog.getId()); // 等待命令执行完成 int i = process.waitFor();这样,平台任务的结束时间与 DataX 任务的结束时间就非常接近了
有一点需要注意,平台任务已结束(失败或成功),但任务执行记录的 sync_rows 可能还在更新中!!!
如果非要
异步等待,可以将等待放到process.waitFor()后面,可能的话,甚至可以在两者之间加入其他逻辑int i = process.waitFor(); // TODO 其他逻辑处理 totalRows = streamFuture.get(); -
同步记录数放入 Redis,减少平台数据库的压力
readStream 中
qslJobLogDao.update(new LambdaUpdateWrapper<QslJobLog>() .eq(QslJobLog::getId, jobLogId) .setSql("sync_rows = sync_rows + " + syncRows));更新数据库的操作次数,与同步数据记录数强相关,记录数越多,更新次数越多,而关系型数据库的更新涉及到磁盘IO,磁盘IO相比于内存读写,效率是非常低的,所以我们可以接入 Redis 进行优化
private void readStream(Process process, Long jobLogId) { executorService.execute(() -> { String threadName = Thread.currentThread().getName(); long totalRows = 0; LOGGER.info("线程[{}]读取任务日志开始,jobLogId={}", threadName, jobLogId); try (BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream(), SYSTEM_ENCODING))) { String line; while ((line = reader.readLine()) != null) { if (line.contains("sync rows=")) { long syncRows = Long.parseLong(line.split("=")[1]); totalRows += totalRows; // 同步过程中,同步记录数更新到redis redisTemplate.opsForValue().increment("sync:rows:" + jobLogId, syncRows); } LOGGER.info(line); } } catch (IOException e) { LOGGER.error("日志读取异常:", e); } LOGGER.info("线程[{}]读取任务日志结束,jobLogId={},totalRows={}", threadName, jobLogId, totalRows); // 任务完成,同步记录数落库,同时删除Redis中对应记录 qslJobLogDao.update(new LambdaUpdateWrapper<QslJobLog>() .eq(QslJobLog::getId, jobLogId) .set(QslJobLog::getSyncRows, totalRows)); redisTemplate.delete("sync:rows:" + jobLogId); }); }更严谨的做法,更新 Redis 记录数的时候设置过期时间,也就是说:既要操作 Redis 自增,还要设置过期时间,比较合理的实现方式是采用
Lua脚本来实现private static final String INCR_LUA = "local newVal = redis.call('INCRBY', KEYS[1], ARGV[1])\n" + "redis.call('EXPIRE', KEYS[1], ARGV[2])\n" + "return newVal"; // 同步过程中,同步记录数更新到redis,并设置过期时间(60秒) redisTemplate.execute(incrementWithExpireScript, Collections.singletonList("sync:rows:" + jobLogId), syncRows, 60);业务上获取同步记录数,可以根据任务状态来区分处理:任务已结束,则从数据库获取,任务还在执行中,则从 Redis 获取
需要注意数据库与缓存双写一致性问题
总结
- Runtime.getRuntime().exec 存在管道缓存区,缓存区数据需要及时读取,否则可能产生死锁,导致程序永远卡住
- 日志数据,尤其是日志数据不固定的时候,不要落库,直接对接日志文件更合理
- 当任务执行,存在频繁更新数据库的时候,接入缓存是非常合适的优化方式