Druid CreateConnectionThread 异常分析

670 阅读4分钟

问题背景

  • 系统业务场景:当前服务消费其他系统 MQ,并将数据进行落库;
  • 系统出现异常报警,线上异常日志情况:
    • Consumer.doStop.png
  • 通过日志可以看出是 CreateConnectionThread 异常,造成了获取不到数据库连接。

CreateConnectionThread 源码中的三种异常情况

1 锁异常中断

  • 源码如下:
try {
    lock.lockInterruptibly();
} catch (InterruptedException e2) {
    break;
}

2 线程异常中断

  • 源码如下:
try {
    boolean emptyWait = true;
    // do something... ...(省略代码)
} catch (InterruptedException e) {
    lastCreateError = e;
    lastErrorTimeMillis = System.currentTimeMillis();

    if (!closing) {
        LOG.error("create connection Thread Interrupted, url: " + jdbcUrl, e);
    }
    break;
} finally {
    lock.unlock();
}

3 获取物理连接异常中断

  • 源码如下:
try {
    connection = createPhysicalConnection();
    setFailContinuous(false);
} catch (SQLException e) {
    LOG.error("create connection SQLException, url: " + jdbcUrl + ", errorCode " + e.getErrorCode()
              + ", state " + e.getSQLState(), e);

    errorCount++;
    if (errorCount > connectionErrorRetryAttempts && timeBetweenConnectErrorMillis > 0) {
        // do something... ...(省略代码)

        try {
            Thread.sleep(timeBetweenConnectErrorMillis);
        } catch (InterruptedException interruptEx) {
            break;
        }
    }
} catch (RuntimeException e) {
    LOG.error("create connection RuntimeException", e);
    setFailContinuous(true);
    continue;
} catch (Error e) {
    LOG.error("create connection Error", e);
    setFailContinuous(true);
    break;
}

CreateConnectionThread 异常原因分析

  • 结合线上报错日志和源码分析,很容易就定位到,线上异常原因是由于 CreateConnectionThread 线程异常中断引起的;
  • 那 CreateConnectionThread 线程异常中断又是因为什么呢?
  • 继续分析报错日志可以看到这样一行记录:
[INFO  2021-10-19 15:33:27.538][][][ClusterManager] c.j.j.c.c.GroupConsumer.doStop consumer xxx to xxx1614:xxx,READ,xxx,NONE is stopped.
  • 发现 MQ 莫名其妙调用了 doStop 方法,猜测问题大概率是出在了 GroupConsumer.doStop 上;
  • 查看该方法的源码实现:
@Override
protected void doStop() {
    super.doStop();
    threadGroup.interrupt();
    threadGroup = null;
    // do something... ...(省略代码)
}
  • 看到这里基本也就破案了,MQ 的 GroupConsumer 调用 doStop 进行退出的时候,调用的是 ThreadGroupinterrupt() 方法,查看该方法的代码实现会发现,该方法会中断当前线程组里的全部线程
/**
 * Interrupts all threads in this thread group.
 * <p>
 * First, the <code>checkAccess</code> method of this thread group is
 * called with no arguments; this may result in a security exception.
 * <p>
 * This method then calls the <code>interrupt</code> method on all the
 * threads in this thread group and in all of its subgroups.
 *
 * @exception  SecurityException  if the current thread is not allowed
 *               to access this thread group or any of the threads in
 *               the thread group.
 * @see        java.lang.Thread#interrupt()
 * @see        java.lang.SecurityException
 * @see        java.lang.ThreadGroup#checkAccess()
 * @since      1.2
 */
public final void interrupt() {
    int ngroupsSnapshot;
    ThreadGroup[] groupsSnapshot;
    synchronized (this) {
        checkAccess();
        for (int i = 0 ; i < nthreads ; i++) {
            threads[i].interrupt();
        }
        ngroupsSnapshot = ngroups;
        if (groups != null) {
            groupsSnapshot = Arrays.copyOf(groups, ngroupsSnapshot);
        } else {
            groupsSnapshot = null;
        }
    }
    for (int i = 0 ; i < ngroupsSnapshot ; i++) {
        groupsSnapshot[i].interrupt();
    }
}
  • 注释里写的很清楚,该方法会在当前线程组及其所有子线程组中调用 interrupt() 方法
  • 到这里可能有人还会很疑惑,“这跟 CreateConnectionThread 有什么关系?”,疑惑点无非就是 MQ 的线程和 Druid 的线程为什么会有关联,上面已经讲了,MQ 的 GroupConsumer 调用 doStop 进行退出的时候,调用的是 ThreadGroup 的 interrupt() 方法,现在只需要搞清楚 Druid 的 ThreadGroup 是如何定义的,就能知道两者的确切关系了;
  • 首先来看一下 Druid 初始化的两种方式(具体的初始化细节,不详细讲了):
    • 手动调用初始化,即代码中配置数据源时,在 init-method 属性直接调用 init 方法 init-method="init"
    • 如果没有配置初始化,那当程序第一次调用 getConnection 方法时也会自动触发 init 方法完成初始化逻辑。
  • 而在 Druid 初始化时,CreateConnectionThread 的创建是调用的父类 Threadstart() 方法进行,所以通过源码可以看出 Druid 的 ThreadGroup 是根据当前调用线程来进行定义的;
private CreateConnectionThread           createConnectionThread;

protected void createAndStartCreatorThread() {
    if (createScheduler == null) {
        String threadName = "Druid-ConnectionPool-Create-" + System.identityHashCode(this);
        createConnectionThread = new CreateConnectionThread(threadName);
        createConnectionThread.start();
        return;
    }

    initedLatch.countDown();
}
/* The group of this thread */
private ThreadGroup group;

public synchronized void start() {
    /**
     * This method is not invoked for the main method thread or "system"
     * group threads created/set up by the VM. Any new functionality added
     * to this method in the future may have to also be added to the VM.
     *
     * A zero status value corresponds to state "NEW".
     */
    if (threadStatus != 0)
        throw new IllegalThreadStateException();

    /* Notify the group that this thread is about to be started
     * so that it can be added to the group's list of threads
     * and the group's unstarted count can be decremented. */
    group.add(this);

    boolean started = false;
    try {
        start0();
        started = true;
    } finally {
        try {
            if (!started) {
                group.threadStartFailed(this);
            }
        } catch (Throwable ignore) {
            /* do nothing. If start0 threw a Throwable then
              it will be passed up the call stack */
        }
    }
}
  • 再来回顾一下业务场景,消费 MQ 并将数据进行落库;
  • 结合 Druid 初始化的两种方式,查看系统 DruidDataSource 配置,发现配置中没有进行 init 操作,那么 CreateConnectionThread 就是由第一次调用 getConnection 时创建的,而创建线程不言而喻,因此 MQ 调用 ThreadGroup.interrupt() 方法时,CreateConnectionThread 会联动退出

解决方案

  • 升级新版本 MQ(已解决问题版本);
  • 在 DruidDataSource 配置文件加上 init 配置或者手动调用 getConnection,避免受到 MQ 线程的影响。