再探线程池 ThreadPoolExecutor - execute关于keepAliveTime的面试问题前几天遇到

关于keepAliveTime的面试问题

前几天遇到一个面试题: Java线程池是如何记录线程是否过期的？

这个确实没有深度思考过，初步想法是线程池可能维护了一个定时器来监控线程是否空闲到期？但是这样会不会太重了，而且对于设计者这种极度聪明的人，应该是有更好的方式的，带着这个问题我们去看 execute 方法。

execute 的注释

Executes the given task sometime in the future. The task may execute in a new thread or in an existing pooled thread. If the task cannot be submitted for execution, either because this executor has been shutdown or because its capacity has been reached, the task is handled by the current RejectedExecutionHandler. Params: command – the task to execute Throws: RejectedExecutionException – at discretion of RejectedExecutionHandler, if the task cannot be accepted for execution NullPointerException – if command is null

在将来某个时间执行给出的task. 任务可能被多个线程执行或者被一个存在池子中的线程执行。如果任务不能被执行器提交，要么是执行器已经被关闭或者因为它达到了容量上限，任务被当前的拒绝策略处理器来处理。

参数： command - 执行的任务

Throws: RejectExecutionException - 如果任务不被接受，由RejectExecutionHandler自行决定执行

NullPointerException - 如果command 是null

execute 的源码

public void execute(Runnable command) {
    // 如果 command 参数是null, 抛出空指针异常
    if (command == null)
        throw new NullPointerException();
    /*
     * Proceed in 3 steps:
     *
     * 1. If fewer than corePoolSize threads are running, try to
     * start a new thread with the given command as its first
     * task.  The call to addWorker atomically checks runState and
     * workerCount, and so prevents false alarms that would add
     * threads when it shouldn't, by returning false.
     *
     * 2. If a task can be successfully queued, then we still need
     * to double-check whether we should have added a thread
     * (because existing ones died since last checking) or that
     * the pool shut down since entry into this method. So we
     * recheck state and if necessary roll back the enqueuing if
     * stopped, or start a new thread if there are none.
     *
     * 3. If we cannot queue task, then we try to add a new
     * thread.  If it fails, we know we are shut down or saturated
     * and so reject the task.
     */
     
     /**
     *  三步处理：
     * 1. 如果运行的线程少于 corePoolSize， 尝试把该task启动一个新线程来执行。 
     * 调用 addWorker 方法自动检查 runState运行状态 和 workerCount 工作线程数， 
     * 同时在不能增加线程时避免错误报警，返回false.
     *
     * 2. 如果任务task被成功入队， 我们仍然需要再次检查我们是否成功添加了线程（因为存在上次检查后死亡的情况）， 
     * 或者进入这个方法之后，线程池关闭了。 所以我们再次检查状态，如果需要就回滚排队过程，如果线程停止的话，就新启动一个。  
     * 
     * 3. 如果我们不能入队任务，我们会尝试新建线程. 如果失败了，我们知道被关闭了或者饱和了所以拒绝了任务。
     */
     
    // 获取状态和线程数字段 ctl 
    int c = ctl.get();
    
    // 如果工作线程数 小于  核心线程数， 就入队列
    if (workerCountOf(c) < corePoolSize) {
        // 如果增加成功，就结束了
        if (addWorker(command, true))
            return;
        // 否则就更新c, 拿到 ctl 的最新值
        c = ctl.get();
    }
    // 如果线程没有增加成功，或者运行线程数 >= 核心线程数了，就入队
    // 如果是运行状态 并且 加入队列成功
    if (isRunning(c) && workQueue.offer(command)) {
    
        // 再次获取池信息, 如果不是运行状态，且删除任务成功(不接受新任务)，走拒绝策略
        // 如果
        int recheck = ctl.get();
        if (! isRunning(recheck) && remove(command))
            reject(command);
        else if (workerCountOf(recheck) == 0)
            addWorker(null, false);
    } 
    // 不是运行状态 或者  入队不成功，   再调用addWorker也返回false，就走拒绝逻辑，
    // (注意此时 addWorker 第二参数是false， 和第一次调用传true不同，这参数就是是否不是核心线程，等下再看addWorker详情)
    else if (!addWorker(command, false))
        reject(command);
}

这里要理解的就是，何时加线程，何时放队列，即，

运行线程数少于核心线程数，就加线程(用 addWorker 方法)，运行线程数大于等于核心线程数，就放队列，如果队列满了，就继续增加线程，此时就是非核心线程。

addWorker

看addWorker 这个方法，看是如何新增线程的:

/*
 * Methods for creating, running and cleaning up after workers
 */

看方法最上一层注释，创建、运行、清理。

Checks if a new worker can be added with respect to current pool state and the given bound (either core or maximum). If so, the worker count is adjusted accordingly, and, if possible, a new worker is created and started, running firstTask as its first task. This method returns false if the pool is stopped or eligible to shut down. It also returns false if the thread factory fails to create a thread when asked. If the thread creation fails, either due to the thread factory returning null, or due to an exception (typically OutOfMemoryError in Thread.start()), we roll back cleanly. Params: firstTask – the task the new thread should run first (or null if none). Workers are created with an initial first task (in method execute()) to bypass queuing when there are fewer than corePoolSize threads (in which case we always start one), or when the queue is full (in which case we must bypass queue). Initially idle threads are usually created via prestartCoreThread or to replace other dying workers. core – if true use corePoolSize as bound, else maximumPoolSize. (A boolean indicator is used here rather than a value to ensure reads of fresh values after checking other pool state). Returns: true if successful

检查一个worker 是否能被添加到线程池，是根据线程池状态和给定的界限(core 或 maximum)来确定的。如果是这样，worker计数将相应地进行调整，如果可能的话，将创建并启动一个新的worker，并将 firstTask 作为第一个任务运行. 这个方法如果pool 是stopped 或者可以关闭，就返回false. 如果线程工厂创建线程失败，也会返回false. 如果线程创建失败，因为是线程工厂返回null, 或者因为异常(典型如Thread.start()时候的 OutOfMemoryError ) ，我们会清晰回滚。

参数: firstTask - 新线程需要跑的任务。当线程数少于核心线程或者队列满了(或者要绕过线程)，我们通常启一个新的线程。初始化空闲线程通常被创建通过prestartCoreThread 或者代替其他死亡线程。

core - 如果true ,使用 corePoolSize为边界，否则就是 maximumPoolSize.


private boolean addWorker(Runnable firstTask, boolean core) {
    retry:
    for (;;) {
        int c = ctl.get();
        int rs = runStateOf(c);

        // Check if queue empty only if necessary.
        if (rs >= SHUTDOWN &&
            ! (rs == SHUTDOWN &&
               firstTask == null &&
               ! workQueue.isEmpty()))
            return false;

        for (;;) {
            int wc = workerCountOf(c);
            if (wc >= CAPACITY ||
                wc >= (core ? corePoolSize : maximumPoolSize))
                return false;
            if (compareAndIncrementWorkerCount(c))
                break retry;
            c = ctl.get();  // Re-read ctl
            if (runStateOf(c) != rs)
                continue retry;
            // else CAS failed due to workerCount change; retry inner loop
        }
    }

    boolean workerStarted = false;
    boolean workerAdded = false;
    Worker w = null;
    try {
        w = new Worker(firstTask);
        final Thread t = w.thread;
        if (t != null) {
            final ReentrantLock mainLock = this.mainLock;
            mainLock.lock();
            try {
                // Recheck while holding lock.
                // Back out on ThreadFactory failure or if
                // shut down before lock acquired.
                int rs = runStateOf(ctl.get());

                if (rs < SHUTDOWN ||
                    (rs == SHUTDOWN && firstTask == null)) {
                    if (t.isAlive()) // precheck that t is startable
                        throw new IllegalThreadStateException();
                    workers.add(w);
                    int s = workers.size();
                    if (s > largestPoolSize)
                        largestPoolSize = s;
                    workerAdded = true;
                }
            } finally {
                mainLock.unlock();
            }
            if (workerAdded) {
                t.start();
                workerStarted = true;
            }
        }
    } finally {
        if (! workerStarted)
            addWorkerFailed(w);
    }
    return workerStarted;
}

addWorker 第一个 for循环

先看第一个标记了 retry 的for循环，

第一个if判断，在非RUNNING状态，并且也不是 (SHUTDOWN && task==null && workQueue is not empty) , 就直接返回false，也就是新增worker失败。

这里的问题是， RUNNING状态新增成功就不说了，为何 SHUTDOWN && task==null && workQueue is not empty 也可以添加成功呢？

要回答这个，我们就要再回顾下线程池SHUTDOWN状态时候的特点了：

SHUTDOWN: Don't accept new tasks, but process queued tasks

不接受新任务，但是会处理还在队列中的任务。那么请问，如果线程池已经是SHUTDOWN了，并且此时工作的线程已经是0了，那么还在队列中的任务要如何处理？当然是新增一个线程来处理。具体就是调用 addWorker(null, false) 来实现。所以此时 addWorker的实现就有了这个判断。

接着看下面实现，继续一个for死循环，拿到工作线程数，判断超过池子容量，或者到达线程池设定的容量(corePoolSize或 maximumPoolSize)，直接返回false.

否则就直接走 compareAndIncrementWorkerCount(c)，增加工作线程数，跳出循环 break retry;
如果增加线程数不成功（CAS读取workercount失败，被改了），重新读取 ctl, 此时的 rs变化的话，重新走retry逻辑，没变的话继续此处内部循环

retry:
    for (;;) {
        // 获取ctl
        int c = ctl.get();
        
        // 获取pool状态runState
        int rs = runStateOf(c);

        // Check if queue empty only if necessary.
        // 线程池的状态如果是SHUTDOWN以上，即所有非 RUNNING状态， 并且
        // 非 ( 如果是SHUTDOWN状态 并且 task==null 并且 工作队列workQueue不为空)
        // 直接返回false
        if (rs >= SHUTDOWN &&
            ! (
                rs == SHUTDOWN &&
                firstTask == null &&
                ! workQueue.isEmpty()
               )
           )
            return false;

        for (;;) {
            int wc = workerCountOf(c);
            if (wc >= CAPACITY ||
                wc >= (core ? corePoolSize : maximumPoolSize))
                return false;
            if (compareAndIncrementWorkerCount(c))
                break retry;
            c = ctl.get();  // Re-read ctl
            if (runStateOf(c) != rs)
                continue retry;
            // else CAS failed due to workerCount change; retry inner loop
        }
    }

addWorker 实体

增加完workerCount后，就走进了addWorker的核心逻辑。

首先是定义了 workerStarted , workerAdded 布尔值，还有个临时变量 Worker w, 这个Worker就是我们要加入工作队里列的对象，也就是一个线程类，稍后看。

继续看addWorker代码结构：

try {
    1. 新建 worker 线程
    锁{
        2. 获取线程池当前状态
        3. 如果是RUNNING 或者 (SHUTDOWN 并且 firstTask==null) // 运行或者SHUTDOWN要添加一个线程
        3. workers 添加一个线程对象w, workers是一个HashSet结构，存储线程对象
        4. 更新最大池子变量值
        5. 更新添加线程为成功 workerAdded = true
    
    }
    
    如果添加成功，启动线程
} finally {
    if worker启动失败
        走addWorkerFailed(w)逻辑
}

return worker启动成功还是失败

这里总结下 addWorker的主要功能：

当 RUNNING 状态时候新建线程并启动
当 SHUTDOWN 状态时候，如果是 firstTask==null, 就新建线程并启动

Worker到底是什么

Worker是AQS的一个实现.

先看注释说了什么：

Class Worker mainly maintains interrupt control state for threads running tasks, along with other minor bookkeeping. This class opportunistically extends AbstractQueuedSynchronizer to simplify acquiring and releasing a lock surrounding each task execution. This protects against interrupts that are intended to wake up a worker thread waiting for a task from instead interrupting a task being run. We implement a simple non-reentrant mutual exclusion lock rather than use ReentrantLock because we do not want worker tasks to be able to reacquire the lock when they invoke pool control methods like setCorePoolSize. Additionally, to suppress interrupts until the thread actually starts running tasks, we initialize lock state to a negative value, and clear it upon start (in runWorker)

Worker类和其他小型记录器，主要维护了线程运行任务的中断状态控制。这个类适合继承了 AQS 来简化获取释放锁。这可以防止中断，这些中断旨在唤醒正在等待任务的工作线程，而不会中断正在运行的任务。我们实现了一个简的非可重入复用排他锁而不是使用 ReentrantLocak，因为我们不想 worker 的任务在调用池子的控制方法像 setCorePoolSize 时候能够获取到锁。更多的，为了抑制中断知道线程实际开始运行任务，我们初始化锁状态为一个复数值，并在 runWorker方法启动时清理掉。

所以，除了看下Worker的锁设计，最重要的还是要看下 runWorker方法

private final class Worker
    extends AbstractQueuedSynchronizer
    implements Runnable
{
    /**
     * This class will never be serialized, but we provide a
     * serialVersionUID to suppress a javac warning.
     */
    private static final long serialVersionUID = 6138294804551838833L;

    /** Thread this worker is running in.  Null if factory fails. */
    final Thread thread;
    /** Initial task to run.  Possibly null. */
    Runnable firstTask;
    /** Per-thread task counter */
    volatile long completedTasks;

    /**
     * Creates with given first task and thread from ThreadFactory.
     * @param firstTask the first task (null if none)
     */
    Worker(Runnable firstTask) {
        setState(-1); // inhibit interrupts until runWorker
        this.firstTask = firstTask;
        this.thread = getThreadFactory().newThread(this);
    }

    /** Delegates main run loop to outer runWorker  */
    public void run() {
        runWorker(this);
    }

    // Lock methods
    //
    // The value 0 represents the unlocked state.
    // The value 1 represents the locked state.

    protected boolean isHeldExclusively() {
        return getState() != 0;
    }

    protected boolean tryAcquire(int unused) {
        if (compareAndSetState(0, 1)) {
            setExclusiveOwnerThread(Thread.currentThread());
            return true;
        }
        return false;
    }

    protected boolean tryRelease(int unused) {
        setExclusiveOwnerThread(null);
        setState(0);
        return true;
    }

    public void lock()        { acquire(1); }
    public boolean tryLock()  { return tryAcquire(1); }
    public void unlock()      { release(1); }
    public boolean isLocked() { return isHeldExclusively(); }

    void interruptIfStarted() {
        Thread t;
        if (getState() >= 0 && (t = thread) != null && !t.isInterrupted()) {
            try {
                t.interrupt();
            } catch (SecurityException ignore) {
            }
        }
    }
}

Worker 的runWorker方法

先看runWorker的注释.

主要的worker运行循环。重复从队里额获取任务然后执行他们，同时处理一些问题：

我们可以从一个初始任务开始，这样我们就不需要获得第一个任务。或者，只要池子运行中，我们可以从getTask获取任务。如果它返回null，由于池状态改变或者配置参数原因，工作线程worker会退出。其他退出是由外部代码中的异常抛出导致的，在这种情况下，CompletedAbruply 保持不变，这通常会导致 processWorkerExit 替换此线程。
在运行任何任务之前，获取锁来避免其他池子在任务执行时候中断，之后我们确保除非池子停止，这个线程没有自己的中断。
每个任务之前有一个 beforeExecute，可能会抛出异常，这时就让线程死亡不处理任务(completedAbruptly 为true 打破循环)
假如 beforeExecute 正常运行，我们就运行任务，收集任何它抛出的异常发给 afterExecute来处理。我们分开处理 RuntimeException, Error 和 Throwables. 因为我们不能在Runnable.run 里再次抛出，我们包装他们进Errors的方式出来(给线程的 UncaughtExceptionHandler)。任何抛出异常也会造成线程死亡。
task.run完成后，我们调用afterExecution ，可能也会抛出一个异常，也会造成线程死亡。根据JLS sec 14.20, 即使是task.run抛出的异常也会生效。

异常机制的最终效果是，在 afterExecution 和线程的 UncaughtExceptionHandler ，我们可以提供关于用户代码遇到的任何问题的准确信息。

Main worker run loop. Repeatedly gets tasks from queue and executes them, while coping with a number of issues:

We may start out with an initial task, in which case we don't need to get the first one. Otherwise, as long as pool is running, we get tasks from getTask. If it returns null then the worker exits due to changed pool state or configuration parameters. Other exits result from exception throws in external code, in which case completedAbruptly holds, which usually leads processWorkerExit to replace this thread.

Before running any task, the lock is acquired to prevent other pool interrupts while the task is executing, and then we ensure that unless pool is stopping, this thread does not have its interrupt set.

Each task run is preceded by a call to beforeExecute, which might throw an exception, in which case we cause thread to die (breaking loop with completedAbruptly true) without processing the task.

Assuming beforeExecute completes normally, we run the task, gathering any of its thrown exceptions to send to afterExecute. We separately handle RuntimeException, Error (both of which the specs guarantee that we trap) and arbitrary Throwables. Because we cannot rethrow Throwables within Runnable.run, we wrap them within Errors on the way out (to the thread's UncaughtExceptionHandler). Any thrown exception also conservatively causes thread to die.

After task.run completes, we call afterExecute, which may also throw an exception, which will also cause thread to die. According to JLS Sec 14.20, this exception is the one that will be in effect even if task.run throws.

The net effect of the exception mechanics is that afterExecute and the thread's UncaughtExceptionHandler have as accurate information as we can provide about any problems encountered by user code. Params: w – the worker

再来看runWorker的源码，根据以上注释，我们知道在 run之前有个 beforeExecution, run之后有 afterExecution. 而最后的finally 有个处理线程关闭的方法 processWorkerExit.

再来看只要逻辑，重要的是while循环条件部分：

while (task != null || (task = getTask()) != null)

这里第一个task来源就是addWorker传参过来的task, 运行此时的任务。第二个来源就是 task= getTask() ，是从阻塞列的取出的 task. 由此可见，只要task队列里不为空或者传参Task不为空，这个循环就会一直运行下去，也就是该线程不断获取需要执行的 task , 这就是池子的意义，避免了创建销毁线程的开销，直接获取可执行任务，复用线程。

final void runWorker(Worker w) {
    Thread wt = Thread.currentThread();
    Runnable task = w.firstTask;
    w.firstTask = null;
    w.unlock(); // allow interrupts
    boolean completedAbruptly = true;
    try {
        while (task != null || (task = getTask()) != null) {
            w.lock();
            // If pool is stopping, ensure thread is interrupted;
            // if not, ensure thread is not interrupted.  This
            // requires a recheck in second case to deal with
            // shutdownNow race while clearing interrupt
            if ((runStateAtLeast(ctl.get(), STOP) ||
                 (Thread.interrupted() &&
                  runStateAtLeast(ctl.get(), STOP))) &&
                !wt.isInterrupted())
                wt.interrupt();
            try {
                beforeExecute(wt, task);
                Throwable thrown = null;
                try {
                    task.run();
                } catch (RuntimeException x) {
                    thrown = x; throw x;
                } catch (Error x) {
                    thrown = x; throw x;
                } catch (Throwable x) {
                    thrown = x; throw new Error(x);
                } finally {
                    afterExecute(task, thrown);
                }
            } finally {
                task = null;
                w.completedTasks++;
                w.unlock();
            }
        }
        completedAbruptly = false;
    } finally {
        processWorkerExit(w, completedAbruptly);
    }
}

processWorkerExit(w, completedAbruptly)

再来看看这个线程退出方法：在runWorker的最后finally直接调用了退出方法 processWorkerExit.

这个方法主要就是在保存线程的 HashSet<Worker> 容器里删除该Worker,

Performs cleanup and bookkeeping for a dying worker. Called only from worker threads. Unless completedAbruptly is set, assumes that workerCount has already been adjusted to account for exit. This method removes thread from worker set, and possibly terminates the pool or replaces the worker if either it exited due to user task exception or if fewer than corePoolSize workers are running or queue is non-empty but there are no workers.

对一个正在死亡的worker 执行清理和记账操作。仅从工作线程调用。除非 completedAbruptly 设置了值，假设 workerCount 已经调整为考虑退出。这个方法删除线程容器中的线程，可能关闭池子或者替换worker 如果用户任务异常或者运行的线程小于 corePoolSize 或者队里额非空但是没有worker了。

private void processWorkerExit(Worker w, boolean completedAbruptly) {
    if (completedAbruptly) // If abrupt, then workerCount wasn't adjusted
        decrementWorkerCount();

    final ReentrantLock mainLock = this.mainLock;
    mainLock.lock();
    try {
        completedTaskCount += w.completedTasks;
        workers.remove(w);
    } finally {
        mainLock.unlock();
    }

    tryTerminate();

    int c = ctl.get();
    if (runStateLessThan(c, STOP)) {
        if (!completedAbruptly) {
            int min = allowCoreThreadTimeOut ? 0 : corePoolSize;
            if (min == 0 && ! workQueue.isEmpty())
                min = 1;
            if (workerCountOf(c) >= min)
                return; // replacement not needed
        }
        addWorker(null, false);
    }
}

keepAliveTime 哪里维护并关闭线程的

回到最初的问题，我们设置了 keepAliveTime，在哪里生效的？

翻遍runWorker方法，也没有发现对keepAliveTime的特殊处理。不禁要重新开始思考，根据改runWorker方法，执行完就直接调用了关闭线程操作，那么在哪里计时的？我们知道线程要么在工作中，要么在空闲中。空闲超过keepAliveTime时间，就会被回收，所以，此时task是执行完了，队列肯定是空的，我们重点看 task = getTask() 方法

private Runnable getTask() {
    boolean timedOut = false; // Did the last poll() time out?

    for (;;) {
        int c = ctl.get();
        int rs = runStateOf(c);

        // Check if queue empty only if necessary.
        if (rs >= SHUTDOWN && (rs >= STOP || workQueue.isEmpty())) {
            decrementWorkerCount();
            return null;
        }

        int wc = workerCountOf(c);

        // Are workers subject to culling?
        boolean timed = allowCoreThreadTimeOut || wc > corePoolSize;

        if ((wc > maximumPoolSize || (timed && timedOut))
            && (wc > 1 || workQueue.isEmpty())) {
            if (compareAndDecrementWorkerCount(c))
                return null;
            continue;
        }

        try {
            Runnable r = timed ?
                workQueue.poll(keepAliveTime, TimeUnit.NANOSECONDS) :
                workQueue.take();
            if (r != null)
                return r;
            timedOut = true;
        } catch (InterruptedException retry) {
            timedOut = false;
        }
    }
}

重点在这一句： Runnable r = timed ? workQueue.poll(keepAliveTime, TimeUnit.NANOSECONDS) : workQueue.take();

workQueue.poll(keepAliveTime, TimeUnit.NANAOSECONDS) ，这句用上了keepAliveTime, 我们知道这是一个阻塞队列，如果队列为空，就阻塞等待 keepAliveTime 后返回null，这样timeOut 就是true了，返回到runWorker 结束就可以直接调用processExit方法了。

/**
 * Retrieves and removes the head of this queue, waiting up to the
 * specified wait time if necessary for an element to become available.
 *
 * @param timeout how long to wait before giving up, in units of
 *        {@code unit}
 * @param unit a {@code TimeUnit} determining how to interpret the
 *        {@code timeout} parameter
 * @return the head of this queue, or {@code null} if the
 *         specified waiting time elapses before an element is available
 * @throws InterruptedException if interrupted while waiting
 */
E poll(long timeout, TimeUnit unit)
    throws InterruptedException;