Goroutine scheduler

The scheduler's job is to distribute ready-to-run goroutines over worker threads.

The main concepts are: G - goroutine. M - worker thread, or machine. P - processor, a resource that is required to execute Go code. M must have an associated P to execute Go code, however it can be blocked or in a syscall w/o an associated P.

Worker thread parking/unparking

We need to balance between keeping enough running worker threads to utilize available hardware parallelism and parking excessive running worker threads to conserve CPU resources and power. This is not simple for two reasons:

(1) scheduler state is intentionally distributed (in particular, per-P work queues), so it is not possible to compute global predicates on fast paths;

(2) for optimal thread management we would need to know the future (don't park a worker thread when a new goroutine will be readied in near future).

Three rejected approaches that would work badly:

Centralize all scheduler state (would inhibit scalability).
Direct goroutine handoff. That is, when we ready a new goroutine and there is a spare P, unpark a thread and handoff it the thread and the goroutine. This would lead to thread state thrashing（见www.geeksforgeeks.org/techniques-…, as the thread that readied the goroutine can be out of work the very next moment, we will need to park it. Also, it would destroy locality of computation as we want to preserve dependent goroutines on the same thread; and introduce additional latency.
Unpark an additional thread whenever we ready a goroutine and there is an idle P, but don't do handoff. This would lead to excessive thread parking/ unparking as the additional threads will instantly park without discovering any work to do.

The current approach:

This approach applies to three primary sources of potential work: readying a goroutine, new/modified-earlier timers, and idle-priority GC. See below for additional details.

We unpark an additional thread when we submit work if (this is wakep()):

There is an idle P, and
There are no "spinning" worker threads.

// Tries to add one more P to execute G's.
// Called when a G is made runnable (newproc, ready).
func wakep() {
   if atomic.Load(&sched.npidle) == 0 {
      return
   }
   // be conservative about spinning threads
   if atomic.Load(&sched.nmspinning) != 0 || !atomic.Cas(&sched.nmspinning, 0, 1) {
      return
   }
   startm(nil, true)
}

A worker thread is considered spinning if it is out of local work and did not find work in the global run queue or netpoller; the spinning state is denoted in m.spinning and in sched.nmspinning. Threads unparked this way are also considered spinning; we don't do goroutine handoff so such threads are out of work initially. Spinning threads spin on looking for work in per-P run queues and timer heaps or from the GC before parking. If a spinning thread finds work it takes itself out of the spinning state and proceeds to execution. If it does not find work it takes itself out of the spinning state and then parks.

If there is at least one spinning thread (sched.nmspinning>1), we don't unpark new threads when submitting work. To compensate for that, if the last spinning thread finds work and stops spinning, it must unpark a new spinning thread. This approach smooths out unjustified spikes of thread unparking, but at the same time guarantees eventual maximal CPU parallelism utilization.

The main implementation complication is that we need to be very careful during spinning->non-spinning thread transition. This transition can race with submission of new work, and either one part or another needs to unpark another worker thread. If they both fail to do that, we can end up with semi-persistent CPU underutilization.

The general pattern for submission is:

Submit work to the local run queue, timer heap, or GC state.
#StoreLoad-style memory barrier.
Check sched.nmspinning.

The general pattern for spinning->non-spinning transition is:

Decrement nmspinning.
#StoreLoad-style memory barrier.
Check all per-P work queues and GC for new work.

Note that all this complexity does not apply to global run queue as we are not sloppy about thread unparking when submitting to global queue. Also see comments for nmspinning manipulation.

How these different sources of work behave varies, though it doesn't affect the synchronization approach:

Ready goroutine: this is an obvious source of work; the goroutine is immediately ready and must run on some thread eventually.
New/modified-earlier timer: The current timer implementation (see time.go) uses netpoll in a thread with no work available to wait for the soonest timer. If there is no thread waiting, we want a new spinning thread to go wait.
Idle-priority GC: The GC wakes a stopped idle thread to contribute to background GC work (note: currently disabled per golang.org/issue/19112). Also see golang.org/issue/44313, as this should be extended to all GC workers.

Goroutine scheduler 及 Worker thread parking/unparking

Goroutine scheduler

Worker thread parking/unparking

Three rejected approaches that would work badly:

The current approach: