Gs, Ms, Ps

A "G" is simply a goroutine. It's represented by type g. When a goroutine exits, its g object is returned to a pool of free gs and can later be reused for some other goroutine.

An "M" is an OS thread that can be executing user Go code, runtime code, a system call, or be idle. It's represented by type m. There can be any number of Ms at a time since any number of threads may be blocked in system calls.

P - processor, a resource that is required to execute Go code, such as scheduler and memory allocator state. There is exactly GOMAXPROCS P’s. All P’s are organized into an array, that is a requirement of work-stealing. GOMAXPROCS change involves stop/start the world to resize array of P’s.

When an M is willing to start executing Go code, it must pop a P form the list. When an M ends executing Go code, it pushes the P to the list.

A P can be thought of like a CPU in the OS scheduler and the contents of the p type like per-CPU state. This is a good place to put state that needs to be sharded for efficiency, but doesn't need to be per-thread or per-goroutine.

The scheduler's job is to match up a G (the code to execute), an M (where to execute it), and a P (the rights and resources to execute it). When an M stops executing user Go code, for example by entering a system call, it returns its P to the idle P pool.(当M陷入阻塞，会把P放回 idle pool ) In order to resume executing user Go code, for example on return from a system call, it must acquire a P from the idle pool.

All g, m, and p objects are heap allocated, but are never freed, so their memory remains type stable. As a result, the runtime can avoid write barriers in the depths of the scheduler.

为什么要有P

docs.google.com/document/d/…

go1.1之前只有 G、M 只有一个全局队列，

创建、销毁、调度 G 都需要获取全局队列锁这就形成了激烈的锁竞争。
然后是goutine切换，每个M都可以执行所有的goroutine。举个很简单的类比，多核CPU中每个核都去执行不同线程的代码，这显然是不利于缓存的局部性的，切换开销也会变大。

go1.1引入Processor的概念，并在Processors之上实现工作流窃取的调度器。

Scheduling

When a new G is created or an existing G becomes runnable, it is pushed onto a list of runnable goroutines of current P. When P finishes executing G, it first tries to pop a G from own list of runnable goroutines; if the list is empty, P chooses a random victim (another P) and tries to steal a half of runnable goroutines from it.

numbers of M & P

Numbers of P. Determined by GOMAXPROCS environment variable or GOMAXPROCS() function in runtime package. 默认值为cpu 核数

The number of M can create up to 10,000, but most of them do not execute user code but to fall into system calls, up to only GOMAXPROCS thread can run normally!

The number of M and P has no absolute relationship, one m block, p will create or switch another M, so even if the default number of p is 1, there may be many M out.

when to create P,M

P: After numbers of P is determined, it will be created by runtime. M: lf there are not enough M to execute P's tasks, it will be created. e.g. all M are blocked, new M will be created to run P's task

p

type p struct {
   id          int32
   status      uint32 // one of pidle/prunning/...
   link        puintptr
   schedtick   uint32     // incremented on every scheduler call
   syscalltick uint32     // incremented on every system call
   sysmontick  sysmontick // last tick observed by sysmon
   m           muintptr   // back-link to associated m (nil if idle)
   mcache      *mcache
   pcache      pageCache
   raceprocctx uintptr

   deferpool    []*_defer // pool of available defer structs (see panic.go)
   deferpoolbuf [32]*_defer

   // Cache of goroutine ids, amortizes accesses to runtime·sched.goidgen.
   goidcache    uint64
   goidcacheend uint64

   // Queue of runnable goroutines. Accessed without lock.
   runqhead uint32
   runqtail uint32
   runq     [256]guintptr
   // runnext, if non-nil, is a runnable G that was ready'd by
   // the current G and should be run next instead of what's in
   // runq if there's time remaining in the running G's time
   // slice. It will inherit the time left in the current time
   // slice. If a set of goroutines is locked in a
   // communicate-and-wait pattern, this schedules that set as a
   // unit and eliminates the (potentially large) scheduling
   // latency that otherwise arises from adding the ready'd
   // goroutines to the end of the run queue.
   //
   // Note that while other P's may atomically CAS this to zero,
   // only the owner P can CAS it to a valid G.
   runnext guintptr

   // Available G's (status == Gdead)
   gFree struct {
      gList
      n int32
   }

   sudogcache []*sudog
   sudogbuf   [128]*sudog

   // Cache of mspan objects from the heap.
   mspancache struct {
      // We need an explicit length here because this field is used
      // in allocation codepaths where write barriers are not allowed,
      // and eliminating the write barrier/keeping it eliminated from
      // slice updates is tricky, moreso than just managing the length
      // ourselves.
      len int
      buf [128]*mspan
   }

   tracebuf traceBufPtr

   // traceSweep indicates the sweep events should be traced.
   // This is used to defer the sweep start event until a span
   // has actually been swept.
   traceSweep bool
   // traceSwept and traceReclaimed track the number of bytes
   // swept and reclaimed by sweeping in the current sweep loop.
   traceSwept, traceReclaimed uintptr

   palloc persistentAlloc // per-P to avoid mutex

   _ uint32 // Alignment for atomic fields below

   // The when field of the first entry on the timer heap.
   // This is updated using atomic functions.
   // This is 0 if the timer heap is empty.
   timer0When uint64

   // The earliest known nextwhen field of a timer with
   // timerModifiedEarlier status. Because the timer may have been
   // modified again, there need not be any timer with this value.
   // This is updated using atomic functions.
   // This is 0 if there are no timerModifiedEarlier timers.
   timerModifiedEarliest uint64

   // Per-P GC state
   gcAssistTime         int64 // Nanoseconds in assistAlloc
   gcFractionalMarkTime int64 // Nanoseconds in fractional mark worker (atomic)

   // gcMarkWorkerMode is the mode for the next mark worker to run in.
   // That is, this is used to communicate with the worker goroutine
   // selected for immediate execution by
   // gcController.findRunnableGCWorker. When scheduling other goroutines,
   // this field must be set to gcMarkWorkerNotWorker.
   gcMarkWorkerMode gcMarkWorkerMode
   // gcMarkWorkerStartTime is the nanotime() at which the most recent
   // mark worker started.
   gcMarkWorkerStartTime int64

   // gcw is this P's GC work buffer cache. The work buffer is
   // filled by write barriers, drained by mutator assists, and
   // disposed on certain GC state transitions.
   gcw gcWork

   // wbBuf is this P's GC write barrier buffer.
   //
   // TODO: Consider caching this in the running G.
   wbBuf wbBuf

   runSafePointFn uint32 // if 1, run sched.safePointFn at next safe point

   // statsSeq is a counter indicating whether this P is currently
   // writing any stats. Its value is even when not, odd when it is.
   statsSeq uint32

   // Lock for timers. We normally access the timers while running
   // on this P, but the scheduler can also do it from a different P.
   timersLock mutex

   // Actions to take at some time. This is used to implement the
   // standard library's time package.
   // Must hold timersLock to access.
   timers []*timer

   // Number of timers in P's heap.
   // Modified using atomic instructions.
   numTimers uint32

   // Number of timerDeleted timers in P's heap.
   // Modified using atomic instructions.
   deletedTimers uint32

   // Race context used while executing timer functions.
   timerRaceCtx uintptr

   // scannableStackSizeDelta accumulates the amount of stack space held by
   // live goroutines (i.e. those eligible for stack scanning).
   // Flushed to gcController.scannableStackSize once scannableStackSizeSlack
   // or -scannableStackSizeSlack is reached.
   scannableStackSizeDelta int64

   // preempt is set to indicate that this P should be enter the
   // scheduler ASAP (regardless of what G is running on it).
   preempt bool

   // Padding is no longer needed. False sharing is now not a worry because p is large enough
   // that its size class is an integer multiple of the cache line size (for any of our architectures).
}

m

type m struct {
   g0      *g     // goroutine with scheduling stack
   morebuf gobuf  // gobuf arg to morestack
   divmod  uint32 // div/mod denominator for arm - known to liblink
   _       uint32 // align next field to 8 bytes

   // Fields not known to debuggers.
   procid        uint64            // for debuggers, but offset not hard-coded
   gsignal       *g                // signal-handling g
   goSigStack    gsignalStack      // Go-allocated signal handling stack
   sigmask       sigset            // storage for saved signal mask
   tls           [tlsSlots]uintptr // thread-local storage (for x86 extern register)
   mstartfn      func()
   curg          *g       // current running goroutine
   caughtsig     guintptr // goroutine running during fatal signal
   p             puintptr // attached p for executing go code (nil if not executing go code)
   nextp         puintptr
   oldp          puintptr // the p that was attached before executing a syscall
   id            int64
   mallocing     int32
   throwing      int32
   preemptoff    string // if != "", keep curg running on this m
   locks         int32
   dying         int32
   profilehz     int32
   spinning      bool // m is out of work and is actively looking for work
   blocked       bool // m is blocked on a note
   newSigstack   bool // minit on C thread called sigaltstack
   printlock     int8
   incgo         bool   // m is executing a cgo call
   freeWait      uint32 // if == 0, safe to free g0 and delete m (atomic)
   fastrand      uint64
   needextram    bool
   traceback     uint8
   ncgocall      uint64      // number of cgo calls in total
   ncgo          int32       // number of cgo calls currently in progress
   cgoCallersUse uint32      // if non-zero, cgoCallers in use temporarily
   cgoCallers    *cgoCallers // cgo traceback if crashing in cgo call
   park          note
   alllink       *m // on allm
   schedlink     muintptr
   lockedg       guintptr
   createstack   [32]uintptr // stack that created this thread.
   lockedExt     uint32      // tracking for external LockOSThread
   lockedInt     uint32      // tracking for internal lockOSThread
   nextwaitm     muintptr    // next m waiting for lock
   waitunlockf   func(*g, unsafe.Pointer) bool
   waitlock      unsafe.Pointer
   waittraceev   byte
   waittraceskip int
   startingtrace bool
   syscalltick   uint32
   freelink      *m // on sched.freem

   // these are here because they are too large to be on the stack
   // of low-level NOSPLIT functions.
   libcall   libcall
   libcallpc uintptr // for cpu profiler
   libcallsp uintptr
   libcallg  guintptr
   syscall   libcall // stores syscall parameters on windows

   vdsoSP uintptr // SP for traceback while in VDSO call (0 if not in call)
   vdsoPC uintptr // PC for traceback while in VDSO call

   // preemptGen counts the number of completed preemption
   // signals. This is used to detect when a preemption is
   // requested, but fails. Accessed atomically.
   preemptGen uint32

   // Whether this is a pending preemption signal on this M.
   // Accessed atomically.
   signalPending uint32

   dlogPerM

   mOS

   // Up to 10 locks held by this m, maintained by the lock ranking code.
   locksHeldLen int
   locksHeld    [10]heldLockInfo
}

m0 , g0 , mcache0

mcache0 is for bootstrapping.

Stacks

Every non-dead G has a user stack associated with it, which is what user Go code executes on. User stacks start small (e.g., 2K) and grow or shrink dynamically.

Every M has a system stack associated with it (also known as the M’s “g0” stack because it’s implemented as a stub G) and, on Unix platforms, a signal stack (also known as the M’s “gsignal” stack). System and signal stacks cannot grow, but are large enough to execute runtime and cgo code (8K in a pure Go binary; system-allocated in a cgo binary).

Runtime code often temporarily switches to the system stack using systemstack, mcall, or asmcgocall to perform tasks that must not be preempted, that must not grow the user stack, or that switch user goroutines. Code running on the system stack is implicitly non-preemptible and the garbage collector does not scan system stacks. While running on the system stack, the current user stack is not used for execution.

`getg()` and `getg().m.curg`

To get the current user g, use getg().m.curg.

getg() alone returns the current g, but when executing on the system or signal stacks, this will return the current M’s “g0” or “gsignal”, respectively. This is usually not what you want.

To determine if you’re running on the user stack or the system stack, use getg() == getg().m.curg.

g

type g struct {
   // Stack parameters.
   // stack describes the actual stack memory: [stack.lo, stack.hi).
   // stackguard0 is the stack pointer compared in the Go stack growth prologue.
   // It is stack.lo+StackGuard normally, but can be StackPreempt to trigger a preemption.
   // stackguard1 is the stack pointer compared in the C stack growth prologue.
   // It is stack.lo+StackGuard on g0 and gsignal stacks.
   // It is ~0 on other goroutine stacks, to trigger a call to morestackc (and crash).
   stack       stack   // offset known to runtime/cgo
   stackguard0 uintptr // offset known to liblink
   stackguard1 uintptr // offset known to liblink

   _panic    *_panic // innermost panic - offset known to liblink
   _defer    *_defer // innermost defer
   m         *m      // current m; offset known to arm liblink
   sched     gobuf
   syscallsp uintptr // if status==Gsyscall, syscallsp = sched.sp to use during gc
   syscallpc uintptr // if status==Gsyscall, syscallpc = sched.pc to use during gc
   stktopsp  uintptr // expected sp at top of stack, to check in traceback
   // param is a generic pointer parameter field used to pass
   // values in particular contexts where other storage for the
   // parameter would be difficult to find. It is currently used
   // in three ways:
   // 1. When a channel operation wakes up a blocked goroutine, it sets param to
   //    point to the sudog of the completed blocking operation.
   // 2. By gcAssistAlloc1 to signal back to its caller that the goroutine completed
   //    the GC cycle. It is unsafe to do so in any other way, because the goroutine's
   //    stack may have moved in the meantime.
   // 3. By debugCallWrap to pass parameters to a new goroutine because allocating a
   //    closure in the runtime is forbidden.
   param        unsafe.Pointer
   atomicstatus uint32
   stackLock    uint32 // sigprof/scang lock; TODO: fold in to atomicstatus
   goid         int64
   schedlink    guintptr
   waitsince    int64      // approx time when the g become blocked
   waitreason   waitReason // if status==Gwaiting

   preempt       bool // preemption signal, duplicates stackguard0 = stackpreempt
   preemptStop   bool // transition to _Gpreempted on preemption; otherwise, just deschedule
   preemptShrink bool // shrink stack at synchronous safe point

   // asyncSafePoint is set if g is stopped at an asynchronous
   // safe point. This means there are frames on the stack
   // without precise pointer information.
   asyncSafePoint bool

   paniconfault bool // panic (instead of crash) on unexpected fault address
   gcscandone   bool // g has scanned stack; protected by _Gscan bit in status
   throwsplit   bool // must not split stack
   // activeStackChans indicates that there are unlocked channels
   // pointing into this goroutine's stack. If true, stack
   // copying needs to acquire channel locks to protect these
   // areas of the stack.
   activeStackChans bool
   // parkingOnChan indicates that the goroutine is about to
   // park on a chansend or chanrecv. Used to signal an unsafe point
   // for stack shrinking. It's a boolean value, but is updated atomically.
   parkingOnChan uint8

   raceignore     int8     // ignore race detection events
   sysblocktraced bool     // StartTrace has emitted EvGoInSyscall about this goroutine
   tracking       bool     // whether we're tracking this G for sched latency statistics
   trackingSeq    uint8    // used to decide whether to track this G
   runnableStamp  int64    // timestamp of when the G last became runnable, only used when tracking
   runnableTime   int64    // the amount of time spent runnable, cleared when running, only used when tracking
   sysexitticks   int64    // cputicks when syscall has returned (for tracing)
   traceseq       uint64   // trace event sequencer
   tracelastp     puintptr // last P emitted an event for this goroutine
   lockedm        muintptr
   sig            uint32
   writebuf       []byte
   sigcode0       uintptr
   sigcode1       uintptr
   sigpc          uintptr
   gopc           uintptr         // pc of go statement that created this goroutine
   ancestors      *[]ancestorInfo // ancestor information goroutine(s) that created this goroutine (only used if debug.tracebackancestors)
   startpc        uintptr         // pc of goroutine function
   racectx        uintptr
   waiting        *sudog         // sudog structures this g is waiting on (that have a valid elem ptr); in lock order
   cgoCtxt        []uintptr      // cgo traceback context
   labels         unsafe.Pointer // profiler labels
   timer          *timer         // cached timer for time.Sleep
   selectDone     uint32         // are we participating in a select and did someone win the race?

   // Per-G GC state

   // gcAssistBytes is this G's GC assist credit in terms of
   // bytes allocated. If this is positive, then the G has credit
   // to allocate gcAssistBytes bytes without assisting. If this
   // is negative, then the G must correct this by performing
   // scan work. We track this in bytes to make it fast to update
   // and check for debt in the malloc hot path. The assist ratio
   // determines how this corresponds to scan work debt.
   gcAssistBytes int64
}

goroutine 的状态

_Gidle：刚刚被分配并且还没有被初始化，值为0，为创建goroutine后的默认值

_Grunnable：没有执行代码，没有栈的所有权，存储在运行队列中，可能在某个P的本地队列或全局队列中(如上图)。

_Grunning：正在执行代码的goroutine，拥有栈的所有权(如上图)。

_Gsyscall：正在执行系统调用，拥有栈的所有权，与P脱离，但是与某个M绑定，会在调用结束后被分配到运行队列(如上图)。

_Gwaiting：被阻塞的goroutine，阻塞在某个channel的发送或者接收队列(如上图)。

_Gdead：当前goroutine未被使用，没有执行代码，可能有分配的栈，分布在空闲列表gFree，可能是一个刚刚初始化的goroutine，也可能是执行了goexit退出的goroutine(如上图)。

p 的状态

// _Pidle means a P is not being used to run user code or the // scheduler. Typically, it's on the idle P list and available // to the scheduler, but it may just be transitioning between // other states. // // The P is owned by the idle list or by whatever is // transitioning its state. Its run queue is empty.

_Pidle = iota

// _Prunning means a P is owned by an M and is being used to // run user code or the scheduler. Only the M that owns this P // is allowed to change the P's status from _Prunning. The M // may transition the P to _Pidle (if it has no more work to // do), _Psyscall (when entering a syscall), or _Pgcstop (to // halt for the GC). The M may also hand ownership of the P // off directly to another M (e.g., to schedule a locked G).

_Prunning

// _Psyscall means a P is not running user code. It has // affinity to an M in a syscall but is not owned by it and // may be stolen by another M. This is similar to _Pidle but // uses lightweight transitions and maintains M affinity. // // Leaving _Psyscall must be done with a CAS, either to steal // or retake the P. Note that there's an ABA hazard: even if // an M successfully CASes its original P back to _Prunning // after a syscall, it must understand the P may have been // used by another M in the interim.

_Psyscall

// _Pgcstop means a P is halted for STW and owned by the M // that stopped the world. The M that stopped the world // continues to use its P, even in _Pgcstop. Transitioning // from _Prunning to _Pgcstop causes an M to release its P and // park. // // The P retains its run queue and startTheWorld will restart // the scheduler on Ps with non-empty run queues.

_Pgcstop

// _Pdead means a P is no longer used (GOMAXPROCS shrank). We // reuse Ps if GOMAXPROCS increases. A dead P is mostly // stripped of its resources, though a few things remain // (e.g., trace buffers).

_Pdead

m的状态

running spinning idle

GMP , stack