Note that escaping to the heap must also be transitive: if a reference to a Go value is written into another Go value that has already been determined to escape, that value must also escape.(如果一个值被一个堆上数据引用,它也会逃逸)
Together, objects and pointers to other objects form the object graph. To identify live memory, the GC walks the object graph starting at the program's roots, pointers that identify objects that are definitely in-use by the program. Two examples of roots are local variables and global variables. The process of walking the object graph is referred to as scanning.
gc 阶段
目前的go GC采用三色标记法和混合写屏障技术。
Go GC有四个阶段:
-
STW,开启混合写屏障(for all Ps to enable the write barrier)
-
将所有对象加入白色集合,从根对象开始,将其放入灰色集合。每次从灰色集合取出一个对象标记为黑色,然后遍历其子对象,标记为灰色,放入灰色集合;
-
如此循环直到灰色集合为空。剩余的白色对象就是需要清理的对象。
-
STW,关闭混合写屏障;
-
在后台进行GC(并发)。
并行标记清理产生的问题
用户态代码在回收过程中会并发的更新对象图
假设某个灰色对象 A 指向白色对象 B, 而此时赋值器并发的将黑色对象 C 指向(ref3)了白色对象 B, 并将灰色对象 A 对白色对象 B 的引用移除(ref2),则在继续扫描的过程中, 白色对象 B 永远不会被标记为黑色对象了(回收器不会重新扫描黑色对象)。 进而产生被错误回收的对象 B
(go.googlesource.com/proposal/+/…)
Go 1.7 uses a coarsened Dijkstra(/dkstr/) write barrier [Dijkstra '78], where pointer writes are implemented as follows:
writePointer(slot, ptr):
shade(ptr)
*slot = ptr
shade(ptr) marks the object at ptr grey if it is not already grey or black. This ensures the strong tricolor invariant by conservatively assuming that *slot may be in a black object, and ensuring ptr cannot be white before installing it in *slot. (对应到上图就是把B变成grey)
it presents a trade-off for pointers on stacks: either writes to pointers on the stack must have write barriers, which is prohibitively expensive, or stacks must be permagrey (恒灰). Go chooses the later, which means that many stacks must be re-scanned during STW
Re-scanning the stacks can take 10‘s to 100’s of milliseconds in an application with a large number of active goroutines.
hybrid write barrier
- to eliminate stack re-scanning
- combines a Yuasa (/u a sa/)-style deletion write barrier [Yuasa '90] with a Dijkstra (/dkstr/)-style insertion write barrier [Dijkstra '78]
writePointer(slot, ptr):
shade(*slot)
if current stack is grey://正在扫描中
shade(ptr)
*slot = ptr
the write barrier shades the object whose reference is being overwritten, and, if the current goroutine's stack has not yet been scanned, also shades the reference being installed. (对应到上图就是把C变成grey)
The hybrid barrier combines the best of the Dijkstra barrier and the Yuasa barrier. The Yuasa barrier requires a STW at the beginning of marking to either scan or snapshot stacks, but does not require a re-scan at the end of marking. The Dijkstra barrier lets concurrent marking start right away, but requires a STW at the end of marking to re-scan stacks (though more sophisticated non-STW approaches are possible [Hudson '97]). The hybrid barrier inherits the best properties of both, allowing stacks to be concurrently scanned at the beginning of the mark phase, while also keeping stacks black after this initial scan.
源码
// The GC runs concurrently with mutator threads, is type accurate (aka precise), allows multiple
// GC thread to run in parallel. It is a **concurrent mark and sweep** that uses a **write barrier**. It is
// non-generational and non-compacting.
// 1. GC performs sweep termination.
//
// a. Stop the world. This causes all Ps to reach a GC safe-point.
//
// b. Sweep any unswept spans. There will only be unswept spans if
// this GC cycle was forced before the expected time.
//
// 2. GC performs the mark phase.
//
// a. Prepare for the mark phase by setting gcphase to _GCmark
// (from _GCoff), enabling the write barrier, enabling mutator
// assists, and enqueueing root mark jobs. No objects may be
// scanned until all Ps have enabled the write barrier, which is
// accomplished using STW.
//
// b. Start the world. From this point, GC work is done by mark
// workers started by the scheduler and by assists performed as
// part of allocation. The write barrier shades both the
// overwritten pointer and the new pointer value for any pointer
// writes (see mbarrier.go for details). Newly allocated objects
// are immediately marked black.
//
// c. GC performs root marking jobs. This includes scanning all
// stacks, shading all globals, and shading any heap pointers in
// off-heap runtime data structures. Scanning a stack stops a
// goroutine, shades any pointers found on its stack, and then
// resumes the goroutine.
//
// d. GC drains the work queue of grey objects, scanning each grey
// object to black and shading all pointers found in the object
// (which in turn may add those pointers to the work queue).
//
// e. Because GC work is spread across local caches, GC uses a
// distributed termination algorithm to detect when there are no
// more root marking jobs or grey objects (see gcMarkDone). At this
// point, GC transitions to mark termination.
//
// 3. GC performs mark termination.
//
// a. Stop the world.
//
// b. Set gcphase to _GCmarktermination, and disable workers and
// assists.
//
// c. Perform housekeeping like flushing mcaches.
//
// 4. GC performs the sweep phase.
//
// a. Prepare for the sweep phase by setting gcphase to _GCoff,
// setting up sweep state and disabling the write barrier.
//
// b. Start the world. From this point on, newly allocated objects
// are white, and allocating sweeps spans before use if necessary.
//
// c. GC does concurrent sweeping in the background and in response
// to allocation. See description below.
//
// 5. When sufficient allocation has taken place, replay the sequence
// starting with 1 above. See discussion of GC rate below.
// Concurrent sweep.
//
// The sweep phase proceeds concurrently with normal program execution.
// The heap is swept span-by-span both lazily (when a goroutine needs another span)
// and concurrently in a background goroutine (this helps programs that are not CPU bound).
// At the end of STW mark termination all spans are marked as "needs sweeping".
// The background sweeper goroutine simply sweeps spans one-by-one.
// To avoid requesting more OS memory while there are unswept spans, when a
// goroutine needs another span, it first attempts to reclaim that much memory
// by sweeping. When a goroutine needs to allocate a new small-object span, it
// sweeps small-object spans for the same object size until it frees at least
// one object. When a goroutine needs to allocate large-object span from heap,
// it sweeps spans until it frees at least that many pages into heap. There is
// one case where this may not suffice: if a goroutine sweeps and frees two
// nonadjacent one-page spans to the heap, it will allocate a new two-page
// span, but there can still be other one-page unswept spans which could be
// combined into a two-page span.
Understanding costs
To begin with, consider this model of GC cost based on three simple axioms.
-
The GC involves only two resources: CPU time, and physical memory.
-
The GC's memory costs consist of live heap memory, new heap memory allocated before the mark phase, and space for metadata that, even if proportional to the previous costs, are small in comparison.
Note: live heap memory is memory that was determined to be live by the previous GC cycle, while new heap memory is any memory allocated in the current cycle, which may or may not be live by the end.
-
The GC's CPU costs are modeled as a fixed cost per cycle, and a marginal cost that scales proportionally with the size of the live heap.
Note: Asymptotically speaking, sweeping scales worse than marking and scanning, as it must perform work proportional to the size of the whole heap, including memory that is determined to be not live (i.e. "dead"). However, in the current implementation sweeping is so much faster than marking and scanning that its associated costs can be ignored in this discussion.
scavenge
bgscavenge
// The current value is chosen assuming a cost of ~10µs/physical page
// (this is somewhat pessimistic), which implies a worst-case latency of
// about 160µs for 4 KiB physical pages. The current value is biased
// toward latency over throughput.
const scavengeQuantum = 64 << 10// 64kb
r := mheap_.pages.scavenge(scavengeQuantum)
Proposal
go.googlesource.com/proposal/+/…
- Scavenge at a rate proportional to the rate at which the application is allocating memory.
- Retain some constant times the peak heap goal over the last
NGCs. - Scavenge the unscavenged spans with the highest base addresses first.
Rationale
Thus, I propose a more robust alternative: change the Go runtime’s span allocation policy to be first-fit, rather than best-fit. Address-ordered first-fit allocation policies generally perform as well as best-fit in practice when it comes to fragmentation [Johnstone98], a claim which I verified holds true for the Go runtime by simulating a large span allocation trace.
Furthermore, I propose we then scavenge the spans with the highest base address first. The advantage of a first-fit allocation policy here is that we know something about which chunks of memory will actually be chosen, which leads us to a sensible scavenging policy.
Implementing a First-fit Data Structure
First, modify the existing treap implementation to sort by a span’s base address.(左节点的base address < cur < 右节点的base address)
Next, attach a new field to each binary tree node called maxPages. This field represents the maximum size in 8 KiB pages of a span in the subtree rooted at that node.(maxPages 用来做最大堆)
For a leaf node, maxPages is always equal to the node’s span’s length. This invariant is maintained every time the tree changes. For most balanced trees, the tree may change in one of three ways: insertion, removal, and tree rotations.
1 func Find(root, pages):
2 t = root
3 for t != nil:
4 if t.left != nil and t.left.maxPages >= pages:
5 t = t.left
6 else if t.span.pages >= pages:
7 return t.span
8 else t.right != nil and t.right.maxPages >= pages:
9 t = t.right
10 else:
11 return nil
func scavenge
// scavenge scavenges nbytes worth of free pages, starting with the
// highest address first. Successive calls continue from where it left
// off until the heap is exhausted. Call scavengeStartGen to bring it
// back to the top of the heap.
//
// Returns the amount of memory scavenged in bytes.
func (p *pageAlloc) scavenge(nbytes uintptr) uintptr {
var (
addrs addrRange
gen uint32
)
released := uintptr(0)
for released < nbytes {
if addrs.size() == 0 {
if addrs, gen = p.scavengeReserve(); addrs.size() == 0 {
break
}
}
systemstack(func() {
r, a := p.scavengeOne(addrs, nbytes-released)
released += r
addrs = a
})
}
// Only unreserve the space which hasn't been scavenged or searched
// to ensure we always make progress.
p.scavengeUnreserve(addrs, gen)
return released
}