golang 内存管理、GC

150 阅读14分钟

Unmanaged memory

In general, the runtime tries to use regular heap allocation. However, in some cases the runtime must allocate objects outside of the garbage collected heap, in unmanaged memory. This is necessary if the objects are part of the memory manager itself or if they must be allocated in situations where the caller may not have a P.

There are three mechanisms for allocating unmanaged memory:

  • sysAlloc obtains memory directly from the OS. This comes in whole multiples of the system page size, but it can be freed with sysFree.
  • persistentalloc combines multiple smaller allocations into a single sysAlloc to avoid fragmentation. However, there is no way to free persistentalloced objects (hence the name).
  • fixalloc is a SLAB-style allocator that allocates objects of a fixed size. fixalloced objects can be freed, but this memory can only be reused by the same fixalloc pool, so it can only be reused for objects of the same type.

In general, types that are allocated using any of these should be marked //go:notinheap (see below).

Objects that are allocated in unmanaged memory must not contain heap pointers unless the following rules are also obeyed:

  1. Any pointers from unmanaged memory to the heap must be garbage collection roots. More specifically, any pointer must either be accessible through a global variable or be added as an explicit garbage collection root in runtime.markroot.
  2. If the memory is reused, the heap pointers must be zero-initialized before they become visible as GC roots. Otherwise, the GC may observe stale heap pointers. See “Zero-initialization versus zeroing”.

Zero-initialization versus zeroing

There are two types of zeroing in the runtime, depending on whether the memory is already initialized to a type-safe state.

If memory is not in a type-safe state, meaning it potentially contains “garbage” because it was just allocated and it is being initialized for first use, then it must be zero-initialized using memclrNoHeapPointers or non-pointer writes. This does not perform write barriers.

If memory is already in a type-safe state and is simply being set to the zero value, this must be done using regular writes, typedmemclr, or memclrHasPointers. This performs write barriers.

go.dev/src/runtime…

Go internal memory structure

mheap

This is where Go stores dynamic data(any data for which size cannot be calculated at compile time). This is the biggest block of memory and this is where Garbage Collection(GC)  takes place.

The resident set is divided into pages of 8KB each and is managed by one global mheap object.

Large objects(Object of Size > 32kb) are allocated directly from mheap. These large requests come at an expense of central lock, so only one P’s request can be served at any given point in time.

mheap manages pages grouped into different constructs as below:

  • mspanmspan is the most basic structure that manages the pages of memory in mheap. It’s a double-linked list that holds the address of the start page, span size class, and the number of pages in the span. Like TCMalloc, Go also divides Memory Pages into a block of 67 different classes by size starting at 8 bytes up to 32 kilobytes as in the below image

  • Each span exists twice, one for objects with pointers (scan classes) and one for objects with no pointers (noscan classes). This helps during GC as noscan spans need not be traversed to look for live objects.

mcentralmcentral groups spans of same size class together. Each mcentral contains two mspanList:

// partial and full contain two mspan sets: one of swept in-use
// spans, and one of unswept in-use spans. These two trade
// roles on each GC cycle. The unswept set is drained either by
// allocation or by the background sweeper in every GC cycle,
// so only two roles are necessary.
partial [2]spanSet // list of spans with a free object
full    [2]spanSet // list of spans with no free objects
  • arena: The heap memory grows and shrinks as required within the virtual memory allocated. When more memory is needed, mheap pulls them from the virtual memory as a chunk of 64MB(for 64-bit architectures) called arena. The pages are mapped to spans here.

  • mcache:  

// Per-thread (in Go, per-P) cache for small objects.
// This includes a small object cache and local allocation stats.
// No locking needed because it is per-thread (per-P).
//
// mcaches are allocated from non-GC'd memory, so any heap pointers
// must be specially handled.
// Main malloc heap.
// The heap itself is the "free" and "scav" treaps,
// but all the other global data is here too.
//
// mheap must not be heap-allocated because it contains mSpanLists,
// which must not be heap-allocated.
//
//go:notinheap
type mheap struct{
    pages pageAlloc // page allocation data structure

    allspans []*mspan // all spans out there

    arenas [1 << arenaL1Bits]*[1 << arenaL2Bits]*heapArena

    // central free lists for small size classes.
    // the padding makes sure that the mcentrals are
    // spaced CacheLinePadSize bytes apart, so that each mcentral.lock
    // gets its own cache line.
    // central is indexed by spanClass.
    central [numSpanClasses]struct {
       mcentral mcentral
       pad      [cpu.CacheLinePadSize - unsafe.Sizeof(mcentral{})%cpu.CacheLinePadSize]byte
    }


}

area 大小64mb,heapArena.spans maps from virtual address page ID within this arena to *mspan.

mspan

//go:notinheap
type mspan struct {
   next *mspan     // next span in list, or nil if none
   prev *mspan     // previous span in list, or nil if none
   list *mSpanList 
   ...
   }

一个span至少one runtime page(8kb)

mcentral

// Central list of free objects of a given size.
//go:notinheap
type mcentral struct {
   spanclass spanClass
   partial [2]spanSet // list of spans with a free object
   full    [2]spanSet // list of spans with no free objects
   }

mcache

// Per-thread (in Go, per-P) cache for small objects.
// This includes a small object cache and local allocation stats.
// No locking needed because it is per-thread (per-P).
//
// mcaches are allocated from non-GC'd memory, so any heap pointers
// must be specially handled.
//
//go:notinheap
type mcache struct {
...
}

Allocation, SizeClasses,SpanClasses

Allocation is done using size segregated per P allocation areas to minimize fragmentation while eliminating locks in the common case.

_NumSizeClasses = 68

numSpanClasses = _NumSizeClasses << 1
var class_to_size = [_NumSizeClasses]uint16{0, 8, 16, 24, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 896, 1024, 1152, 1280, 1408, 1536, 1792, 2048, 2304, 2688, 3072, 3200, 3456, 4096, 4864, 5376, 6144, 6528, 6784, 6912, 8192, 9472, 9728, 10240, 10880, 12288, 13568, 14336, 16384, 18432, 19072, 20480, 21760, 24576, 27264, 28672, 32768}

mallocinit()

Check physPageSize.

mheap_.init()
mcache0 = allocmcache()

//Create initial arena growth hints.

// On a 64-bit machine, we pick the following hints
// because:
//
// 1. Starting from the middle of the address space
// makes it easier to grow out a contiguous range
// without running in to some other mapping.
//
// 2. This makes Go heap addresses more easily
// recognizable when debugging.


// On a 32-bit machine, we're much more concerned
// about keeping the usable heap contiguous.
// Hence:
//
// 1. We reserve space for all heapArenas up front so
// they don't get interleaved with the heap. They're
// ~258MB, so this isn't too bad. (We could reserve a
// smaller amount of space up front if this is a
// problem.)
//
// 2. We hint the heap to start right above the end of
// the binary so we have the best chance of keeping it
// contiguous.

heap init

调用 spanalloc、cachealloc 等 fixalloc 的init方法//Initialize f to allocate objects of the given size

h.central[i].mcentral.init(spanClass(i))//Initialize a single central free list, 设置mcentral.spanclass

h.pages.init(&h.lock, &memstats.gcMiscSys)//pageAlloc init

spanAllocType

// spanAllocType represents the type of allocation to make, or
// the type of allocation to be freed.
type spanAllocType uint8

const (
   spanAllocHeap          spanAllocType = iota // heap span
   spanAllocStack                              // stack span
   spanAllocPtrScalarBits                      // unrolled GC prog bitmap span
   spanAllocWorkBuf                            // work buf span
)

Page allocator

// The page allocator manages mapped pages (defined by pageSize, NOT
// physPageSize) for allocation and re-use. It is embedded into mheap.
//
// Pages are managed using a bitmap that is sharded into chunks.
// In the bitmap, 1 means in-use, and 0 means free. The bitmap spans the
// process's address space. Chunks are managed in a sparse-array-style structure
// similar to mheap.arenas, since the bitmap may be large on some systems.
//
// The bitmap is efficiently searched by using a radix tree in combination
// with fast bit-wise intrinsics. Allocation is performed using an address-ordered
// first-fit approach.
//
// Each entry in the radix tree is a summary that describes three properties of
// a particular region of the address space: the number of contiguous free pages
// at the start and end of the region it represents, and the maximum number of
// contiguous free pages found anywhere in that region.
//
// Each level of the radix tree is stored as one contiguous array, which represents
// a different granularity of subdivision of the processes' address space. Thus, this
// radix tree is actually implicit in these large arrays, as opposed to having explicit
// dynamically-allocated pointer-based node structures. Naturally, these arrays may be
// quite large for system with large address spaces, so in these cases they are mapped
// into memory as needed. The leaf summaries of the tree correspond to a bitmap chunk.
//
// The root level (referred to as L0 and index 0 in pageAlloc.summary) has each
// summary represent the largest section of address space (16 GiB on 64-bit systems),
// with each subsequent level representing successively smaller subsections until we
// reach the finest granularity at the leaves, a chunk.
//
// More specifically, each summary in each level (except for leaf summaries)
// represents some number of entries in the following level. For example, each
// summary in the root level may represent a 16 GiB region of address space,
// and in the next level there could be 8 corresponding entries which represent 2
// GiB subsections of that 16 GiB region, each of which could correspond to 8
// entries in the next level which each represent 256 MiB regions, and so on.
//
// Thus, this design only scales to heaps so large, but can always be extended to
// larger heaps by simply adding levels to the radix tree, which mostly costs
// additional virtual address space. The choice of managing large arrays also means
// that a large amount of virtual address space may be reserved by the runtime.
type pageAlloc struct {
   // Radix tree of summaries.
   //
   // Each slice's cap represents the whole memory reservation.
   // Each slice's len reflects the allocator's maximum known
   // mapped heap address for that level.
 
   summary [summaryLevels][]pallocSum


   // chunks is a slice of bitmap chunks.
   // two-level sparse array
   
   chunks [1 << pallocChunksL1Bits]*[1 << pallocChunksL2Bits]pallocData

  
   searchAddr offAddr

   // start and end represent the chunk indices
   // which pageAlloc knows about. It assumes
   // chunks in the range [start, end) are
   // currently ready to use.
   start, end chunkIdx

   inUse addrRanges

   // scav stores the scavenger state.
   ...
}
// pallocSum is a packed summary type which packs three numbers: start, max,
// and end into a single 8-byte value. Each of these values are a summary of
// a bitmap and are thus counts, each of which may have a maximum value of
// 2^21 - 1, or all three may be equal to 2^21. The latter case is represented
// by just setting the 64th bit.
type pallocSum uint64

gc

Note that escaping to the heap must also be transitive: if a reference to a Go value is written into another Go value that has already been determined to escape, that value must also escape.

Together, objects and pointers to other objects form the object graph. To identify live memory, the GC walks the object graph starting at the program's roots, pointers that identify objects that are definitely in-use by the program. Two examples of roots are local variables and global variables. The process of walking the object graph is referred to as scanning.

gc 阶段

目前的go GC采用三色标记法混合写屏障技术。

Go GC有个阶段:

  • STW,开启混合写屏障

  • 将所有对象加入白色集合,从根对象开始,将其放入灰色集合。每次从灰色集合取出一个对象标记为黑色,然后遍历其子对象,标记为灰色,放入灰色集合;

  • 如此循环直到灰色集合为空。剩余的白色对象就是需要清理的对象。

  • STW,关闭混合写屏障;

  • 在后台进行GC(并发)。

并行标记清理产生的问题

用户态代码在回收过程中会并发的更新对象图

假设某个灰色对象 A 指向白色对象 B, 而此时赋值器并发的将黑色对象 C 指向(ref3)了白色对象 B, 并将灰色对象 A 对白色对象 B 的引用移除(ref2),则在继续扫描的过程中, 白色对象 B 永远不会被标记为黑色对象了(回收器不会重新扫描黑色对象)。 进而产生被错误回收的对象 B

image.png

image.png

(go.googlesource.com/proposal/+/…)

Go 1.7 uses a coarsened Dijkstra write barrier [Dijkstra '78], where pointer writes are implemented as follows:

writePointer(slot, ptr):
    shade(ptr)
    *slot = ptr

shade(ptr) marks the object at ptr grey if it is not already grey or black. This ensures the strong tricolor invariant by conservatively assuming that *slot may be in a black object, and ensuring ptr cannot be white before installing it in *slot. (对应到上图就是把B变成grey)

it presents a trade-off for pointers on stacks: either writes to pointers on the stack must have write barriers, which is prohibitively expensive, or stacks must be permagrey (恒灰). Go chooses the later, which means that many stacks must be re-scanned during STW

Re-scanning the stacks can take 10‘s to 100’s of milliseconds in an application with a large number of active goroutines.

hybrid write barrier

  • to eliminate stack re-scanning
  • combines a Yuasa-style deletion write barrier [Yuasa '90] with a Dijkstra-style insertion write barrier [Dijkstra '78]
writePointer(slot, ptr):
    shade(*slot)
    if current stack is grey://正在扫描中
        shade(ptr)
    *slot = ptr

the write barrier shades the object whose reference is being overwritten, and, if the current goroutine's stack has not yet been scanned, also shades the reference being installed. (对应到上图就是把C变成grey)

The hybrid barrier combines the best of the Dijkstra barrier and the Yuasa barrier. The Yuasa barrier requires a STW at the beginning of marking to either scan or snapshot stacks, but does not require a re-scan at the end of marking. The Dijkstra barrier lets concurrent marking start right away, but requires a STW at the end of marking to re-scan stacks (though more sophisticated non-STW approaches are possible [Hudson '97]). The hybrid barrier inherits the best properties of both, allowing stacks to be concurrently scanned at the beginning of the mark phase, while also keeping stacks black after this initial scan.

源码

// The GC runs concurrently with mutator threads, is type accurate (aka precise), allows multiple
// GC thread to run in parallel. It is a **concurrent mark and sweep** that uses a **write barrier**. It is
// non-generational and non-compacting. 
// 1. GC performs sweep termination.
//
//    a. Stop the world. This causes all Ps to reach a GC safe-point.
//
//    b. Sweep any unswept spans. There will only be unswept spans if
//    this GC cycle was forced before the expected time.
//
// 2. GC performs the mark phase.
//
//    a. Prepare for the mark phase by setting gcphase to _GCmark
//    (from _GCoff), enabling the write barrier, enabling mutator
//    assists, and enqueueing root mark jobs. No objects may be
//    scanned until all Ps have enabled the write barrier, which is
//    accomplished using STW.
//
//    b. Start the world. From this point, GC work is done by mark
//    workers started by the scheduler and by assists performed as
//    part of allocation. The write barrier shades both the
//    overwritten pointer and the new pointer value for any pointer
//    writes (see mbarrier.go for details). Newly allocated objects
//    are immediately marked black.
//
//    c. GC performs root marking jobs. This includes scanning all
//    stacks, shading all globals, and shading any heap pointers in
//    off-heap runtime data structures. Scanning a stack stops a
//    goroutine, shades any pointers found on its stack, and then
//    resumes the goroutine.
//
//    d. GC drains the work queue of grey objects, scanning each grey
//    object to black and shading all pointers found in the object
//    (which in turn may add those pointers to the work queue).
//
//    e. Because GC work is spread across local caches, GC uses a
//    distributed termination algorithm to detect when there are no
//    more root marking jobs or grey objects (see gcMarkDone). At this
//    point, GC transitions to mark termination.
//
// 3. GC performs mark termination.
//
//    a. Stop the world.
//
//    b. Set gcphase to _GCmarktermination, and disable workers and
//    assists.
//
//    c. Perform housekeeping like flushing mcaches.
//
// 4. GC performs the sweep phase.
//
//    a. Prepare for the sweep phase by setting gcphase to _GCoff,
//    setting up sweep state and disabling the write barrier.
//
//    b. Start the world. From this point on, newly allocated objects
//    are white, and allocating sweeps spans before use if necessary.
//
//    c. GC does concurrent sweeping in the background and in response
//    to allocation. See description below.
//
// 5. When sufficient allocation has taken place, replay the sequence
// starting with 1 above. See discussion of GC rate below.
// Concurrent sweep.
//
// The sweep phase proceeds concurrently with normal program execution.
// The heap is swept span-by-span both lazily (when a goroutine needs another span)
// and concurrently in a background goroutine (this helps programs that are not CPU bound).
// At the end of STW mark termination all spans are marked as "needs sweeping".

// The background sweeper goroutine simply sweeps spans one-by-one.

// To avoid requesting more OS memory while there are unswept spans, when a
// goroutine needs another span, it first attempts to reclaim that much memory
// by sweeping. When a goroutine needs to allocate a new small-object span, it
// sweeps small-object spans for the same object size until it frees at least
// one object. When a goroutine needs to allocate large-object span from heap,
// it sweeps spans until it frees at least that many pages into heap. There is
// one case where this may not suffice: if a goroutine sweeps and frees two
// nonadjacent one-page spans to the heap, it will allocate a new two-page
// span, but there can still be other one-page unswept spans which could be
// combined into a two-page span.

Understanding costs

To begin with, consider this model of GC cost based on three simple axioms.

  1. The GC involves only two resources: CPU time, and physical memory.

  2. The GC's memory costs consist of live heap memory, new heap memory allocated before the mark phase, and space for metadata that, even if proportional to the previous costs, are small in comparison.

    Note: live heap memory is memory that was determined to be live by the previous GC cycle, while new heap memory is any memory allocated in the current cycle, which may or may not be live by the end.

  3. The GC's CPU costs are modeled as a fixed cost per cycle, and a marginal cost that scales proportionally with the size of the live heap.

    Note: Asymptotically speaking, sweeping scales worse than marking and scanning, as it must perform work proportional to the size of the whole heap, including memory that is determined to be not live (i.e. "dead"). However, in the current implementation sweeping is so much faster than marking and scanning that its associated costs can be ignored in this discussion.

scavenge

bgscavenge

// The current value is chosen assuming a cost of ~10µs/physical page
// (this is somewhat pessimistic), which implies a worst-case latency of
// about 160µs for 4 KiB physical pages. The current value is biased
// toward latency over throughput.
const scavengeQuantum = 64 << 10// 64kb

r := mheap_.pages.scavenge(scavengeQuantum)

Proposal

go.googlesource.com/proposal/+/…

  1. Scavenge at a rate proportional to the rate at which the application is allocating memory.
  2. Retain some constant times the peak heap goal over the last N GCs.
  3. Scavenge the unscavenged spans with the highest base addresses first.

Rationale

Thus, I propose a more robust alternative: change the Go runtime’s span allocation policy to be first-fit, rather than best-fit. Address-ordered first-fit allocation policies generally perform as well as best-fit in practice when it comes to fragmentation [Johnstone98], a claim which I verified holds true for the Go runtime by simulating a large span allocation trace.

Furthermore, I propose we then scavenge the spans with the highest base address first. The advantage of a first-fit allocation policy here is that we know something about which chunks of memory will actually be chosen, which leads us to a sensible scavenging policy.

Implementing a First-fit Data Structure

First, modify the existing treap implementation to sort by a span’s base address.(左节点的base address < cur < 右节点的base address)

Next, attach a new field to each binary tree node called maxPages. This field represents the maximum size in 8 KiB pages of a span in the subtree rooted at that node.(maxPages 用来做最大堆)

For a leaf node, maxPages is always equal to the node’s span’s length. This invariant is maintained every time the tree changes. For most balanced trees, the tree may change in one of three ways: insertion, removal, and tree rotations.

1  func Find(root, pages):
2    t = root
3    for t != nil:
4      if t.left != nil and t.left.maxPages >= pages:
5        t = t.left
6      else if t.span.pages >= pages:
7        return t.span
8      else t.right != nil and t.right.maxPages >= pages:
9        t = t.right
10     else:
11       return nil

func scavenge

// scavenge scavenges nbytes worth of free pages, starting with the
// highest address first. Successive calls continue from where it left
// off until the heap is exhausted. Call scavengeStartGen to bring it
// back to the top of the heap.
//
// Returns the amount of memory scavenged in bytes.
func (p *pageAlloc) scavenge(nbytes uintptr) uintptr {
   var (
      addrs addrRange
      gen   uint32
   )
   released := uintptr(0)
   for released < nbytes {
      if addrs.size() == 0 {
         if addrs, gen = p.scavengeReserve(); addrs.size() == 0 {
            break
         }
      }
      systemstack(func() {
         r, a := p.scavengeOne(addrs, nbytes-released)
         released += r
         addrs = a
      })
   }
   // Only unreserve the space which hasn't been scavenged or searched
   // to ensure we always make progress.
   p.scavengeUnreserve(addrs, gen)
   return released
}