Concurrent Copying GC 总览
Performance and memory improvements in Android Run Time (ART) (Google I/O '17)
在Google I/O 17上,Android的ART团队讲解了CC算法的设计思路和算法过程,如下图所示:CC算法总的包括Pause Phase、Copying Phase、Reclaim Phase。
当然,CC算法的总流程可能还要更复杂一些,我们从CC算法的入口RunPhases开始看起
void ConcurrentCopying::RunPhases() {
CHECK(kUseBakerReadBarrier || kUseTableLookupReadBarrier);
CHECK(!is_active_);
is_active_ = true;
Thread* self = Thread::Current();
thread_running_gc_ = self;
Locks::mutator_lock_->AssertNotHeld(self);
{
ReaderMutexLock mu(self, *Locks::mutator_lock_);
InitializePhase();
// In case of forced evacuation, all regions are evacuated and hence no
// need to compute live_bytes.
if (use_generational_cc_ && !young_gen_ && !force_evacuate_all_) {
MarkingPhase();
}
}
if (kUseBakerReadBarrier && kGrayDirtyImmuneObjects) {
// Switch to read barrier mark entrypoints before we gray the objects. This is required in case
// a mutator sees a gray bit and dispatches on the entrypoint. (b/37876887).
ActivateReadBarrierEntrypoints();
// Gray dirty immune objects concurrently to reduce GC pause times. We re-process gray cards in
// the pause.
ReaderMutexLock mu(self, *Locks::mutator_lock_);
GrayAllDirtyImmuneObjects();
}
FlipThreadRoots();
{
ReaderMutexLock mu(self, *Locks::mutator_lock_);
CopyingPhase();
}
// Verify no from space refs. This causes a pause.
if (kEnableNoFromSpaceRefsVerification) {
TimingLogger::ScopedTiming split("(Paused)VerifyNoFromSpaceReferences", GetTimings());
ScopedPause pause(this, false);
CheckEmptyMarkStack();
if (kVerboseMode) {
LOG(INFO) << "Verifying no from-space refs";
}
VerifyNoFromSpaceReferences();
if (kVerboseMode) {
LOG(INFO) << "Done verifying no from-space refs";
}
CheckEmptyMarkStack();
}
{
ReaderMutexLock mu(self, *Locks::mutator_lock_);
ReclaimPhase();
}
FinishPhase();
CHECK(is_active_);
is_active_ = false;
thread_running_gc_ = nullptr;
}
Concurrent Copying GC算法主要包括了几个重要的函数
- InitializePhase
- MarkingPhase
- GrayAllDirtyImmuneObjects
- FlipThreadRoots
- CopyingPhase
- ReclaimPhase
下面对于这几个函数进行重点分析
InitializePhase
初始化函数,主要做一些参数的初始化操作以及BindBitmaps来确定GC范围
void ConcurrentCopying::InitializePhase() {
TimingLogger::ScopedTiming split("InitializePhase", GetTimings());
num_bytes_allocated_before_gc_ = static_cast<int64_t>(heap_->GetBytesAllocated());
...
CheckEmptyMarkStack();
rb_mark_bit_stack_full_ = false;
mark_from_read_barrier_measurements_ = measure_read_barrier_slow_path_;
if (measure_read_barrier_slow_path_) {
rb_slow_path_ns_.store(0, std::memory_order_relaxed);
rb_slow_path_count_.store(0, std::memory_order_relaxed);
rb_slow_path_count_gc_.store(0, std::memory_order_relaxed);
}
immune_spaces_.Reset();
bytes_moved_.store(0, std::memory_order_relaxed);
objects_moved_.store(0, std::memory_order_relaxed);
bytes_moved_gc_thread_ = 0;
objects_moved_gc_thread_ = 0;
bytes_scanned_ = 0;
GcCause gc_cause = GetCurrentIteration()->GetGcCause();
force_evacuate_all_ = false;
if (!use_generational_cc_ || !young_gen_) {
if (gc_cause == kGcCauseExplicit ||
gc_cause == kGcCauseCollectorTransition ||
GetCurrentIteration()->GetClearSoftReferences()) {
force_evacuate_all_ = true;
}
}
if (kUseBakerReadBarrier) {
updated_all_immune_objects_.store(false, std::memory_order_relaxed);
// GC may gray immune objects in the thread flip.
gc_grays_immune_objects_ = true;
if (kIsDebugBuild) {
MutexLock mu(Thread::Current(), immune_gray_stack_lock_);
DCHECK(immune_gray_stack_.empty());
}
}
if (use_generational_cc_) {
done_scanning_.store(false, std::memory_order_release);
}
BindBitmaps();
...
if (use_generational_cc_ && !young_gen_) {
region_space_bitmap_->Clear(ShouldEagerlyReleaseMemoryToOS());
}
mark_stack_mode_.store(ConcurrentCopying::kMarkStackModeThreadLocal, std::memory_order_release);
// Mark all of the zygote large objects without graying them.
MarkZygoteLargeObjects();
}
重点关注一下BindBitmaps
void ConcurrentCopying::BindBitmaps() {
Thread* self = Thread::Current();
WriterMutexLock mu(self, *Locks::heap_bitmap_lock_);
for (const auto& space : heap_->GetContinuousSpaces()) {
// 将不需要进行GC的space添加到immune_space_,包括ZygoteSpace和ImageSpace
if (space->GetGcRetentionPolicy() == space::kGcRetentionPolicyNeverCollect ||
space->GetGcRetentionPolicy() == space::kGcRetentionPolicyFullCollect) {
CHECK(space->IsZygoteSpace() || space->IsImageSpace());
immune_spaces_.AddSpace(space);
} else {
// 在ContinuousSpaces中除去ZygoteSpace、ImagesSpace之外,
// RegionSpace和NonMovingSpace都是需要进行GC的
CHECK(!space->IsZygoteSpace());
CHECK(!space->IsImageSpace());
CHECK(space == region_space_ || space == heap_->non_moving_space_);
if (use_generational_cc_) {
if (space == region_space_) {
// 获得RegionSpace的MarkBitMap
region_space_bitmap_ = region_space_->GetMarkBitmap();
} else if (young_gen_ && space->IsContinuousMemMapAllocSpace()) {
DCHECK_EQ(space->GetGcRetentionPolicy(), space::kGcRetentionPolicyAlwaysCollect);
// 将LiveBitMap拷贝到MarkBitMap
// 在YoungGC是通过这个方式来减少Tracing操作
space->AsContinuousMemMapAllocSpace()->BindLiveToMarkBitmap();
}
if (young_gen_) {
// 如果是YoungGC,将所有Cards Aged
heap_->GetCardTable()->ModifyCardsAtomic(space->Begin(),
space->End(),
AgeCardVisitor(),
VoidFunctor());
} else {
heap_->GetCardTable()->ClearCardRange(space->Begin(), space->Limit());
}
} else {
if (space == region_space_) {
region_space_bitmap_ = region_space_->GetMarkBitmap();
region_space_bitmap_->Clear(ShouldEagerlyReleaseMemoryToOS());
}
}
}
}
if (use_generational_cc_ && young_gen_) {
for (const auto& space : GetHeap()->GetDiscontinuousSpaces()) {
CHECK(space->IsLargeObjectSpace());
space->AsLargeObjectSpace()->CopyLiveToMarked();
}
}
}
BindBitmaps()里涉及到了一个重要的数据结构 MarkBitmap,这里的MarkBitmap是一个SpaceBitmap
using ContinuousSpaceBitmap = SpaceBitmap<kObjectAlignment>;
// Mark bitmap used by the GC.
accounting::ContinuousSpaceBitmap mark_bitmap_;
ART使用SpaceBitmap来标记整个Heap的内存
// Initialize a space bitmap so that it points to a bitmap large enough to cover a heap at
// heap_begin of heap_capacity bytes, where objects are guaranteed to be kAlignment-aligned.
template<size_t kAlignment>
SpaceBitmap<kAlignment> SpaceBitmap<kAlignment>::Create(
const std::string& name, uint8_t* heap_begin, size_t heap_capacity) {
// Round up since `heap_capacity` is not necessarily a multiple of `kAlignment * kBitsPerIntPtrT`
// (we represent one word as an `intptr_t`).
const size_t bitmap_size = ComputeBitmapSize(heap_capacity);
std::string error_msg;
MemMap mem_map = MemMap::MapAnonymous(name.c_str(),
bitmap_size,
PROT_READ | PROT_WRITE,
/*low_4gb=*/ false,
&error_msg);
if (UNLIKELY(!mem_map.IsValid())) {
LOG(ERROR) << "Failed to allocate bitmap " << name << ": " << error_msg;
return SpaceBitmap<kAlignment>();
}
return CreateFromMemMap(name, std::move(mem_map), heap_begin, heap_capacity);
}
template<size_t kAlignment>
SpaceBitmap<kAlignment>::SpaceBitmap(const std::string& name,
MemMap&& mem_map,
uintptr_t* bitmap_begin,
size_t bitmap_size,
const void* heap_begin,
size_t heap_capacity)
: mem_map_(std::move(mem_map)),
bitmap_begin_(reinterpret_cast<Atomic<uintptr_t>*>(bitmap_begin)),
bitmap_size_(bitmap_size),
heap_begin_(reinterpret_cast<uintptr_t>(heap_begin)),
heap_limit_(reinterpret_cast<uintptr_t>(heap_begin) + heap_capacity),
name_(name) {
CHECK(bitmap_begin_ != nullptr);
CHECK_NE(bitmap_size, 0U);
}
而MarkBitMap则是用来标记在GC过程中所有被Mark的对象,在MarkingPhase中具体讲解
MarkingPhase
Concurrent Copying GC也是Tracing GC,因此也采用了三色标记法来完成tracing操作
- White:没有被扫描到
- Gray:被扫描到但是它的引用的对象还没有完全扫描
- Black:它和它的引用对象都被扫描到
Concurrent Copying GC在实现三色标记的时候使用到了两个数据结构:
- MarkBitmap
- MarkStack
MarkBitmap会将所有扫描到的对象做标记,也就是说被标记的对象就是存活的对象,没有被标记的对象就是需要被回收的垃圾对象
MarkStack是一个包括了所有需要处理引用对象的对象(Gray)集合。在进行Tracing的时候就可以有如下操作:
- 将roots集对象push到MarkStack中
- 对MarkStack的栈顶对象A进行操作,扫描所有它的引用关系,将扫描到的引用关系push到MarkStack中
- 不断递归执行如上操作,直到栈顶对象又变为了A,说明A对象的引用关系都已经处理完成了
- 将A pop出MarkStack,并将A标记为Black
这里的MarkStack其实包括了三种状态
- Thread Local Mark Stack Mode
- Shared Mark stack Mode
- Gc Exclusive Mark Stack Mode
具体的原因在CopyingPhase中详细说明
当然Concurrent Copying GC是一个并发GC,因此在Marking的过程中,mutator线程还是可以继续读/写已经被标记的对象的,为了实现这个操作,Concurrent Copying GC定义了Black-clean和Black-dirty两个状态,其中Black-clean表示它的refs是干净的没有被修改;Black-dirty表示它的refs被修改了,因此需要在CopyingPhase的过程中重新进行Tracing
因此Gray就没有clean的状态,因为Gray表示它的refs还没有被扫描完,因此此时一定是处于Gray-dirty状态
并不是每次GC都需要MarkingPhase,Young Concurrent Copying GC在初始化时已经设置好了MarkBitmap,因此不需要继续做MarkingPhase
/* Invariants for two-phase CC
* ===========================
* A) Definitions
* ---------------
* 1) Black: marked in bitmap, rb_state is non-gray, and not in mark stack
* 2) Black-clean: marked in bitmap, and corresponding card is clean/aged
* 3) Black-dirty: marked in bitmap, and corresponding card is dirty
* 4) Gray: marked in bitmap, and exists in mark stack
* 5) Gray-dirty: marked in bitmap, rb_state is gray, corresponding card is
* dirty, and exists in mark stack
* 6) White: unmarked in bitmap, rb_state is non-gray, and not in mark stack
*
* B) Before marking phase
* -----------------------
* 1) All objects are white
* 2) Cards are either clean or aged (cannot be asserted without a STW pause)
* 3) Mark bitmap is cleared
* 4) Mark stack is empty
*
* C) During marking phase
* ------------------------
* 1) If a black object holds an inter-region or white reference, then its
* corresponding card is dirty. In other words, it changes from being
* black-clean to black-dirty
* 2) No black-clean object points to a white object
*
* D) After marking phase
* -----------------------
* 1) There are no gray objects
* 2) All newly allocated objects are in from space
* 3) No white object can be reachable, directly or otherwise, from a
* black-clean object
*
* E) During copying phase
* ------------------------
* 1) Mutators cannot observe white and black-dirty objects
* 2) New allocations are in to-space (newly allocated regions are part of to-space)
* 3) An object in mark stack must have its rb_state = Gray
*
* F) During card table scan
* --------------------------
* 1) Referents corresponding to root references are gray or in to-space
* 2) Every path from an object that is read or written by a mutator during
* this period to a dirty black object goes through some gray object.
* Mutators preserve this by graying black objects as needed during this
* period. Ensures that a mutator never encounters a black dirty object.
*
* G) After card table scan
* ------------------------
* 1) There are no black-dirty objects
* 2) Referents corresponding to root references are gray, black-clean or in
* to-space
*
* H) After copying phase
* -----------------------
* 1) Mark stack is empty
* 2) No references into evacuated from-space
* 3) No reference to an object which is unmarked and is also not in newly
* allocated region. In other words, no reference to white objects.
*/
void ConcurrentCopying::MarkingPhase() {
TimingLogger::ScopedTiming split("MarkingPhase", GetTimings());
if (kVerboseMode) {
LOG(INFO) << "GC MarkingPhase";
}
accounting::CardTable* const card_table = heap_->GetCardTable();
Thread* const self = Thread::Current();
CHECK_EQ(self, thread_running_gc_);
// Clear live_bytes_ of every non-free region, except the ones that are newly
// allocated.
region_space_->SetAllRegionLiveBytesZero();
if (kIsDebugBuild) {
region_space_->AssertAllRegionLiveBytesZeroOrCleared();
}
// Scan immune spaces
{
TimingLogger::ScopedTiming split2("ScanImmuneSpaces", GetTimings());
for (auto& space : immune_spaces_.GetSpaces()) {
DCHECK(space->IsImageSpace() || space->IsZygoteSpace());
accounting::ContinuousSpaceBitmap* live_bitmap = space->GetLiveBitmap();
accounting::ModUnionTable* table = heap_->FindModUnionTableFromSpace(space);
ImmuneSpaceCaptureRefsVisitor visitor(this);
if (table != nullptr) {
table->VisitObjects(ImmuneSpaceCaptureRefsVisitor::Callback, &visitor);
} else {
WriterMutexLock rmu(Thread::Current(), *Locks::heap_bitmap_lock_);
card_table->Scan<false>(
live_bitmap,
space->Begin(),
space->Limit(),
visitor,
accounting::CardTable::kCardDirty - 1);
}
}
}
// Scan runtime roots
{
TimingLogger::ScopedTiming split2("VisitConcurrentRoots", GetTimings());
CaptureRootsForMarkingVisitor visitor(this, self);
Runtime::Current()->VisitConcurrentRoots(&visitor, kVisitRootFlagAllRoots);
}
{
// TODO: don't visit the transaction roots if it's not active.
TimingLogger::ScopedTiming split2("VisitNonThreadRoots", GetTimings());
CaptureRootsForMarkingVisitor visitor(this, self);
Runtime::Current()->VisitNonThreadRoots(&visitor);
}
// Capture thread roots
CaptureThreadRootsForMarking();
// Process mark stack
ProcessMarkStackForMarkingAndComputeLiveBytes();
if (kVerboseMode) {
LOG(INFO) << "GC end of MarkingPhase";
}
}
可以看到需要标记的对象有以下几个来源:
A) ImmuneSpaces
如果ImmuneSpace中的对象引用了其他space中的对象,那么相应的Card就会被标记为kCardDirty。这里通过扫描所有被标记为kCardDirty中的对象,将这些对象标记在MarkBitmap上并且将这些对象push到gc_mark_stack_中
B) Runtime Roots
- intern_table_、class_linker_、jni_id_manager_、jit_
- resolution_method_、imt_conflict_method_、imt_unimplemented_method_、ArtMethod等
C) Non Thread Roots:
- VM中的全局对象
- sentinel_,pre_allocated_OutOfMemoryError_when_throwing_exception_,pre_allocated_OutOfMemoryError_when_throwing_oome_,pre_allocated_OutOfMemoryError_when_handling_stack_overflow_,pre_allocated_NoClassDefFoundError_
- ImageRoots
- transaction roots for AOT compilation.
D) Thread Roots 包括thread的成员属性,比如表示thread local storage的tlsPtr_的成员;线程函数调用栈中的对象等。
最后调用ProcessMarkStack对上述push到MarkStack中的对象进行处理
GrayAllDirtyImmuneObjects
和MarkingPhase中的ScanImmuneSpaces类似,扫描所有ImmuneSpaces中被标记在diry cards上的ImmuneObjects,将这些对象标记在MarkBitmap上并且将这些对象push到gc_mark_stack_中。
可以看到GrayDiry ImmuneObjects这个操作有很多次,在MarkingPhase中有一次(Young GC不需要),当前一次,在FlipThreadRoots还要进行一次
在FlipThreadRoots时涉及到引用关系的处理,需要STW,因此为了减少Pause时长,所以在Pause前先处理当前的。在Pause后再处理新标记的dirty ImmuneSpace页
FlipThreadRoots
FlipThreadRoots需要标记线程中的roots对象,同时将from-space的对象移动到to-space,完成这个操作的是FlipCallback,因为这个操作需要让每一个线程都来执行,因此这里通过调用ThreadFlipSuspendAll来把所有线程挂起,让每个线程执行FlipCallback后再释放
// runtime/thread_list.h
// Used to flip thread roots from from-space refs to to-space refs. Used only by the concurrent
// moving collectors during a GC, and hence cannot be called from multiple threads concurrently.
//
// Briefly suspends all threads to atomically install a checkpoint-like thread_flip_visitor
// function to be run on each thread. Run flip_callback while threads are suspended.
// Thread_flip_visitors are run by each thread before it becomes runnable, or by us. We do not
// return until all thread_flip_visitors have been run.
void FlipThreadRoots(Closure* thread_flip_visitor,
Closure* flip_callback,
gc::collector::GarbageCollector* collector,
gc::GcPauseListener* pause_listener)
REQUIRES(!Locks::mutator_lock_,
!Locks::thread_list_lock_,
!Locks::thread_suspend_count_lock_);
FlipCallback完成的工作主要有:
- SetFromSpace:确定有哪些regions是需要evacuate的然后把这些regions标记为from-space,剩下的regions则标记为unevacuated from-space
- GrayAllNewlyDirtyImmuneObjects:ImmuneObject可能引用关系发生了改变,再一次进行处理,因为之前已经扫描过了,因此这里的(Paused)GrayAllNewlyDirtyImmuneObjects的时间非常短(相较于GrayAllDirtyImmuneObjects)
- ThreadFlip:将from-space的对象拷贝到to-space
// Switch threads that from from-space to to-space refs. Forward/mark the thread roots.
void ConcurrentCopying::FlipThreadRoots() {
TimingLogger::ScopedTiming split("FlipThreadRoots", GetTimings());
if (kVerboseMode || heap_->dump_region_info_before_gc_) {
LOG(INFO) << "time=" << region_space_->Time();
region_space_->DumpNonFreeRegions(LOG_STREAM(INFO));
}
Thread* self = Thread::Current();
Locks::mutator_lock_->AssertNotHeld(self);
ThreadFlipVisitor thread_flip_visitor(this, heap_->use_tlab_);
FlipCallback flip_callback(this);
Runtime::Current()->GetThreadList()->FlipThreadRoots(
&thread_flip_visitor, &flip_callback, this, GetHeap()->GetGcPauseListener());
is_asserting_to_space_invariant_ = true;
QuasiAtomic::ThreadFenceForConstructor(); // TODO: Remove?
if (kVerboseMode) {
LOG(INFO) << "time=" << region_space_->Time();
region_space_->DumpNonFreeRegions(LOG_STREAM(INFO));
LOG(INFO) << "GC end of FlipThreadRoots";
}
}
CopyingPhase
在CopyingPhase期间主要做了如下几件事情:
- ScanCardsForSpace
- ScanImmuneSpaces
- VisitConcurrentRoots && VisitNonThreadRoots
- Process mark stacks and References
ScanCardsForSpace:
扫描所有card_table中被标记为dirty-card,并且处于 unevac_rom_space/non_moving_space中的object (这些object不会被回收)
ScanImmuneSpaces:
扫描所有card_table中被标记为dirty-card,并且处于immune_space中的object(这些object不会被回收)
VisitConcurrentRoots && VisitNonThreadRoots:
扫描ConcurrentRoots和NonThreadRoots
Process mark stacks and References:
递归处理mark_stack,将被标记的对象从from-space拷贝到to-space
这一步需要concurrent的同时处理很复杂的引用关系,因此Concurrent Copying在ProcessReference过程中设计了三种模式
- thread local mark stack mode
- shared mark stack mode
- gc exclusive mark stack mode
ART团队在这里的注释解释了为什么需要设计三种模式
// We transition through three mark stack modes (thread-local, shared, GC-exclusive). The
// primary reasons are that we need to use a checkpoint to process thread-local mark
// stacks, but after we disable weak refs accesses, we can't use a checkpoint due to a deadlock
// issue because running threads potentially blocking at WaitHoldingLocks, and that once we
// reach the point where we process weak references, we can avoid using a lock when accessing
// the GC mark stack, which makes mark stack processing more efficient.
处理shared mark stacks时需要禁止弱引用访问来保证引用关系的准确性;处理thread-local mark stacks时需要使用一个checkopoint来让线程自己去处理thread-local mark stacks。但是禁止了弱引用访问后就不能使用checkpoint,因为这样可能会导致一个死锁的问题: 考虑一个线程在执行checkpoint的过程中被blocking on weak global access,这个线程需要等待GC线程重新允许弱引用访问,GC线程在等待这个线程执行完checkpoint,这样就死锁了
因此核心原因就是:处理thread-local mark stacks和处理shared mark stacks需要分开
这里的三种模式首先调用一个checkpoint来处理thread-local mark stacks,接着禁止弱引用访问后来处理shared mark stacks和gc exclusive mark stacks
ReclaimPhase
ReclaimPhase会清理fromspace,释放free regions,也会将一些没有用的内存页还给还给内核;
会进行swap bitmap,将mark_bitmap_和live_bitmap_交换,可以用于下一次Young GC,减少Tracing的开销
同时也会统计这次GC的一些数据,比如RSS Peak,释放了多少内存等等
性能
在考虑GC的性能问题时,一般会考虑的性能指标有:
- Throughput:GC占用的时间比例
- Latency:GC暂停的时长
- Capacity:GC占用的额外空间
这里的Latency在Concurrent GC中其实就和Pause时长成正相关。移动端设备的GC通常情况下会更加关注Latency,因为出现一次GC长时间的Pause UI线程就会导致丢帧,这也是为什么Concurrent Copying这个极低暂停时长的GC算法会作为默认的GC算法应用在ART中
通常来说这三个指标不可能同时满足: 想要提高Throughput就需要减少GC频率,那么每次GC的Pause时长就会更长;想要减少Latency可能就需要更频繁的GC和用空间换时间(CC的from-space和to-space);想要减少Capacity可能就会需要用时间换空间(CMC的两次拷贝)。
Concurrent Copying GC的Latency是非常强的,带来的问题就是ReadBarrier带来的额外开销(Throughput)和RSS Clif(Capacity):
-
为了在保证GC在拷贝过程中应用线程对于对象引用的准确性,需要在每次访问对象时都执行ReadBarrier插桩,这带来了更多的内存占用和CPU占用
-
由于CC是将from-space中的object拷贝到to-space,然后在最后的阶段在去释放from-space,导致在GC的过程中Heap水线会先急速的上升,再迅速的下降,造成了一个RSS Clif在极端场景下,RSS Peak会是存活对象占用的两倍,这可能会导致在高负载场景下造成的不正常Low-Memory Kill
由于以上的原因Google的ART团队在Android V将默认的GC算法切换到了Concurrent Mark-Compact GC算法,使用Kernel的uffd机制避免了ReadBarrier插桩和RSS Clif。当然由于CMC算法的一些性能feature开发还不完善例如分代GC等,还有一些软件生态的问题,CMC算法似乎在国内有些“水土不服”,下一篇文章可以讨论一下CMC算法的实现细节。