出发点
我们为什么要分析 dialogue 和 trade (不仅仅这两个)服务的 GC 日志?
因为 dialogue 在用户对话链路上(请求来自用户)相当于服务调用中枢,一旦服务出现延迟/调用超时,本来可以回复用户的对话,因为超时被丢弃,可能会造成以下几种情况
- 对商家,造成一定的损失,回复率降低
- 对用户最直观的感受就是,为什么这种问题都没回复
dialogue 服务上实例对象的特点
- 对象较大(富字段的 proto 对象,上下文信息,集合等)
- 缓存了一定量(比如十条)用户上下文信息,所以长期未对话,缓存的旧的上下文信息也就没了意义,可作为垃圾对象进行回收
trade-build
通过消费第三方电商的订单消息来驱动
- 对象多,字段较少,生命周期短
垃圾回收
serial, parallel, CMS(串行,并行,并发标记清理)
堆内存结构
年轻代
分配和老化新对象的区域
老年代
经历过 young 区域 gc 则会计数一次,达到一定次数(默认 15 次)则会复制到该区域
永久代
存储 jvm 元数据,基础库的类信息和方法
cms 并发标记清理五阶段
这边直接贴上官方文档。需要注意的是,被标记的是可以继续存活对象,没标记的是需要释放内存的对象
The CMS collector performs the following phases on the old generation of the heap:
Phase | Description |
---|---|
(1) Initial Mark |
(Stop the World Event) | Objects in old generation are “marked” as reachable including those objects which may be reachable from young generation. Pause times are typically short in duration relative to minor collection pause times. | | (2) Concurrent Marking | Traverse the tenured generation object graph for reachable objects concurrently while Java application threads are executing. Starts scanning from marked objects and transitively marks all objects reachable from the roots. The mutators are executing during the concurrent phases 2, 3, and 5 and any objects allocated in the CMS generation during these phases (including promoted objects) are immediately marked as live. | | (3) Remark
(Stop the World Event) | Finds objects that were missed by the concurrent mark phase due to updates by Java application threads to objects after the concurrent collector had finished tracing that object. | | (4) Concurrent Sweep | Collects the objects identified as unreachable during marking phases. The collection of a dead object adds the space for the object to a free list for later allocation. Coalescing of dead objects may occur at this point. Note that live objects are not moved. | | (5) Resetting | Prepare for next concurrent collection by clearing data structures. |
Garbage First
堆内存分配
除了以上三个区域,还有一个巨型对象存储区。以上的一个个小方格内存分配在 1M-32M, 巨型对象大小超过小方格的 50% 则直接分配到巨型对象区
垃圾回收的多个阶段,这里直接贴官方文档
Phase | Description |
---|---|
(1) Initial Mark |
(Stop the World Event) | This is a stop the world event. With G1, it is piggybacked on a normal young GC. Mark survivor regions (root regions) which may have references to objects in old generation. | | (2) Root Region Scanning | Scan survivor regions for references into the old generation. This happens while the application continues to run. The phase must be completed before a young GC can occur. | | (3) Concurrent Marking | Find live objects over the entire heap. This happens while the application is running. This phase can be interrupted by young generation garbage collections. | | (4) Remark
(Stop the World Event) | Completes the marking of live object in the heap. Uses an algorithm called snapshot-at-the-beginning (SATB) which is much faster than what was used in the CMS collector. | | (5) Cleanup
(Stop the World Event and Concurrent) |
- Performs accounting on live objects and completely free regions. (Stop the world)
- Scrubs the Remembered Sets. (Stop the world)
- Reset the empty regions and return them to the free list. (Concurrent)
|
| (*) Copying
(Stop the World Event) | These are the stop the world pauses to evacuate or copy live objects to new unused regions. This can be done with young generation regions which are logged as [GC pause (young)]
. Or both young and old generation regions which are logged as [GC Pause (mixed)]
. |
线上情况
dialogue gc 日志分析图
可以看到总内存一共分配了近 6G, 年轻代与老年代各分配了 2.24G 和 3.53G. 应该是使用了默认的 NewRatio=2.
但是从使用率来看,年轻代不够用,所以在分配新的对象时,都会尝试年轻代的 gc. 而老年代有 800M 的空闲内存没有使用,这显然不是合理的内存分配
trade-build 也是年轻代不够用,创建新的对象时,经常出现内存分配失败,从而触发年轻代的 gc. 但该服务不在用户对话调用链路上,就算有 gc 延迟,只是影响数据落库时间
trade-build 优化思路
对象特点:占用内存下,数量多,生命周期短
所以大多数对象在年轻代的 eden 就可以进行回收了,可以尝试调整下 **NewRatio **的配置,将其调整为 NewRatio=1, 即年轻代和老年代各占用一半的堆内存
新对象分配失败,触发年轻代 gc 日志如下:
2021-03-18T10:13:44.256+0800: [GC (Allocation Failure) 2021-03-18T10:13:44.257+0800: [ParNew: 406482K->35023K(407424K), 0.0183108 secs] 817272K->450538K(1296256K), 0.0184282 secs] [Times: user=0.06 sys=0.00, real=0.02 secs]
2021-03-18T10:14:33.687+0800: [GC (Allocation Failure) 2021-03-18T10:14:33.687+0800: [ParNew: 405455K->36853K(407424K), 0.0212494 secs] 820970K->459797K(1296256K), 0.0213765 secs] [Times: user=0.06 sys=0.00, real=0.03 secs]
2021-03-18T10:15:41.909+0800: [GC (Allocation Failure) 2021-03-18T10:15:41.910+0800: [ParNew: 407285K->32213K(407424K), 0.0177182 secs] 830229K->459058K(1296256K), 0.0178312 secs] [Times: user=0.05 sys=0.01, real=0.01 secs]
调整之后
在启动脚本中增加 -XXNewRatio=1
之后,线程的 GC 日志如下图,可以明显的看出 GC 耗时峰值下降,感兴趣的话可以对比两张 GC 日志解析的截图。
GC 监控打点图
下图中红框标记的实例为金丝雀发布的一台,相对于其它两台,GC 耗时的峰值明显下降
剩余两台发布后,三台 GC 峰值全部下降