JVM GC 之那点事如何设置合适的 JVM 启动参数 ( 如 heap、GC )？关于这个问题，需要跳出问题本身来看

如何设置合适的 JVM 启动参数 ( 如 heap、GC )？关于这个问题，需要跳出问题本身来看 -- 配置只是一种手段而非要解决的问题：减少 GC 时间，降低 STW 时间。

减少 GC 时间，那么意味着有更多的 CPU 用于业务处理，如何减少？只能是减少对象分配，但对象分配属于业务行为，在 JVM 层面能做的努力较少，这点不计入考量 ( 在业务上可以通过对象池或者标量替换较少分配 )；
降低 STW 时间，而这也是现在 GC 的专注方向 -- 并发标记/收集，在此先明确一个误区：降低 STW 时间，不并非减少，实际上是将 STW 打散，总的是 GC 时间并没有减少，并且并发/行开销，总的时间开销还会增加，但并发/行可以将 STW 这个明显的点打平成一条线从而避免明显抖动。

在此以分代回收场景为例 ( 使用 CMS / G1 分代模式，Shenandoah / ZGC 无分代另行讨论 )：在分代场景下，对象在新生代分配 ( 超过制定阈值的大对象直接进入老年代 )，经历一段时间晋升入老年代，在分代的假设前提下：大部分对象朝生夕灭，那么这个问题就转化成如何保证大多数对象在新生代回收，避免一次性的对象进入老年代。在现有的请求响应模式下，这个问题可以转化为：如何保证一个请求响应周期内产生的对象不晋升进入老年代 -- 即 RT 内的 YGC 频率不能超过晋升阈值。

而针对一些业务场景，如果一开始就明确需要存活较长时间，可以取巧通过字节填充伪装成大对象直接进入老年代。

首先来看如何控制 YGC 频率，这就涉及到两个参数：eden 大小和分配速率

针对分配速率我们可以从 gc log 获得，那么自然可以反向推出 eden 大小，在这儿需要澄清一个误区：GC 时间表面上跟内存大小有关，本质上是跟活跃对象数量有关，在遍历过程中，假如活跃对象很少，GC 时间自然不在话下。要么增加内存大小，要么控制分配速率，增加内存则会影响每次 GC 时间，分配速率在业务上并不好控制，那么一个简单的形式是考虑使用 G1 或者 ZGC，获取多实例将请求量和开销打散；

-XX:NewRatio=ratio

Sets the ratio between young and old generation sizes. By default, this option is set to 2.

-XX:SurvivorRatio=ratio

Sets the ratio between eden space size and survivor space size. By default, this option is set to 8.

再来看看如何控制晋升阈值

MaxTenuringThreshold 这个配置项，但，这就足够了吗？不！晋升阈值还会动态改变，如下代码所示：

uint ageTable::compute_tenuring_threshold(size_t survivor_capacity) {
  size_t desired_survivor_size = (size_t)((((double) survivor_capacity)*TargetSurvivorRatio)/100);
  size_t total = 0;
  uint age = 1;
  assert(sizes[0] == 0, "no objects with age zero should be recorded");
  while (age < table_size) {
    total += sizes[age];
    // check if including objects of age 'age' made us pass the desired
    // size, if so 'age' is the new threshold
    if (total > desired_survivor_size) break;
    age++;
  }
  uint result = age < MaxTenuringThreshold ? age : MaxTenuringThreshold;
  ······ //
  return result;
}

即：min( MaxTenuringThreshold, 小于等于某 age 的对象大小总和大于 survivor_capacity * TargetSurvivorRatio / 100 ) 当出现 major gc 很频繁时，就要关注下这个数值 ( 开启 GC log，会有如下输出 )，看是否是变得太小 ( 太小对象没几次 GC 就进入老年代了，这显然跟大部分对象朝生夕灭不符，当然并不是绝对情况，但却很符合我们大部分业务系统的情况 )，这个时候说明 survivor 区太小，应该适当调大，如下就是一个不好的示例：

2019-06-13T11:07:15.053+0000: 66.470: [GC (Allocation Failure) 2019-06-13T11:07:15.054+0000: 66.470: [ParNew Desired survivor size 26836992 bytes, new threshold 1 (max 6)
  age 1: 27463464 bytes, 27463464 total
  age 2: 6852984 bytes, 34316448 total
  age 3: 6540928 bytes, 40857376 total : 469311K->52416K(471872K), 0.0739434 secs] 547987K->149997K(4141888K), 0.0742620 secs] [Times: user=0.25 sys=0.00, real=0.08 secs]

这其中涉及的配置项：

-XX:MaxTenuringThreshold=threshold

Sets the maximum tenuring threshold for use in adaptive GC sizing. The largest value is 15. The default value is 15 for the parallel (throughput) collector, and 6 for the CMS collector.

-XX:TargetSurvivorRatio=50

Desired percentage of survivor space used after scavenge.

接下来再来看如何处理 OGC，虽然现在在大众观念中 -- CMS 已经进入垂暮之年，但作为曾经划时代的 GC -- 并发标记清除，开启了一个近乎没有 STW 的时代，尽管 GC ROOTS 遍历以及 REMARK 任然需要 STW，但耗时的标记通过 barrier 可以和应用线程并发执行，那么应该如何避免 FGC，要知道 FGC 产生的原因实在收集的速度赶不上分配的速度，同样我们可以通过 gc log 分析得出 promtion rate 以及 gc rate，然后设置合适的老年代大小，同时需要特别关注本地缓存集中过其中的场景。以 CMS 为例，核心参数为：

-XX:CMSInitiatingOccupancyFraction=percent

Sets the percentage of the old generation occupancy (0 to 100) at which to start a CMS collection cycle. The default value is set to -1. Any negative value (including the default) implies that -XX:CMSTriggerRatio is used to define the value of the initiating occupancy fraction.

-XX:+UseCMSInitiatingOccupancyOnly

Enables the use of the occupancy value as the only criterion for initiating the CMS collector. By default, this option is disabled and other criteria may be used.

-XX:+CMSScavengeBeforeRemark

Enables scavenging attempts before the CMS remark step. By default, this option is disabled.

通过上面的描述，我们再来总结下步骤：

在默认配置下执行应用，分析 gc log 得出 ygc 的周期均值；
调整 SurvivorRatio 和 MaxTenuringThreshold，使得一次请求内的 ygc 次数小于晋升阈值
结合晋升速率和 ogc 速率配置 CMSInitiatingOccupancyFraction
关注业务场景特点，避免集中过期

YGC 相对于 OGC 来说，都需要遍历 GC ROOTS，不同点在于，有专门的数据结构记录老年代哪些区域指向新生代的引用 ( card table )，那么自然 YGC 的扫描集比 OGC 的扫描集要小得多，而这也是为什么 OGC 比 YGC 耗时原因。but，为什么没有新生代指向老年代的 card table 呢？-- 从新生代和老年代的用途和成本考虑，因此 OGC 前进行一次 YGC，可以较少 OGC 需要的遍历量。

小知识：

为何推荐 server 场景 -Xms 和 -Xmx 设置相同值：

-Xmssize

Sets the initial size (in bytes) of the heap. This value must be a multiple of 1024 and greater than 1 MB. Append the letter k or K to indicate kilobytes, m or M to indicate megabytes, g or G to indicate gigabytes.

-Xmxsize

Specifies the maximum size (in bytes) of the memory allocation pool in bytes. This value must be a multiple of 1024 and greater than 2 MB. Append the letter k or K to indicate kilobytes, m or M to indicate megabytes, g or G to indicate gigabytes. The default value is chosen at runtime based on system configuration. For server deployments, -Xms and -Xmx are often set to the same value

void CardGeneration::compute_new_size() {
  ······ //
  if (capacity_after_gc < minimum_desired_capacity) {
    // If we have less free space than we want then expand
    size_t expand_bytes = minimum_desired_capacity - capacity_after_gc;
    // Don't expand unless it's significant
    if (expand_bytes >= _min_heap_delta_bytes) {
      expand(expand_bytes, 0); // safe if expansion fails
    }
    if (PrintGC && Verbose) {
      gclog_or_tty->print_cr("    expanding:"
                    "  minimum_desired_capacity: %6.1fK"
                    "  expand_bytes: %6.1fK"
                    "  _min_heap_delta_bytes: %6.1fK",
                    minimum_desired_capacity / (double) K,
                    expand_bytes / (double) K,
                    _min_heap_delta_bytes / (double) K);
    }
    return;
  }
  ······ //
  // Don't shrink unless it's significant
  if (shrink_bytes >= _min_heap_delta_bytes) {
    shrink(shrink_bytes);
  }
}

这些预留内存在 GC 中根据当前内存回收情况进行扩缩容 ( GC 也就意味性能损耗 )。当 Xms 和 Xmx 参数设置不相等，会优先执行 GC 回收内存，当内存不足时扩容，当内存有余时缩容 ( 起码从 CMS 默认情况下的分析来说是如此 )，大概是早期 JVM 运行在小型机上开发的策略 ( 毕竟当时内存太小 )。

tips: 在分析 gc log，gceasy.io/ 是一个不错的工具。