本文正在参加「Java主题月 - Java Debug笔记活动」,详情查看
1 说明
发现在资源有限的情况下,还是蛮容易接触到很多的问题,蛮有意思的。
这次主要是搭建了hadoop小环境,测试用,给的资源极少。
1台机器 2C4G
2 GC 异常
- Java heap space
- 堆溢出,发生在 map 阶段
- 这种一般发生在 Xmx 不够的情况下,yang区或者old区都塞不下新创建的对象了
- 后来打了 GC 日志
- 最后发生了好几次 Full GC 但是堆里面没啥东西,才 2%
- ps:写文章的时候才发现都是 Allocation Failure,原来是一直有东西塞不进去
- Allocation Failure 表示给新对象分配内存失败
[GC (Allocation Failure) [PSYoungGen: 14848K->2038K(16896K)] 14848K->2606K(55296K), 0.0026520 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
[GC (Allocation Failure) [PSYoungGen: 16886K->2041K(31744K)] 17454K->4037K(70144K), 0.0034995 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[GC (Allocation Failure) [PSYoungGen: 31737K->2021K(31744K)] 33733K->7351K(70144K), 0.0052218 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]
[GC (Allocation Failure) [PSYoungGen: 31717K->2034K(61440K)] 37047K->10238K(99840K), 0.0072265 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
[GC (Metadata GC Threshold) [PSYoungGen: 42160K->2021K(61440K)] 50365K->12918K(99840K), 0.0056580 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
[Full GC (Metadata GC Threshold) [PSYoungGen: 2021K->0K(61440K)] [ParOldGen: 10897K->8623K(35328K)] 12918K->8623K(96768K), [Metaspace: 20839K->20839K(1069056K)], 0.0277290 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]
[GC (Allocation Failure) [PSYoungGen: 59392K->5374K(93184K)] 68015K->13997K(128512K), 0.0035573 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]
[GC (Allocation Failure) [PSYoungGen: 49468K->4018K(124416K)] 58091K->12649K(474112K), 0.0037595 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[GC (Allocation Failure) [PSYoungGen: 4018K->4003K(135168K)] 12649K->12634K(484864K), 0.0027440 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]
[Full GC (Allocation Failure) [PSYoungGen: 4003K->0K(135168K)] [ParOldGen: 8631K->8546K(39424K)] 12634K->8546K(174592K), [Metaspace: 28225K->28225K(1075200K)], 0.0297214 secs] [Times: user=0.04 sys=0.00, real=0.03 secs]
[GC (Allocation Failure) [PSYoungGen: 0K->0K(135168K)] 8546K->8546K(484864K), 0.0005536 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[Full GC (Allocation Failure) [PSYoungGen: 0K->0K(135168K)] [ParOldGen: 8546K->8057K(49152K)] 8546K->8057K(184320K), [Metaspace: 28225K->28214K(1075200K)], 0.0371439 secs] [Times: user=0.07 sys=0.00, real=0.04 secs]
Heap
PSYoungGen total 135168K, used 9416K [0x00000000f5580000, 0x0000000100000000, 0x0000000100000000)
eden space 128512K, 7% used [0x00000000f5580000,0x00000000f5eb21d8,0x00000000fd300000)
from space 6656K, 0% used [0x00000000fd300000,0x00000000fd300000,0x00000000fd980000)
to space 7168K, 0% used [0x00000000ff900000,0x00000000ff900000,0x0000000100000000)
ParOldGen total 349696K, used 8057K [0x00000000e0000000, 0x00000000f5580000, 0x00000000f5580000)
object space 349696K, 2% used [0x00000000e0000000,0x00000000e07de4f8,0x00000000f5580000)
Metaspace used 28262K, capacity 28534K, committed 28928K, reserved 1075200K
class space used 3154K, capacity 3259K, committed 3328K, reserved 1048576K
3 关键配置
- etc/hadoop/yarn-site.xml
- 资源给的比较小
- 之前调试时应该
minimum
给大了,发现任务卡在了 map 0% - 查日志之后,发现 AppMst 去申请 4G 内存去了,一直给不到
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3172</value>
</property>
<!-- 每个容器的最小内存分配 默认:1024 -->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value>
</property>
<!-- 每个容器的最大内存分配 默认:8192 -->
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>3172</value>
</property>
- etc/hadoop/mapred-site.xml
- 会产生 OOM 的配置
<!-- map操作最大内存 默认:1536 1.5G -->
<property>
<name>mapreduce.map.memory.mb</name>
<value>512</value>
</property>
<!-- map操作java参数 默认:-Xmx1024M -->
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx512M</value>
</property>
<!-- reduce操作最大内存 默认:3072 3G -->
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1024</value>
</property>
<!-- reduce操作java参数 默认:-Xmx2560M -->
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx1024M</value>
</property>
<!-- io排序最大内存 默认:512 -->
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>512</value>
</property>
4 发现问题出在哪里
- 修改了
mapred-site.xml
加大了 map 堆内存及内存为 1024 - 通过看 GC 日志发现了问题(目前 GC 排错还不是很熟悉)
- ParOldGen total 560128K, used 532981K
- 560128 K = 512 MB
- 原来是有哪里来的对象一下子申请了 500 MB 多 内存 ... ...
{Heap before GC invocations=7 (full 1):
PSYoungGen total 61440K, used 59392K [0x00000000eab00000, 0x00000000f2c00000, 0x0000000100000000)
eden space 59392K, 100% used [0x00000000eab00000,0x00000000ee500000,0x00000000ee500000)
from space 2048K, 0% used [0x00000000ee500000,0x00000000ee500000,0x00000000ee700000)
to space 6656K, 0% used [0x00000000f2580000,0x00000000f2580000,0x00000000f2c00000)
ParOldGen total 35328K, used 8685K [0x00000000c0000000, 0x00000000c2280000, 0x00000000eab00000)
object space 35328K, 24% used [0x00000000c0000000,0x00000000c087b598,0x00000000c2280000)
Metaspace used 24829K, capacity 25096K, committed 25472K, reserved 1071104K
class space used 2792K, capacity 2882K, committed 2944K, reserved 1048576K
1.426: [GC (Allocation Failure) [PSYoungGen: 59392K->5394K(95744K)] 68077K->14087K(131072K), 0.0046666 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
Heap after GC invocations=7 (full 1):
PSYoungGen total 95744K, used 5394K [0x00000000eab00000, 0x00000000f2b00000, 0x0000000100000000)
eden space 90112K, 0% used [0x00000000eab00000,0x00000000eab00000,0x00000000f0300000)
from space 5632K, 95% used [0x00000000f2580000,0x00000000f2ac48b0,0x00000000f2b00000)
to space 7168K, 0% used [0x00000000f1d00000,0x00000000f1d00000,0x00000000f2400000)
ParOldGen total 35328K, used 8693K [0x00000000c0000000, 0x00000000c2280000, 0x00000000eab00000)
object space 35328K, 24% used [0x00000000c0000000,0x00000000c087d598,0x00000000c2280000)
Metaspace used 24829K, capacity 25096K, committed 25472K, reserved 1071104K
class space used 2792K, capacity 2882K, committed 2944K, reserved 1048576K
}
Heap
PSYoungGen total 95744K, used 67268K [0x00000000eab00000, 0x00000000f2b00000, 0x0000000100000000)
eden space 90112K, 68% used [0x00000000eab00000,0x00000000ee76ca98,0x00000000f0300000)
from space 5632K, 95% used [0x00000000f2580000,0x00000000f2ac48b0,0x00000000f2b00000)
to space 7168K, 0% used [0x00000000f1d00000,0x00000000f1d00000,0x00000000f2400000)
ParOldGen total 560128K, used 532981K [0x00000000c0000000, 0x00000000e2300000, 0x00000000eab00000)
object space 560128K, 95% used [0x00000000c0000000,0x00000000e087d5a8,0x00000000e2300000)
Metaspace used 29087K, capacity 29334K, committed 29824K, reserved 1075200K
class space used 3248K, capacity 3333K, committed 3456K, reserved 1048576K
5 找到了问题配置
- etc/hadoop/mapred-site.xml
- 此处的 512 MB,会让 map task 在 map 节点申请这么大的内存数组,作为后面排序合并等用途
- 看看源码是怎么做的
<!-- io排序最大内存 默认:512 -->
<!-- map阶段排序时会使用该值占用堆内存,请注意 -->
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>512</value>
</property>
6 源码分析
- MRJobConfig.IO_SORT_MB
- 配置项
public interface MRJobConfig {
String IO_SORT_MB = "mapreduce.task.io.sort.mb";
public static final int DEFAULT_IO_SORT_MB = 100;
}
- MapTask
- 看着像是环形缓冲区部分耶
// 占了 16 byte; meta 看着像元数
private static final int NMETA = 4; // num meta ints
private static final int METASIZE = NMETA * 4; // size in bytes
// 默认值是 100MB
final int sortmb = job.getInt(MRJobConfig.IO_SORT_MB,
MRJobConfig.DEFAULT_IO_SORT_MB);
// 左移 20; 100 << 20 = 104857600; 神奇的操作,相当于 100*1024*1024,计算出来 byte 数
int maxMemUsage = sortmb << 20;
// 减去个什么数的 100M - (100M % 16b); 居然是 100M - 0; 此处回头再看
// ps: idea 都提示了 Result of 'maxMemUsage % METASIZE' is always '0'
// 不知道此处何用
maxMemUsage -= maxMemUsage % METASIZE;
// 就是这了,这个数组就是 OOM 的原因
kvbuffer = new byte[maxMemUsage];
// 此处就是环形缓冲区 80% 就 spill 的由来了
final float spillper = job.getFloat(JobContext.MAP_SORT_SPILL_PERCENT, (float)0.8);
softLimit = (int)(kvbuffer.length * spillper);
- 程序日志
mapreduce.task.io.sort.mb: 512
soft limit at 429496736
bufstart = 0; bufvoid = 536870912
kvstart = 134217724; length = 33554432