问题描述
同样的 SQL 查询 COS 比 HDFS 慢很多倍
问题分析
- 测试 SQL 语句:
SELECT count(1) FROM test where date='20180720';
2.对比两个 job
(1)查询 HDFS 的 mapper 数
(2)查询 COS 的 mapper 数
经过对比,发现 COS job 的 mapper 数非常少。
源码分析
CombineFileInputFormat 创建 split 的原理
void createSplits(Map<String, Set<OneBlockInfo>> nodeToBlocks,
Map<OneBlockInfo, String[]> blockToNodes,
Map<String, List<OneBlockInfo>> rackToBlocks,
long totLength,
long maxSize,
long minSizeNode,
long minSizeRack,
List<InputSplit> splits
) {
ArrayList<OneBlockInfo> validBlocks = new ArrayList<OneBlockInfo>();
long curSplitSize = 0;
int totalNodes = nodeToBlocks.size();
long totalLength = totLength;
Multiset<String> splitsPerNode = HashMultiset.create();
Set<String> completedNodes = new HashSet<String>();
while(true) {
//1. -------------------- 先按照 node 处理 nodeToBlocks --------------------
//注意:如果数据在 HDFS 上面,那么 node 就是 文件所有 blks 在的 datanodes 会比较多,nodeToBlocks 就是 dn -> blks
//而数据如果在 COS 上,node 是 localhost !!!!!, 所以 nodeToBlocks 是 localhost -> 所有 blks.
//而 createSplits 方法是先按照 node 处理 nodeToBlocks,所有 blks 都集中到了 localhost,没有被打散,因此大概率生成一些大的 split
//相反,数据在 hdfs 上,blks 会被打算到多个 node,大概率生成一些 小的 splits,所有存在 splits 个数的显著差异
for (Iterator<Map.Entry<String, Set<OneBlockInfo>>> iter = nodeToBlocks
.entrySet().iterator(); iter.hasNext();) {
//一个 node 节点
Map.Entry<String, Set<OneBlockInfo>> one = iter.next();
String node = one.getKey();
// Skip the node if it has previously been marked as completed.
if (completedNodes.contains(node)) {
continue;
}
//一个 node 上的所有 blks
Set<OneBlockInfo> blocksInCurrentNode = one.getValue();
//(1)处理 该 node 上的所有 blks, 处理完一批,就会删掉它们
// for each block, copy it into validBlocks. Delete it from
// blockToNodes so that the same block does not appear in two different splits.
Iterator<OneBlockInfo> oneBlockIter = blocksInCurrentNode.iterator();
while (oneBlockIter.hasNext()) {
OneBlockInfo oneblock = oneBlockIter.next();
// Remove all blocks which may already have been assigned to other splits.
//注意:先是按照 nodeToBlocks 依次处理每一个 node,会存在一个 blk 对应多个 node 的情况,所以会出现重复,
//处理完一个 blk 会从 blockToNodes 中移除它,所以 blockToNodes 中没有 oneblock,说明已经处理过了,直接 skip
if(!blockToNodes.containsKey(oneblock)) {
oneBlockIter.remove();
continue;
}
//都加入到 validBlocks
validBlocks.add(oneblock);
//从 blk 角度,处理完一个 blk, 就从 blockToNodes 移除它 !!!
blockToNodes.remove(oneblock);
curSplitSize += oneblock.length;
//如果 大于 maxSize=256M,那么切割一次分片
// if the accumulated split size exceeds the maximum, then create this split.
if (maxSize != 0 && curSplitSize >= maxSize) {
// create an input split and add it to the splits array
addCreatedSplit(splits, Collections.singleton(node), validBlocks);
totalLength -= curSplitSize;
curSplitSize = 0;
splitsPerNode.add(node);
// Remove entries from blocksInNode so that we don't walk these
// again.
blocksInCurrentNode.removeAll(validBlocks);
validBlocks.clear();
// Done creating a single split for this node. Move on to the next
// node so that splits are distributed across nodes.
break;
}
}
//(2)该 node 的 blks 总大小不足 maxSplit 或者 split 后还剩余了一些
if (validBlocks.size() != 0) {
// This implies that the last few blocks (or all in case maxSize=0)
// were not part of a split. The node is complete.
// if there were any blocks left over and their combined size is
// larger than minSplitNode, then combine them into one split.
// Otherwise add them back to the unprocessed pool. It is likely
// that they will be combined with other blocks from the
// same rack later on.
// This condition also kicks in when max split size is not set. All
// blocks on a node will be grouped together into a single split.
if (minSizeNode != 0 && curSplitSize >= minSizeNode && splitsPerNode.count(node) == 0) {
// haven't created any split on this machine. so its ok to add a
// smaller one for parallelism. Otherwise group it in the rack for
// balanced size create an input split and add it to the splits array
//如果该 node 没有 split,并且 curSplitSize 比 minSizeNode 大,那么这些 blks 放一起作为一个 split
//注意:此时这个 split 的大小不足 256M !!!!
addCreatedSplit(splits, Collections.singleton(node), validBlocks);
totalLength -= curSplitSize;
//记录一下该 node 有 split 了
splitsPerNode.add(node);
//从 blocksInCurrentNode 移除 split 完的 blks
// Remove entries from blocksInNode so that we don't walk this again.
blocksInCurrentNode.removeAll(validBlocks);
// The node is done. This was the last set of blocks for this node.
} else {
// Put the unplaced blocks back into the pool for later rack-allocation.
for (OneBlockInfo oneblock : validBlocks) {
blockToNodes.put(oneblock, oneblock.hosts);
}
}
validBlocks.clear();
curSplitSize = 0;
//记录一下该 node 处理完了
completedNodes.add(node);
} else { // No in-flight blocks.
if (blocksInCurrentNode.size() == 0) {
// Node is done. All blocks were fit into node-local splits.
completedNodes.add(node);
} // else Run through the node again.
}
}
//所有 node 都处理完了,或者 所有 blks 都处理完了
//注意:如果是 COS ,只有一个 localhost 的 node,此处 completedNodes.size() == totalNodes 是 true,而 totalLength == 0 是 false, 说明有些 blks 没有处理
// Check if node-local assignments are complete.
if (completedNodes.size() == totalNodes || totalLength == 0) {
// All nodes have been walked over and marked as completed or all blocks
// have been assigned. The rest should be handled via rackLock assignment.
LOG.info("DEBUG: Terminated node allocation with : CompletedNodes: "
+ completedNodes.size() + ", size left: " + totalLength);
break;
}
}
//2. -------------------- 未被处理到的 blks,在这里继续被处理, 这回按照 rackToBlocks 去遍历 --------------------
//如果没有做机架划分,只会有一个 /default-rack -> {ArrayList@10687} size = 512
// if blocks in a rack are below the specified minimum size, then keep them
// in 'overflow'. After the processing of all racks is complete, these
// overflow blocks will be combined into splits.
ArrayList<OneBlockInfo> overflowBlocks = new ArrayList<OneBlockInfo>();
Set<String> racks = new HashSet<String>();
//这一回,要把所有未处理的 blks in blockToNodes 全部处理掉(能产生 split 的就每个 rack 产生一个,如果不能 那么全部放入 overflowBlocks)
// Process all racks over and over again until there is no more work to do.
while (blockToNodes.size() > 0) {
// Create one split for this rack before moving over to the next rack.
// Come back to this rack after creating a single split for each of the
// remaining racks.
// Process one rack location at a time, Combine all possible blocks that
// reside on this rack as one split. (constrained by minimum and maximum
// split size). iterate over all racks
//按照机架 遍历 blks
for (Iterator<Map.Entry<String, List<OneBlockInfo>>> iter =
rackToBlocks.entrySet().iterator(); iter.hasNext();) {
// /default-rack -> {ArrayList@10687} size = 512
Map.Entry<String, List<OneBlockInfo>> one = iter.next();
//记录机架
racks.add(one.getKey());
//该机架对应的所有 blks
List<OneBlockInfo> blocks = one.getValue();
// for each block, copy it into validBlocks. Delete it from
// blockToNodes so that the same block does not appear in
// two different splits.
boolean createdSplit = false;
for (OneBlockInfo oneblock : blocks) {
//blockToNodes 保存是未被处理的 blks !!!
if (blockToNodes.containsKey(oneblock)) {
validBlocks.add(oneblock);
//处理后,从 blockToNodes 移除
blockToNodes.remove(oneblock);
curSplitSize += oneblock.length;
//判断未处理的这些 blks,够不够 256M,如果够那么划出一个 split
// if the accumulated split size exceeds the maximum, then create this split.
if (maxSize != 0 && curSplitSize >= maxSize) {
// create an input split and add it to the splits array
addCreatedSplit(splits, getHosts(racks), validBlocks);
createdSplit = true;
//此处最多创建一个 split !
break;
}
}
}
//如果该 机架创建了一个 split ,那么 contine 处理下一个机架
// if we created a split, then just go to the next rack
if (createdSplit) {
curSplitSize = 0;
validBlocks.clear();
racks.clear();
continue;
}
//如果没有创建出来 >= 256M 的 split,就用这些 blks 创建一个小 split
if (!validBlocks.isEmpty()) {
if (minSizeRack != 0 && curSplitSize >= minSizeRack) {
// if there is a minimum size specified, then create a single split
// otherwise, store these blocks into overflow data structure
addCreatedSplit(splits, getHosts(racks), validBlocks);
} else {
//如果 这些 blks 总大小实在太小了,连一个 rack 要求的 minSizeRack 都不足,那么放到 overflowBlocks 里,最后在合并到一起创建 split
//目的是不创建过小的 split
// There were a few blocks in this rack that remained to be processed.
// Keep them in 'overflow' block list.
// These will be combined later.
overflowBlocks.addAll(validBlocks);
}
}
curSplitSize = 0;
validBlocks.clear();
racks.clear();
}
}
//3. -------------------- 经过 步骤 2,只有 overflowBlocks 可能存在 未处理的 blks,集中处理掉它们 --------------------
assert blockToNodes.isEmpty();
assert curSplitSize == 0;
assert validBlocks.isEmpty();
assert racks.isEmpty();
//处理剩下的,还未处理的 blks
// Process all overflow blocks
for (OneBlockInfo oneblock : overflowBlocks) {
validBlocks.add(oneblock);
curSplitSize += oneblock.length;
//记录 oneblock 的 racks
// This might cause an exiting rack location to be re-added, but it should be ok.
for (int i = 0; i < oneblock.racks.length; i++) {
racks.add(oneblock.racks[i]);
}
//如果凑够了 256M ,那么创建一个 split
// if the accumulated split size exceeds the maximum, then create this split.
if (maxSize != 0 && curSplitSize >= maxSize) {
// create an input split and add it to the splits array
addCreatedSplit(splits, getHosts(racks), validBlocks);
curSplitSize = 0;
validBlocks.clear();
racks.clear();
}
}
//4. -------------------- 最后,如果前面 3 步处理完,还有 blks 未处理掉,那么全部放到一起,生成一个 < 256M 的 split --------------------
// Process any remaining blocks, if any.
if (!validBlocks.isEmpty()) {
addCreatedSplit(splits, getHosts(racks), validBlocks);
}
}
总结:
1.如果数据在 HDFS 上面,那么 node 就是文件所有 blks 在的 datanodes,因此,node 会比较多,nodeToBlocks 保存的是 dn -> blks, 相当于对 blks 做了打散。
2.而数据如果在 COS 上,node 是 localhost ,所以 nodeToBlocks 是 localhost -> 所有 blks 全都集中在一个 node 下。createSplits 方法是先按照 node 处理 nodeToBlocks,因为所有 blks 都集中到了 localhost,没有被打散,所以,大概率生成一些大的 split。
3.而数据在 HDFS 上,blks 会被打散到多个 nodes,大概率生成一些小的 splits,所有存在 splits 个数的显著差异。该方法为了创建尽可能大的 split 分了 4 轮处理,原则就是尽可能多的创建「大的 split」(默认 maxSize=256M,尽可能创建 >= 256M 的 split),实在不足的,就放到一起,能创建多大的就创建多大的。 所以,调小了 mapred.max.split.size,才会多出来一些 splits。
解决方案
调小 mapred.max.split.size 值