## 核心算法

### reduce个数计算

• 参数

mapreduce.job.reduces默认-1 hive.exec.reducers.max默认1009 hive.exec.reducers.bytes.per.reducer默认10G

• 算法

``````--错误案例：
select count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04';
--正确案例：
select pt,count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt;

``````b) 用了Order by
c) 有笛卡尔积

1. 设置了mapreduce.job.reduces参数，则启用此设置
2. 若均未设置，根据数据量大小评估reduce个数
• 代码

``````  if (numReducersFromWork >= 0) {//1. 编译期间sql的执行计划需要强制reduce个数
console.printInfo("Number of reduce tasks determined at compile time: "
} else if (job.getNumReduceTasks() > 0) {//2. 设置了mapreduce.job.reducers参数
console
.printInfo("Number of reduce tasks not specified. Defaulting to jobconf value of: "
+ reducers);
} else {
if (inputSummary == null) {
inputSummary =  Utilities.getInputSummary(driverContext.getCtx(), work.getMapWork(), null);
}
int reducers = Utilities.estimateNumberOfReducers(conf, inputSummary, work.getMapWork(),
work.isFinalMapRed());//3. 根据数据量大小评估reduce个数
console
.printInfo("Number of reduce tasks not specified. Estimated from input data size: "
+ reducers);
}

Utilities.estimateNumberOfReducers()

``````double bytes = Math.max(totalInputFileSize, bytesPerReducer);//Math.max(totalInputFileSize, hive.exec.reducers.bytes.per.reducer)/hive.exec.reducers.bytes.per.reducer
int reducers = (int) Math.ceil(bytes / bytesPerReducer);
reducers = Math.max(1, reducers);
reducers = Math.min(maxReducers, reducers);

### map个数计算

#### CombineHiveInputFormat

1. 参数

1. 源码

``````    // Process the normal splits
if (nonCombinablePaths.size() > 0) {//1. 如果path的inputFileFormatClass不可合并，则使用HiveInputFormat进行分片
FileInputFormat.setInputPaths(job, nonCombinablePaths.toArray
(new Path[nonCombinablePaths.size()]));
InputSplit[] splits = super.getSplits(job, numSplits);
for (InputSplit split : splits) {
}
}
// Process the combine splits
if (combinablePaths.size() > 0) {//2.1 如果可以合并，则调用getCombineSplits获取
FileInputFormat.setInputPaths(job, combinablePaths.toArray
(new Path[combinablePaths.size()]));
Map<String, PartitionDesc> pathToPartitionInfo = this.pathToPartitionInfo != null ?
this.pathToPartitionInfo : Utilities.getMapWork(job).getPathToPartitionInfo();
InputSplit[] splits = getCombineSplits(job, numSplits, pathToPartitionInfo);//2.2 最终还是调用CombineFileInputFormat.getSplits(job, 1)
for (InputSplit split : splits) {
}
}

#### OrcInputFormat

hive.exec.orc.split.strategy 默认 HYBRID, 参数控制在读取ORC表时生成split的策略。

• BI策略以文件为粒度进行split划分；
• ETL策略会将文件进行切分，多个stripe组成一个split；
• HYBRID策略为：当文件的平均大小大于hadoop最大split值（默认256 * 1024 * 1024）时使用ETL策略，否则使用BI策略。

### mergefile开启策略

1. 参数

hive.merge.mapfiles 默认true hive.merge.mapredfiles 默认false hive.merge.size.per.task 默认256M hive.merge.smallfiles.avgsize 默认16M hive.merge.supports.splittable.combineinputformat 默认true

1. 源码

``````if (dpCtx != null &&  dpCtx.getNumDPCols() > 0) {//如果是动态分区
int numDPCols = dpCtx.getNumDPCols();
int dpLbLevel = numDPCols + lbLevel;

mrAndMvTask, dirPath, inpFs, ctx, work, dpLbLevel);
} else { // no dynamic partitions
if(lbLevel == 0) {
// static partition without list bucketing
long totalSz = getMergeSize(inpFs, dirPath, avgConditionSize);//如果是静态分区，则按照分区下文件平均大小小于avgsize则开启mergeTask.
if (totalSz >= 0) { // add the merge job
setupMapRedWork(conf, work, trgtSize, totalSz);