开启掘金成长之旅!这是我参与「掘金日新计划 · 12 月更文挑战」的第32天,点击查看活动详情
点击MRAppMaster.java 中的initAndStartAppMaster 方法中的appMaster.start();
public void start() {
if (isInState(STATE.STARTED)) {
return;
}
//enter the started state
synchronized (stateChangeLock) {
if (stateModel.enterState(STATE.STARTED) != STATE.STARTED) {
try {
startTime = System.currentTimeMillis();
// 调用MRAppMaster中的serviceStart()方法
serviceStart();
if (isInState(STATE.STARTED)) {
//if the service started (and isn't now in a later state), notify
LOG.debug("Service {} is started", getName());
notifyListeners();
}
} catch (Exception e) {
noteFailure(e);
ServiceOperations.stopQuietly(LOG, this);
throw ServiceStateException.convert(e);
}
}
}
}
protected void serviceStart() throws Exception {
... ...
if (initFailed) {
JobEvent initFailedEvent = new JobEvent(job.getID(), JobEventType.JOB_INIT_FAILED);
jobEventDispatcher.handle(initFailedEvent);
} else {
// All components have started, start the job.
// 初始化成功后,提交Job到队列中
startJobs();
}
}
protected void startJobs() {
/** create a job-start event to get this ball rolling */
JobEvent startJobEvent = new JobStartEvent(job.getID(),
recoveredJobStartTime);
/** send the job-start event. this triggers the job execution. */
// 这里将job存放到yarn队列
// dispatcher = AsyncDispatcher
// getEventHandler()返回的是GenericEventHandler
dispatcher.getEventHandler().handle(startJobEvent);
}
ctrl + alt +B 查找handle实现类,GenericEventHandler.java
class GenericEventHandler implements EventHandler<Event> {
public void handle(Event event) {
... ...
try {
// 将job存储到yarn队列中
eventQueue.put(event);
} catch (InterruptedException e) {
... ...
}
};
}
5.5 调度器任务执行(YarnChild)
1)启动MapTask ctrl +n 查找YarnChild,搜索main方法
public static void main(String[] args) throws Throwable {
Thread.setDefaultUncaughtExceptionHandler(new YarnUncaughtExceptionHandler());
LOG.debug("Child starting");
... ...
task = myTask.getTask();
YarnChild.taskid = task.getTaskID();
... ...
// Create a final reference to the task for the doAs block
final Task taskFinal = task;
childUGI.doAs(new PrivilegedExceptionAction<Object>() {
@Override
public Object run() throws Exception {
// use job-specified working directory
setEncryptedSpillKeyIfRequired(taskFinal);
FileSystem.get(job).setWorkingDirectory(job.getWorkingDirectory());
// 调用task执行(maptask或者reducetask)
taskFinal.run(job, umbilical); // run the task
return null;
}
});
}
... ...
}
ctrl + alt +B 查找run实现类,maptask.java
public void run(final JobConf job, final TaskUmbilicalProtocol umbilical)
throws IOException, ClassNotFoundException, InterruptedException {
this.umbilical = umbilical;
// 判断是否是MapTask
if (isMapTask()) {
// If there are no reducers then there won't be any sort. Hence the map
// phase will govern the entire attempt's progress.
// 如果reducetask个数为零,maptask占用整个任务的100%
if (conf.getNumReduceTasks() == 0) {
mapPhase = getProgress().addPhase("map", 1.0f);
} else {
// If there are reducers then the entire attempt's progress will be
// split between the map phase (67%) and the sort phase (33%).
// 如果reduceTask个数不为零,MapTask占用整个任务的66.7% sort阶段占比
mapPhase = getProgress().addPhase("map", 0.667f);
sortPhase = getProgress().addPhase("sort", 0.333f);
}
}
... ...
if (useNewApi) {
// 调用新的API执行maptask
runNewMapper(job, splitMetaInfo, umbilical, reporter);
} else {
runOldMapper(job, splitMetaInfo, umbilical, reporter);
}
done(umbilical, reporter);
}
void runNewMapper(final JobConf job,
final TaskSplitIndex splitIndex,
final TaskUmbilicalProtocol umbilical,
TaskReporter reporter
) throws IOException, ClassNotFoundException,
InterruptedException {
... ...
try {
input.initialize(split, mapperContext);
// 运行maptask
mapper.run(mapperContext);
mapPhase.complete();
setPhase(TaskStatus.Phase.SORT);
statusUpdate(umbilical);
input.close();
input = null;
output.close(mapperContext);
output = null;
} finally {
closeQuietly(input);
closeQuietly(output, mapperContext);
}
}
Mapper.java(和Map联系在一起)
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} finally {
cleanup(context);
}
}
2)启动ReduceTask 在YarnChild.java类中的main方法中ctrl + alt +B 查找run实现类,reducetask.java
public void run(JobConf job, final TaskUmbilicalProtocol umbilical)
throws IOException, InterruptedException, ClassNotFoundException {
job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
... ...
if (useNewApi) {
// 调用新API执行reduce
runNewReducer(job, umbilical, reporter, rIter, comparator,
keyClass, valueClass);
} else {
runOldReducer(job, umbilical, reporter, rIter, comparator,
keyClass, valueClass);
}
shuffleConsumerPlugin.close();
done(umbilical, reporter);
}
void runNewReducer(JobConf job,
final TaskUmbilicalProtocol umbilical,
final TaskReporter reporter,
RawKeyValueIterator rIter,
RawComparator<INKEY> comparator,
Class<INKEY> keyClass,
Class<INVALUE> valueClass
) throws IOException,InterruptedException,
ClassNotFoundException {
... ...
try {
// 调用reducetask的run方法
reducer.run(reducerContext);
} finally {
trackedRW.close(reducerContext);
}
}
Reduce.java
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKey()) {
reduce(context.getCurrentKey(), context.getValues(), context);
// If a back up store is used, reset it
Iterator<VALUEIN> iter = context.getValues().iterator();
if(iter instanceof ReduceContext.ValueIterator) {
((ReduceContext.ValueIterator<VALUEIN>)iter).resetBackupStore();
}
}
} finally {
cleanup(context);
}
}
第六章 MapReduce源码解析
6.1 Job提交流程源码和切片源码详解
1)Job提交流程源码详解
waitForCompletion()
submit();
// 1建立连接
connect();
// 1)创建提交Job的代理
new Cluster(getConfiguration());
// (1)判断是本地运行环境还是yarn集群运行环境
initialize(jobTrackAddr, conf);
// 2 提交job
submitter.submitJobInternal(Job.this, cluster)
// 1)创建给集群提交数据的Stag路径
Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);
// 2)获取jobid ,并创建Job路径
JobID jobId = submitClient.getNewJobID();
// 3)拷贝jar包到集群
copyAndConfigureFiles(job, submitJobDir);
rUploader.uploadFiles(job, jobSubmitDir);
// 4)计算切片,生成切片规划文件
writeSplits(job, submitJobDir);
maps = writeNewSplits(job, jobSubmitDir);
input.getSplits(job);
// 5)向Stag路径写XML配置文件
writeConf(conf, submitJobFile);
conf.writeXml(out);
// 6)提交Job,返回提交状态
status = submitClient.submitJob(jobId, submitJobDir.toString(), job.getCredentials());
6.2 Job提交流程源码解析
Configuration conf=new Configuration(); Job=job.getInstance(conf); … … Job.waitForCompletion(true) Job.submit(); MR程序运行在本地 模拟器 yarn JobSubmiter Cluster成员 proxy YarnRunner LocalJobRunner stagingDir File://..../.staging hdfs://..../.staging jobid file://..../.staging/jobid hdfs://..../.staging/jobid 调用 FileInputFormat.ge tSplits()获取切片 规划,并序列化成 文件 Job.split Job.xml 将Job相关参数 写到文件 如果是yarnRunner, 还需要获取Job的 jar包 xxx.jar file://..../.staging/jobid/job.split hdfs://..../.staging/jobid/job.split file://..../.staging/jobid/job.xml hdfs://..../.staging/jobid/job.xml hdfs://..../.staging/jobid/job.jar
2)FileInputFormat切片源码解析(input.getSplits(job))
6.1.2 FileInputFormat切片源码解析
(1)程序先找到你数据存储的目录。 (2)开始遍历处理(规划切片)目录下的每一个文件 (3)遍历第一个文件ss.txt a)获取文件大小fs.sizeOf(ss.txt) b)计算切片大小 computeSplitSize(Math.max(minSize,Math.min(maxSize,blocksize)))=blocksize=128M c)默认情况下,切片大小=blocksize d)开始切,形成第1个切片:ss.txt—0:128M 第2个切片ss.txt—128:256M 第3个切片ss.txt—256M:300M (每次切片时,都要判断切完剩下的部分是否大于块的1.1倍,不大于1.1倍就划分一块切片) e)将切片信息写到一个切片规划文件中 f)整个切片的核心过程在getSplit()方法中完成 g)InputSplit只记录了切片的元数据信息,比如起始位置、长度以及所在的节点列表等。 (4)提交切片规划文件到YARN上,YARN上的MrAppMaster就可以根据切片规划文件计算开启MapTask个数。