mapreduce-4. AppMaster

317 阅读6分钟

核心类

image.png

org.apache.hadoop.mapreduce.v2.app.MRAppMaster

The Map-Reduce Application Master.

The state machine is encapsulated in the implementation of Job interface.

All state changes happens via Job interface. Each event results in a Finite State Transition in Job.

MR AppMaster is the composition of loosely coupled services. The services interact with each other via events. The components resembles the Actors model. The component acts on received event and send out the events to other components.

This keeps it highly concurrent with no or minimal synchronization needs.

The events are dispatched by a central Dispatch mechanism. All components register to the Dispatcher.

The information is shared across different components using AppContext.

MRAppMaster.serviceList

  • org.apache.hadoop.yarn.event.AsyncDispatcher

Dispatches Events in a separate thread. Currently only single thread does that. Potentially there could be multiple channels for each event type class and a thread pool can be used to dispatch the events.

  • org.apache.hadoop.mapreduce.v2.app.TaskAttemptFinishingMonitor

This class generates TA_TIMED_OUT if the task attempt stays in FINISHING state for too long.

  • org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler

EventProcessor thread resolve CommitterEvent(JOB_SETUP, JOB_COMMIT, JOB_ABORT,TASK_ABORT)

  • org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter

    OutputCommitter describes the commit of task output for a Map-Reduce job. The Map-Reduce framework relies on the OutputCommitter of the job to:

    1. Setup the job during initialization. For example, create the temporary output directory for the job during the initialization of the job.
    2. Cleanup the job after the job completion. For example, remove the temporary output directory after the job completion.
    3. Setup the task temporary output.
    4. Check whether a task needs a commit. This is to avoid the commit procedure if a task does not need commit.
    5. Commit of the task output.
    6. Discard the task commit.
  • org.apache.hadoop.mapred.TaskAttemptListenerImpl

This class listens for changes to the state of a Task. Protocol that task child process uses to contact its parent process. The parent is a daemon which which polls the central master for a new map or reduce task and runs it as a child process. All communication between child and parent is via this protocol.

  • org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator

Speculator component. Task Attempts' status updates are sent to this component. Concrete implementation runs the speculative algorithm and sends the TaskEventType.T_ADD_ATTEMPT. An implementation also has to arrange for the jobs to be scanned from time to time, to launch the speculations.

  • org.apache.hadoop.mapreduce.v2.app.MRAppMaster.StagingDirCleaningService

  • org.apache.hadoop.mapreduce.v2.app.MRAppMaster.ContainerAllocatorRouter

proxy for org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator

  • org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator Allocates the container from the ResourceManager scheduler.

  • org.apache.hadoop.mapreduce.v2.app.MRAppMaster.ContainerLauncherRouter

proxy for org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl

  • org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl This class is responsible for launching of containers.

  • org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler

The job history events get routed to this class. This class writes the Job history events to the DFS directly into a staging dir and then moved to a done-dir. JobHistory implementation is in this package to access package private classes.

org.apache.hadoop.yarn.event.AsyncDispatcher

Dispatches Events in a separate thread. Currently only single thread does that. Potentially there could be multiple channels for each event type class and a thread pool can be used to dispatch the events.

  • AsyncDispatcher$GenericEventHandler

    通用的EventHandler,调用

AsyncDispatcher.eventDispatchers

事件handler处理逻辑
EventTypeJobHistoryEventHandler将event事件日志写入hdfs中
TaskAttemptEventTypeMRAppMaster$TaskAttemptEventDispatcherTaskAttemptImpl触发状态机事件变化操作,发送dispatcher新事件
TaskEventTypeMRAppMaster$TaskEventDispatcherTaskImpl触发状态机事件变化操作,发送dispather新事件
CommitterEventTypeCommitterEventHandlerFileOutputCommiter进行job的setup和commit操作记录
Speculator$EventTypeMRAppMaster$SpeculatorEventDispatcherDefaultSpeculator推测执行评估
JobEventTypeMRAppMaster$JobEventDispatcherJobImpl触发状态机事件变化操作,发送dispather新事件
ContainerLauncher$EventTypeMRAppMaster$ContainerLauncherRouterContainerLauncherImpl向nm发送startContainers请求
ContainerAllocator$EventTypeRAppMaster$ContainerAllocatorRouterRMContainerAllocator处理分配容器请求,RMCommunicator的线程监听容器请求队列heartbeat
JobFinishEvent$EventTypeJobFinishEventHandler停止作业shutDownJob

org.apache.hadoop.mapreduce.lib.output.OutoutCommitter

输出提交器(Output Committers)

Hadoop 的 MapReduce 使用一个提交协议来清晰地确保作业和任务要么全部要么失败。这个行为由OutputCommitter来为作业实现,在新的API中,由OutoutFormat的 getOutputCommitter()方法决定。默认是FileOutputCommitter,它适合基于文件的MapReduce。你可以在一个已存在的OutoutCommitter上进行定制或者需要为作业或任务做一些特殊的设置或清理来写一个新的实现

public abstract class OutputCommitter {
    public abstract void setupJob(JobContext jobContext) throws IOException;
    public void commitJob(JobContext jobContext) throws IOException { }
    public void abortJob(JobContext jobContext, JobStatus.State state) throws IOException { }
    public abstract void setupTask(TaskAttemptContext taskContext) throws IOException;
    public abstract boolean needsTaskCommit(TaskAttemptContext taskContext) throws IOException;
    public abstract void commitTask(TaskAttemptContext taskContext) throws IOException;
    public abstract void abortTask(TaskAttemptContext taskContext) throws IOException;
}
  • 作业运行之前setupJob()方法被调用,它是典型的用于执行初始化。对于FileOutputCommitter,这个方法创建最终的输出目录(mapreduce.output.fileoutputformat.outputdir)和为任务输出创建的一个临时工作空间,_temporary,作为最终输出目录下的一个子目录

  • 如果作业成功,commitJob()方法被调用,默认基于文件的实现会删除临时工作空间,然后创建一个隐藏的标记文件在输出目录中_SUCCESS,用来标识文件系统客户端这个作业已经成功完成。

  • 如果作业没有成功,abortJob()被调用并传递一个状态对象,该对象指示作业是失败还是被中止。默认实现,会删除作业的临时工作空间

在任务级别,操作也是类似的。setUpTask()方法在任务运行之前被调用,默认的实现什么也不做,因为临时目录的命名已经在任务输出时被创建了

RPC通信类

  • 提供客户端通信服务: client -> MRAppMaster
MRClientService:  
	MRClientProtocol->MRClientService$MRClientProtocolHandler
	AMWebApp: jetty
	service to handle requests from JobClient
  • 提供Task通信服务: Task -> MRAppMaster
TaskAttemptListenerImpl:  
	TaskUmbilicalProtocol->TaskAttemptListenerImpl
    Protocol that task child process uses to contact MRAppMaster. The MRAppMaster is a daemon which which polls the rm for a new map or reduce task and runs it as a child process. All communication between task and  is via this protocol.

例如Task调用done(TaskAttemptID taskAttemptID),MRAppMaster更新taskHeartbeatHandler的task心跳检测信息,并发送TA_DONE事件给Dispatcher

  • 向RM注册ApplicationMaster信息,申请container资源
RMContainerAllocator extends RMContainerRequestor extends RMCommunicator: 
	ApplicationMasterProtocol -> ApplicationMasterProtocolPBClientImpl    (ClientRMProxy.createRMProxy(conf, ApplicationMasterProtocol.class))
	registerApplicationMaster(request)
	finishApplicationMaster(request)
	allocate(allocateRequest)
  • 向NM发送启动container信息
ContainerLauncherImpl:
	ContainerManagementProtocol -> ContainerManagementProtocolProxy
	startContainers(requestList)

核心算法

状态机与事件驱动机制

image.png

doc

AsyncDispatcher事件驱动

  1. JobEventType -> MRAppMaster$JobEventDispatcher: 交给JobImpl处理job事件
  2. TaskEventType -> MRAppMaster$TaskEventDispatcher: 交给TaskImpl处理task事件
  3. TaskAttemptEventType -> MRAppMaster$TaskAttemptEventDispatcher: 交给TaskAttemptImpl处理TaskAttempt事件
  4. ContainerAllocatorEventType>MRAppMasterEventType -> MRAppMasterContainerAllocatorRouter: 容器分配事件处理,交由RMContainerAllocator处理
  5. ContainerLauncherEventType>MRAppMasterEventType -> MRAppMasterContainerLauncherRouter: 容器启动事件处理,交由ContainerLauncherImpl处理
  6. CommitterEventType -> CommitterEventHandler: 作业提交信息记录committer
  7. SpeculatorEventType>MRAppMasterEventType -> MRAppMasterSpeculatorEventDispatcher: 推测执行事件处理
  8. JobFinishEvent -> JobFinishEventHandler: 作业结束处理
  9. org.apache.hadoop.mapreduce.jobhistory.EventType -> JobHistoryEventHandler: 记录事件日志

关键状态流转

  1. MRAppMaster初始化启动完毕,依次发送JOB_INIT事件和JOB_START事件
// create a job event for job initialization
JobEvent initJobEvent = new JobEvent(job.getID(), JobEventType.JOB_INIT);
// Send init to the job (this does NOT trigger job execution)
// This is a synchronous call, not an event through dispatcher. We want
// job-init to be done completely here.
jobEventDispatcher.handle(initJobEvent);
...
startJobs();//new JobStartEvent

1.1 JOB_INIT事件被jobEventDispatcher分发,触发JobImpl$InitTransition状态变化,执行作业初始化操作

setup(job);
...
TaskSplitMetaInfo[] taskSplitMetaInfo = createSplits(job, job.jobId);
job.numMapTasks = taskSplitMetaInfo.length;
job.numReduceTasks = job.conf.getInt(MRJobConfig.NUM_REDUCES, 0);

// create the Tasks but don't start them yet
createMapTasks(job, inputLength, taskSplitMetaInfo);
createReduceTasks(job);
return JobStateInternal.INITED;

1.2 JOB_START事件被jobEventDispatcher分发,触发JobImpl$StartTransition状态变化,发送CommitterJobSetupEvent(CommitterEventType.JOB_SETUP)事件

job.eventHandler.handle(new CommitterJobSetupEvent(
      job.jobId, job.jobContext));
  1. CommitterEventType.JOB_SETUP事件被CommitterEventHandler分发,输出提交器OutputCommiter处理setupJob信息,并发送JobSetupCompletedEvent(JobEventType.JOB_SETUP_COMPLETED)事件
committer.setupJob(event.getJobContext());
context.getEventHandler().handle(
    new JobSetupCompletedEvent(event.getJobID()));
  1. JOB_SETUP_COMPLETED事件被jobEventDispatcher分发,触发JobImpl$SetupCompletedTransition状态变化,发送TaskEvent(TaskEventType.T_SCHEDULE)事件
job.scheduleTasks(job.mapTasks, job.numReduceTasks == 0);//eventHandler.handle(new TaskEvent(taskID, TaskEventType.T_SCHEDULE));
job.scheduleTasks(job.reduceTasks, true);
  1. T_SCHEDULE事件被TaskEventDispatcher分发,触发TaskImpl$InitialScheduleTransition状态变化,发送TaskAttemptEvent(TaskAttemptEventType.TA_SCHEDULE)事件
task.addAndScheduleAttempt(Avataar.VIRGIN);//eventHandler.handle(new TaskAttemptEvent(attempt.getID(), TaskAttemptEventType.TA_SCHEDULE));
  1. TA_SCHEDULE事件被TaskAttemptEventDispatcher分发,触发TaskAttemptImpl$RequestContainerTransition状态变化,发送ContainerRequestEvent(ContainerAllocator.EventType.CONTAINER_REQ)事件
// Tell any speculator that we're requesting a container
taskAttempt.eventHandler.handle
  (new SpeculatorEvent(taskAttempt.getID().getTaskId(), +1));
//request for container
if (rescheduled) {
taskAttempt.eventHandler.handle(
    ContainerRequestEvent.createContainerRequestEventForFailedContainer(
        taskAttempt.attemptId, 
        taskAttempt.resourceCapability));
} else {
taskAttempt.eventHandler.handle(new ContainerRequestEvent(
    taskAttempt.attemptId, taskAttempt.resourceCapability,
    taskAttempt.dataLocalHosts.toArray(
        new String[taskAttempt.dataLocalHosts.size()]),
    taskAttempt.dataLocalRacks.toArray(
        new String[taskAttempt.dataLocalRacks.size()])));
}
  1. CONTAINER_REQ事件被ContainerAllocatorRouter分发由RMContainerAllocator处理,map任务直接进入调度放入ResourceRequest的TreeSet结构的ask容器中,reduce任务需要等待map任务
ContainerRequestEvent reqEvent = (ContainerRequestEvent) event;
boolean isMap = reqEvent.getAttemptID().getTaskId().getTaskType().
  equals(TaskType.MAP);
if (isMap) {
    handleMapContainerRequest(reqEvent);//scheduledRequests.addMap(reqEvent); //maps are immediately scheduled
} else {
    handleReduceContainerRequest(reqEvent);//pendingReduces.add(ContainerRequest)//reduces are added to pending queue and are slowly ramped up
}
  1. RMContainerAllocator的分配线程一直与RM心跳连接AllocatorRunnable.heartbeat(), 一方面是轮询是否有新资源请求并申请资源,另一方面是查看资源是否已申请完成以及申请完成的container状态。其调用getResources() 方法然后调用makeRemoteRequest()方法会向RM申请资源 allocate(allocateRequest),申请到资源之后会调用assignContainers(allocatedContainers)方法会发送TaskAttemptContainerAssignedEvent(TaskAttemptEventType.TA_ASSIGNED)事件 heartbeat()方法
List<Container> allocatedContainers = getResources();
if (allocatedContainers != null && allocatedContainers.size() > 0) {
  scheduledRequests.assign(allocatedContainers);//调用assignContainers(allocatedContainers);方法会发送TaskAttemptContainerAssignedEvent事件
}

makeRemoteRequest()方法

AllocateResponse allocateResponse = scheduler.allocate(allocateRequest);//scheduler为与RM通信类
lastResponseID = allocateResponse.getResponseId();
availableResources = allocateResponse.getAvailableResources();
lastClusterNmCount = clusterNmCount;
clusterNmCount = allocateResponse.getNumClusterNodes();
int numCompletedContainers =
    allocateResponse.getCompletedContainersStatuses().size();
  1. TA_ASSIGNED事件被TaskAttemptEventDispatcher分发,触发TaskAttemptImpl$ContainerAssignedTransition状态变化,构造ContainerLaunchContext包括环境变量启动命令等,发送ContainerRemoteLaunchEvent(ContainerLauncher.EventType.CONTAINER_REMOTE_LAUNCH)事件
  //launch the container
  //create the container object to be launched for a given Task attempt
  ContainerLaunchContext launchContext = createContainerLaunchContext(
      cEvent.getApplicationACLs(), taskAttempt.conf, taskAttempt.jobToken,
      taskAttempt.remoteTask, taskAttempt.oldJobId, taskAttempt.jvmID,
      taskAttempt.taskAttemptListener, taskAttempt.credentials);
  taskAttempt.eventHandler
    .handle(new ContainerRemoteLaunchEvent(taskAttempt.attemptId,
      launchContext, container, taskAttempt.remoteTask));

  // send event to speculator that our container needs are satisfied
  taskAttempt.eventHandler.handle
      (new SpeculatorEvent(taskAttempt.getID().getTaskId(), -1));
  1. CONTAINER_REMOTE_LAUNCH事件被ContainerLauncherRouter分发由ContainerLauncherImpl处理,向NM发送启动container命令,发送TaskAttemptContainerLaunchedEvent(TaskAttemptEventType.TA_CONTAINER_LAUNCHED)事件
proxy = getCMProxy(containerMgrAddress, containerID);//与NM通信

// Construct the actual Container
ContainerLaunchContext containerLaunchContext =
  event.getContainerLaunchContext();

// Now launch the actual container
StartContainerRequest startRequest =
    StartContainerRequest.newInstance(containerLaunchContext,
      event.getContainerToken());
List<StartContainerRequest> list = new ArrayList<StartContainerRequest>();
list.add(startRequest);
StartContainersRequest requestList = StartContainersRequest.newInstance(list);
StartContainersResponse response =
    proxy.getContainerManagementProtocol().startContainers(requestList);//向NM发送启动container信息
if (response.getFailedRequests() != null
    && response.getFailedRequests().containsKey(containerID)) {
  throw response.getFailedRequests().get(containerID).deSerialize();
}
ByteBuffer portInfo =
    response.getAllServicesMetaData().get(
        ShuffleHandler.MAPREDUCE_SHUFFLE_SERVICEID);//取出返回信息的shuffle的端口信息
int port = -1;
if(portInfo != null) {
  port = ShuffleHandler.deserializeMetaData(portInfo);
}
LOG.info("Shuffle port returned by ContainerManager for "
    + taskAttemptID + " : " + port);

if(port < 0) {
  this.state = ContainerState.FAILED;
  throw new IllegalStateException("Invalid shuffle port number "
      + port + " returned for " + taskAttemptID);
}

// after launching, send launched event to task attempt to move
// it from ASSIGNED to RUNNING state
context.getEventHandler().handle(
    new TaskAttemptContainerLaunchedEvent(taskAttemptID, port));//发送TaskAttemptContainerLaunchedEvent
this.state = ContainerState.RUNNING;
  1. TA_CONTAINER_LAUNCHED事件被TaskAttemptEventDispatcher分发,触发TaskAttemptImpl$LaunchedContainerTransition状态变化,设置taskAttemp的启动时间、shuffle端口、nm地址等信息,发送TaskTAttemptEvent(TaskEventType.T_ATTEMPT_LAUNCHED)事件
//set the launch time
taskAttempt.launchTime = taskAttempt.clock.getTime();
taskAttempt.shufflePort = event.getShufflePort();

// register it to TaskAttemptListener so that it can start monitoring it.
taskAttempt.taskAttemptListener
.registerLaunchedTask(taskAttempt.attemptId, taskAttempt.jvmID);

//TODO Resolve to host / IP in case of a local address.
InetSocketAddress nodeHttpInetAddr = // TODO: Costly to create sock-addr?
  NetUtils.createSocketAddr(taskAttempt.container.getNodeHttpAddress());
taskAttempt.trackerName = nodeHttpInetAddr.getHostName();
taskAttempt.httpPort = nodeHttpInetAddr.getPort();
taskAttempt.sendLaunchedEvents();
taskAttempt.eventHandler.handle
  (new SpeculatorEvent
      (taskAttempt.attemptId, true, taskAttempt.clock.getTime()));
//make remoteTask reference as null as it is no more needed
//and free up the memory
taskAttempt.remoteTask = null;

//tell the Task that attempt has started
taskAttempt.eventHandler.handle(new TaskTAttemptEvent(
  taskAttempt.attemptId, 
 TaskEventType.T_ATTEMPT_LAUNCHED));
  1. Task运行完成会与AM通信,调用TaskAttemptListenerImpl的done方法,发送TaskAttemptEvent(TaskAttemptEventType.TA_DONE)事件
  public void done(TaskAttemptID taskAttemptID) throws IOException {
    LOG.info("Done acknowledgment from " + taskAttemptID.toString());

    org.apache.hadoop.mapreduce.v2.api.records.TaskAttemptId attemptID =
        TypeConverter.toYarn(taskAttemptID);

    taskHeartbeatHandler.progressing(attemptID);

    context.getEventHandler().handle(
        new TaskAttemptEvent(attemptID, TaskAttemptEventType.TA_DONE));
  }
  1. TA_DONE事件被TaskAttemptEventDispatcher分发,触发TaskAttemptImpl$MoveContainerToSucceededFinishingTransition状态变化, 发送TaskTAttemptEvent(TaskEventType.T_ATTEMPT_SUCCEEDED)事件
finalizeProgress(taskAttempt);

// register it to finishing state
taskAttempt.appContext.getTaskAttemptFinishingMonitor().register(
  taskAttempt.attemptId);

// set the finish time
taskAttempt.setFinishTime();

// notify job history
taskAttempt.eventHandler.handle(
  createJobCounterUpdateEventTASucceeded(taskAttempt));
taskAttempt.logAttemptFinishedEvent(TaskAttemptStateInternal.SUCCEEDED);

//notify the task even though the container might not have exited yet.
taskAttempt.eventHandler.handle(new TaskTAttemptEvent(
  taskAttempt.attemptId,
  TaskEventType.T_ATTEMPT_SUCCEEDED));
taskAttempt.eventHandler.handle
  (new SpeculatorEvent
      (taskAttempt.reportedStatus, taskAttempt.clock.getTime()));
  1. T_ATTEMPT_SUCCEEDED事件被TaskAttemptEventDispatcher分发,触发TaskImpl$AttemptSucceededTransition状态变化, 发送JobTaskAttemptCompletedEvent(JobEventType.JOB_TASK_ATTEMPT_COMPLETED)和JobTaskEvent(JobEventType.JOB_TASK_COMPLETED)事件
  TaskTAttemptEvent taskTAttemptEvent = (TaskTAttemptEvent) event;
  TaskAttemptId taskAttemptId = taskTAttemptEvent.getTaskAttemptID();
  task.handleTaskAttemptCompletion(
      taskAttemptId, 
      TaskAttemptCompletionEventStatus.SUCCEEDED);//JobEventType.JOB_TASK_ATTEMPT_COMPLETED
  task.finishedAttempts.add(taskAttemptId);
  task.inProgressAttempts.remove(taskAttemptId);
  task.successfulAttempt = taskAttemptId;
  task.sendTaskSucceededEvents();//JobTaskEvent
  for (TaskAttempt attempt : task.attempts.values()) {
    if (attempt.getID() != task.successfulAttempt &&
        // This is okay because it can only talk us out of sending a
        //  TA_KILL message to an attempt that doesn't need one for
        //  other reasons.
        !attempt.isFinished()) {
      LOG.info("Issuing kill to other attempt " + attempt.getID());
      task.eventHandler.handle(new TaskAttemptKillEvent(attempt.getID(),
          SPECULATION + task.successfulAttempt + " succeeded first!"));
    }
  }
  task.finished(TaskStateInternal.SUCCEEDED);
  1. 分别触发JobImpl的TaskAttemptCompletedEventTransition和TaskCompletedTransition状态变化,更新job的作业进度统计信息,并且当checkReadyForCommit发现task全部完成发送CommitterJobCommitEvent(CommitterEventType.JOB_COMMIT)事件记录作业完成信息。

  2. RMContainerAllocator与RM的心跳hearBeat里的getResourse发现有completed的container会发送TaskAttemptEvent(TaskAttemptEventType.TA_CONTAINER_COMPLETED)事件

  void processFinishedContainer(ContainerStatus container) {// container等于getResource返回结果response.getCompletedContainersStatuses();
    LOG.info("Received completed container " + container.getContainerId());
    TaskAttemptId attemptID = assignedRequests.get(container.getContainerId());
    if (attemptID == null) {
      LOG.error("Container complete event for unknown container "
          + container.getContainerId());
    } else {
      pendingRelease.remove(container.getContainerId());
      assignedRequests.remove(attemptID);

      // Send the diagnostics
      String diagnostic = StringInterner.weakIntern(container.getDiagnostics());
      eventHandler.handle(new TaskAttemptDiagnosticsUpdateEvent(attemptID,
          diagnostic));

      // send the container completed event to Task attempt
      eventHandler.handle(createContainerFinishedEvent(container, attemptID));//new TaskAttemptEvent(attemptId, TaskAttemptEventType.TA_CONTAINER_COMPLETED)

      preemptionPolicy.handleCompletedContainer(attemptID);
    }
  }

16, TA_CONTAINER_COMPLETED事件触发ExitFinishingOnContainerCompletedTransition,并发送ContainerLauncherEvent(ContainerLauncher.EventType.CONTAINER_COMPLETED)事件 ContainerLauncherEvent(CONTAINER_COMPLETED) -> ContainerLauncherImpl.done()

map数和reduce数计算

map数计算

map数等于split数, mapreduce.job.maps数只是一个提示,最终由InputFormat决定,可以设置split大小来调整 预支一下AppMaster的代码: JobImpl.$InitTransition

TaskSplitMetaInfo[] taskSplitMetaInfo = createSplits(job, job.jobId);
...
// create the Tasks but don't start them yet
createMapTasks(job, inputLength, taskSplitMetaInfo);
createReduceTasks(job);
private void createMapTasks(JobImpl job, long inputLength, TaskSplitMetaInfo[] splits) {//根据splits划分,一个spit一个map,第几个split标识第几个map
  for (int i=0; i < job.numMapTasks; ++i) {
    TaskImpl task =
        new MapTaskImpl(job.jobId, i,
            job.eventHandler, 
            job.remoteJobConfFile, 
            job.conf, splits[i], 
            job.taskAttemptListener, 
            job.jobToken, job.jobCredentials,
            job.clock,
            job.applicationAttemptId.getAttemptId(),
            job.metrics, job.appContext);
    job.addTask(task);
  }
  LOG.info("Input size for job " + job.jobId + " = " + inputLength
      + ". Number of splits = " + splits.length);
}

reduce数计算

  • reduce数等于mapreduce.job.reduces,客户端jobConf默认设置为1
private void createReduceTasks(JobImpl job) {
  for (int i = 0; i < job.numReduceTasks; i++) {
    TaskImpl task =
        new ReduceTaskImpl(job.jobId, i,
            job.eventHandler, 
            job.remoteJobConfFile, 
            job.conf, job.numMapTasks, 
            job.taskAttemptListener, job.jobToken,
            job.jobCredentials, job.clock,
            job.applicationAttemptId.getAttemptId(),
            job.metrics, job.appContext);
    job.addTask(task);
  }
  LOG.info("Number of reduces for job " + job.jobId + " = "
      + job.numReduceTasks);
}

推测执行算法

private long speculationValue(TaskId taskID, long now) {
    Job job = context.getJob(taskID.getJobId());
    Task task = job.getTask(taskID);
    Map<TaskAttemptId, TaskAttempt> attempts = task.getAttempts();
    long acceptableRuntime = Long.MIN_VALUE;
    long result = Long.MIN_VALUE;

    if (!mayHaveSpeculated.contains(taskID)) {
      //判断task任务运行时间是否可以开启推测
      //条件:已经完成的任务数目大于1且比例不小于5%
      //只有这样,才能有足够的历史信息估算estimatedReplacementEndTime
      acceptableRuntime = estimator.thresholdRuntime(taskID);
      if (acceptableRuntime == Long.MAX_VALUE) {
        return ON_SCHEDULE;
      }
    }

    TaskAttemptId runningTaskAttemptID = null;

    int numberRunningAttempts = 0;

    for (TaskAttempt taskAttempt : attempts.values()) {
      if (taskAttempt.getState() == TaskAttemptState.RUNNING
          || taskAttempt.getState() == TaskAttemptState.STARTING) {
        if (++numberRunningAttempts > 1) {//每个任务最多只能有一个备份任务
          return ALREADY_SPECULATING;
        }
        runningTaskAttemptID = taskAttempt.getID();

        //estimatedRunTime为推测出来的任务运行时间
        //默认LegacyTaskRuntimeEstimator的updateAttempt方法会一直更新此值,计算方法为:
        // estimate = (long) ((timestamp - start) / Math.max(0.0001, status.progress));已经运行的时间/百分比
        long estimatedRunTime = estimator.estimatedRuntime(runningTaskAttemptID);
        

        long taskAttemptStartTime
            = estimator.attemptEnrolledTime(runningTaskAttemptID);//taskAttemptStartTime为该任务的启动时间
        if (taskAttemptStartTime > now) {//
          // This background process ran before we could process the task
          //  attempt status change that chronicles the attempt start
          return TOO_NEW;
        }

        long estimatedEndTime = estimatedRunTime + taskAttemptStartTime;//estimatedEndTime是通过预测算法推测的该任务的最终完成时刻

        //estimatedReplacementEndTime含义为:如果此刻启动该任务,(可推测出来的)任务最终可能的完成时刻。
        //计算公式为:{当前时刻}+{已经成功运行完成的任务所使用的平均运行时间}
        //调用StartEndTimesBase.estimatedNewAttemptRuntime()获得statistics.mean()平均数
        long estimatedReplacementEndTime
            = now + estimator.estimatedNewAttemptRuntime(taskID);

        ...
        
        //预估结束时间小于现在
        if (estimatedEndTime < now) {
          return PROGRESS_IS_GOOD;
        }

        //如果estimatedReplacementEndTime大于estimatedEndTime,则没必要启动备份任务,因为即使启动了,它的完成时刻也会大于当前正在运行任务的完成时刻
        if (estimatedReplacementEndTime >= estimatedEndTime) {
          return TOO_LATE_TO_SPECULATE;
        }

        result = estimatedEndTime - estimatedReplacementEndTime;
      }
    }

    // If we are here, there's at most one task attempt.
    if (numberRunningAttempts == 0) {
      return NOT_RUNNING;
    }

    ...
    return result;
  }

reduce开启时机

配置mapreduce.job.reduce.slowstart.completedmaps 默认0.05,即95%的map任务结束 RMContainerAllocator.java的scheduleReduces方法

    //check for slow start
    if (!getIsReduceStarted()) {//not set yet
      int completedMapsForReduceSlowstart = (int)Math.ceil(reduceSlowStart * 
                      totalMaps);
      if(completedMaps < completedMapsForReduceSlowstart) {
        LOG.info("Reduce slow start threshold not met. " +
              "completedMapsForReduceSlowstart " + 
            completedMapsForReduceSlowstart);
        return;
      } else {
        LOG.info("Reduce slow start threshold reached. Scheduling reduces.");
        setIsReduceStarted(true);
      }
    }

Task心跳超时检测

TaskAttemptListenerImpl会更新taskatempt心跳数据到TaskHearBeatHandler, TaskHeartBeatHandler会线程定时检测心跳超时的taskAttempt,发送TA_TIMED_OUT事件之后taskAtempt会被kill掉,默认心跳超时时间为mapreduce.task.timeout, 600 secs

private void checkRunning(long currentTime) {
  Iterator<Map.Entry<TaskAttemptId, ReportTime>> iterator =
      runningAttempts.entrySet().iterator();

  while (iterator.hasNext()) {
    Map.Entry<TaskAttemptId, ReportTime> entry = iterator.next();
    boolean taskTimedOut = (taskTimeOut > 0) &&
        (currentTime > (entry.getValue().getLastProgress() + taskTimeOut));

    if(taskTimedOut) {
      // task is lost, remove from the list and raise lost event
      iterator.remove();
      eventHandler.handle(new TaskAttemptDiagnosticsUpdateEvent(entry
          .getKey(), "AttemptID:" + entry.getKey().toString()
          + " Timed out after " + taskTimeOut / 1000 + " secs"));
      eventHandler.handle(new TaskAttemptEvent(entry.getKey(),
          TaskAttemptEventType.TA_TIMED_OUT));
    }
  }
}

容错机制

ApplicationMaster

  1. mapreduce最大的尝试次数有 mapreduce.am.max-attempts 属性控制。默认的值是2,如果一个 MapReduce master 失败两次,它不会被再次进行尝试,作业标识为失败
  2. YARN 为任何运行在集群中的 application master 限制了一个最大尝试数,每个独立的应用不能超过这个限制。这个限制有 yarn.resourcemanager.am.max-attempts 设定,默认值为 2,如果你想要增加 MapReduce application master 的尝试次数,你必须在 YARN 集群上增加限制数

恢复工作的方式如下:

  1. application master 周期性的发送心跳给资源管理器,一旦 application master 失败,资源管理器将会侦测到失败,然后在一个新的容器中启动一个新的 appmaster 实例。
  2. 至于 MapReduce application master,它会使用作业历史(job history)来恢复先前运行成功的任务,所以这些任务不必重新运行。恢复是默认可用的,可以通过 yarn.app.mapreduce.am.job.recovery.enable 设置为 false来禁止

MapReduce 客户端轮询 application master 来获得进度报告,但是如果 application master 失败,客户端需要定位新的 application master。在作业初始化阶段,客户端向资源管理器请求 application master 的地址,然后缓存这个地址,所以客户端不会overload地每次向资源管理器轮询 application master。如果这个 application master 失败,客户端将会经历一次状态更新的超时,此时客户端会重新向资源管理器请求新的 application master 的地址。这个过程对于用户来说是透明的

Task

  • 失败的情况
  1. 任务失败最多的场景是用户 map 或 reduce 任务的代码抛出一个运行时异常。如果错误发生,在退出之前,任务 JVM 向父 application master 报告错误。这个错误最终会记录到用户日志中。Application master 将任务尝试标记为失败,然后释放容器资源给另一个任务

  2. 另一个失败的模式是任务 JVM 的突然退出 -- 可能在用户代码中使用了一些引起 JVM 自身 bug 的特殊场景。这种情况下,节点管理器注意到进程已经退出并通知 application master 可以标记任务尝试为失败

  3. 对挂起的任务处理是不同的。 Application master 注意到在一段时间内它没有收到任务的进度跟新,进而将该任务标记为失败。这个任务的 JVM 进程随后被自动杀掉。默认情况下超过 10 分钟就被认为是失败的,可以通过 mapreduce.task.timeout 属性进行配置。这是超时值为 0 时,就等于让超时失效,所以长时间运行的任务永远也不会被标记为失败。这种情况下,挂起的任务永远不会释放它的容器,且超时也有可能导致集群变慢。所以要避免这种情况,确保任务周期的报告其进度

  4. Too Many fetch failures.Failing the attempt。Reduce任务拉取Map数据错误, reduce task启动后的第一阶段是shuffle(向map端fetch数据),每次fetch数据的时候都可能因为connect timeout,read timeout,checksum error等原因时报,因而reduce task为每个map设置了一个计数器,用以记录fetch该map输出时失败的次数,当失败次数达到一定阀值的时候。会通知MRAppMaster 从该map fetch数据时失败的次数太多了, 重跑map任务 mapreduce.reduce.shuffle.max-fetch-failures-fraction

JobTaskAttemptFetchFailureEvent fetchfailureEvent = 
    (JobTaskAttemptFetchFailureEvent) event;
  for (org.apache.hadoop.mapreduce.v2.api.records.TaskAttemptId mapId : 
        fetchfailureEvent.getMaps()) {
    Integer fetchFailures = job.fetchFailuresMapping.get(mapId);
    fetchFailures = (fetchFailures == null) ? 1 : (fetchFailures+1);
    job.fetchFailuresMapping.put(mapId, fetchFailures);

    float failureRate = shufflingReduceTasks == 0 ? 1.0f : 
      (float) fetchFailures / shufflingReduceTasks;
    // declare faulty if fetch-failures >= max-allowed-failures
    if (fetchFailures >= job.getMaxFetchFailuresNotifications()
        && failureRate >= job.getMaxAllowedFetchFailuresFraction()) {
      LOG.info("Too many fetch-failures for output of task attempt: " + 
          mapId + " ... raising fetch failure to map");
      job.eventHandler.handle(new TaskAttemptTooManyFetchFailureEvent(mapId,
          fetchfailureEvent.getReduce(), fetchfailureEvent.getHost()));//mapId对应的任务会被标记为Failed,进而appmaster重新调度新的map任务
      job.fetchFailuresMapping.remove(mapId);
    }
  }
2021-01-29 02:48:12,735 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Too many fetch-failures for output of task attempt: attempt_2082859101800_24555422_m_001088_0 ... raising fetch failure to map
2021-01-29 02:48:12,735 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_2082859101800_24555422_m_001088_0 TaskAttempt Transitioned from SUCCEEDED to FAILED
2021-01-29 02:48:12,738 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_2082859101800_24555422_m_001088 Task Transitioned from SUCCEEDED to SCHEDULED
  • 重试机制 当 application master 知道一个任务尝试失败了,它会重新调度任务的执行。Application master 会尝试避免重新调度这个任务在先前失败的节点管理器上。更进一步,如果一个任务失败 4 次,它将不会被重试。这个值是可以配置的(mapreduce.map.maxattemps, mapreduce.reduce.maxattemps)。默认情况下,任何任务失败四次,该作业也就失败了,推测执行的任务不算在次数里面。

对于一些应用,并不希望少数任务的失败而丢弃该作业,可能在丢弃部分失败的作业的结果也是可用的。这种情况下,最大比例任务允许失败而不至于触发作业失败能够被设置(mapreduce.map.failures.maxpercent, mapreduce.reduce.failures.maxpercent)

运行示例

start_container_request

start_container_request {
  container_launch_context {
    localResources {
      key: "job.xml"
      value {
        resource {
          scheme: "hdfs"
          host: "localhost"
          port: 9000
          file: "/tmp/hadoop-yarn/staging/root/.staging/job_1611734493242_0013/job.xml"
        }
        size: 186198
        timestamp: 1612092384017
        type: FILE
        visibility: APPLICATION
      }
    }
    localResources {
      key: "job.jar"
      value {
        resource {
          scheme: "hdfs"
          host: "localhost"
          port: 9000
          file: "/tmp/hadoop-yarn/staging/root/.staging/job_1611734493242_0013/job.jar"
        }
        size: 316382
        timestamp: 1612092383209
        type: PATTERN
        visibility: APPLICATION
        pattern: "(?:classes/|lib/).*"
      }
    }
    tokens: "HDTS\000\001\bJobToken\027\026job_1611734493242_0013\024b\\\221\330\257Q\330\353\275\222\340XSg\221\300\031\354\353\rmapreduce.job\026job_1611734493242_0013\001\025MapReduceShuffleToken\b\032\271\355\002\351\033\364\001"
    service_data {
      key: "mapreduce_shuffle"
      value: "\027\026job_1611734493242_0013\b\032\271\355\002\351\033\364\001\rmapreduce.job\026job_1611734493242_0013"
    }
    environment {
      key: "STDOUT_LOGFILE_ENV"
      value: "<LOG_DIR>/stdout"
    }
    environment {
      key: "SHELL"
      value: "/bin/bash"
    }
    environment {
      key: "LD_LIBRARY_PATH"
      value: "$PWD:{{HADOOP_COMMON_HOME}}/lib/native"
    }
    environment {
      key: "HADOOP_ROOT_LOGGER"
      value: "INFO,console"
    }
    environment {
      key: "STDERR_LOGFILE_ENV"
      value: "<LOG_DIR>/stderr"
    }
    environment {
      key: "HADOOP_MAPRED_HOME"
      value: "${HADOOP_HOME}"
    }
    environment {
      key: "CLASSPATH"
      value: "$PWD:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:job.jar/*:job.jar/classes/:job.jar/lib/*:$PWD/*"
    }
    environment {
      key: "HADOOP_CLIENT_OPTS"
      value: ""
    }
    command: "$JAVA_HOME/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN  -Xmx1536M -Djava.io.tmpdir=$PWD/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 172.17.0.4 37465 attempt_1611734493242_0013_m_000000_1000 2 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr "
    application_ACLs {
      accessType: APPACCESS_VIEW_APP
      acl: " "
    }
    application_ACLs {
      accessType: APPACCESS_MODIFY_APP
      acl: " "
    }
  }
  container_token {
    identifier: "\n\021\022\r\n\t\b\r\020\272\320\362\226\364.\020\002\030\002\022\rhadoop3:43845\032\004root\"+\b\200\020\020\001\032\024\n\tmemory-mb\020\200\020\032\002Mi \000\032\016\n\006vcores\020\001\032\000 \000(\257\224\314\302\365.0\361\374\235\266\375\377\377\377\377\0018\272\320\362\226\364.B\002\b\024H\351\371\245\302\365.Z\000`\002h\001p\000x\377\377\377\377\377\377\377\377\377\001"
    password: "]J\222\320\234I@w\310\217i\230\217\273\371\367\035\236\236\004"
    kind: "ContainerToken"
    service: "172.17.0.4:43845"
  }
}