常见错误情况

假设在常见3副本的场景下, 传输一个大小超过64K的文件, 出现了以下几种可能的错误:

在创建pipeline时发生了错误
在pipeline创建好后, 发送packet的途中出现了错误, 分这两种常见情况:
- 写入pipeline后, response线程收到DN返回的ack里标记异常
- 写入pipeline时发生了错误, 如果当前没有其他DN被标记异常, 那么直接认为第一个DN异常（第一个DN就是离客户端最近的那个）
在普通packet已经都发送完, 发送空尾包时, DN返回的ack 状态有异常
- 如果重新建立无异常, 则更新block时间戳, 然后正常完成block
- 如果重新建立pipeline仍有异常, 则无视这个异常, 视为正常完成block

错误处理流程

ResponsorProcessor设置errorIndex以及restartingNodeIndex，hasError标记为true，关闭response线程
把ackQueue中的packet移到dataQueue
重新初始化pipline

初始化pipline流程

写恢复2.png
条件1：

DEFAULT: 1）副本数 >= 3 2）当前DN数 <= 副本数/2 || isAppend || isHflushed`
ALWAYS true
NEVER false

条件2:

pipeline处于create态就失败了
- PIPELINE_SETUP_APPEND: 总会申请新的DN
- PIPELINE_SETUP_CREATE:
  - 没有写数据, 那么不申请新DN
  - 有写数据, 申请DN. (特殊情况, 可先不管, 我也没搞清楚这个情况如何触发)
pipeline处于streaming态时 --> 都需要申请DN (大部分情况)
PIPELINE_CLOSE(发空尾包acked异常) --> 都直接忽略, 等待Namenode去补足缺的副本

情景说明

创建pipline时失败，给NN发送abandonBlock-RPC, 放弃此Block, 加入黑名单, 重试（和写失败处理不同）
3副本，默认策略，append失败，添加新的node
2副本默认策略 append失败，不添加新的node
3副本，默认策略，write失败，不添加新的node
2副本，默认策略，write失败，添加新的node
PIPELINE_CLOSE 不添加新的node

源码分析

Datastreamer##run

//关闭response
if (hasError && response != null) {
  try {
    response.close();
    response.join();
    response = null;
  } catch (InterruptedException  e) {
    DFSClient.LOG.warn("Caught exception ", e);
  }
}
if (hasError && (errorIndex >= 0 || restartingNodeIndex.get() >= 0)) {
  doSleep = processDatanodeError();//失败返回false,失败情况下也不等待
}

DataStreamer##processDatanodeError

private boolean processDatanodeError() throws IOException {
  //如果response还没关闭，等待
  if (response != null) {
    DFSClient.LOG.info("Error Recovery for " + block +
    " waiting for responder to exit. ");
    return true;
  }
  // 1.先关闭当前pipeline (blockStream, replyStream, socket)
  closeStream();

  // move packets from ack queue to front of the data queue
  // 2.把ack队列中的移到data队列
  synchronized (dataQueue) {
    dataQueue.addAll(0, ackQueue);
    ackQueue.clear();
  }

  // Record the new pipeline failure recovery.
  if (lastAckedSeqnoBeforeFailure != lastAckedSeqno) {
     lastAckedSeqnoBeforeFailure = lastAckedSeqno;
     pipelineRecoveryCount = 1;
  } else {
    // If we had to recover the pipeline five times in a row for the
    // same packet, this client likely has corrupt data or corrupting
    // during transmission.
    if (++pipelineRecoveryCount > 5) {
      DFSClient.LOG.warn("Error recovering pipeline for writing " +
          block + ". Already retried 5 times for the same packet.");
      lastException.set(new IOException("Failing write. Tried pipeline " +
          "recovery 5 times without success."));
      streamerClosed = true;
      return false;
    }
  }
  //3.重新初始化pipline
  boolean doSleep = setupPipelineForAppendOrRecovery();
  //4.发普通包异常 & 空尾包异常
  if (!streamerClosed && dfsClient.clientRunning) {
    if (stage == BlockConstructionStage.PIPELINE_CLOSE) {

      // If we had an error while closing the pipeline, we go through a fast-path
      // where the BlockReceiver does not run. Instead, the DataNode just finalizes
      // the block immediately during the 'connect ack' process. So, we want to pull
      // the end-of-block packet from the dataQueue, since we don't actually have
      // a true pipeline to send it over.
      //
      // We also need to set lastAckedSeqno to the end-of-block Packet's seqno, so that
      // a client waiting on close() will be aware that the flush finished.
      //4.1从dataQueue取出空尾包，和block要完成时调用一致, 会置空pipeline, 下一个packet会申请新的block
      synchronized (dataQueue) {
        DFSPacket endOfBlockPacket = dataQueue.remove();  // remove the end of block packet
        Span span = endOfBlockPacket.getTraceSpan();
        if (span != null) {
          // Close any trace span associated with this Packet
          TraceScope scope = Trace.continueSpan(span);
          scope.close();
        }
        assert endOfBlockPacket.isLastPacketInBlock();
        assert lastAckedSeqno == endOfBlockPacket.getSeqno() - 1;
        lastAckedSeqno = endOfBlockPacket.getSeqno();
        dataQueue.notifyAll();
      }
      endBlock();
    } else {
      //4.2重新启动response线程, 改pipeline为streaming(就绪态)
      initDataStreaming();
    }
  }
  
  return doSleep;
}

初始化pipline，并判断是否需要添加新的node
判断是否满足添加datanode的条件：
1.DEFAULT: 1）副本数 >= 3 2）当前DN数 <= 副本数/2 || isAppend || isHflushed
2.ALWAYS true
3.NEVER false

DataStreamer##setupPipelineForAppendOrRecovery

private boolean setupPipelineForAppendOrRecovery() throws IOException {
  // check number of datanodes
  if (nodes == null || nodes.length == 0) {
    String msg = "Could not get block locations. " + "Source file ""
        + src + "" - Aborting...";
    DFSClient.LOG.warn(msg);
    setLastException(new IOException(msg));
    streamerClosed = true;
    return false;
  }
  
  boolean success = false;
  long newGS = 0L;
  while (!success && !streamerClosed && dfsClient.clientRunning) {
    // Sleep before reconnect if a dn is restarting.
    // This process will be repeated until the deadline or the datanode
    // starts back up.
    //1.重启中的节点，等待4s
    if (restartingNodeIndex.get() >= 0) {
      // 4 seconds or the configured deadline period, whichever is shorter.
      // This is the retry interval and recovery will be retried in this
      // interval until timeout or success.
      long delay = Math.min(dfsClient.getConf().datanodeRestartTimeout,
          4000L);
      try {
        Thread.sleep(delay);
      } catch (InterruptedException ie) {
        lastException.set(new IOException("Interrupted while waiting for " +
            "datanode to restart. " + nodes[restartingNodeIndex.get()]));
        streamerClosed = true;
        return false;
      }
    }
    boolean isRecovery = hasError;
    // remove bad datanode from list of datanodes.
    // If errorIndex was not set (i.e. appends), then do not remove 
    // any datanodes
    // 2.错误节点处理
    if (errorIndex >= 0) {
      StringBuilder pipelineMsg = new StringBuilder();
      for (int j = 0; j < nodes.length; j++) {
        pipelineMsg.append(nodes[j]);
        if (j < nodes.length - 1) {
          pipelineMsg.append(", ");//怎么加逗号
        }
      }
      //只剩一个节点的时候会报错，此时错误无法剔除，直接报错
      if (nodes.length <= 1) {
        lastException.set(new IOException("All datanodes " + pipelineMsg
            + " are bad. Aborting..."));
        streamerClosed = true;
        return false;
      }
      DFSClient.LOG.warn("Error Recovery for block " + block +
          " in pipeline " + pipelineMsg + 
          ": bad datanode " + nodes[errorIndex]);
      //异常DN加入列表
      failed.add(nodes[errorIndex]);

      DatanodeInfo[] newnodes = new DatanodeInfo[nodes.length-1];
      arraycopy(nodes, newnodes, errorIndex);

      final StorageType[] newStorageTypes = new StorageType[newnodes.length];
      arraycopy(storageTypes, newStorageTypes, errorIndex);

      final String[] newStorageIDs = new String[newnodes.length];
      arraycopy(storageIDs, newStorageIDs, errorIndex);
      //更新pipline
      setPipeline(newnodes, newStorageTypes, newStorageIDs);

      // Just took care of a node error while waiting for a node restart
      if (restartingNodeIndex.get() >= 0) {
        // If the error came from a node further away than the restarting
        // node, the restart must have been complete.
        if (errorIndex > restartingNodeIndex.get()) {
          restartingNodeIndex.set(-1);
        } else if (errorIndex < restartingNodeIndex.get()) {
          // the node index has shifted.
          restartingNodeIndex.decrementAndGet();
        } else {
          // this shouldn't happen...
          assert false;
        }
      }

      if (restartingNodeIndex.get() == -1) {
        hasError = false;
      }
      lastException.set(null);
      errorIndex = -1;
    }

    // Check if replace-datanode policy is satisfied.
    //3.判断是否满足追加节点条件
    // 条件 --> 1. 副本数 >= 3
    //          2. 当前DN数 <= 副本数/2 || isAppend || isHflushed
    // 比如默认3副本, 那只有处于1副本或追加写/hflush调用后, 才会申请新的DN
    if (dfsClient.dtpReplaceDatanodeOnFailure.satisfy(blockReplication,
        nodes, isAppend, isHflushed)) {
      try {
        addDatanode2ExistingPipeline();
      } catch(IOException ioe) {
        if (!dfsClient.dtpReplaceDatanodeOnFailure.isBestEffort()) {
          throw ioe;
        }
        DFSClient.LOG.warn("Failed to replace datanode."
            + " Continue with the remaining datanodes since "
            + DFSConfigKeys.DFS_CLIENT_WRITE_REPLACE_DATANODE_ON_FAILURE_BEST_EFFORT_KEY
            + " is set to true.", ioe);
      }
    }

    // get a new generation stamp and an access token
    // 4.给NN发updateBlockForPipeline-RPC获取一个新的block时间戳/token
    LocatedBlock lb = dfsClient.namenode.updateBlockForPipeline(block, dfsClient.clientName);
    newGS = lb.getBlock().getGenerationStamp();
    accessToken = lb.getBlockToken();
    
    // set up the pipeline again with the remaining nodes
    // 5.重新通过Sender发送writeBlock请求, 并初始化pipeline, 成功返回true。
    //和pipline由初始态变为就绪态的初始化相同
    if (failPacket) { // for testing
      success = createBlockOutputStream(nodes, storageTypes, newGS, isRecovery);
      failPacket = false;
      try {
        // Give DNs time to send in bad reports. In real situations,
        // good reports should follow bad ones, if client committed
        // with those nodes.
        Thread.sleep(2000);
      } catch (InterruptedException ie) {}
    } else {
      success = createBlockOutputStream(nodes, storageTypes, newGS, isRecovery);
    }
    // 6.再次检查, 将超时的重启状态DN置为badNode
    if (restartingNodeIndex.get() >= 0) {
      assert hasError == true;
      // check errorIndex set above
      if (errorIndex == restartingNodeIndex.get()) {
        // ignore, if came from the restarting node
        errorIndex = -1;
      }
      // still within the deadline
      if (Time.monotonicNow() < restartDeadline) {
        continue; // with in the deadline
      }
      // expired. declare the restarting node dead
      restartDeadline = 0;
      int expiredNodeIndex = restartingNodeIndex.get();
      restartingNodeIndex.set(-1);
      DFSClient.LOG.warn("Datanode did not restart in time: " +
          nodes[expiredNodeIndex]);
      // Mark the restarting node as failed. If there is any other failed
      // node during the last pipeline construction attempt, it will not be
      // overwritten/dropped. In this case, the restarting node will get
      // excluded in the following attempt, if it still does not come up.
      if (errorIndex == -1) {
        errorIndex = expiredNodeIndex;
      }
      // From this point on, normal pipeline recovery applies.
    }
  } // while

  if (success) {
    // update pipeline at the namenode
    ExtendedBlock newBlock = new ExtendedBlock(
        block.getBlockPoolId(), block.getBlockId(), block.getNumBytes(), newGS);
    dfsClient.namenode.updatePipeline(dfsClient.clientName, block, newBlock,
        nodes, storageIDs);
    // update client side generation stamp
    block = newBlock;
  }
  return false; // do not sleep, continue processing
}

HDFS写错误处理

常见错误情况

错误处理流程

初始化pipline流程

情景说明

源码分析