常见错误情况
假设在常见3副本的场景下, 传输一个大小超过64K的文件, 出现了以下几种可能的错误:
-
在创建pipeline时发生了错误
-
在pipeline创建好后, 发送packet的途中出现了错误, 分这两种常见情况:
- 写入pipeline后, response线程收到DN返回的ack里标记异常
- 写入pipeline时发生了错误, 如果当前没有其他DN被标记异常, 那么直接认为第一个DN异常(第一个DN就是离客户端最近的那个)
-
在普通packet已经都发送完, 发送空尾包时, DN返回的
ack状态有异常- 如果重新建立无异常, 则更新block时间戳, 然后正常完成block
- 如果重新建立pipeline仍有异常, 则无视这个异常, 视为正常完成block
错误处理流程
- ResponsorProcessor设置errorIndex以及restartingNodeIndex,hasError标记为true,关闭response线程
- 把ackQueue中的packet移到dataQueue
- 重新初始化pipline
初始化pipline流程
条件1:
- DEFAULT: 1)副本数 >= 3 2) 当前DN数 <= 副本数/2 || isAppend || isHflushed`
- ALWAYS true
- NEVER false
条件2:
- pipeline处于create态就失败了
- PIPELINE_SETUP_APPEND: 总会申请新的DN
- PIPELINE_SETUP_CREATE:
- 没有写数据, 那么不申请新DN
- 有写数据, 申请DN. (特殊情况, 可先不管, 我也没搞清楚这个情况如何触发)
- pipeline处于streaming态时 --> 都需要申请DN (大部分情况)
- PIPELINE_CLOSE(发空尾包acked异常) --> 都直接忽略, 等待Namenode去补足缺的副本
情景说明
- 创建pipline时失败,给NN发送abandonBlock-RPC, 放弃此Block, 加入黑名单, 重试(和写失败处理不同)
- 3副本,默认策略,append失败,添加新的node
- 2副本 默认策略 append失败,不添加新的node
- 3副本,默认策略,write失败,不添加新的node
- 2副本,默认策略,write失败,添加新的node
- PIPELINE_CLOSE 不添加新的node
源码分析
Datastreamer##run
//关闭response
if (hasError && response != null) {
try {
response.close();
response.join();
response = null;
} catch (InterruptedException e) {
DFSClient.LOG.warn("Caught exception ", e);
}
}
if (hasError && (errorIndex >= 0 || restartingNodeIndex.get() >= 0)) {
doSleep = processDatanodeError();//失败返回false,失败情况下也不等待
}
DataStreamer##processDatanodeError
private boolean processDatanodeError() throws IOException {
//如果response还没关闭,等待
if (response != null) {
DFSClient.LOG.info("Error Recovery for " + block +
" waiting for responder to exit. ");
return true;
}
// 1.先关闭当前pipeline (blockStream, replyStream, socket)
closeStream();
// move packets from ack queue to front of the data queue
// 2.把ack队列中的移到data队列
synchronized (dataQueue) {
dataQueue.addAll(0, ackQueue);
ackQueue.clear();
}
// Record the new pipeline failure recovery.
if (lastAckedSeqnoBeforeFailure != lastAckedSeqno) {
lastAckedSeqnoBeforeFailure = lastAckedSeqno;
pipelineRecoveryCount = 1;
} else {
// If we had to recover the pipeline five times in a row for the
// same packet, this client likely has corrupt data or corrupting
// during transmission.
if (++pipelineRecoveryCount > 5) {
DFSClient.LOG.warn("Error recovering pipeline for writing " +
block + ". Already retried 5 times for the same packet.");
lastException.set(new IOException("Failing write. Tried pipeline " +
"recovery 5 times without success."));
streamerClosed = true;
return false;
}
}
//3.重新初始化pipline
boolean doSleep = setupPipelineForAppendOrRecovery();
//4.发普通包异常 & 空尾包异常
if (!streamerClosed && dfsClient.clientRunning) {
if (stage == BlockConstructionStage.PIPELINE_CLOSE) {
// If we had an error while closing the pipeline, we go through a fast-path
// where the BlockReceiver does not run. Instead, the DataNode just finalizes
// the block immediately during the 'connect ack' process. So, we want to pull
// the end-of-block packet from the dataQueue, since we don't actually have
// a true pipeline to send it over.
//
// We also need to set lastAckedSeqno to the end-of-block Packet's seqno, so that
// a client waiting on close() will be aware that the flush finished.
//4.1从dataQueue取出空尾包,和block要完成时调用一致, 会置空pipeline, 下一个packet会申请新的block
synchronized (dataQueue) {
DFSPacket endOfBlockPacket = dataQueue.remove(); // remove the end of block packet
Span span = endOfBlockPacket.getTraceSpan();
if (span != null) {
// Close any trace span associated with this Packet
TraceScope scope = Trace.continueSpan(span);
scope.close();
}
assert endOfBlockPacket.isLastPacketInBlock();
assert lastAckedSeqno == endOfBlockPacket.getSeqno() - 1;
lastAckedSeqno = endOfBlockPacket.getSeqno();
dataQueue.notifyAll();
}
endBlock();
} else {
//4.2重新启动response线程, 改pipeline为streaming(就绪态)
initDataStreaming();
}
}
return doSleep;
}
初始化pipline,并判断是否需要添加新的node
判断是否满足添加datanode的条件:
1.DEFAULT: 1)副本数 >= 3 2) 当前DN数 <= 副本数/2 || isAppend || isHflushed
2.ALWAYS true
3.NEVER false
DataStreamer##setupPipelineForAppendOrRecovery
private boolean setupPipelineForAppendOrRecovery() throws IOException {
// check number of datanodes
if (nodes == null || nodes.length == 0) {
String msg = "Could not get block locations. " + "Source file ""
+ src + "" - Aborting...";
DFSClient.LOG.warn(msg);
setLastException(new IOException(msg));
streamerClosed = true;
return false;
}
boolean success = false;
long newGS = 0L;
while (!success && !streamerClosed && dfsClient.clientRunning) {
// Sleep before reconnect if a dn is restarting.
// This process will be repeated until the deadline or the datanode
// starts back up.
//1.重启中的节点,等待4s
if (restartingNodeIndex.get() >= 0) {
// 4 seconds or the configured deadline period, whichever is shorter.
// This is the retry interval and recovery will be retried in this
// interval until timeout or success.
long delay = Math.min(dfsClient.getConf().datanodeRestartTimeout,
4000L);
try {
Thread.sleep(delay);
} catch (InterruptedException ie) {
lastException.set(new IOException("Interrupted while waiting for " +
"datanode to restart. " + nodes[restartingNodeIndex.get()]));
streamerClosed = true;
return false;
}
}
boolean isRecovery = hasError;
// remove bad datanode from list of datanodes.
// If errorIndex was not set (i.e. appends), then do not remove
// any datanodes
// 2.错误节点处理
if (errorIndex >= 0) {
StringBuilder pipelineMsg = new StringBuilder();
for (int j = 0; j < nodes.length; j++) {
pipelineMsg.append(nodes[j]);
if (j < nodes.length - 1) {
pipelineMsg.append(", ");//怎么加逗号
}
}
//只剩一个节点的时候会报错,此时错误无法剔除,直接报错
if (nodes.length <= 1) {
lastException.set(new IOException("All datanodes " + pipelineMsg
+ " are bad. Aborting..."));
streamerClosed = true;
return false;
}
DFSClient.LOG.warn("Error Recovery for block " + block +
" in pipeline " + pipelineMsg +
": bad datanode " + nodes[errorIndex]);
//异常DN加入列表
failed.add(nodes[errorIndex]);
DatanodeInfo[] newnodes = new DatanodeInfo[nodes.length-1];
arraycopy(nodes, newnodes, errorIndex);
final StorageType[] newStorageTypes = new StorageType[newnodes.length];
arraycopy(storageTypes, newStorageTypes, errorIndex);
final String[] newStorageIDs = new String[newnodes.length];
arraycopy(storageIDs, newStorageIDs, errorIndex);
//更新pipline
setPipeline(newnodes, newStorageTypes, newStorageIDs);
// Just took care of a node error while waiting for a node restart
if (restartingNodeIndex.get() >= 0) {
// If the error came from a node further away than the restarting
// node, the restart must have been complete.
if (errorIndex > restartingNodeIndex.get()) {
restartingNodeIndex.set(-1);
} else if (errorIndex < restartingNodeIndex.get()) {
// the node index has shifted.
restartingNodeIndex.decrementAndGet();
} else {
// this shouldn't happen...
assert false;
}
}
if (restartingNodeIndex.get() == -1) {
hasError = false;
}
lastException.set(null);
errorIndex = -1;
}
// Check if replace-datanode policy is satisfied.
//3.判断是否满足追加节点条件
// 条件 --> 1. 副本数 >= 3
// 2. 当前DN数 <= 副本数/2 || isAppend || isHflushed
// 比如默认3副本, 那只有处于1副本或追加写/hflush调用后, 才会申请新的DN
if (dfsClient.dtpReplaceDatanodeOnFailure.satisfy(blockReplication,
nodes, isAppend, isHflushed)) {
try {
addDatanode2ExistingPipeline();
} catch(IOException ioe) {
if (!dfsClient.dtpReplaceDatanodeOnFailure.isBestEffort()) {
throw ioe;
}
DFSClient.LOG.warn("Failed to replace datanode."
+ " Continue with the remaining datanodes since "
+ DFSConfigKeys.DFS_CLIENT_WRITE_REPLACE_DATANODE_ON_FAILURE_BEST_EFFORT_KEY
+ " is set to true.", ioe);
}
}
// get a new generation stamp and an access token
// 4.给NN发updateBlockForPipeline-RPC获取一个新的block时间戳/token
LocatedBlock lb = dfsClient.namenode.updateBlockForPipeline(block, dfsClient.clientName);
newGS = lb.getBlock().getGenerationStamp();
accessToken = lb.getBlockToken();
// set up the pipeline again with the remaining nodes
// 5.重新通过Sender发送writeBlock请求, 并初始化pipeline, 成功返回true。
//和pipline由初始态变为就绪态的初始化相同
if (failPacket) { // for testing
success = createBlockOutputStream(nodes, storageTypes, newGS, isRecovery);
failPacket = false;
try {
// Give DNs time to send in bad reports. In real situations,
// good reports should follow bad ones, if client committed
// with those nodes.
Thread.sleep(2000);
} catch (InterruptedException ie) {}
} else {
success = createBlockOutputStream(nodes, storageTypes, newGS, isRecovery);
}
// 6.再次检查, 将超时的重启状态DN置为badNode
if (restartingNodeIndex.get() >= 0) {
assert hasError == true;
// check errorIndex set above
if (errorIndex == restartingNodeIndex.get()) {
// ignore, if came from the restarting node
errorIndex = -1;
}
// still within the deadline
if (Time.monotonicNow() < restartDeadline) {
continue; // with in the deadline
}
// expired. declare the restarting node dead
restartDeadline = 0;
int expiredNodeIndex = restartingNodeIndex.get();
restartingNodeIndex.set(-1);
DFSClient.LOG.warn("Datanode did not restart in time: " +
nodes[expiredNodeIndex]);
// Mark the restarting node as failed. If there is any other failed
// node during the last pipeline construction attempt, it will not be
// overwritten/dropped. In this case, the restarting node will get
// excluded in the following attempt, if it still does not come up.
if (errorIndex == -1) {
errorIndex = expiredNodeIndex;
}
// From this point on, normal pipeline recovery applies.
}
} // while
if (success) {
// update pipeline at the namenode
ExtendedBlock newBlock = new ExtendedBlock(
block.getBlockPoolId(), block.getBlockId(), block.getNumBytes(), newGS);
dfsClient.namenode.updatePipeline(dfsClient.clientName, block, newBlock,
nodes, storageIDs);
// update client side generation stamp
block = newBlock;
}
return false; // do not sleep, continue processing
}