- 源文档:hadoop.apache.org/docs/r2.10.…
- 旨在对照研读
HDFS High Availability Using the Quorum Journal Manager
Purpose 目的
-
This guide provides an overview of the HDFS High Availability (HA) feature and how to configure and manage an HA HDFS cluster, using the Quorum Journal Manager (QJM) feature.
-
本指南概述了HDFS高可用性(HA)功能以及如何使用Quorum Journal Manager(QJM)功能配置和管理HA HDFS群集。
-
This document assumes that the reader has a general understanding of general components and node types in an HDFS cluster. Please refer to the HDFS Architecture guide for details.
-
本文档假定读者对HDFS群集中的常规组件和节点类型有一般的了解。有关详细信息,请参阅HDFS体系结构指南。
Note: Using the Quorum Journal Manager or Conventional Shared Storage 注意:使用Quorum Journal Manager或常规共享存储
- This guide discusses how to configure and use HDFS HA using the Quorum Journal Manager (QJM) to share edit logs between the Active and Standby NameNodes. For information on how to configure HDFS HA using NFS for shared storage instead of the QJM, please see this alternative guide.. For information on how to configure HDFS HA with Observer NameNode, please see this guide.
- 本指南讨论如何使用Quorum Journal Manager(QJM)配置和使用HDFS HA,以在活动和备用NameNode之间共享编辑日志。有关如何使用NFS而非QJM将NFS用于共享存储来配置HDFS HA的信息,请参阅此替代指南。。有关如何使用Observer NameNode配置HDFS HA的信息,请参阅本指南。
Background 背景
-
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.
-
在Hadoop 2.0.0之前,NameNode是HDFS集群中的单点故障(SPOF)。每个群集只有一个NameNode,如果该计算机或进程不可用,则整个群集在整个NameNode重新启动或在另一台计算机上启动之前将不可用。
-
This impacted the total availability of the HDFS cluster in two major ways:
-
这从两个方面影响了HDFS群集的总可用性:
-
In the case of an unplanned event such as a machine crash, the cluster would be unavailable until an operator restarted the NameNode.
-
如果发生意外事件(例如机器崩溃),则在操作员重新启动NameNode之前,群集将不可用。
-
Planned maintenance events such as software or hardware upgrades on the NameNode machine would result in windows of cluster downtime.
-
如果发生意外事件(例如机器崩溃),则在操作员重新启动NameNode之前,群集将不可用。
-
-
The HDFS High Availability feature addresses the above problems by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.
-
HDFS高可用性功能通过提供在具有热备用功能的主动/被动配置中在同一群集中运行两个冗余NameNode的选项来解决上述问题。这可以在计算机崩溃的情况下快速故障转移到新的NameNode,或者出于计划维护的目的由管理员发起的正常故障转移。
Architecture 架构
-
In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.
-
在典型的HA群集中,将两个单独的计算机配置为NameNode。在任何时间点,恰好其中一个NameNode处于活动状态,而另一个处于Standby状态。Active NameNode负责群集中的所有客户端操作,而Standby只是充当从属,并保持足够的状态以在必要时提供快速故障转移。
-
In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the JournalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.
-
为了使Standby节点保持其状态与Active节点同步,两个节点都与称为“ JournalNodes”(JN)的一组单独的守护程序进行通信。当活动节点执行任何名称空间修改时,它会持久地将修改记录记录到这些JN的大多数中。Standby节点能够从JN读取编辑内容,并不断监视它们以查看编辑日志的更改。当“备用节点”看到编辑内容时,会将其应用到自己的名称空间。发生故障转移时,备用服务器将确保在将自身升级为活动状态之前,已从JournalNode读取所有编辑内容。这样可确保在发生故障转移之前,名称空间状态已完全同步。
-
In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.
-
为了提供快速故障转移,备用节点还必须具有有关集群中块位置的最新信息。为了实现这一点,DataNode被配置了两个NameNode的位置,并向两者发送块位置信息和心跳。
-
It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called “split-brain scenario,” the JournalNodes will only ever allow a single NameNode to be a writer at a time. During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active to safely proceed with failover.
-
对于HA群集的正确操作至关重要,一次只能有一个NameNode处于活动状态。否则,名称空间状态将在两者之间迅速分散,从而有数据丢失或其他不正确结果的风险。为了确保此属性并防止所谓的“裂脑情况”,JournalNode将仅一次允许单个NameNode成为作者。在故障转移期间,将变为活动状态的NameNode将仅承担写入JournalNodes的角色,这将有效地防止另一个NameNode继续处于活动状态,从而使新的Active节点可以安全地进行故障转移。
Hardware resources 硬件资源
-
In order to deploy an HA cluster, you should prepare the following:
-
为了部署高可用性群集,您应该准备以下内容:
-
NameNode machines - the machines on which you run the Active and Standby NameNodes should have equivalent hardware to each other, and equivalent hardware to what would be used in a non-HA cluster.
-
NameNode计算机-运行活动NameNode和Standby NameNode的计算机应具有彼此等效的硬件,以及与非HA群集中将使用的硬件相同的硬件。
-
JournalNode machines - the machines on which you run the JournalNodes. The JournalNode daemon is relatively lightweight, so these daemons may reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager. Note: There must be at least 3 JournalNode daemons, since edit log modifications must be written to a majority of JNs. This will allow the system to tolerate the failure of a single machine. You may also run more than 3 JournalNodes, but in order to actually increase the number of failures the system can tolerate, you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N JournalNodes, the system can tolerate at most (N - 1) / 2 failures and continue to function normally.
-
JournalNode计算机-运行JournalNode的计算机。JournalNode守护程序相对较轻,因此可以合理地将这些守护程序与其他Hadoop守护程序(例如NameNodes,JobTracker或YARN ResourceManager)并置在计算机上。注意:必须至少有3个JournalNode守护程序,因为必须将编辑日志修改写入大多数JN。这将允许系统容忍单个计算机的故障。您可能还会运行3个以上的JournalNode,但是为了实际增加系统可以容忍的故障数量,您应该运行奇数个JN(即3、5、7等)。请注意,当与N个JournalNode一起运行时,系统最多可以容忍(N-1)/ 2个故障,并继续正常运行。
-
-
Note that, in an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode.
-
请注意,在高可用性群集中,备用名称节点还执行名称空间状态的检查点,因此不必在高可用性群集中运行辅助名称节点,检查点节点或备份节点。实际上,这样做将是一个错误。这还允许正在重新配置未启用HA的HDFS群集的用户启用HA,以重用他们先前专用于次要NameNode的硬件。
Deployment 部署方式
Configuration overview 配置概述
-
Similar to Federation configuration, HA configuration is backward compatible and allows existing single NameNode configurations to work without change. The new configuration is designed such that all the nodes in the cluster may have the same configuration without the need for deploying different configuration files to different machines based on the type of the node.
-
与联合身份验证配置类似,高可用性配置向后兼容,并允许现有的单个NameNode配置无需更改即可工作。设计新配置后,群集中的所有节点都可以具有相同的配置,而无需根据节点的类型将不同的配置文件部署到不同的计算机。
-
Like HDFS Federation, HA clusters reuse the nameservice ID to identify a single HDFS instance that may in fact consist of multiple HA NameNodes. In addition, a new abstraction called NameNode ID is added with HA. Each distinct NameNode in the cluster has a different NameNode ID to distinguish it. To support a single configuration file for all of the NameNodes, the relevant configuration parameters are suffixed with the nameservice ID as well as the NameNode ID.
-
与HDFS联合身份验证一样,HA群集重用了名称服务ID来标识实际上可能由多个HA NameNode组成的单个HDFS实例。此外,HA还添加了一个名为NameNode ID的新抽象。群集中的每个不同的NameNode都有一个不同的NameNode ID进行区分。为了支持所有NameNode的单个配置文件,相关的配置参数后缀有nameservice ID和NameNode ID。
Configuration details 配置细节
-
To configure HA NameNodes, you must add several configuration options to your hdfs-site.xml configuration file.
-
要配置HA NameNode,必须将多个配置选项添加到hdfs-site.xml配置文件中。
-
The order in which you set these configurations is unimportant, but the values you choose for dfs.nameservices and dfs.ha.namenodes.[nameservice ID] will determine the keys of those that follow. Thus, you should decide on these values before setting the rest of the configuration options.
-
设置这些配置的顺序并不重要,但是为dfs.nameservices和dfs.ha.namenodes。[nameservice ID]选择的值将确定后面的密钥。因此,您应该在设置其余配置选项之前决定这些值。
-
dfs.nameservices - the logical name for this new nameservice 此新名称服务的逻辑名称
-
Choose a logical name for this nameservice, for example “mycluster”, and use this logical name for the value of this config option. The name you choose is arbitrary. It will be used both for configuration and as the authority component of absolute HDFS paths in the cluster.
-
选择此名称服务的逻辑名称,例如“ mycluster”,然后将此逻辑名称用作此配置选项的值。您选择的名称是任意的。它既可以用于配置,也可以用作群集中绝对HDFS路径的权限组件。
-
Note: If you are also using HDFS Federation, this configuration setting should also include the list of other nameservices, HA or otherwise, as a comma-separated list.
-
注意:如果您还使用HDFS Federation,则此配置设置还应包括其他名称服务(HA或其他)的列表,以逗号分隔的列表。
-
<property> <name>dfs.nameservices</name> <value>mycluster</value> </property> -
dfs.ha.namenodes.[nameservice ID] - unique identifiers for each NameNode in the nameservice 名称服务中每个NameNode的唯一标识符
-
Configure with a list of comma-separated NameNode IDs. This will be used by DataNodes to determine all the NameNodes in the cluster. For example, if you used “mycluster” as the nameservice ID previously, and you wanted to use “nn1” and “nn2” as the individual IDs of the NameNodes, you would configure this as such:
-
使用逗号分隔的NameNode ID列表进行配置。DataNode将使用它来确定集群中的所有NameNode。例如,如果您以前使用“ mycluster”作为名称服务ID,并且想要使用“ nn1”和“ nn2”作为NameNode的各个ID,则可以这样配置:
-
<property> <name>dfs.ha.namenodes.mycluster</name> <value>nn1,nn2</value> </property> -
Note: Currently, only a maximum of two NameNodes may be configured per nameservice.
-
注意:目前,每个名称服务最多只能配置两个NameNode。
-
dfs.namenode.rpc-address.[nameservice ID].[name node ID] - the fully-qualified RPC address for each NameNode to listen on 每个NameNode监听的标准RPC地址
-
For both of the previously-configured NameNode IDs, set the full address and IPC port of the NameNode processs. Note that this results in two separate configuration options. For example:
-
对于两个先前配置的NameNode ID,请设置NameNode进程的完整地址和IPC端口。请注意,这将导致两个单独的配置选项。例如:
-
<property> <name>dfs.namenode.rpc-address.mycluster.nn1</name> <value>machine1.example.com:8020</value> </property> <property> <name>dfs.namenode.rpc-address.mycluster.nn2</name> <value>machine2.example.com:8020</value> </property> -
Note: You may similarly configure the “servicerpc-address” setting if you so desire.
-
注意:如果需要,您可以类似地配置“ servicerpc-address ”设置。
-
dfs.namenode.http-address.[nameservice ID].[name node ID] - the fully-qualified HTTP address for each NameNode to listen on 每个NameNode监听的标准HTTP地址
-
Similarly to rpc-address above, set the addresses for both NameNodes’ HTTP servers to listen on. For example:
-
与上面的rpc-address相似,为两个NameNode的HTTP服务器设置地址以进行侦听。例如:
-
<property> <name>dfs.namenode.http-address.mycluster.nn1</name> <value>machine1.example.com:50070</value> </property> <property> <name>dfs.namenode.http-address.mycluster.nn2</name> <value>machine2.example.com:50070</value> </property> -
Note: If you have Hadoop’s security features enabled, you should also set the https-address similarly for each NameNode.
-
注意:如果启用了Hadoop的安全性功能,则还应该为每个NameNode相似地设置https-地址。
-
dfs.namenode.shared.edits.dir - the URI which identifies the group of JNs where the NameNodes will write/read edits 标识NameNode将在其中写入/读取编辑内容的JN组的URI
-
This is where one configures the addresses of the JournalNodes which provide the shared edits storage, written to by the Active nameNode and read by the Standby NameNode to stay up-to-date with all the file system changes the Active NameNode makes. Though you must specify several JournalNode addresses, you should only configure one of these URIs. The URI should be of the form:
qjournal://*host1:port1*;*host2:port2*;*host3:port3*/*journalId*. The Journal ID is a unique identifier for this nameservice, which allows a single set of JournalNodes to provide storage for multiple federated namesystems. Though not a requirement, it’s a good idea to reuse the nameservice ID for the journal identifier. -
在这里,可以配置提供共享编辑存储的JournalNodes的地址,该地址由Active nameNode写入并由Standby NameNode读取,以与Active NameNode所做的所有文件系统更改保持最新。尽管您必须指定多个JournalNode地址,但是您仅应配置这些URI之一。URI的格式应为:qjournal:// * host1:port1 *; * host2:port2 *; * host3:port3 * / * journalId *。日记ID是此名称服务的唯一标识符,它允许单个JournalNode集为多个联合名称系统提供存储。尽管不是必需的,但最好将名称服务ID重用于日志标识符。
-
For example, if the JournalNodes for this cluster were running on the machines “node1.example.com”, “node2.example.com”, and “node3.example.com” and the nameservice ID were “mycluster”, you would use the following as the value for this setting (the default port for the JournalNode is 8485):
-
例如,如果此群集的JournalNode在计算机“ node1.example.com”,“ node2.example.com”和“ node3.example.com”上运行,并且名称服务ID为“ mycluster”,则应使用以下为该设置的值(JournalNode的默认端口为8485):
-
<property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value> </property> -
dfs.client.failover.proxy.provider.[nameservice ID] - the Java class that HDFS clients use to contact the Active NameNode HDFS客户端用于联系活动NameNode的Java类
-
Configure the name of the Java class which will be used by the DFS Client to determine which NameNode is the current Active, and therefore which NameNode is currently serving client requests. The two implementations which currently ship with Hadoop are the ConfiguredFailoverProxyProvider and the RequestHedgingProxyProvider (which, for the first call, concurrently invokes all namenodes to determine the active one, and on subsequent requests, invokes the active namenode until a fail-over happens), so use one of these unless you are using a custom proxy provider. For example:
-
配置Java类的名称,DFS客户端将使用该Java类来确定哪个NameNode是当前的Active,从而确定哪个NameNode当前正在服务于客户端请求。Hadoop当前附带的两个实现是ConfiguredFailoverProxyProvider和RequestHedgingProxyProvider(对于第一个调用,它们同时调用所有名称节点以确定活动的名称节点,并在后续请求时调用活动的名称节点,直到发生故障转移),因此除非您使用自定义代理提供程序,否则请使用其中之一。例如:
-
<property> <name>dfs.client.failover.proxy.provider.mycluster</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> -
dfs.ha.fencing.methods - a list of scripts or Java classes which will be used to fence the Active NameNode during a failover 脚本或Java类的列表,将在故障转移期间用来隔离Active NameNode
-
It is desirable for correctness of the system that only one NameNode be in the Active state at any given time. Importantly, when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario. However, when a failover occurs, it is still possible that the previous Active NameNode could serve read requests to clients, which may be out of date until that NameNode shuts down when trying to write to the JournalNodes. For this reason, it is still desirable to configure some fencing methods even when using the Quorum Journal Manager. However, to improve the availability of the system in the event the fencing mechanisms fail, it is advisable to configure a fencing method which is guaranteed to return success as the last fencing method in the list. Note that if you choose to use no actual fencing methods, you still must configure something for this setting, for example “shell(/bin/true)”.
-
为了保证系统的正确性,在任何给定时间只有一个NameNode处于Active状态。重要的是,使用Quorum Journal Manager时,将只允许一个NameNode写入JournalNodes,因此不会因裂脑情况而损坏文件系统元数据。但是,当发生故障转移时,以前的Active NameNode仍可能会向客户端提供读取请求,这可能已过期,直到该NameNode在尝试写入JournalNodes时关闭为止。因此,即使使用Quorum Journal Manager,仍然需要配置一些防护方法。但是,为了在防护机制失败的情况下提高系统的可用性,建议配置一种防护方法,以确保成功返回列表中的最后一种防护方法。请注意,如果您选择不使用实际的防护方法,则仍必须为此设置配置一些内容,例如“ shell(/ bin / true) ”。
-
The fencing methods used during a failover are configured as a carriage-return-separated list, which will be attempted in order until one indicates that fencing has succeeded. There are two methods which ship with Hadoop: shell and sshfence. For information on implementing your own custom fencing method, see the org.apache.hadoop.ha.NodeFencer class.
-
故障转移期间使用的防护方法配置为以回车符分隔的列表,将按顺序尝试该列表,直到指示防护成功为止。Hadoop附带两种方法:shell和sshfence。有关实现自己的自定义防护方法的信息,请参见org.apache.hadoop.ha.NodeFencer类。
-
sshfence - SSH to the Active NameNode and kill the process SSH到Active NameNode并终止进程
-
The sshfence option SSHes to the target node and uses fuser to kill the process listening on the service’s TCP port. In order for this fencing option to work, it must be able to SSH to the target node without providing a passphrase. Thus, one must also configure the dfs.ha.fencing.ssh.private-key-files option, which is a comma-separated list of SSH private key files. For example:
-
该sshfence选项SSHes到目标节点,然后通过定影杀该服务的TCP端口上侦听的过程。为了使该防护选项起作用,它必须能够在不提供密码的情况下SSH到目标节点。因此,还必须配置dfs.ha.fencing.ssh.private-key-files选项,该选项是用逗号分隔的SSH私钥文件列表。例如:
-
<property> <name>dfs.ha.fencing.methods</name> <value>sshfence</value> </property> <property> <name>dfs.ha.fencing.ssh.private-key-files</name> <value>/home/exampleuser/.ssh/id_rsa</value> </property> -
Optionally, one may configure a non-standard username or port to perform the SSH. One may also configure a timeout, in milliseconds, for the SSH, after which this fencing method will be considered to have failed. It may be configured like so:
-
可以选择配置一个非标准的用户名或端口来执行SSH。您还可以为SSH配置一个以毫秒为单位的超时,此后该防护方法将被视为失败。可以这样配置:
-
<property> <name>dfs.ha.fencing.methods</name> <value>sshfence([[username][:port]])</value> </property> <property> <name>dfs.ha.fencing.ssh.connect-timeout</name> <value>30000</value> </property> -
shell - run an arbitrary shell command to fence the Active NameNode 运行一个任意的shell命令来隔离Active NameNode
-
The shell fencing method runs an arbitrary shell command. It may be configured like so:
-
<property> <name>dfs.ha.fencing.methods</name> <value>shell(/path/to/my/script.sh arg1 arg2 ...)</value> </property> -
The string between ‘(’ and ‘)’ is passed directly to a bash shell and may not include any closing parentheses.
-
The shell command will be run with an environment set up to contain all of the current Hadoop configuration variables, with the ‘_’ character replacing any ‘.’ characters in the configuration keys. The configuration used has already had any namenode-specific configurations promoted to their generic forms – for example dfs_namenode_rpc-address will contain the RPC address of the target node, even though the configuration may specify that variable as dfs.namenode.rpc-address.ns1.nn1.
-
shell命令将在一个环境中运行,该环境设置为包含所有当前Hadoop配置变量,并用“ _”字符替换任何“。”。配置键中的字符。所使用的配置已经将任何特定于名称节点的配置提升为通用形式,例如dfs_namenode_rpc-address将包含目标节点的RPC地址,即使该配置可以将该变量指定为dfs.namenode.rpc-address.ns1 .nn1。
-
Additionally, the following variables referring to the target node to be fenced are also available:
-
此外,还提供以下变量,这些变量引用要隔离的目标节点:
- $target_host hostname of the node to be fenced 要隔离的节点的主机名
- $target_port IPC port of the node to be fenced 要隔离的节点的IPC端口
- $target_address the above two, combined as host:port 以上两个,合并为主机:端口
- $target_nameserviceid the nameservice ID of the NN to be fenced 被隔离的NN的名称服务ID
- $target_namenodeid the namenode ID of the NN to be fenced 被隔离的NN的名称服务ID
-
These environment variables may also be used as substitutions in the shell command itself. For example:
-
这些环境变量也可以在shell命令本身中用作替代。例如:
-
<property> <name>dfs.ha.fencing.methods</name> <value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)</value> </property> -
If the shell command returns an exit code of 0, the fencing is determined to be successful. If it returns any other exit code, the fencing was not successful and the next fencing method in the list will be attempted.
-
如果shell命令返回退出代码0,则确定防护成功。如果返回任何其他退出代码,则防护不成功,并且将尝试列表中的下一个防护方法。
-
Note: This fencing method does not implement any timeout. If timeouts are necessary, they should be implemented in the shell script itself (eg by forking a subshell to kill its parent in some number of seconds).
-
注意:此防护方法不会实现任何超时。如果需要超时,则应在shell脚本本身中实现超时(例如,通过分叉subshell在几秒钟内杀死其父级)。
-
fs.defaultFS - the default path prefix used by the Hadoop FS client when none is given 当未指定任何路径时,Hadoop FS客户端使用的默认路径前缀
-
Optionally, you may now configure the default path for Hadoop clients to use the new HA-enabled logical URI. If you used “mycluster” as the nameservice ID earlier, this will be the value of the authority portion of all of your HDFS paths. This may be configured like so, in your core-site.xml file:
-
(可选)您现在可以配置Hadoop客户端的默认路径,以使用新的启用HA的逻辑URI。如果您之前使用“ mycluster”作为名称服务ID,则它将是所有HDFS路径的授权部分的值。可以这样配置,在您的core-site.xml文件中:
-
<property> <name>fs.defaultFS</name> <value>hdfs://mycluster</value> </property> -
dfs.journalnode.edits.dir - the path where the JournalNode daemon will store its local state JournalNode守护程序将存储其本地状态的路径
-
This is the absolute path on the JournalNode machines where the edits and other local state used by the JNs will be stored. You may only use a single path for this configuration. Redundancy for this data is provided by running multiple separate JournalNodes, or by configuring this directory on a locally-attached RAID array. For example:
-
这是JournalNode机器上将存储JN使用的编辑和其他本地状态的绝对路径。您只能为此配置使用一条路径。通过运行多个单独的JournalNode或在本地连接的RAID阵列上配置此目录,可以提供此数据的冗余。例如:
-
<property> <name>dfs.journalnode.edits.dir</name> <value>/path/to/journal/node/local/data</value> </property>
-
Deployment details 部署细节
-
After all of the necessary configuration options have been set, you must start the JournalNode daemons on the set of machines where they will run. This can be done by running the command “hadoop-daemon.sh start journalnode” and waiting for the daemon to start on each of the relevant machines.
-
设置完所有必需的配置选项后,必须在将要运行它们的机器集上启动JournalNode守护程序。这可以通过运行命令“ hadoop-daemon.sh start journalnode ”并等待守护程序在每台相关计算机上启动来完成。
-
Once the JournalNodes have been started, one must initially synchronize the two HA NameNodes’ on-disk metadata.
-
一旦启动JournalNode,便必须首先同步两个HA NameNode的磁盘元数据。
-
If you are setting up a fresh HDFS cluster, you should first run the format command (hdfs namenode -format) on one of NameNodes.
-
如果要设置新的HDFS群集,则应首先在其中一个NameNode上运行format命令(hdfs namenode -format)。
-
If you have already formatted the NameNode, or are converting a non-HA-enabled cluster to be HA-enabled, you should now copy over the contents of your NameNode metadata directories to the other, unformatted NameNode by running the command “hdfs namenode -bootstrapStandby” on the unformatted NameNode. Running this command will also ensure that the JournalNodes (as configured by dfs.namenode.shared.edits.dir) contain sufficient edits transactions to be able to start both NameNodes.
-
如果您已经格式化了NameNode,或者正在将未启用HA的群集转换为启用HA,则现在应该通过运行命令“ hdfs namenode-将NameNode元数据目录的内容复制到其他未格式化的NameNode。未经格式化的NameNode上的bootstrapStandby ”。运行此命令还将确保JournalNode(由dfs.namenode.shared.edits.dir配置)包含足够的编辑事务以能够启动两个NameNode。
-
If you are converting a non-HA NameNode to be HA, you should run the command “hdfs namenode -initializeSharedEdits”, which will initialize the JournalNodes with the edits data from the local NameNode edits directories.
-
如果要将非HA NameNode转换为HA,则应运行命令“ hdfs namenode -initializeSharedEdits ”,该命令将使用本地NameNode edits目录中的edits数据初始化JournalNodes。
-
-
At this point you may start both of your HA NameNodes as you normally would start a NameNode.
-
此时,您可以像平常启动NameNode一样启动两个HA NameNode。
-
You can visit each of the NameNodes’ web pages separately by browsing to their configured HTTP addresses. You should notice that next to the configured address will be the HA state of the NameNode (either “standby” or “active”.) Whenever an HA NameNode starts, it is initially in the Standby state.
-
您可以通过浏览到它们的已配置HTTP地址来分别访问每个NameNode的网页。您应注意,配置的地址旁边将是NameNode的HA状态(“待机”或“活动”)。无论何时启动HA NameNode,它最初都处于Standby状态。
Administrative commands 行政命令
-
Now that your HA NameNodes are configured and started, you will have access to some additional commands to administer your HA HDFS cluster. Specifically, you should familiarize yourself with all of the subcommands of the “hdfs haadmin” command. Running this command without any additional arguments will display the following usage information:
-
现在您的HA NameNodes已配置并启动,您将可以访问一些其他命令来管理HA HDFS群集。具体来说,您应该熟悉“ hdfs haadmin ”命令的所有子命令。在不使用任何其他参数的情况下运行此命令将显示以下用法信息:
-
Usage: haadmin [-transitionToActive <serviceId>] [-transitionToStandby <serviceId>] [-failover [--forcefence] [--forceactive] <serviceId> <serviceId>] [-getServiceState <serviceId>] [-getAllServiceState] [-checkHealth <serviceId>] [-help <command>] -
This guide describes high-level uses of each of these subcommands. For specific usage information of each subcommand, you should run “hdfs haadmin -help ”.
-
本指南描述了每个子命令的高级用法。有关每个子命令的特定用法信息,应运行“ hdfs haadmin -help <命令>”。
-
transitionToActive and transitionToStandby - transition the state of the given NameNode to Active or Standby 将给定NameNode的状态转换为Active或Standby
-
These subcommands cause a given NameNode to transition to the Active or Standby state, respectively. These commands do not attempt to perform any fencing, and thus should rarely be used. Instead, one should almost always prefer to use the “hdfs haadmin -failover” subcommand.
-
这些子命令使给定的NameNode分别转换为Active或Standby状态。这些命令不会尝试执行任何防护,因此应很少使用。相反,几乎应该总是喜欢使用“ hdfs haadmin -failover ”子命令。
-
failover - initiate a failover between two NameNodes 启动两个NameNode之间的故障转移
-
This subcommand causes a failover from the first provided NameNode to the second. If the first NameNode is in the Standby state, this command simply transitions the second to the Active state without error. If the first NameNode is in the Active state, an attempt will be made to gracefully transition it to the Standby state. If this fails, the fencing methods (as configured by dfs.ha.fencing.methods) will be attempted in order until one succeeds. Only after this process will the second NameNode be transitioned to the Active state. If no fencing method succeeds, the second NameNode will not be transitioned to the Active state, and an error will be returned.
-
此子命令导致从第一个提供的NameNode到第二个的NameNode故障转移。如果第一个NameNode处于Standby状态,则此命令仅将第二个NameNode转换为Active状态而不会出现错误。如果第一个NameNode处于活动状态,则将尝试将其优雅地转换为Standby状态。如果失败,将按顺序尝试使用防护方法(由dfs.ha.fencing.methods配置),直到成功为止。仅在此过程之后,第二个NameNode才会转换为Active状态。如果没有成功的防护方法,则第二个NameNode将不会转换为Active状态,并且将返回错误。
-
getServiceState - determine whether the given NameNode is Active or Standby 确定给定的NameNode是活动的还是备用的
-
Connect to the provided NameNode to determine its current state, printing either “standby” or “active” to STDOUT appropriately. This subcommand might be used by cron jobs or monitoring scripts which need to behave differently based on whether the NameNode is currently Active or Standby.
-
连接到提供的NameNode以确定其当前状态,并在STDOUT上适当打印“待机”或“活动”。此子命令可能由cron作业或监视脚本使用,这些脚本或监视脚本需要根据NameNode当前处于活动状态还是待机状态而表现不同。
-
getAllServiceState - returns the state of all the NameNodes 返回所有NameNode的状态
-
Connect to the configured NameNodes to determine the current state, print either “standby” or “active” to STDOUT appropriately.
-
连接到已配置的NameNode以确定当前状态,并在STDOUT上适当打印“待机”或“活动”。
-
checkHealth - check the health of the given NameNode 检查给定NameNode的运行状况
-
Connect to the provided NameNode to check its health. The NameNode is capable of performing some diagnostics on itself, including checking if internal services are running as expected. This command will return 0 if the NameNode is healthy, non-zero otherwise. One might use this command for monitoring purposes.
-
连接到提供的NameNode来检查其运行状况。NameNode能够对其自身执行一些诊断,包括检查内部服务是否按预期运行。如果NameNode正常,此命令将返回0,否则返回非零。人们可能会使用此命令进行监视。
-
Note: This is not yet implemented, and at present will always return success, unless the given NameNode is completely down.
-
注意:这尚未实现,除非给定的NameNode完全关闭,否则当前将始终返回成功。
-
Load Balancer Setup 负载均衡器设置
- If you are running a set of NameNodes behind a Load Balancer (e.g. Azure or AWS ) and would like the Load Balancer to point to the active NN, you can use the /isActive HTTP endpoint as a health probe. http://NN_HOSTNAME/isActive will return a 200 status code response if the NN is in Active HA State, 405 otherwise.
- 如果您正在负载均衡器(例如Azure或AWS)后面运行一组NameNode,并且希望负载均衡器指向活动的NN,则可以将/ isActive HTTP终结点用作运行状况探测器。如果NN处于活动HA状态,则http:// NN_HOSTNAME / isActive将返回200状态代码响应,否则返回405。
In-Progress Edit Log Tailing 进行中的编辑日志拖尾
-
Under the default settings, the Standby NameNode will only apply edits that are present in an edit log segments which has been finalized. If it is desirable to have a Standby NameNode which has more up-to-date namespace information, it is possible to enable tailing of in-progress edit segments. This setting will attempt to fetch edits from an in-memory cache on the JournalNodes and can reduce the lag time before a transaction is applied on the Standby NameNode to the order of milliseconds. If an edit cannot be served from the cache, the Standby will still be able to retrieve it, but the lag time will be much longer. The relevant configurations are:
-
在默认设置下,Standby NameNode将仅应用已完成的编辑日志段中存在的编辑。如果希望具有一个具有最新名称空间信息的Standby NameNode,则可以启用正在进行的编辑段的尾部处理。此设置将尝试从JournalNode上的内存高速缓存中获取编辑,并且可以将事务应用于Standby NameNode之前的延迟时间减少到毫秒量级。如果无法从缓存中进行编辑,则备用数据库仍将能够检索它,但是滞后时间将更长。相关配置为:
-
dfs.ha.tail-edits.in-progress - Whether or not to enable tailing on in-progress edits logs. This will also enable the in-memory edit cache on the JournalNodes. Disabled by default. 是否在正在进行的编辑日志中启用拖尾。这还将在JournalNodes上启用内存中编辑缓存。默认禁用。
-
dfs.journalnode.edit-cache-size.bytes - The size of the in-memory cache of edits on the JournalNode. Edits take around 200 bytes each in a typical environment, so, for example, the default of 1048576 (1MB) can hold around 5000 transactions. It is recommended to monitor the JournalNode metrics RpcRequestCacheMissAmountNumMisses and RpcRequestCacheMissAmountAvgTxns, which respectively count the number of requests unable to be served by the cache, and the extra number of transactions which would have needed to have been in the cache for the request to succeed. For example, if a request attempted to fetch edits starting at transaction ID 10, but the oldest data in the cache was at transaction ID 20, a value of 10 would be added to the average. JournalNode上的内存中编辑缓存的大小。在典型的环境中,每个编辑占用大约200个字节,因此,例如,默认值1048576(1MB)可以容纳大约5000个事务。建议监视JournalNode指标RpcRequestCacheMissAmountNumMisses和RpcRequestCacheMissAmountAvgTxns,它们分别计算无法由缓存服务的请求数,以及为成功请求而必须存在于缓存中的额外事务数。例如,如果请求尝试从事务ID 10开始获取编辑,但是缓存中最旧的数据在事务ID 20,则将平均值添加10。
-
-
This feature is primarily useful in conjunction with the Standby/Observer Read feature. Using this feature, read requests can be serviced from non-active NameNodes; thus tailing in-progress edits provides these nodes with the ability to serve requests with data which is much more fresh. See the Apache JIRA ticket HDFS-12943 for more information on this feature.
-
该功能主要与“待机/观察者读取”功能结合使用。使用此功能,可以从非活动的NameNode服务读取请求;因此,进行中的后期编辑为这些节点提供了为请求提供最新数据的能力。有关此功能的更多信息,请参见Apache JIRA票证HDFS-12943。
Automatic Failover 自动故障转移
Introduction 介绍
- The above sections describe how to configure manual failover. In that mode, the system will not automatically trigger a failover from the active to the standby NameNode, even if the active node has failed. This section describes how to configure and deploy automatic failover.
- 以上各节描述了如何配置手动故障转移。在该模式下,即使活动节点发生故障,系统也不会自动触发从活动NameNode到备用NameNode的故障转移。本节介绍如何配置和部署自动故障转移。
Components 组件
-
Automatic failover adds two new components to an HDFS deployment: a ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC).
-
自动故障转移为HDFS部署添加了两个新组件:ZooKeeper仲裁和ZKFailoverController进程(缩写为ZKFC)。
-
Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of automatic HDFS failover relies on ZooKeeper for the following things:
-
Apache ZooKeeper是一项高可用性服务,用于维护少量的协调数据,将数据中的更改通知客户端并监视客户端的故障。HDFS自动故障转移的实现依赖ZooKeeper进行以下操作:
-
Failure detection - each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other NameNode that a failover should be triggered. 群集中的每个NameNode计算机都在ZooKeeper中维护一个持久会话。如果计算机崩溃,则ZooKeeper会话将终止,通知另一个NameNode应触发故障转移。
-
Active NameNode election - ZooKeeper provides a simple mechanism to exclusively elect a node as active. If the current active NameNode crashes, another node may take a special exclusive lock in ZooKeeper indicating that it should become the next active.ZooKeeper提供了一种简单的机制来专门选举一个节点为活动的节点。如果当前活动的NameNode崩溃,则另一个节点可能会在ZooKeeper中采取特殊的排他锁,指示它应该成为下一个活动的NameNode。
-
-
The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client which also monitors and manages the state of the NameNode. Each of the machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible for:
-
ZKFailoverController(ZKFC)是一个新组件,它是一个ZooKeeper客户端,它还监视和管理NameNode的状态。运行NameNode的每台计算机也都运行ZKFC,该ZKFC负责:
-
Health monitoring - the ZKFC pings its local NameNode on a periodic basis with a health-check command. So long as the NameNode responds in a timely fashion with a healthy status, the ZKFC considers the node healthy. If the node has crashed, frozen, or otherwise entered an unhealthy state, the health monitor will mark it as unhealthy.
-
运行状况监视-ZKFC使用运行状况检查命令定期ping其本地NameNode。只要NameNode以健康状态及时响应,ZKFC就会认为该节点是健康的。如果节点崩溃,冻结或以其他方式进入不正常状态,则运行状况监视器将其标记为不正常。
-
ZooKeeper session management - when the local NameNode is healthy, the ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it also holds a special “lock” znode. This lock uses ZooKeeper’s support for “ephemeral” nodes; if the session expires, the lock node will be automatically deleted.
-
ZooKeeper会话管理-当本地NameNode运行状况良好时,ZKFC会在ZooKeeper中保持打开的会话。如果本地NameNode处于活动状态,则它还将持有一个特殊的“锁定” znode。该锁使用ZooKeeper对“临时”节点的支持。如果会话期满,将自动删除锁定节点。
-
ZooKeeper-based election - if the local NameNode is healthy, and the ZKFC sees that no other node currently holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has “won the election”, and is responsible for running a failover to make its local NameNode active. The failover process is similar to the manual failover described above: first, the previous active is fenced if necessary, and then the local NameNode transitions to active state.
-
基于ZooKeeper的选举-如果本地NameNode运行状况良好,并且ZKFC看到当前没有其他节点持有该锁znode,则它本身将尝试获取该锁。如果成功,则表明它“赢得了选举”,并负责运行故障转移以使其本地NameNode处于活动状态。故障转移过程类似于上述的手动故障转移:首先,如有必要,将先前的活动节点隔离,然后将本地NameNode转换为活动状态。
-
-
For more details on the design of automatic failover, refer to the design document attached to HDFS-2185 on the Apache HDFS JIRA.
-
有关自动故障转移设计的更多详细信息,请参阅Apache HDFS JIRA上HDFS-2185附带的设计文档。
Deploying ZooKeeper 部署ZooKeeper
-
In a typical deployment, ZooKeeper daemons are configured to run on three or five nodes. Since ZooKeeper itself has light resource requirements, it is acceptable to collocate the ZooKeeper nodes on the same hardware as the HDFS NameNode and Standby Node. Many operators choose to deploy the third ZooKeeper process on the same node as the YARN ResourceManager. It is advisable to configure the ZooKeeper nodes to store their data on separate disk drives from the HDFS metadata for best performance and isolation.
-
在典型的部署中,ZooKeeper守护程序被配置为在三个或五个节点上运行。由于ZooKeeper本身对光资源有要求,因此可以将ZooKeeper节点并置在与HDFS NameNode和Standby Node相同的硬件上。许多操作员选择将第三个ZooKeeper进程与YARN ResourceManager部署在同一节点上。建议将ZooKeeper节点配置为将其数据与HDFS元数据存储在单独的磁盘驱动器上,以实现最佳性能和隔离。
-
The setup of ZooKeeper is out of scope for this document. We will assume that you have set up a ZooKeeper cluster running on three or more nodes, and have verified its correct operation by connecting using the ZK CLI.
-
ZooKeeper的设置超出了本文档的范围。我们将假定您已经设置了在三个或更多节点上运行的ZooKeeper集群,并已通过使用ZK CLI进行连接来验证其正确的操作。
Before you begin 在开始之前
- Before you begin configuring automatic failover, you should shut down your cluster. It is not currently possible to transition from a manual failover setup to an automatic failover setup while the cluster is running.
- 在开始配置自动故障转移之前,应关闭集群。在群集运行时,当前无法从手动故障转移设置过渡到自动故障转移设置。
Configuring automatic failover
- The configuration of automatic failover requires the addition of two new parameters to your configuration. In your hdfs-site.xml file, add:
- 自动故障转移的配置需要在配置中添加两个新参数。在您的hdfs-site.xml文件中,添加:
-
<property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value> </property>
- This specifies that the cluster should be set up for automatic failover. In your core-site.xml file, add:
- 这指定应为自动故障转移设置群集。在您的core-site.xml文件中,添加:
- ```
<property>
<name>ha.zookeeper.quorum</name>
<value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value>
</property>
-
This lists the host-port pairs running the ZooKeeper service.
-
这列出了运行ZooKeeper服务的主机端口对。
-
As with the parameters described earlier in the document, these settings may be configured on a per-nameservice basis by suffixing the configuration key with the nameservice ID. For example, in a cluster with federation enabled, you can explicitly enable automatic failover for only one of the nameservices by setting dfs.ha.automatic-failover.enabled.my-nameservice-id.
-
与本文档前面介绍的参数一样,可以通过在配置密钥后缀名称服务ID来基于每个名称服务配置这些设置。例如,在启用了联合身份验证的群集中,可以通过设置dfs.ha.automatic-failover.enabled.my-nameservice-id显式地仅对其中一种名称服务启用自动故障转移。
-
There are also several other configuration parameters which may be set to control the behavior of automatic failover; however, they are not necessary for most installations. Please refer to the configuration key specific documentation for details.
-
还可以设置其他几个配置参数来控制自动故障转移的行为。但是,对于大多数安装而言,它们不是必需的。有关详细信息,请参阅特定于配置密钥的文档。
Initializing HA state in ZooKeeper 在ZooKeeper中初始化HA状态
-
After the configuration keys have been added, the next step is to initialize required state in ZooKeeper. You can do so by running the following command from one of the NameNode hosts.
-
添加配置密钥后,下一步是在ZooKeeper中初始化所需的状态。您可以通过从其中一个NameNode主机运行以下命令来执行此操作。
[hdfs]$ $HADOOP_PREFIX/bin/hdfs zkfc -formatZK
-
This will create a znode in ZooKeeper inside of which the automatic failover system stores its data.
-
这将在ZooKeeper中创建一个znode,自动故障转移系统将在其中存储其数据。
Starting the cluster with start-dfs.sh 使用start-dfs.sh启动集群
- Since automatic failover has been enabled in the configuration, the start-dfs.sh script will now automatically start a ZKFC daemon on any machine that runs a NameNode. When the ZKFCs start, they will automatically select one of the NameNodes to become active.
- 由于已在配置中启用了自动故障转移,因此start-dfs.sh脚本现在将在任何运行NameNode的计算机上自动启动ZKFC守护程序。ZKFC启动时,它们将自动选择一个NameNode激活。
Starting the cluster manually 手动启动集群
-
If you manually manage the services on your cluster, you will need to manually start the zkfc daemon on each of the machines that runs a NameNode. You can start the daemon by running:
-
如果您手动管理集群上的服务,则需要在运行NameNode的每台计算机上手动启动zkfc守护程序。您可以通过运行以下命令启动守护程序:
[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --script $HADOOP_PREFIX/bin/hdfs start zkfc
Securing access to ZooKeeper 确保对ZooKeeper的访问
-
If you are running a secure cluster, you will likely want to ensure that the information stored in ZooKeeper is also secured. This prevents malicious clients from modifying the metadata in ZooKeeper or potentially triggering a false failover.
-
如果您正在运行安全群集,则可能需要确保存储在ZooKeeper中的信息也受到保护。这样可以防止恶意客户端修改ZooKeeper中的元数据或潜在地触发错误的故障转移。
-
In order to secure the information in ZooKeeper, first add the following to your core-site.xml file:
-
为了保护ZooKeeper中的信息,首先将以下内容添加到您的core-site.xml文件中:
-
<property> <name>ha.zookeeper.auth</name> <value>@/path/to/zk-auth.txt</value> </property> <property> <name>ha.zookeeper.acl</name> <value>@/path/to/zk-acl.txt</value> </property> -
Please note the ‘@’ character in these values – this specifies that the configurations are not inline, but rather point to a file on disk.
-
请注意这些值中的'@'字符-这表明配置不是内联的,而是指向磁盘上的文件。
-
The first configured file specifies a list of ZooKeeper authentications, in the same format as used by the ZK CLI. For example, you may specify something like:
-
第一个配置的文件指定ZooKeeper身份验证列表,格式与ZK CLI使用的格式相同。例如,您可以指定以下内容:
digest:hdfs-zkfcs:mypassword
-
…where hdfs-zkfcs is a unique username for ZooKeeper, and mypassword is some unique string used as a password.
-
…其中hdfs-zkfcs是ZooKeeper的唯一用户名,而mypassword是用作密码的一些唯一字符串。
-
Next, generate a ZooKeeper ACL that corresponds to this authentication, using a command like the following:
-
接下来,使用类似以下的命令生成与此身份验证相对应的ZooKeeper ACL:
[hdfs]$ java -cp $ZK_HOME/lib/*:$ZK_HOME/zookeeper-3.4.2.jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider hdfs-zkfcs:mypassword output: hdfs-zkfcs:mypassword->hdfs-zkfcs:P/OQvnYyU/nF/mGYvB/xurX8dYs=
-
Copy and paste the section of this output after the ‘->’ string into the file zk-acls.txt, prefixed by the string “digest:”. For example:
-
复制此输出的“->”字符串之后的部分,并将其粘贴到文件zk-acls.txt中,该文件的前缀为“ digest: ”。例如:
digest:hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=:rwcda
-
In order for these ACLs to take effect, you should then rerun the zkfc -formatZK command as described above.
-
为了使这些ACL生效,您应该按照上述说明重新运行zkfc -formatZK命令。
-
After doing so, you may verify the ACLs from the ZK CLI as follows:
-
这样做之后,您可以按照以下步骤从ZK CLI验证ACL:
[zk: localhost:2181(CONNECTED) 1] getAcl /hadoop-ha 'digest,'hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM= : cdrwa
Verifying automatic failover 验证自动故障转移
-
Once automatic failover has been set up, you should test its operation. To do so, first locate the active NameNode. You can tell which node is active by visiting the NameNode web interfaces – each node reports its HA state at the top of the page.
-
设置自动故障转移后,应测试其操作。为此,请首先找到活动的NameNode。您可以通过访问NameNode Web界面来确定哪个节点处于活动状态–每个节点在页面顶部报告其HA状态。
-
Once you have located your active NameNode, you may cause a failure on that node. For example, you can use kill -9 to simulate a JVM crash. Or, you could power cycle the machine or unplug its network interface to simulate a different kind of outage. After triggering the outage you wish to test, the other NameNode should automatically become active within several seconds. The amount of time required to detect a failure and trigger a fail-over depends on the configuration of ha.zookeeper.session-timeout.ms, but defaults to 5 seconds.
-
找到活动的NameNode之后,可能会导致该节点发生故障。例如,您可以使用kill -9 <NN的pid >模拟JVM崩溃。或者,您可以重新启动计算机电源或拔出其网络接口以模拟另一种中断。触发您要测试的中断后,另一个NameNode应在几秒钟内自动变为活动状态。检测故障和触发故障转移所需的时间取决于ha.zookeeper.session-timeout.ms的配置,但默认值为5秒。
-
If the test does not succeed, you may have a misconfiguration. Check the logs for the zkfc daemons as well as the NameNode daemons in order to further diagnose the issue.
-
如果测试不成功,则可能是配置错误。检查zkfc守护程序以及NameNode守护程序的日志,以便进一步诊断问题。
Automatic Failover FAQ 自动故障转移常见问题
-
Is it important that I start the ZKFC and NameNode daemons in any particular order? 以任何特定顺序启动ZKFC和NameNode守护程序是否重要?
-
No. On any given node you may start the ZKFC before or after its corresponding NameNode. 不可以。在任何给定节点上,您可以在其对应的NameNode之前或之后启动ZKFC。
-
What additional monitoring should I put in place? 我应该进行哪些其他监视?
-
You should add monitoring on each host that runs a NameNode to ensure that the ZKFC remains running. In some types of ZooKeeper failures, for example, the ZKFC may unexpectedly exit, and should be restarted to ensure that the system is ready for automatic failover. 您应该在运行NameNode的每个主机上添加监视,以确保ZKFC保持运行。例如,在某些类型的ZooKeeper故障中,ZKFC可能会意外退出,应重新启动以确保系统已准备好进行自动故障转移。
-
Additionally, you should monitor each of the servers in the ZooKeeper quorum. If ZooKeeper crashes, then automatic failover will not function. 此外,您应该监视ZooKeeper仲裁中的每个服务器。如果ZooKeeper崩溃,则自动故障转移将不起作用。
-
What happens if ZooKeeper goes down? 如果ZooKeeper崩溃了怎么办?
-
If the ZooKeeper cluster crashes, no automatic failovers will be triggered. However, HDFS will continue to run without any impact. When ZooKeeper is restarted, HDFS will reconnect with no issues. 如果ZooKeeper群集崩溃,将不会触发自动故障转移。但是,HDFS将继续运行而不会产生任何影响。重新启动ZooKeeper后,HDFS将重新连接,不会出现任何问题。
-
Can I designate one of my NameNodes as primary/preferred? 我可以将我的NameNode之一指定为主/首选吗?
-
No. Currently, this is not supported. Whichever NameNode is started first will become active. You may choose to start the cluster in a specific order such that your preferred node starts first. 否。目前不支持此功能。无论哪个NameNode首先启动,都将变为活动状态。您可以选择以特定顺序启动群集,以便您的首选节点首先启动。
-
How can I initiate a manual failover when automatic failover is configured? 配置自动故障转移后,如何启动手动故障转移?
-
Even if automatic failover is configured, you may initiate a manual failover using the same hdfs haadmin command. It will perform a coordinated failover. 即使配置了自动故障转移,也可以使用相同的hdfs haadmin命令启动手动故障转移。它将执行协调的故障转移。
HDFS Upgrade/Finalization/Rollback with HA Enabled 启用HA的HDFS升级/最终确定/回滚
-
When moving between versions of HDFS, sometimes the newer software can simply be installed and the cluster restarted. Sometimes, however, upgrading the version of HDFS you’re running may require changing on-disk data. In this case, one must use the HDFS Upgrade/Finalize/Rollback facility after installing the new software. This process is made more complex in an HA environment, since the on-disk metadata that the NN relies upon is by definition distributed, both on the two HA NNs in the pair, and on the JournalNodes in the case that QJM is being used for the shared edits storage. This documentation section describes the procedure to use the HDFS Upgrade/Finalize/Rollback facility in an HA setup.
-
在不同版本的HDFS之间移动时,有时可以简单地安装较新的软件,然后重新启动群集。但是,有时,升级正在运行的HDFS版本可能需要更改磁盘上的数据。在这种情况下,安装新软件后必须使用HDFS升级/最终化/回滚功能。在HA环境中,此过程变得更加复杂,因为根据定义,NN所依赖的磁盘元数据既分布在该对中的两个HA NN上,又在使用QJM的情况下分布在JournalNode上共享编辑存储。本文档部分介绍在HA设置中使用HDFS升级/最终化/回滚功能的过程。
-
To perform an HA upgrade, the operator must do the following:
-
要执行HA升级,操作员必须执行以下操作:
-
Shut down all of the NNs as normal, and install the newer software. 正常关闭所有NN,然后安装较新的软件。
-
Start up all of the JNs. Note that it is critical that all the JNs be running when performing the upgrade, rollback, or finalization operations. If any of the JNs are down at the time of running any of these operations, the operation will fail. 启动所有JN。请注意,执行升级,回滚或完成操作时,所有JN都必须运行是至关重要的。如果在运行任何这些操作时任何JN都已关闭,则该操作将失败。
-
Start one of the NNs with the '-upgrade' flag. 用“ -upgrade”标志启动其中一个NN。
-
On start, this NN will not enter the standby state as usual in an HA setup. Rather, this NN will immediately enter the active state, perform an upgrade of its local storage dirs, and also perform an upgrade of the shared edit log. 启动时,此NN将不会像通常在HA设置中那样进入待机状态。相反,此NN将立即进入活动状态,对其本地存储目录进行升级,还对共享编辑日志进行升级。
-
At this point the other NN in the HA pair will be out of sync with the upgraded NN. In order to bring it back in sync and once again have a highly available setup, you should re-bootstrap this NameNode by running the NN with the '-bootstrapStandby' flag. It is an error to start this second NN with the '-upgrade' flag. 此时,HA对中的另一个NN将与升级的NN不同步。为了使其同步并再次具有高可用性设置,您应该通过使用'-bootstrapStandby'标志运行NN重新引导此NameNode 。用“ -upgrade”标志启动第二个NN是错误的。
-
-
Note that if at any time you want to restart the NameNodes before finalizing or rolling back the upgrade, you should start the NNs as normal, i.e. without any special startup flag.
-
请注意,如果您要在完成或回滚升级之前随时重新启动NameNode,则应正常启动NN,即没有任何特殊的启动标志。
-
To finalize an HA upgrade, the operator will use the `hdfs dfsadmin -finalizeUpgrade' command while the NNs are running and one of them is active. The active NN at the time this happens will perform the finalization of the shared log, and the NN whose local storage directories contain the previous FS state will delete its local state.
-
为了完成HA升级,操作员将在NN正在运行且其中一个处于活动状态时使用“ hdfs dfsadmin -finalizeUpgrade”命令。此时发生的活动NN将完成共享日志的终结,其本地存储目录包含先前FS状态的NN将删除其本地状态。
-
To perform a rollback of an upgrade, both NNs should first be shut down. The operator should run the roll back command on the NN where they initiated the upgrade procedure, which will perform the rollback on the local dirs there, as well as on the shared log, either NFS or on the JNs. Afterward, this NN should be started and the operator should run `-bootstrapStandby' on the other NN to bring the two NNs in sync with this rolled-back file system state.
-
要执行升级的回滚,应首先关闭两个NN。操作员应在启动升级过程的NN上运行rollback命令,该命令将在该本地目录以及共享日志(NFS或JN)上执行回滚。之后,应启动此NN,操作员应在另一个NN上运行“ -bootstrapStandby”,以使两个NN与此回滚文件系统状态同步。