读Hadoop 3.1.4官方文档(一)—— HDFS架构

203 阅读24分钟

HDFS 架构(HDFS Architecture)

文章英文来源: 官方文档(HDFS架构)

以下不是翻译,只是看完文档之后的一点提炼。

Introduction

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is hadoop.apache.org/.

理解:HDFS厉害之处在于高容错,低硬件成本,适合大数据集的高吞吐量。HDFS为了支持对文件系统数据的流访问,放宽了一些POSIX(可移植操作系统接口(Portable Operating System Interface),X表明其对Unix API的传承)的需求。

Assumptions and Goals

Hardware Failure

Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

理解:硬件故障是常态。HDFS的核心架构目标是检测故障并进行快速自动的恢复。

Streaming Data Access

Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.

理解:HDFS更多是为了批量操作,而不是用户交互。因而它的重点是在数据访问时的高吞吐量而不是低延迟。

Large Data Sets

Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.

理解:HDFS就是为了大数据而生的。

Simple Coherency Model

HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed except for appends and truncates. Appending the content to the end of the files is supported but cannot be updated at arbitrary point. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model.

理解:HDFS需要的是一个“写一多读”的文件访问模型。意思是文件一旦被创建,写入并关闭后,除了追加和缩短(某种程度的删除),不可以更改。虽然支持追加内容到文件结尾,但并不是随时都可以的。这种假设简化了数据一致性问题,并支持高吞吐数据访问。跟MapReduce和网络爬虫在一起简直完美。

“Moving Computation is Cheaper than Moving Data”

A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.

理解:一言以蔽之,计算比数据便宜。所以要在数据所在的位置附近去计算,这可以最小化网络拥挤和提高整个系统的吞吐量。HDFS提供的接口能够实现这点。(就是要算的草纸太多了,我搬起来太费劲,所以我带着计算器来了。)

Portability Across Heterogeneous Hardware and Software Platforms

HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.

理解:HDFS移植性好。

NameNode and DataNodes

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.

The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.

理解:HDFS是一个主从架构。一个HDFS集群只有一个NameNode,它是一个主服务器,负责管理文件系统命名空间和客户端对文件的访问。此外,这个集群还有很多的DataNode,经常是一个集群的每个节点中有一个DataNode,它们管理连接到它们所运行节点上的存储。HDFS上有了一个文件系统命名空间,允许用户数据存储到文件中。在内部,一个文件被分成一个或者多个块(block),这些块(数据块)存储在一组DataNode中。NameNode会执行文件系统命名空间的操作,比如打开、关闭、重命名文件和目录。它也决定了块和DataNode间的映射关系。DataNode负责为文件系统客户端提供读写的请求。同样它也要在NameNode的指导下进行块的创建,删除和复制。

HDFS是用Java构建的。典型的部署中,一台机器专门运行NameNode,然后集群中的其他机器们,每台运行一个DataNode。也有一台机器运行多个DataNode的情况,但是实际情况中几乎没有。

集群中只有一个NameNode极大的简化了系统架构。NameNode是决策者也是HDFS所有元数据的储存库。因为这种设计,用户数据从来不通过NameNode。

The File System Namespace

HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS supports user quotas and access permissions. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features.

While HDFS follows naming convention of the FileSystem, some paths and names (e.g. /.reserved and .snapshot ) are reserved. Features such as transparent encryption and snapshot use reserved paths.

The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.

理解:HDFS支持传统的分级文件结构(详见PC)。HDFS不支持硬链接和软连接,但是可以实现。NameNode维护文件系统命名空间。HDFS要维护多少副本数量,这个复制因子(拷贝数)就存在NameNode中。

Data Replication

HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file.

All blocks in a file except the last block are the same size, while users can start a new block without filling out the last block to the configured block size after the support for variable length block was added to append and hsync.

An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once (except for appends and truncates) and have strictly one writer at any time.

The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.

理解:HDFS存储每个文件都是一连串的块(block)。为了容错,每个块会被复制。每个文件中 块的大小和复制因子都是可以配置的。

一个文件中的所有块,除了最后一块,大小都是相同的。由于现在支持了追加和同步,导致块的长度可变,所以用户可以用一个新的块去存储,而不需要去填最后一个块剩下的。

复制因子可以在文件创建时指定,也可以在之后进行修改。由于HDFS中的文件是写入一次(之前有提及),所以任何时候都严格要求只有一个写入器。

NameNode定期接收集群内每个DataNode的心跳(Heartbeat)和块报告(Blockreport)。收到心跳表示DataNode正常工作。Blockreport包含一个DataNode上所有块的列表。

Replica Placement: The First Baby Steps

The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it on production systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies.

Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks.

The NameNode determines the rack id each DataNode belongs to via the process outlined in Hadoop Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks. This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.

For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on the local machine if the writer is on a datanode, otherwise on a random datanode, another replica on a node in a different (remote) rack, and the last on a different node in the same remote rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.

If the replication factor is greater than 3, the placement of the 4th and following replicas are determined randomly while keeping the number of replicas per rack below the upper limit (which is basically (replicas - 1) / racks + 2).

Because the NameNode does not allow DataNodes to have multiple replicas of the same block, maximum number of replicas created is the total number of DataNodes at that time.

After the support for Storage Types and Storage Policies was added to HDFS, the NameNode takes the policy into account for replica placement in addition to the rack awareness described above. The NameNode chooses nodes based on rack awareness at first, then checks that the candidate node have storage required by the policy associated with the file. If the candidate node does not have the storage type, the NameNode looks for another node. If enough nodes to place replicas can not be found in the first path, the NameNode looks for nodes having fallback storage types in the second path.

The current, default replica placement policy described here is a work in progress.

理解:之前写过 详见

Replica Selection

To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If HDFS cluster spans multiple data centers, then a replica that is resident in the local data center is preferred over any remote replica.

理解:为了最小化全局带宽消耗和读延迟,HDFS会尝试优先读取离读请求最近的一个副本。

Safemode

On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes.

理解:启动时,NameNode会进入一个特殊的状态,叫做安全模式(Safemode)。如果NameNode在安全模式下,是不会复制数据块的。NameNode接收来自DataNode们的心跳(Heartbeat)和块报告(Blockreport)。Blockreport中包含DataNode所在的数据块列表。每个块都有指定的最小副本数量。当NameNode检查了该数据块的最小副本数量后,就会认为该块是安全复制。当NameNode检查的安全复制数据块达到一个百分比(可配置的)的时候(再加30秒),NameNode会退出安全模式。然后,NameNode会找到那些仍然低于最小副本数量的数据块们,把它们复制到其他DataNode中。

The Persistence of File System Metadata

The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. For example, creating a new file in HDFS causes the NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.

The NameNode keeps an image of the entire file system namespace and file Blockmap in memory. When the NameNode starts up, or a checkpoint is triggered by a configurable threshold, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory representation of the FsImage, and flushes out this new version into a new FsImage on disk. It can then truncate the old EditLog because its transactions have been applied to the persistent FsImage. This process is called a checkpoint. The purpose of a checkpoint is to make sure that HDFS has a consistent view of the file system metadata by taking a snapshot of the file system metadata and saving it to FsImage. Even though it is efficient to read a FsImage, it is not efficient to make incremental edits directly to a FsImage. Instead of modifying FsImage for each edit, we persist the edits in the Editlog. During the checkpoint the changes from Editlog are applied to the FsImage. A checkpoint can be triggered at a given time interval (dfs.namenode.checkpoint.period) expressed in seconds, or after a given number of filesystem transactions have accumulated (dfs.namenode.checkpoint.txns). If both of these properties are set, the first threshold to be reached triggers a checkpoint.

The DataNode stores HDFS data in files in its local file system. The DataNode has no knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local file system. The DataNode does not create all files in the same directory. Instead, it uses a heuristic to determine the optimal number of files per directory and creates subdirectories appropriately. It is not optimal to create all local files in the same directory because the local file system might not be able to efficiently support a huge number of files in a single directory. When a DataNode starts up, it scans through its local file system, generates a list of all HDFS data blocks that correspond to each of these local files, and sends this report to the NameNode. The report is called the Blockreport.

理解:NameNode用一个叫做EditLog的事务日志来持续的记录文件系统中元数据的每次修改。比如,在HDFS中穿件一个新文件,更改复制因子等都会在EditLog中插入一条记录。NameNode在它本机操作系统的文件系统中用一个文件来存储EditLog。整个文件系统命名空间,包括块与文件之间的映射关系、文件系统的属性,都被存在一个叫做FsImage的文件中。FsImage也存在NameNode的本地文件系统中。

当NameNode启动时,或者一个检查点被一个可配置的阈值触发时,它会开始从磁盘上读取FsImage和EditLog,将所有EditLog中的事务放到FsImage的内存中,然后将其在磁盘中产生一个新版本的FsImage。接下来,它可以删除旧的EditLog,因为里面的事务已经放到持续更新的FsImage中了。这个流程就称作一次检查(checkpoint)。读取FsImage很方便,但是编辑FsImage并不方便。所以我们需要EditLog记录事务更新,然后在检查点触发时,再将EditLog放入FsImage中。检查点的触发有两种,一种是给定时间间隔(dfs.namenode.checkpoint.period),以秒为单位,另一种是累计事务数量(dfs.namenode.checkpoint.txns)。如果两种属性都被设置了,哪个先到阈值就马上触发检查点。

DataNode将HDFS文件中的数据存储在本地文件系统上。DataNode不知道什么HDFS文件,它只是将HDFS数据中的每个块存储在本地文件系统的单独文件中。DataNode不会再同一个目录下创建所有文件,它用一个方法决定每个目录中的最优文件数量,并适当地创建子目录。在同一个目录中创建所有本地文件不是最佳选择,因为本地文件系统可能无法有效地支持单个目录中放入大量文件。当DataNode启动时,它会扫描本地文件系统,生成与每个本地文件对应的 所有HDFS数据块的 列表,并将该报告发送给NameNode。这个报告就是块报告(Blockreport)。

The Communication Protocols

All HDFS communication protocols are layered on top of the TCP/IP protocol. A client establishes a connection to a configurable TCP port on the NameNode machine. It talks the ClientProtocol with the NameNode. The DataNodes talk to the NameNode using the DataNode Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol. By design, the NameNode never initiates any RPCs. Instead, it only responds to RPC requests issued by DataNodes or clients.

理解:客户端用一个可配置的TCP端口连接到NameNode机器上。客户端与NameNode用客户端协议,DataNode与NameNode用DataNode协议。一个远程过程调用(RPC)封装了客户端协议和DataNode协议。按照设计,NameNode从不发起任何RPC,它只响应DataNode或客户端发出的RPC请求。

Robustness

The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are NameNode failures, DataNode failures and network partitions.

理解:HDFS的主要目标是即使出现故障也要可靠地存储数据。常见的故障类型有NameNode故障、DataNode故障和网络分区故障。

Data Disk Failure, Heartbeats and Re-Replication

Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.

The time-out to mark DataNodes dead is conservatively long (over 10 minutes by default) in order to avoid replication storm caused by state flapping of DataNodes. Users can set shorter interval to mark DataNodes as stale and avoid stale nodes on reading and/or writing by configuration for performance sensitive workloads.

理解:每个DataNode定期都会给NameNode发一个心跳消息。网络分区故障会导致一群DataNode与NameNode失去连接。NameNode通过没有心跳消息来检测到这种情况。NameNode将最近没有心跳的DataNode标记为死亡,不会再向其转发任何新的IO请求。任何标记为死亡的DataNode上的数据,对HDFS都不可用。DataNode死亡可能导致某些块的复制因子低于指定值。然后NameNode会持续跟踪需要复制哪些块,并在必要时启动复制。需要启动复制的原因有很多,比如,DataNode可能不可用,副本可能损坏,DataNode上的硬盘可能故障,或者文件的复制因子增加。

为了避免DataNode状态转变导致的复制风暴,检查DataNode的死亡标记时长要谨慎(默认为十分钟以上)。对于性能敏感的工作负载,用户可以设置更短的时间间隔,将DataNode标记为不新鲜的(stale),通过配置避免读写不新鲜节点。

Cluster Rebalancing

The HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented.

理解:HDFS架构支持多种数据再平衡方案。比如,一个DataNode上的空闲空间低于某个阈值,就可能会自动将数据从另一个DataNode移动到这个DataNode上。再比如,突然对一个特定文件有很高的需求,就可能会动态创建额外的副本,并重新平衡集群中的其他数据。但是这些再平衡数据方案都没有实现。(一时之间不知道是看了个寂寞,还是说大有可为的空间。)

Data Integrity

It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.

理解:数据完整性的检验 从DataNode获取的数据块可能是损坏的,原因可能是存储设备故障、网络故障或有bug的软件。这里HDFS客户端软件会实现一个checksum来检查HDFS文件的内容。当客户端创建一个HDFS文件时,它会计算每个块的checksum,并将这些checksum存储在同一个HDFS命名空间中的一个单独的隐藏文件中。当客户端检索文件内容时,它会验证从每个DataNode接收到的数据中的checksum作对比。如果不匹配,则客户端可以选择从另一个具有该块副本的DataNode中检索。

Metadata Disk Failure

The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use.

Another option to increase resilience against failures is to enable High Availability using multiple NameNodes either with a shared storage on NFS or using a distributed edit log (called Journal). The latter is the recommended approach.

理解:FsImage和EditLog是HDFS的核心数据结构。这些文件的损坏会导致HDFS失去功能。因此,可以将NameNode配置为支持维护多个FsImage和EditLog的副本。对FsImage或EditLog的任何更新都会同步更新到每个FsImage和EditLog。虽然,对多个FsImage和EditLog副本的同步更新,可能会降低NameNode支持命名空间中事务每秒的速率。然而,这种降级是可以接受的,因为尽管HDFS应用程序在本质上是数据密集型的,但不是元数据密集型的。当NameNode重启时,会选择使用最新一致的FsImage和EditLog。

另一种提高故障恢复能力的方法是使用多个NameNode。这些NameNode可以在NFS上共享存储,也可以使用分布式编辑日志(Journal)。推荐后者的方法(Journal)。

Snapshots

Snapshots support storing a copy of data at a particular instant of time. One usage of the snapshot feature may be to roll back a corrupted HDFS instance to a previously known good point in time.

理解:利用快照回滚。

Data Organization

Data Blocks

HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 128 MB. Thus, an HDFS file is chopped up into 128 MB chunks, and if possible, each chunk will reside on a different DataNode.

理解:HDFS中块(block)的大小一般是128MB。所以一个HDFS文件会根据128MB被切割成小块,而且如果可能,每个小块会被放在不同的DataNode上。

Replication Pipelining

When a client is writing data to an HDFS file with a replication factor of three, the NameNode retrieves a list of DataNodes using a replication target choosing algorithm. This list contains the DataNodes that will host a replica of that block. The client then writes to the first DataNode. The first DataNode starts receiving the data in portions, writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.

理解:副本可以流水线传输。DataNode1一边接收写入数据,一边把写入的数据传给DataNode2。同理,DataNode2传送给DataNode3。

Accessibility

HDFS can be accessed from applications in many different ways. Natively, HDFS provides a FileSystem Java API for applications to use. A C language wrapper for this Java API and REST API is also available. In addition, an HTTP browser and can also be used to browse the files of an HDFS instance. By using NFS gateway, HDFS can be mounted as part of the client’s local file system.

理解:多种访问方式。Java API,一个C语言包装的Java API和REST API,HTTP浏览器。通过使用NFS网关,HDFS可以挂载到客户端本地文件系统。

FS Shell

HDFS allows user data to be organized in the form of files and directories. It provides a commandline interface called FS shell that lets a user interact with the data in HDFS. The syntax of this command set is similar to other shells (e.g. bash, csh) that users are already familiar with. Here are some sample action/command pairs:

Action And Command

  Create a directory named /foodir	bin/hadoop dfs -mkdir /foodir
  Remove a directory named /foodir	bin/hadoop fs -rm -R /foodir
  View the contents of a file named /foodir/myfile.txt	bin/hadoop dfs -cat /foodir/myfile.txt

FS shell is targeted for applications that need a scripting language to interact with the stored data.

理解:FS hell命令语言。

DFSAdmin The DFSAdmin command set is used for administering an HDFS cluster. These are commands that are used only by an HDFS administrator. Here are some sample action/command pairs:

Action Command Put the cluster in Safemode bin/hdfs dfsadmin -safemode enter Generate a list of DataNodes bin/hdfs dfsadmin -report Recommission or decommission DataNode(s) bin/hdfs dfsadmin -refreshNodes

理解:DFSAdmin命令语言,HDFS管理者使用。

Browser Interface A typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser.

理解:浏览器交互界面。

Space Reclamation

File Deletes and Undeletes

If trash configuration is enabled, files removed by FS Shell is not immediately removed from HDFS. Instead, HDFS moves it to a trash directory (each user has its own trash directory under /user/< username >/.Trash). The file can be restored quickly as long as it remains in trash.

Most recent deleted files are moved to the current trash directory (/user/< username >/.Trash/Current), and in a configurable interval, HDFS creates checkpoints (under /user/< username >/.Trash/< date >) for files in current trash directory and deletes old checkpoints when they are expired. See expunge command of FS shell about checkpointing of trash.

After the expiry of its life in trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.

理解:如果启动了垃圾桶,在FS Shell中删除的文件,都会被暂时存起来,在目录 r /user/< username >/.Trash 中。一般最近删除的都会在该目录下,并有一个可设置的定时去查看垃圾桶中是否有过期的删除文件。一旦过期被删除,相关的block存储区间会被释放。注意,在用户删除文件和HDFS中相应的空闲空间增加之间,可能有一个明显的时间延迟。

Decrease Replication Factor

When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and the corresponding free space appears in the cluster. Once again, there might be a time delay between the completion of the setReplication API call and the appearance of free space in the cluster.

理解:当文件的复制因子降低时,NameNode会选择可以删除的多余副本。下一个心跳将此信息传递给DataNode。DataNode就会删除相应块,然后相应的空闲空间会出现在集群中。同样,在完成API调用设置副本和集群出现对应空闲空间之间,也可能存在时间延迟。

文章英文来源: 官方文档(HDFS架构)