分布式文件系统hdfs是hadoop的核心组件之一,hdfs正是基于此论文的开源代码实现,此论文可以说是hdfs的奠基之作,非常值得仔细阅读。
摘要
We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
第一段第一句说了GFS是一个分布式的文件系统,第二句说了GFS能运行在便宜的商业硬件上,并且还保证了容错。看到这里我脑海里就有一个疑问,便宜的商业硬件指的肯定是普通的个人电脑,与之相对的不便宜的硬件是什么呢?显然,文中隐含的昂贵硬件肯定是大型机。
大型机真的贵吗
我特地去查询了IBM产的某款z15型号的大型机,从IBM客服那里了解到一台z15售价在200W起,配置越高价格越贵。换而言之,一台丐版的z15就得花200W,我特地从IBM官网下载了z15的技术手册,丐版z15有34个CPU,每个CPU有12个核心,相当于一台408核的服务器。毕竟是大型机,z15的CPU可不一般,技术手册宣称cpu核心可以运行在5.2GHz,5.2GH主频是什么概念呢,服务器的CPU通常也就2~3GHz,单从主频一项,大型机就比商业电脑高了接近一倍。仔细研究大型机的配置可以发现,大型机不光主频高,其它方面配置也都是吊打商业电脑,所以说贵还是有贵的道理。
摘要第一段第二句提到了GFS的一个重要特点------容错,GFS是怎么保证容错的呢?读到后面我们再回过头来看。
While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.
摘要第二段大体意思是:早年的文件系统的设计理念已经不能适应如今的应用负载和技术环境了,因此我们需要重新探索一个新的解决方案。(再记个问题,早年文件系统设计理念哪些地方不满足当今的需求?)
The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.
摘要第三段大体意思是:本文的GFS能够满足谷歌的存储需求。GFS是谷歌内部的存储平台,我们的服务以及需要海量数据的研发工作都在使用GFS处理数据。我们最大的数据集群能提供几百TB的容量,集群有一千台机器,数据分散在几千块硬盘里,并且还能同时给几百个客户端访问。
In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.
摘要第四段大体意思是:本论文展示了文件系统的接口扩展,用于分布式应用,讨论了GFS的许多设计细节,并且给出了基础测试结果和真实环境中的度量结果。(文件系统的接口拓展是什么东西?)
引言
We have designed and implemented the Google File System (GFS) to meet the rapidly growing demands of Google’s data processing needs. GFS shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability. However, its design has been driven by key observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system design assumptions. We have reexamined traditional choices and explored radically different points in the design space.
引言第一段对前面摘要第二段记录的问题做出了解释,紧接着下文给出了4点理由。
早年文件系统设计理念哪些地方不满足当今的需求?
First, component failures are the norm rather than the exception. The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts and is accessed by a comparable number of client machines. The quantity and quality of the components virtually guarantee that some are not functional at any given time and some will not recover from their current failures. We have seen problems caused by application bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies. Therefore, constant monitoring, error detection, fault tolerance, and automatic recovery must be integral to the system.
第一,组件故障已经是常态,不再是意外了。文件系统由几百上千的廉价的商用机器组成,提供给大量的客户端机器访问。考虑到商用机器的低质量和庞大的数量,我们能肯定的说,在任何时间点上总会有一些机器会出故障,总有一些机器不能从故障中恢复过来。我们见过太多故障了,程序bug、操作系统bug、人操作失误、磁盘故障、内存故障、连接器、网络故障、供电故障。所以说,系统必须要有持续监控、异常检测、容错以及自动恢复这些功能。
Second, files are huge by traditional standards. Multi-GB files are common. Each file typically contains many application objects such as web documents. When we are regularly working with fast growing data sets of many TBs comprising billions of objects, it is unwieldy to manage billions of ap- proximately KB-sized files even when the file system could support it. As a result, design assumptions and parameters such as I/O operation and block sizes have to be revisited.
第二,与以前相比,文件变大了。几GB的文件已经很常见了。一个文件通常包含许多应用对象,例如云文档。当我们面临由几十亿对象组成的快速增长的数TB的数据时,管理几十亿约KB级大小的文件是很不方便的(即便文件系统能够支持)。因此,我们必须重新审视之前的设计理念和参数,比如说IO操作、块大小。
Third, most files are mutated by appending new data rather than overwriting existing data. Random writes within a file are practically non-existent. Once written, the files are only read, and often only sequentially. A variety of data share these characteristics. Some may constitute large repositories that data analysis programs scan through. Some may be data streams continuously generated by running applications. Some may be archival data. Some may be in- termediate results produced on one machine and processed on another, whether simultaneously or later in time. Given this access pattern on huge files, appending becomes the focus of performance optimization and atomicity guarantees, while caching data blocks in the client loses its appeal.
第三,大多数文件不再需要去复写已存在的数据,而只需往文件尾追加新数据。在文件内的随机写实际上是不存在的。文件一旦写好了,后续就只需用于读,而且一般还是顺序读。各种数据都有这样的特点。比如,用于数据分析程序扫描的大型存储库;运行中的APP会持续产生的数据流;一些档案数据;有些可能是一些中间结果数据,这些中间数据需要同时或后续及时的给另一台机器处理。考虑到上述对大文件的访问模式,在客户端缓存数据块不再是主流,而追加方式写文件成为性能优化与原子性保证的核心。
没看懂「在客户端缓存数据块不再是主流」,难道是说客户端会把数据缓存到本地以优化性能?
Fourth, co-designing the applications and the file system API benefits the overall system by increasing our flexibility. For example, we have relaxed GFS’s consistency model to vastly simplify the file system without imposing an onerous burden on the applications. We have also introduced an atomic append operation so that multiple clients can append concurrently to a file without extra synchronization between them. These will be discussed in more details later in the paper.
第四,文件系统和应用协同设计对整个系统是很有好处的。举个例子,在不给应用带来麻烦的情况下,我们放宽了GFS的一致性模型以简化文件系统设计。我们引入了原子追加操作,让多个客户端无需进行额外的同步操作就能同时写入同一个文件。这部分细节论文后面会讨论。
co-designing直译为协同设计,真的没看懂什么意思,所幸作者可能知道读者此处会有疑问,特地做了举例补充说明。此处提到作者放宽了一致性模型,我很震惊,一致性也是能够妥协的吗?后面重点关注放宽了哪些东西。
Multiple GFS clusters are currently deployed for different purposes. The largest ones have over 1000 storage nodes, over 300 TB of disk storage, and are heavily accessed by hundreds of clients on distinct machines on a continuous basis.
当前,谷歌已部署了多个GFS集群,分别用于不同的用途。最大的集群已经有超过一千个存储节点,拥有超过300TB的磁盘数据,被不同机器上的几百个客户端同时重度访问着。