Hadoop Storm Spark HDFS

79 阅读3分钟

hbase vs hive

Hbase,其实是Hadoop database的简称,是一种NoSQL数据库,主要适用于海量明细数据(十亿、百亿)的随机实时查询,如日志明细、交易清单、轨迹行为等。

Hive,Hadoop数据仓库,通过SQL来处理和计算HDFS的数据,Hive会将SQL翻译为Mapreduce来处理数据,适用于离线的批量数据计算。

Hive和Hbase关系**

在大数据架构中,Hive和HBase是协作关系,在数据引入到数据存储上密切配合,共同完成任务——

通过ETL工具将数据源抽取到HDFS存储;

通过Hive清洗、处理和计算原始数据;

HIve清洗处理后的结果,如果是面向海量数据随机查询场景的可存入Hbase;

数据应用从HBase查询数据。

Hive和Hbase底层对比**

Hive中的表是纯逻辑表,就只是表的定义等,即表的元数据。Hive本身不存储数据,它完全依赖HDFS和MapReduce。这样就可以将结构化的数据文件映射为为一张数据库表,并提供完整的SQL查询功能,并将SQL语句最终转换为MapReduce任务进行运行。而HBase表是物理表,适合存放非结构化的数据。

Hive是基于MapReduce来处理数据,而MapReduce处理数据是基于行的模式;HBase处理数据是基于列的而不是基于行的模式,适合海量数据的随机访问。

HBase的表是疏松的存储的,因此用户可以给行定义各种不同的列;而Hive表是稠密型,即定义多少列,每一行有存储固定列数的数据。

Hive使用Hadoop来分析处理数据,而Hadoop系统是批处理系统,因此不能保证处理的低迟延问题;而HBase是近实时系统,支持实时查询。

Batch Processing (Hadoop)

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models

map reduce

举例: count occurrences of each word in a large collection of documents

map: 对每个document 遍历 word, 输出map[word]count

reduce: 把map阶段的结果相加

Stream Processing (storm)

a free and open source distributed realtime computation system

Storm Concepts

  • Topology: a graph of computation where the nodes represent some individual computations and the edges represent the data being passed between nodes.

  • Tuple: A tuple is an ordered list of values, where each value is assigned a name. Nodes in the topology send data between one another in the form of tuples.

  • Stream: An unbounded sequence of tuples between two nodes in the topology

  • Spout: Source of a stream in the topology. Read data from an external data source and emit tuples into the topology.

  • Bolt: Accepts a tuple from its input stream, performs some computation or transformation—filtering, aggregation, or a join, perhaps—on that tuple, and then optionally emits a new tuple(s).

image.png

Architecture

  • Master node: runs the Nimbus, a central job master to which topologies are submitted . It is in charge of scheduling, job orchestration, communication and fault tolerance; !
  • Worker nodes: nodes of the cluster in which applications are executed. Each of them run a Supervisor.
  • Master and workers coordinate through Zookeper.

image.png

Stateful Computations over Data Streams (flink)

Flink offers a more expressive programming model, exactly-once processing semantics, and efficient memory management, while Storm provides lower latency, simplicity, and high throughput for large-scale data processing.

big data processing (spark)

a unified analytics engine for big data

image.png

Hadoop

.

HDFS

HDFS,是Hadoop Distributed File System的简称

image.png

Simple Coherency Model

HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.

Files in HDFS are write-once and have strictly one writer at any time.

It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.