Apache HBase is a distributed, scalable, non-relational, column-oriented database that runs on top of Hadoop and HDFS. It provides random, realtime read/write access to Big Data.
hbase vs hive
数据都放在hdfs上
hive
- hive 数据查询最终被转化为MapReduce执行, 查询时间比较长
- hive中的表纯逻辑表,只是表的定义,本身是不存储的、不计算的,完全依赖于hdfs/MapReduce
应用场景
- 主要用于构建基于hadoop平台的数据仓库,离线处理海量数据
- hive是提供完整的SQL实现,用于历史数据的分析、挖掘
hbase
- hbase 基于数据库本身的实时查询,而不是去运行MapReduce
- 有自己的一级索引,rowkey,基于一级索引进行数据查询,查询速度是比较快的
- 底层基于scan进行数据扫描
应用场景
- 适用于大数据的实时查询,还有海量数据的存储
- 做实时数仓的,把维表数据存在hbase (维度表主要包含一个主键和各种维度字段)
- 做标签的,把标签数据存在hbase里
data model
Column Family
- Column families physically colocate a set of columns and their values, often for performance reasons. Each **column family** has a set of **storage properties**, such as whether its values should be **cached** in memory, how its data is **compressed** or its row keys are **encoded**, and others. Each row in a table has the same column families, though a given row might not store anything in a given column family.
- Each column family in an HBase table has its own **Memstore**. This means that data modifications for different column families are stored in separate Memstores.
- The Memstore uses a **skip-list**-based data structure, which is optimized for both insertions and lookups. This allows for efficient write and read operations.
namespace
A namespace is a logical grouping of tables analogous to a database in relation database systems.
cell
A {row, column, version} tuple exactly specifies a cell in HBase. Cell content is uninterpreted bytes
operations
-
get
-
put
-
scan
-
delete
-
HBase 采用 rowkey 作为一级索引,不支持多条件查询,如果要对库里的非 rowkey 进行数据检索和查询,往往需要通过 MapReduce 等分布式框架进行计算,时间延迟上会比较高
-
为了既能支持对数据的高效查询,同时也能支持通过条件筛选进行复杂查询,需要在HBase上构建二级索引,可以采用Elasticsearch存储 HBase 的索引信息,以支持复杂高效的查询功能。
example
two rows (com.cnn.www and com.example.www) and three column families named contents, anchor, and people
- anchor:cssnsi.com, anchor:my.look.ca
- contents:html
- people:author
physically stored by column family
a request for the values of all columns in the row com.cnn.www if no timestamp is specified would be: the value of contents:html from timestamp t6, the value of anchor:cnnsi.com from timestamp t9, the value of anchor:my.look.ca from timestamp t8.
bigtable
A Bigtable is a sparse, distributed, persistent multidimensional sorted map.
The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
HBase uses a data model very similar to that of Bigtable. Users store data rows in labelled tables. A data row has a sortable key and an arbitrary number of columns. The table is stored sparsely, so that rows in the same table can have crazily-varying columns, if the user likes.
architecture
HBase vs Cassandra
zhuanlan.zhihu.com/p/346855324