Apache HBase is a distributed, scalable, non-relational, column-oriented database that runs on top of Hadoop and HDFS. It provides random, realtime read/write access to Big Data.

hbase vs hive

数据都放在hdfs上

hive

hive 数据查询最终被转化为MapReduce执行, 查询时间比较长
hive中的表纯逻辑表，只是表的定义，本身是不存储的、不计算的，完全依赖于hdfs/MapReduce

应用场景

主要用于构建基于hadoop平台的数据仓库，离线处理海量数据
hive是提供完整的SQL实现，用于历史数据的分析、挖掘

hbase

hbase 基于数据库本身的实时查询，而不是去运行MapReduce
有自己的一级索引，rowkey，基于一级索引进行数据查询，查询速度是比较快的
底层基于scan进行数据扫描

应用场景

适用于大数据的实时查询，还有海量数据的存储
做实时数仓的，把维表数据存在hbase (维度表主要包含一个主键和各种维度字段)
做标签的，把标签数据存在hbase里

data model

Column Family

-   Column families physically colocate a set of columns and their values, often for performance reasons. Each **column family** has a set of **storage properties**, such as whether its values should be **cached** in memory, how its data is **compressed** or its row keys are **encoded**, and others. Each row in a table has the same column families, though a given row might not store anything in a given column family.


-   Each column family in an HBase table has its own **Memstore**. This means that data modifications for different column families are stored in separate Memstores.

-   The Memstore uses a **skip-list**-based data structure, which is optimized for both insertions and lookups. This allows for efficient write and read operations.

namespace

A namespace is a logical grouping of tables analogous to a database in relation database systems.

cell

A {row, column, version} tuple exactly specifies a cell in HBase. Cell content is uninterpreted bytes

operations

get
put
scan
delete
HBase 采用 rowkey 作为一级索引，不支持多条件查询，如果要对库里的非 rowkey 进行数据检索和查询，往往需要通过 MapReduce 等分布式框架进行计算，时间延迟上会比较高
为了既能支持对数据的高效查询，同时也能支持通过条件筛选进行复杂查询，需要在HBase上构建二级索引，可以采用Elasticsearch存储 HBase 的索引信息，以支持复杂高效的查询功能。

example

two rows (com.cnn.www and com.example.www) and three column families named contents, anchor, and people

anchor:cssnsi.com, anchor:my.look.ca
contents:html
people:author

physically stored by column family

a request for the values of all columns in the row com.cnn.www if no timestamp is specified would be: the value of contents:html from timestamp t6, the value of anchor:cnnsi.com from timestamp t9, the value of anchor:my.look.ca from timestamp t8.

bigtable

dzone.com/articles/un…

A Bigtable is a sparse, distributed, persistent multidimensional sorted map.

The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

HBase uses a data model very similar to that of Bigtable. Users store data rows in labelled tables. A data row has a sortable key and an arbitrary number of columns. The table is stored sparsely, so that rows in the same table can have crazily-varying columns, if the user likes.

architecture

HBase vs Cassandra

zhuanlan.zhihu.com/p/346855324