hbase

50 阅读2分钟

Apache HBase is a distributed, scalable, non-relational, column-oriented database that runs on top of Hadoop and HDFS. It provides random, realtime read/write access to Big Data.

hbase vs hive

数据都放在hdfs上

hive

  • hive 数据查询最终被转化为MapReduce执行, 查询时间比较长
  • hive中的表纯逻辑表,只是表的定义,本身是不存储的、不计算的,完全依赖于hdfs/MapReduce

应用场景

  • 主要用于构建基于hadoop平台的数据仓库离线处理海量数据
  • hive是提供完整的SQL实现,用于历史数据的分析、挖掘

hbase

  • hbase 基于数据库本身的实时查询,而不是去运行MapReduce
  • 有自己的一级索引,rowkey,基于一级索引进行数据查询,查询速度是比较快的
  • 底层基于scan进行数据扫描

应用场景

  • 适用于大数据的实时查询,还有海量数据的存储
  • 做实时数仓的,把维表数据存在hbase (维度表主要包含一个主键和各种维度字段)
  • 做标签的,把标签数据存在hbase里

data model

Column Family

-   Column families physically colocate a set of columns and their values, often for performance reasons. Each **column family** has a set of **storage properties**, such as whether its values should be **cached** in memory, how its data is **compressed** or its row keys are **encoded**, and others. Each row in a table has the same column families, though a given row might not store anything in a given column family.


-   Each column family in an HBase table has its own **Memstore**. This means that data modifications for different column families are stored in separate Memstores.

-   The Memstore uses a **skip-list**-based data structure, which is optimized for both insertions and lookups. This allows for efficient write and read operations.

namespace

A namespace is a logical grouping of tables analogous to a database in relation database systems.

cell

{row, column, version}  tuple exactly specifies a cell in HBase. Cell content is uninterpreted bytes

operations

  • get

  • put

  • scan

  • delete

  • HBase 采用 rowkey 作为一级索引,不支持多条件查询,如果要对库里的非 rowkey 进行数据检索和查询,往往需要通过 MapReduce 等分布式框架进行计算,时间延迟上会比较高

  • 为了既能支持对数据的高效查询,同时也能支持通过条件筛选进行复杂查询,需要在HBase上构建二级索引,可以采用Elasticsearch存储 HBase 的索引信息,以支持复杂高效的查询功能。

example

two rows (com.cnn.www and com.example.www) and three column families named contentsanchor, and people

  • anchor:cssnsi.com, anchor:my.look.ca
  • contents:html
  • people:author

image.png

physically stored by column family

a request for the values of all columns in the row com.cnn.www if no timestamp is specified would be: the value of contents:html from timestamp t6, the value of anchor:cnnsi.com from timestamp t9, the value of anchor:my.look.ca from timestamp t8.

bigtable

dzone.com/articles/un…

A Bigtable is a sparse, distributed, persistent multidimensional sorted map.

The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

HBase uses a data model very similar to that of Bigtable. Users store data rows in labelled tables. A data row has a sortable key and an arbitrary number of columns. The table is stored sparsely, so that rows in the same table can have crazily-varying columns, if the user likes.

architecture

image.png

HBase vs Cassandra

zhuanlan.zhihu.com/p/346855324

image.png

image.png