【纸上谈兵】HBase 课堂随笔小知识，大挑战！本文正在参与“程序员必备小知识”创作活动。本文同时参与「掘力星计划」，

小知识，大挑战！本文正在参与“程序员必备小知识”创作活动。

本文同时参与「掘力星计划」，赢取创作大礼包，挑战创作激励金。

1. 物理模型

hbase 的数据都是字符串类型，且控制不占用存储空间。

2. 实际存储方式

每个 Region 由多个 Store 构成，每个 Store 保存一个 Columns Family.

3. hbase shell

启动 hbase shell 命令行工具:

$ hbase shell

HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.2.6, rUnknown, Mon May 29 02:25:32 CDT 2017

hbase(main):001:0>

3.1 数据定义

命令	描述
`create`	创建指定模式的新表
`alter`	修改表的结构，如添加新的列族
`describe`	展示表结构的信息，包括列族的数量与属性
`list`	列出 hbase 中已有的表
`disable`/`enabled`	为了删除或更改表禁用一个表(disable)，更改完毕后需要解禁表(enabled)
`disable_all`	禁用所有的表
`is_disabled`	判断一个表是否被禁用
`drop`	删除表
`truncate`	删除数据但不删除表结构，使用 truncate（禁用表 > 删除表 > 重构表）

3.1.1 创建表

① create：

$ create '表名', '列族名'

描述：

必须指定表名和列族；
可以创建多个列族；
可以对列族指明一些参数；
参数大小写敏感；
字符串参数需要包含在单引号中。

例1：

$ create 'Student', 'StuInfo', 'Grades'

例2：

$ create 'Student', 'StuInfo', 'stuInfo'

例3：

$ create 'Student', 'StuInfo', 'Grades', MAX_FILESIZE=>'134217718'

例4：

$ create 'Student', {NAME=>'Grades', VERSION=>5, BLOCKCACHE=>true}

3.1.2 表相关操作

① exists：查看某个表是否存在

$ exists 'Student'

# 执行结果
Table Student does exist
0 row(s) in 0.2080 seconds

② list：查看当前所有的表名

$ list

# 执行结果
TABLE
Student
Hbase thrift
Student
Test
8 row(s) in 0.0560 seconds

③ describe：查看选定表的列族及其参数

$ describe 'Student'

# 执行结果
Table Student is ENABLED
Student
COLUMN FAMILIES DESCRIPTION
{NAME => 'Grades', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'StuInfo', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
2 row(s) in 0.0340 seconds

④ alter：修改表结构

功能：

修改表中列族的参数信息

# alter '原表', {修改后的表及其参数}
$ alter 'Student', {NAME => 'Grades', VERSIONS => 3}

增加列族

# alter '表', '增加列族1
$ alter 'Student', 'hobby'

移除或删除已有的列族

# alter '表名', 'delete' => '列族'
$ alter 'Student', 'delete' => 'hobby'

注意：删除列族时，表中至少有两个列族组成。

⑤ drop：删除表（相当于 truncate 的前两步）

注意：删除表之前需要先禁用表！

# 删除Student表
$ disable 'Student'
$ is_disabled 'Student'
$ drop 'Student'

⑥ truncate：禁用 & 删除 & 按原结构重新建立表结构

$ truncate 'Student'

3.2 数据操作

命令	描述
`put`	添加一个值到指定单元格中
`delete`	删除表中列族或列的数据
`get`	通过表名、行键等参数获取行或单元格数据
`scan`	扫描表并输出满足指定条件的行记录（可以指定行键范围，或使用过滤器）

3.2.1 put

语法：

$ put '表名', '行键', '列族:列限定符', '单元格值', 时间戳

描述：

必须指定表名、行键、列族、列限定符（可忽略 TimeStamp）；
参数必须区分大小写，字符串使用单引号；
只能插入单条数据；
如果指定的单元格已经存在，则 put 为更新数据；
单元格会保存指定 VERSION=>n的多个版本数据。

Example

3.2.2 delete

语法：

$ delete '表名', '行键', '列族<:列限定符>', <时间戳>

描述：

必须指定表名、行键和列族，列限定符和时间戳是可选参数；
delete 最小删除粒度为单元格，且不能跨列族删除。

例子：

# 删除Student表下行键001、Grades列族下的多个数据！
$ delete 'Student', '001', 'Grades'

# 删除Student表下行键001、Grades列族下Math这一个的数据！
$ delete 'Student', '001', 'Grades:Math'

# Student表中行键001的Grades列族下的Math这一单元格中，所有时间戳<=2的数据都会被删掉。
# 注意这里不是只删除时间戳等于2的数据！
$ delete 'Student', '001', 'Grades:Math', 2

# 删除行键所指示的这一行数据，无法删除多行！
$ deleteall 'Student', '001'

注意：delete 操作并不会马上删除数据，只是将对应的数据打上删除标记 (tombstone)，只有在数据产生合并时，数据才会被删除！

3.2.3 get

根据行键获取一条数据，语法：

$ get '表名', '行键', <'列族:列限定符', 时间戳>

例1：

$ get 'Student', '001'

# 执行结果
COLUMN                                        CELL
Grades:English                             timestamp=1541039459116, value=80
Grades:Math                                timestamp=1541039459299, value=90
StuInfo:Age                                timestamp=1541039335956, value=18
StuInfo:Name                               timestamp=1, value=alice
StuInfo:Sex                                timestamp=1541039336280, value=female
5 row(s) in 0.0870 seconds

例2：

$ get 'Student', '001', {COLUMN=>'Grades'}

# 执行结果
COLUMN                                        CELL
Grades:English                             timestamp=1541039459116, value=80
Grades:Math                                timestamp=1541039459299, value=90
2 row(s) in 0.0870 seconds

例3：

$ get 'Student', '001', {COLUMN=>'Grades', TIMERANGE=>[1,2]}
# 等价于
$ get 'Student', '001', {COLUMN=>'stuInfo', VERSION=>2}

3.2.4 scan

语法：

$ scan '表名', {<'列族:列限定符', 时间戳>}

（1）扫描全表

$ scan 'Student'

（2）指定列族名称

$ scan 'Student', {COLUMN=>'StuInfo'}

（3）指定列族和列的名称

$ scan 'Student', {COLUMN=>'StuInfo:Name'}

（4）指定输出行数

$ scan 'Student', {LIMIT=>1}

（5）指定输出行键范围

$ scan 'Student', {STARTROW=>'001', ENDROW=>'003'}

（6）指定组合条件查询

$ scan 'Student', {COLUMN=>'StuInfo', STARTROW=>'001', ENDROW=>'002'}

4. 过滤器

可以根据主键、列簇、列、版本等更多的条件来对数据进行过滤！类似 SQL 中的 WHERE 关键字！

$ show_filters
# 执行结果
DependentColumnFilter
**KeyOnlyFilter**
ColumnCountGetFilter
SingleColumnValueFilter
PrefixFilter
SingleColumnValueExcludeFilter
FirstKeyOnlyFilter
ColumnRangeFilter
TimestampsFilter
**FamilyFilter**
**QualifierFilter**
ColumnPrefixFilter
**RowFilter**
MultipleColumnPrefixFilter
InclusiveStopFilter
PageFilter
**ValueFilter**
ColumnPaginationFilter

过滤器语法格式：

$ scan/get '表名',{FILTER=>"过滤器(比较运算符,'比较器')"}

比较运算符	描述
=	等于
大于
>=	大于等于
<	小于
<=	小于等于
!=	不等于

比较器	描述
`BinaryComparator`	匹配完整字节数组
`BinaryPrefixComparator`	匹配字节数组前缀
`BitComparator`	匹配比特位
`NullComparator`	匹配空值
`RegexStringComparator`	匹配正则表达式
`SubstringComparator`	匹配子字符串

接下来介绍一些常见过滤器的使用方法！

4.1 `RowFilter`

⭐针对行键进行字符串的比较和过滤！

例 1：显示行键前缀为0开头的键值对

$ scan 'Student', {FILTER=>"RowFilter(=,'substring:0')"}

例 2：显示行键字节顺序大于002的键值对

$ scan 'Student', {FILTER=>"RowFilter(>,'binary:002')"}

4.2 `PrefixFilter`

⭐行键前缀比较器，一种更简单的比较行键前缀的命令，等值（前缀）比较！

例 1：显示行键前缀为0开头的键值对

$ scan 'Student', {FILTER=>"PrefixFilter('0')"}
# 等价于使用RowFilter过滤器的以下语句
$ scan 'Student', {FILTER=>"RowFilter(=,'substring:0')"}

4.3 `KeyOnlyFilter`

⭐只对 cell 的键进行过滤和显示，不显示值。这里的键指行键、列族、列和时间戳，值指单元格的值。它的扫描效率比 RowFilter 高一些！

例 1：显示所有的键，不显示值

$ scan 'Student', {FILTER=>"KeyOnlyFilter()"}

# 执行结果
001                   column=grades:englisg, timestamp=1541485306878, value=
001                   column=grades:math, timestamp=1541485384199, value=
001                   column=stuinfo:age, timestamp=1541485224974, value=
...

4.4 `FirstKeyOnlyFilter`

⭐只扫描相同键的第一个cell，其键值对都会显示出来，如有重复的行键则跳过。该过滤器是一个比较特殊的过滤器，它可以完成对表中行键类别的计数，类似功能 count 命令！

例 1：统计表的逻辑行数

$ scan 'Student',{FILTER=>"FirstKeyOnlyFilter()"}
ROW                    COLUMN+CELL
001                   column=grades:englisg, timestamp=1541485306878, value=80
002                   column=grades:bigdata, timestamp=1541485403649, value=88
003                   column=grades:bigdata, timestamp=1541485412686, value=80
3 row(s) in 0.0400 seconds

4.5 `InclusiveStopFilter`

⭐替代 ENDROW 返回终止条件行；在使用 STARTROW 和 ENDROW 进行设置范围的 scan 时，结果会包含 STARTROW 行，但不包含 ENDROW 行，而使用 InclusiveStopFilter 可以替代 ENDROW 条件，且同时包含终止条件行！

例 1：显示起始行键为001，结束行键为003的记录

$ scan 'Student', {STARTROW=>'001',FILTER=>"InclusiveStopFilter('003')"}
# 等价于
$ scan 'Student', {STARTROW=>'001', ENDROW=>'004'}

# 执行结果
ROW                    				COLUMN+CELL
001                   column=grades:englisg, timestamp=1541485306878, value=80
001                   column=grades:math, timestamp=1541485384199, value=90
001                   column=stuinfo:age, timestamp=1541485224974, value=18
001                   column=stuinfo:name, timestamp=1541485170696, value=alice
001                   column=stuinfo:sex, timestamp=1541485235855, value=female
002                   column=grades:bigdata, timestamp=1541485403649, value=88
002                   column=grades:englisg, timestamp=1541485319132, value=85
002                   column=grades:math, timestamp=1541485376414, value=78
002                   column=stuinfo:class, timestamp=1541485278646, value=1802
002                   column=stuinfo:name, timestamp=1541485187403, value=nancy
002                   column=stuinfo:sex, timestamp=1541485245291, value=male
003                   column=grades:bigdata, timestamp=1541485412686, value=80
003                   column=grades:englisg, timestamp=1541485328260, value=90
003                   column=grades:math, timestamp=1541485368087, value=80           
003                   column=stuinfo:age, timestamp=1541485209410, value=19           
003                   column=stuinfo:class, timestamp=1541485271479, value=1803       
003                   column=stuinfo:name, timestamp=1541485198223, value=harry
003                   column=stuinfo:sex, timestamp=1541485253075, value=male
3 row(s) in 0.1010 seconds

4.6 `FamilyFilter`

⭐针对列族进行比较和过滤！

例 1：显示列族前缀为 stu 开头的键值对

$ scan 'Student', {FILTER=>"FamilyFilter(=,'substring:stu')"}
# 亦或者
$ scan 'Student', {FILTER=>"FamilyFilter(=,'binary:stu')"}

# 执行结果
ROW                    COLUMN+CELL
001                   column=stuinfo:age, timestamp=1541485224974, value=18
001                   column=stuinfo:name, timestamp=1541485170696, value=alice
001                   column=stuinfo:sex, timestamp=1541485235855, value=female
002                   column=stuinfo:class, timestamp=1541485278646, value=1802
002                   column=stuinfo:name, timestamp=1541485187403, value=nancy
002                   column=stuinfo:sex, timestamp=1541485245291, value=male
003                   column=stuinfo:age, timestamp=1541485209410, value=19
003                   column=stuinfo:class, timestamp=1541485271479, value=1803
003                   column=stuinfo:name, timestamp=1541485198223, value=harry
003                   column=stuinfo:sex, timestamp=1541485253075, value=male
3 row(s) in 0.4470 seconds

4.7 `QualifierFilter`

⭐针对列名进行过滤！

例 1：显示列名为name的记录

$ scan 'Student', {FILTER=>"QualifierFilter(=,'substring:name')"}

# 执行结果
ROW                    COLUMN+CELL
001                   column=stuinfo:name, timestamp=1541485170696, value=alice
002                   column=stuinfo:name, timestamp=1541485187403, value=nancy
003                   column=stuinfo:name, timestamp=1541485198223, value=harry
3 row(s) in 0.0630 seconds

4.8 `ColumnPrefixFilter`

⭐对列名前缀进行过滤！

例 1：显示列名为name的记录

$ scan 'Student',{FILTER=>"ColumnPrefixFilter('name')"}
# 等价于
$ scan 'Student',{FILTER=>"QualifierFilter(=,'substring:name')"}

4.9 `MultipleColumnPrefixFilter`

⭐可以指定多个前缀！

例 1：显示列名为name和age的记录！

$ scan 'Student',{FILTER=>"MultipleColumnPrefixFilter('name','age')"}

4.10 `ColumnRangeFilter`

⭐设置范围按字典序对列名进行过滤！

例 1：查询列名在 bi 和 na 之间的记录

Student 表有以下列族和列名：

Stuinfo: name
Stuinfo: age
Stuinfo: sex
Stuinfo: class
Grades: math
Grades: english
Grades: bigdata

$ scan 'Student',{FILTER=>"ColumnRangeFilter('bi',true,'na',true)"}

# 执行结果
ROW                    COLUMN+CELL
001                   column=grades:englisg, timestamp=1541485306878, value=80
001                   column=grades:math, timestamp=1541485384199, value=90
002                   column=grades:bigdata, timestamp=1541485403649, value=88
002                   column=grades:englisg, timestamp=1541485319132, value=85
002                   column=grades:math, timestamp=1541485376414, value=78
002                   column=stuinfo:class, timestamp=1541485278646, value=1802
003                   column=grades:bigdata, timestamp=1541485412686, value=80
003                   column=grades:englisg, timestamp=1541485328260, value=90
003                   column=grades:math, timestamp=1541485368087, value=80
003                   column=stuinfo:class, timestamp=1541485271479, value=1803
3 row(s) in 0.0120 seconds

4.11 `TimestampsFilter`

⭐时间戳过滤器，支持等值方式比较，但可以设置多个时间戳！

例 1：只查询时间戳为 2 和 4 的键值对

$ scan 'Student',{FILTER=>"TimestampFilter(2,4)"}

# 执行结果
ROW                    COLUMN+CELL
004                   column=stuinfo:age, timestamp=2, value=19
004                   column=stuinfo:name, timestamp=2, value=curry
004                   column=stuinfo:sex, timestamp=4, value=male
1 row(s) in 0.0150 seconds

4.12 `ValueFilter`

⭐针对具体值进行过滤的 filter！

例 1：查询值等于19的所有键值对

$ scan 'Student',{FILTER=>"ValueFilter(=,'binary:19')"}
$ scan 'Student',{FILTER=>"ValueFilter(=,'substring:19')"}

4.13 `SingleColumnValueFilter`

⭐该过滤器也是对值进行过滤，但是需要指定列族和列名，即是对特定的列族和列进行值扫描的！

例 1：查询Stuinfo列族age列中值等于19的所有键值对

$ scan 'Student',{FILTER=>"SingleColumnValueFilter('Stuinfo','age',=,'binary:19')"}

4.14 `SingleColumnValueExcludeFilter`

⭐在指定的列族和列中进行值过滤，与 SingleColumnValueFilter 功能相反！

4.15 `ColumnCountGetFilter`

⭐限制每个逻辑行返回的键值对数！

例 1：返回行键为001的前3个键值对

$ get 'Student','001',{FILTER=>"ColumnCountGetFilter(3)"}

# 执行结果
COLUMN                    CELL
grades:englisg           timestamp=1541485306878, value=80
grades:math              timestamp=1541485384199, value=90
stuinfo:age              timestamp=1541485224974, value=18
3 row(s) in 0.0950 seconds

4.16 `PageFilter`

⭐基于行的分页过滤器，设置返回行数；即过滤器里的参数为该页显示的行数！需要注意的是返回的行数并不以一定是指定的行数，由于Hbase的存储特性，scan在并行扫描不同Region服务器时并不能做到共享他们各自已经筛选的行数，因此在返回的行数极有可能超过设定的值。

$ scan 'Student',{FILTER=>"PageFilter(1)"}

# 执行结果
ROW                       COLUMN+CELL
001                      column=grades:englisg, timestamp=1541485306878, value=80
001                      column=grades:math, timestamp=1541485384199, value=90
001                      column=stuinfo:age, timestamp=1541485224974, value=18
001                      column=stuinfo:name, timestamp=1541485170696, value=alice
001                      column=stuinfo:sex, timestamp=1541485235855, value=female
1 row(s) in 0.0680 seconds

4.17 `ColumnPaginationFilter`

⭐基于列进行分页的过滤器，需要设置每行返回数量和偏移量（即显示这一行的第几列），第一列从0开始计数！

例 1：显示每行第1列之后的2个键值对

先查询全表：

$ scan 'Student'

# 执行结果
ROW                       COLUMN+CELL
001                      column=grades:englisg, timestamp=1541485306878, value=80
001                      column=grades:math, timestamp=1541485384199, value=90
001                      column=stuinfo:age, timestamp=1541485224974, value=18
001                      column=stuinfo:name, timestamp=1541485170696, value=alice
001                      column=stuinfo:sex, timestamp=1541485235855, value=female
002                      column=grades:bigdata, timestamp=1541485403649, value=88
002                      column=grades:englisg, timestamp=1541485319132, value=85
002                      column=grades:math, timestamp=1541485376414, value=78
002                      column=stuinfo:class, timestamp=1541485278646, value=1802
002                      column=stuinfo:name, timestamp=1541485187403, value=nancy
002                      column=stuinfo:sex, timestamp=1541485245291, value=male

再基于列进行分页查询：

$ scan 'Student',{FILTER=>"ColumnPaginationFilter(2,1)"}

# 执行结果
ROW                       COLUMN+CELL
001                      column=grades:math, timestamp=1541485384199, value=90
001                      column=stuinfo:age, timestamp=1541485224974, value=18
002                      column=grades:englisg, timestamp=1541485319132, value=85
002                      column=grades:math, timestamp=1541485376414, value=78
003                      column=grades:englisg, timestamp=1541485328260, value=90
003                      column=grades:math, timestamp=1541485368087, value=80
004                      column=stuinfo:name, timestamp=2, value=curry
004                      column=stuinfo:sex, timestamp=4, value=male
4 row(s) in 0.0840 seconds

4.18 组合过滤器

⭐使用AND或OR等连接符，组合多个过滤器进行组合扫描！

例 1：组合过滤器的使用

$ scan 'Student',{FILTER=>"ColumnPaginationFilter(2,1) AND ValueFilter(=,'substring:ma')"}

# 执行结果
ROW                       COLUMN+CELL
004                      column=stuinfo:sex, timestamp=4, value=male
1 row(s) in 0.1040 seconds

例 2：组合过滤器的使用

$ scan 'Student',{FILTER=>"ColumnPaginationFilter(1,1) OR ValueFilter(=,'substring:ma')"}

# 执行结果
ROW                           COLUMN+CELL
001                          column=grades:math, timestamp=1541485384199, value=90
001                          column=stuinfo:sex, timestamp=1541485235855, value=female
002                          column=grades:englisg, timestamp=1541485319132, value=85
002                          column=stuinfo:sex, timestamp=1541485245291, value=male
003                          column=grades:englisg, timestamp=1541485328260, value=90
003                          column=stuinfo:sex, timestamp=1541485253075, value=male
004                          column=stuinfo:name, timestamp=2, value=curry
004                          column=stuinfo:sex, timestamp=4, value=male
4 row(s) in 0.0440 seconds

过滤器の小结：

RowFilter：针对 rowkey 进行字符串的比较和过滤
PrefixFilter：rowkey 前缀比较器
KeyOnlyFilter：只对cell的键进行过滤和显示，不显示值
FirstKeyOnlyFilter：只扫描相同键的第一个cell，其键值对都会显示出来
InclusiveStopFilter：替代 ENDROW 返回终止条件行
FamilyFilter：针对列族进行比较和过滤
QualifierFilter：列标识过滤器
ColumnPrefixFilter：对列名前缀进行过滤
MultipleColumnPrefixFilter：可以指定列的多个前缀
ColumnRangeFilter ：设置范围按字典序对列名进行过滤
TimestampsFilter ：时间戳过滤器。支持等值方式比较，但可以设置多个时间戳
ValueFilter ：值过滤器
SingleColumnValueFilter ：在指定的列族和列中进行值过滤器
SingleColumnValueExcludeFilter：在指定的列族和列中进行值过滤器，与 SingleColumnValueFilter 功能相反
ColumnCountGetFilter ：限制每个逻辑行返回的键值对数
PageFilter ：基于行的分页过滤器，设置返回行数
ColumnPaginationFilter ：基于列的进行分页过滤器，需要设置偏移量与返回数量

🔥未完待续...

【纸上谈兵】HBase 课堂随笔

1. 物理模型

2. 实际存储方式

3. hbase shell

3.1 数据定义

3.1.1 创建表

3.1.2 表相关操作

3.2 数据操作

3.2.1 put

3.2.2 delete

3.2.3 get

3.2.4 scan

4. 过滤器

4.1 RowFilter

4.2 PrefixFilter

4.3 KeyOnlyFilter

4.4 FirstKeyOnlyFilter

4.5 InclusiveStopFilter

4.6 FamilyFilter

4.7 QualifierFilter

4.8 ColumnPrefixFilter

4.9 MultipleColumnPrefixFilter

4.10 ColumnRangeFilter

4.11 TimestampsFilter

4.12 ValueFilter

4.13 SingleColumnValueFilter

4.14 SingleColumnValueExcludeFilter

4.15 ColumnCountGetFilter

4.16 PageFilter

4.17 ColumnPaginationFilter