HBase 非关系型数据库访问该RegionServer，获取hbase:meta表，根据namespace:table

简介

分布式、可扩展、支持海量数据存储的NoSQL数据库

逻辑结构

存储结构

架构图

写流程

Client访问Zookeeper，获取hbase：meta表所在的RegionServer
访问RegionServer，获取hbase：meta表，根据请求的namespace：table/rowkey，查询出目标数据所在的RegionServer的region，并将该table的region以及meta表信息缓存到Meta Cache中，方便下次访问
与目标RegionServer通讯
将数据顺序写入(追加)到WAL
将数据写入对应的MemStore，数据会在MemStore中进行排序
向客户端发送ack
等待MemStore的刷写事件，将数据写到HFile中

MemStore刷写

读流程

Client访问Zookeeper，获取hbase:meta表所在的RegionServer
访问该RegionServer，获取hbase:meta表，根据namespace:table/rowkey，查询出目标数据所在的Regionser的Region，并将该table的region及meta表的信息缓存到客户端的meta cache中，方便下次访问。
与目标RegionServer通讯
分别在Block Cache（读缓存）、MemStore、和StoreFile（HFile）中查询目标数据，并将查到的所有数据进行合并。此处所有数据是指同一条数据的不同版本（timestamp）或者不同类型（Put/Delete）。
将从文件中查询到的数据块（Block，HFile数据存储单元，默认大小64KB）缓存到Block Cache
将合并后的最终结果返回给客户端

StoreFile Compaction

由于memstore每次刷写都会生成一个新的HFile，且同一字段的不同版本（timestamp）和不同类型（Put/Delete）有可能会分不到不同的HFile中，因此查询时需要遍历所有HFile。为了减少HFile的个数，以及清理掉过期和已被删除的数据，会进行StoreFile Compaction。

Region Split

安装部署

zookeeper-3.4.10 bin/zkServer.sh start

hadoop-2.7.2 sbin/start-dfs.sh
hadoop-2.7.2 sbin/start-yarn.sh

hbase-env.sh
export JAVA_HOME=/opt/module/jdk1.6.0_144
export HBASE_MANAGES_ZK=false

hbas-site.xml
<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://hadoop102:9000/HBase</value>
    </property>
    <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
    </property>
    <!-- 0.98 后的新变动，之前版本没有.port,默认端口为 60000 -->
    <property>
        <name>hbase.master.port</name>
        <value>16000</value>
    </property>
    <property> 
        <name>hbase.zookeeper.quorum</name>
        <value>hadoop102,hadoop103,hadoop104</value>
    </property>
    <property> 
        <name>hbase.zookeeper.property.dataDir</name>
        <value>/opt/module/zookeeper-3.4.10/zkData</value>
    </property>
</configuration>

hbase bin/hbase-daemon.sh start master
hbase bin/hbase-daemon.sh start regionserver

HBaseAPI

conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "192.168.5.102");
conf.set("hbase.zookeeper.property.clientPort", "2181");

HBaseAdmin admin = new HBaseAdmin(conf);
admin.tableExists(tableName);

HTableDescriptor descriptor = new HTableDescriptor(TableName.valueOf(tableName));
for (String cf: columnFamily) {
    descriptor.addFamily(new HColumnDescriptor(cf));
}
admin.createTable(descriptor);

admin.disableTable(tableName);
admin.deleteTable(tableName);

HTable htable = new HTable(conf, tableName);
Put put = new Put(Bytes.toBytes(rowKey));
put.add(Bytes.toBytes(columnFamily), Bytes.toBytes(column), Bytes.toBytes(value));
htable.put(put);
htable.close();

HTable htable = new HTable(conf, tableName);
List<Delete> deleteList = new ArrayList<Delete>();
for (String row: rows) {
    Delete delete = new Delete(Bytes.toBytes(row));
    deleteList.add(delete);
}
htable.delete(deleteList);
htable.close()

HTable htable = new HTable(conf, tableName);
Scan scan = new Scan()
ResultScanner scanner = htable.getScanner(scan);
for (Result result: scanner) {
    Cell[] cells = result.rawCells();
    for (Cell cell: cells) {
        System.out.println(" 行 键 :" + Bytes.toString(CellUtil.cloneRow(cell)));
        System.out.println(" 列 族 " + Bytes.toString(CellUtil.cloneFamily(cell)));
        System.out.println(" 列 :" + Bytes.toString(CellUtil.cloneQualifier(cell)));
        System.out.println(" 值 :" + Bytes.toString(CellUtil.cloneValue(cell)));
    }
}
htable.close();

HTable htable = new HTable(conf, tableName);
Get get = new Get(Bytes.toBytes(rowKey));
//get.setMaxVersions();显示所有版本
//get.setTimeStamp();显示指定时间戳的版本
Result result = table.get(get);
for(Cell cell : result.rawCells()){
    System.out.println(" 行 键 :" + Bytes.toString(result.getRow()));
    System.out.println(" 列 族 " + Bytes.toString(CellUtil.cloneFamily(cell)));
    System.out.println(" 列 :" + Bytes.toString(CellUtil.cloneQualifier(cell)));
    System.out.println(" 值 :" + Bytes.toString(CellUtil.cloneValue(cell)));
    System.out.println("时间戳:" + cell.getTimestamp());
}
htable.close();

优化

HMaster高可用

touch conf/backup-masters
echo hadoop102 > conf/backup-masters

预分区

byte[][] splitKeys = 散列值函数
HBaseAdmin hAdmin = new HBaseAdmin(HbaseConfiguration.create());
HTableDescriptor tableDesc = new HTableDescriptor(tableName);
hAdmin.createTable(tableDesc, splitKeys);

RowKey设计数据的唯一标识是RowKey，RowKey处于哪个预分区，数据就会存储在该分区。设计RowKey的目的是防止数据倾斜。

生成随机数、hash、散列值
字符串翻转
字符串拼接

内存优化 HBase操作过程中需要大量的内存开销，一般会分配整个可用内存的70%给HBase，不建议分配太大的内存，会导致GC时间太长。
基础配置优化

// 允许HDFS文件追加内容 hdfs-site.xml、hbase-site.xml
dfs.support.append = true

// 调高DataNode允许的最大文件打开数 hdfs-site.xml
dfs.datanode.max.transfer.threads = 4096(默认值)

// 延迟高的数据，调高等待时间，确保socket不会被timeout hdfs-site.xml
dfs.image.transfer.timeout = 6000(默认值，毫秒)

// 设置输出端压缩 mapred-site.xml
mapreduce.map.output.compress = true
mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.GzipCodec

// 读写数据较多时，调高RPC监听 hbase-site.xml
HBase.regionserver.handler.count = 30(默认值)

// HBase需要运行MR任务时，因为一个Region对应一个map任务，所以要调低
hbase.hregion.max.filesize = 10737418240(10G, 默认值)

// 增大该值可以减少RPC调用次数，但会消耗更多的内存 hbase-site.xml
hbase.client.write.buffer

// 制定scan.next扫描HBase所获取的行数 hbase-site.xml
hbase.client.scanner.caching