Conpaction做的是数据文件合并,过大时Split,当一个region split 文件过多时还可以用 Merge去合并文件
避免Conpact、split (特别和hive整合时) 跟下面四个参数都有关
//--------------------------------------------------------------------
hbase.hregion.memstore.flush.size --- 跟split相关,达到这个大小就写入HBase,调大避免产生过多的storeFile
hbase.hregion.max.fileze (table_att => MAX_FILESIZE) 设置得非常大就可以避免自动split
hbase.hstore.compactionThreshold 阈值(hstore的文件个数积累到什么数(8乘以flush.size)时触发Conpaction),如置8:超过8个文件就开始split,跟flush.size相关
hbase.hregion.majorcompaction majorcompaction间隔时间设置,删除过期的数据
MapReduce方案
//--------------------------------相关请看上篇文章
IndexBuilder:利用MR的方式构建Index
优点:可以并发批量构建Index
缺点:当对hbase插入一条数据时,不能实时构建Index
›坑!。。。
–老要想着设scan cache
–block cache开还是不开
–bloom filter有用不
–莫名其妙的job就挂了!
解决:
使用MapReduce框架访问HBase数据时Compact、Split带来的问题
//------------------------------------------------------
–region offline、mapper访问region失败
–调整参数控制 Compact、Split
•hbase.hregion.majorcompaction
–网上有说设成0就能避免compaction发生(错误的)
–删除过期的数据,硬盘空间不是立即清楚,很耗时耗资源
•hbase.hstore.compactionThreshold
–Region中文件个数多于此值,开始compact
–可能进入minor,(只做)合并较小StoreFile
•hbase.hregion.memstore.flush.size
–memstore何时写入HStore,region最大的fileSize,超过就split(会产生两个region-一个文件-自动做了compaction)
•hbase.hregion.max.filesize (table_att => MAX_FILESIZE)
–Region何时开始split
•hbase.hstore.blockingStoreFiles (灾难性的属性)
–max.filesize < flush.size * blockingStoreFiles 以保证region不会被block
-达到这个值就不能往里面写block,做了compaction之后才能继续往里面写内容
–offpeak hour
–merge
脚本的使用:
//-------------------------
die () {
echo >&2 "$@"
echo "usage:"
echo " $0 check|split table_name [split_size]"
exit 1
}
[[ "$#" -lt 2 ]] && die "at least 2 arguments required, $# provided"
COMMAND=$1
TABLE=$2
SIZE="${3:-1073741824}"
split() {
region_key=`python /home/hduser/hbase/hbase-scan.py -t hbase:meta -f "RowFilter (=, 'substring:$1')"`
echo "split '$region_key'" | hbase shell
}
if [ "$COMMAND" != "check" ] ; then
for region in `hadoop fs -ls /hbase/data/default/$TABLE | awk {'print $8'}`
do
[[ ${region##*/} =~ ^\. ]] && continue
[[ `hadoop fs -du -s $region | awk {'print $1'}` -gt $SIZE ]] && split ${region##*/}
done
sleep 60
fi
for region in `hadoop fs -ls /hbase/data/default/$TABLE | awk {'print $8'}`
do
[[ ${region##*/} =~ ^\. ]] && continue
[[ `hadoop fs -du -s $region | awk {'print $1'}` -gt $SIZE ]] && echo "${region##*/} (`hadoop fs -du -s -h $region | awk {'print $1 $2'}`) is a huge region" || echo "${region##*/} (`hadoop fs -du -s -h $region | awk {'print $1 $2'}`) is a small region"
done
上面的hbase-scan.py调用hbase读region在hbase:meta表里的一个key
//---------------------------------------------------------
import subprocess
import datetime
import argparse
import csv
import gzip
import happybase
import logging
def connect_to_hbase():
return happybase.Connection('itr-hbasetest01')
def main():
logging.basicConfig(format='%(asctime)s %(name)s %(levelname)s: %(message)s',level=logging.INFO)
argp = argparse.ArgumentParser(description='EventLog Reader')
argp.add_argument('-t','--table', dest='table', default='eventlog')
argp.add_argument('-p','--prefix', dest='prefix')
argp.add_argument('-f','--filter', dest='filter')
argp.add_argument('-l','--limit', dest='limit', default=10)
args = argp.parse_args()
hbase_conn = connect_to_hbase()
table = hbase_conn.table(args.table)
logging.info("scan start")
scanner = table.scan(row_prefix=args.prefix, batch_size=1000, limit=int(args.limit), filter=args.filter)
logging.info("scan done")
i = 0
for key, data in scanner:
logging.info(key)
print key
i+=1
logging.info('%s rows read in total', i)
if __name__ == '__main__':
main()