HBase1.x实战:BulkLoad批量导入数据代码实例

906 阅读3分钟
原文链接: mp.weixin.qq.com

1.概述:

    HBase包含几种将数据加载到表中的方法。最直接的方法是使用MapReduce作业中的TableOutputFormat类,或者使用客户端api;然而,这些并不总是最有效的方法。

    BulkLoad将文本文件或者其他数据库上传到HDFS中。将数据转换为HFile,这个步骤需要MapReduce,将Rowkey作为OutputKey,将一个Put或者Delete作为OutputValue,该阶段在输出文件夹中,一个Region就创建一个HFile;RegionServers使用LoadIncrementalHFiles将HFile文件移动到相应的Region目录下。这种方法不占用Region资源、能快速导入海量的数据、还节省了内存,避免了频繁进行flush,split,compact等大量IO操作,配合mapreduce完成更高效便捷。

注意:

    BulkLoad由于并不是通过HBase的API写入数据,而是直接在HDFS上生成了HFile文件,所以并不会记录WAL预写日志,如果集群是通过Replication机制同步数据的话,则不会同步bulkload部分的数据,因为 Replication是通过预写日志WAL备份的。

2.实战:

数据准备:

    数据文件student.txt,上传到/tmp//bulkLoadInput目录下:

1,lujs1,11,712,lujs2,12,723,lujs3,13,734,lujs4,14,745,lujs5,15,756,lujs6,16,767,lujs7,17,778,lujs8,18,78

在hbase shell中新建studentbulk表,执行命令:

create 'studentbulk','info'

自定义一个Mapper类,处理输入数据生产HFile

package com.unicom.ljs.hbase125.study;import org.apache.hadoop.hbase.client.Put;import org.apache.hadoop.hbase.io.ImmutableBytesWritable;import org.apache.hadoop.hbase.util.Bytes;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;import java.io.IOException;/** * @author: Created By lujisen * @company ChinaUnicom Software JiNan * @date: 2020-02-01 14:58 * @version: v1.0 * @description: com.unicom.ljs.hbase125.study */public class HFileMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {    @Override    /*输入key的类型 输入value的类型 输出key的类型 输出value的类型*/    protected void map(LongWritable key, Text value,        Context context)throws IOException, InterruptedException {        String[] lineWords = value.toString().split(",");        String rowKey = lineWords[0];        ImmutableBytesWritable row =new ImmutableBytesWritable(Bytes.toBytes(rowKey));        Put put =new Put(Bytes.toBytes(rowKey));        put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes(lineWords[1]));        put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("age"), Bytes.toBytes(lineWords[2]));        put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("score"), Bytes.toBytes(lineWords[3]));        context.write(row, put);    }}

主类代码实例:

package com.unicom.ljs.hbase125.study;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.hbase.HBaseConfiguration;import org.apache.hadoop.hbase.TableName;import org.apache.hadoop.hbase.client.*;import org.apache.hadoop.hbase.io.ImmutableBytesWritable;import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2;import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class BulkLoadData {    public static Configuration conf =null;    public static Connection conn =null;    public static Table table =null;    public static RegionLocator locator =null;    public static Admin admin =null;    public static final String tableName="studentbulk";    public static final String inputPath="/tmp/bulkLoadInput/";    public static final String outputPath="/tmp/bulkLoadOutput/";    public static void main(String[] args) {        try {            conf = HBaseConfiguration.create();            conf.set("hbase.zookeeper.quorum","salver158.hadoop.unicom,salver31.hadoop.unicom,salver32.hadoop.unicom");            conf.set("hbase.zookeeper.property.clientPort", "2181");            conf.set("zookeeper.znode.parent", "/hbase-unsecure");            conn = ConnectionFactory.createConnection(conf);            table =conn.getTable(TableName.valueOf(tableName));            locator =conn.getRegionLocator(TableName.valueOf(tableName));            admin =conn.getAdmin();            Job job = Job.getInstance();            job.setJarByClass(BulkLoadData.class);            //map端key value输出类型            job.setMapperClass(HFileMapper.class);            job.setMapOutputKeyClass(ImmutableBytesWritable.class);            job.setMapOutputValueClass(Put.class);            //文件输入类型            job.setInputFormatClass(TextInputFormat.class);            job.setOutputFormatClass(HFileOutputFormat2.class);            //定义输入输出文件路径            FileInputFormat.addInputPath(job, new Path(inputPath));            FileOutputFormat.setOutputPath(job, new Path(outputPath));            //配置bulkLoad            HFileOutputFormat2.configureIncrementalLoad(job, table, locator);            boolean result = job.waitForCompletion(true);            System.out.println("Mapreduce执行结果:"+result);            /*加载文件到hbase表*/            LoadIncrementalHFiles load =new LoadIncrementalHFiles(conf);            load.doBulkLoad(new Path(outputPath), admin, table, locator);        }catch (Exception e) {            System.out.println("报错信息:"+e);        }finally {            System.out.println("BulkLoadData执行完成!!!");        }    }}

最后将程序打成jar包,将程序相关依赖打进去,打包依赖pom需要添加一个plugin:

       <plugins>            <plugin>                <artifactId> maven-assembly-plugin</artifactId>                <configuration>                    <descriptorRefs>                        <descriptorRef>jar-with-dependencies</descriptorRef>                    </descriptorRefs>                    <archive>                        <manifest>                            <mainClass> com.unicom.ljs.hbase125.study.BulkLoadData</mainClass>                        </manifest>                    </archive>                </configuration>                <executions>                    <execution>                        <id>make-assembly</id>                        <phase>package</phase>                        <goals>                            <goal>assembly</goal>                        </goals>                    </execution>                </executions>            </plugin>        </plugins>

最后将导出的jar上传到集群任意一个节点上,执行以下命令提交jar:

hadoop jar hbase125-1.0-SNAPSHOT-jar-with-dependencies.jar  com.unicom.ljs.hbase125.study.BulkLoadData

任务执行完成,查看数据,bulkload导入成功。