1.概述:
HBase包含几种将数据加载到表中的方法。最直接的方法是使用MapReduce作业中的TableOutputFormat类,或者使用客户端api;然而,这些并不总是最有效的方法。
BulkLoad将文本文件或者其他数据库上传到HDFS中。将数据转换为HFile,这个步骤需要MapReduce,将Rowkey作为OutputKey,将一个Put或者Delete作为OutputValue,该阶段在输出文件夹中,一个Region就创建一个HFile;RegionServers使用LoadIncrementalHFiles将HFile文件移动到相应的Region目录下。这种方法不占用Region资源、能快速导入海量的数据、还节省了内存,避免了频繁进行flush,split,compact等大量IO操作,配合mapreduce完成更高效便捷。
注意:
BulkLoad由于并不是通过HBase的API写入数据,而是直接在HDFS上生成了HFile文件,所以并不会记录WAL预写日志,如果集群是通过Replication机制同步数据的话,则不会同步bulkload部分的数据,因为 Replication是通过预写日志WAL备份的。
2.实战:
数据准备:
数据文件student.txt,上传到/tmp//bulkLoadInput目录下:
1,lujs1,11,712,lujs2,12,723,lujs3,13,734,lujs4,14,745,lujs5,15,756,lujs6,16,767,lujs7,17,778,lujs8,18,78
在hbase shell中新建studentbulk表,执行命令:
create 'studentbulk','info'
自定义一个Mapper类,处理输入数据生产HFile
package com.unicom.ljs.hbase125.study;import org.apache.hadoop.hbase.client.Put;import org.apache.hadoop.hbase.io.ImmutableBytesWritable;import org.apache.hadoop.hbase.util.Bytes;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;import java.io.IOException;/** * @author: Created By lujisen * @company ChinaUnicom Software JiNan * @date: 2020-02-01 14:58 * @version: v1.0 * @description: com.unicom.ljs.hbase125.study */public class HFileMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> { @Override /*输入key的类型 输入value的类型 输出key的类型 输出value的类型*/ protected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException { String[] lineWords = value.toString().split(","); String rowKey = lineWords[0]; ImmutableBytesWritable row =new ImmutableBytesWritable(Bytes.toBytes(rowKey)); Put put =new Put(Bytes.toBytes(rowKey)); put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes(lineWords[1])); put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("age"), Bytes.toBytes(lineWords[2])); put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("score"), Bytes.toBytes(lineWords[3])); context.write(row, put); }}
主类代码实例:
package com.unicom.ljs.hbase125.study;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.hbase.HBaseConfiguration;import org.apache.hadoop.hbase.TableName;import org.apache.hadoop.hbase.client.*;import org.apache.hadoop.hbase.io.ImmutableBytesWritable;import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2;import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class BulkLoadData { public static Configuration conf =null; public static Connection conn =null; public static Table table =null; public static RegionLocator locator =null; public static Admin admin =null; public static final String tableName="studentbulk"; public static final String inputPath="/tmp/bulkLoadInput/"; public static final String outputPath="/tmp/bulkLoadOutput/"; public static void main(String[] args) { try { conf = HBaseConfiguration.create(); conf.set("hbase.zookeeper.quorum","salver158.hadoop.unicom,salver31.hadoop.unicom,salver32.hadoop.unicom"); conf.set("hbase.zookeeper.property.clientPort", "2181"); conf.set("zookeeper.znode.parent", "/hbase-unsecure"); conn = ConnectionFactory.createConnection(conf); table =conn.getTable(TableName.valueOf(tableName)); locator =conn.getRegionLocator(TableName.valueOf(tableName)); admin =conn.getAdmin(); Job job = Job.getInstance(); job.setJarByClass(BulkLoadData.class); //map端key value输出类型 job.setMapperClass(HFileMapper.class); job.setMapOutputKeyClass(ImmutableBytesWritable.class); job.setMapOutputValueClass(Put.class); //文件输入类型 job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(HFileOutputFormat2.class); //定义输入输出文件路径 FileInputFormat.addInputPath(job, new Path(inputPath)); FileOutputFormat.setOutputPath(job, new Path(outputPath)); //配置bulkLoad HFileOutputFormat2.configureIncrementalLoad(job, table, locator); boolean result = job.waitForCompletion(true); System.out.println("Mapreduce执行结果:"+result); /*加载文件到hbase表*/ LoadIncrementalHFiles load =new LoadIncrementalHFiles(conf); load.doBulkLoad(new Path(outputPath), admin, table, locator); }catch (Exception e) { System.out.println("报错信息:"+e); }finally { System.out.println("BulkLoadData执行完成!!!"); } }}
最后将程序打成jar包,将程序相关依赖打进去,打包依赖pom需要添加一个plugin:
<plugins> <plugin> <artifactId> maven-assembly-plugin</artifactId> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> <archive> <manifest> <mainClass> com.unicom.ljs.hbase125.study.BulkLoadData</mainClass> </manifest> </archive> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>assembly</goal> </goals> </execution> </executions> </plugin> </plugins>
最后将导出的jar上传到集群任意一个节点上,执行以下命令提交jar:
hadoop jar hbase125-1.0-SNAPSHOT-jar-with-dependencies.jar com.unicom.ljs.hbase125.study.BulkLoadData
任务执行完成,查看数据,bulkload导入成功。