mongodb Bson文件导入hive分区表实践

1,607 阅读1分钟

分割Bson文件

m.mongodb.hadoop.splitter.BSONSplitter是一个可以分割和压缩Bson文件的可执行类,这个类包含在mongo-hadoop-core.jar中。 你可以这样运行:

hadoop jar mongo-hadoop-core.jar com.mongodb.hadoop.splitter.BSONSplitter
           <fileName> [-c compressionCodec] [-o outputDirectory]
  • fileNameBson文件的绝对路径
  • compressionCodec (可选的)
  • outputDirectory指定输出文件的路径

示例:

hadoop jar mongo-hadoop-core.jar com.mongodb.hadoop.splitter.BSONSplitter \
    file:///home/mongodb/backups/really-big-bson-file.bson \
    -o hdfs://hadoop003:8020/bson05

输出如下:

20/09/15 15:51:41 INFO splitter.BSONSplitter: writing hdfs://hadoop003:8020/bson05/test05.bson-1.deflate.
20/09/15 15:51:42 INFO splitter.BSONSplitter: writing hdfs://hadoop003:8020/bson05/test05.bson-2.deflate.
20/09/15 15:51:43 INFO splitter.BSONSplitter: writing hdfs://hadoop003:8020/bson05/test05.bson-3.deflate.
20/09/15 15:51:44 INFO splitter.BSONSplitter: writing hdfs://hadoop003:8020/bson05/test05.bson-4.deflate.
20/09/15 15:51:45 INFO splitter.BSONSplitter: done.

在hdfs上生成了分割后的文件。

创建Hive外部表

需要mongo-hadoop-core-2.0.2.jar mongo-hadoop-hive-2.0.2.jar mongo-java-driver-3.2.2.jar三个jar。
添加jar:

add jar /path/to/mongo-hadoop-core.jar
add jar /path/to/mongo-hadoop-hive.jar
add jar /path/to/mongo-java-driver.jar

创建外部表:

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrick;
CREATE EXTERNAL TABLE ttt05 (
    id string,
    xxtime TIMESTAMP,
) 
ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"id","xxtime":"xxtime"}')
STORED AS INPUTFORMAT 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
OUTPUTFORMAT 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat' LOCATION '/bson05';

导入分区表

先创建分区表,按天分区:

create table partion_table(id string, xxtime timestamp)
    partitioned by(dt string) stored as parquet
    TBLPROPERTIES('parquet.compression'='SNAPPY');

外部表导入分区表:

INSERT INTO TABLE partion_table PARTITION (dt) SELECT id, xxtime, from_unixtime(cast(unix_timestamp(xxtime) as bigint), 'yyyy-MM-dd') as dt FROM ttt05;