分割Bson文件
m.mongodb.hadoop.splitter.BSONSplitter是一个可以分割和压缩Bson文件的可执行类,这个类包含在mongo-hadoop-core.jar中。
你可以这样运行:
hadoop jar mongo-hadoop-core.jar com.mongodb.hadoop.splitter.BSONSplitter
<fileName> [-c compressionCodec] [-o outputDirectory]
fileNameBson文件的绝对路径compressionCodec(可选的)outputDirectory指定输出文件的路径
示例:
hadoop jar mongo-hadoop-core.jar com.mongodb.hadoop.splitter.BSONSplitter \
file:///home/mongodb/backups/really-big-bson-file.bson \
-o hdfs://hadoop003:8020/bson05
输出如下:
20/09/15 15:51:41 INFO splitter.BSONSplitter: writing hdfs://hadoop003:8020/bson05/test05.bson-1.deflate.
20/09/15 15:51:42 INFO splitter.BSONSplitter: writing hdfs://hadoop003:8020/bson05/test05.bson-2.deflate.
20/09/15 15:51:43 INFO splitter.BSONSplitter: writing hdfs://hadoop003:8020/bson05/test05.bson-3.deflate.
20/09/15 15:51:44 INFO splitter.BSONSplitter: writing hdfs://hadoop003:8020/bson05/test05.bson-4.deflate.
20/09/15 15:51:45 INFO splitter.BSONSplitter: done.
在hdfs上生成了分割后的文件。
创建Hive外部表
需要mongo-hadoop-core-2.0.2.jar mongo-hadoop-hive-2.0.2.jar mongo-java-driver-3.2.2.jar三个jar。
添加jar:
add jar /path/to/mongo-hadoop-core.jar
add jar /path/to/mongo-hadoop-hive.jar
add jar /path/to/mongo-java-driver.jar
创建外部表:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrick;
CREATE EXTERNAL TABLE ttt05 (
id string,
xxtime TIMESTAMP,
)
ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"id","xxtime":"xxtime"}')
STORED AS INPUTFORMAT 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
OUTPUTFORMAT 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat' LOCATION '/bson05';
导入分区表
先创建分区表,按天分区:
create table partion_table(id string, xxtime timestamp)
partitioned by(dt string) stored as parquet
TBLPROPERTIES('parquet.compression'='SNAPPY');
外部表导入分区表:
INSERT INTO TABLE partion_table PARTITION (dt) SELECT id, xxtime, from_unixtime(cast(unix_timestamp(xxtime) as bigint), 'yyyy-MM-dd') as dt FROM ttt05;