使用HBASE进行MAPREDUCE编辑
MapReduce 的这种“链式”使用非常流行,可以很好地解析大型文件。一旦数据被上传到 HBase,您可以再次使用 MapReduce 来运行一些聚合查询。
要将 MapReduce 与 HBase 结合使用,可以选择使用 Java 作为编程语言。但这不是唯一的选择。您可以用 Python、 Ruby 或 PHP 编写 MapReduce ,并将 HBase 作为作业的源和/或接收器。
在这个例子中,我创建了四个需要协同工作的程序元素:
-
发出键/值对的映射器类。
-
获取 mapper 发出的值并对其进行操作以创建聚合的 reduce 类。在数据上传示例中,映射器仅将数据插入到 HBase 表中。
-
将 mapper 类和 reduce 类放在一起的驱动程序类。
-
在其主方法中触发作业的类。
您还可以将所有这四个元素组合成一个单独的类。在这种情况下,mapper 和 reduce 可以成为静态 ininer 类。但是,对于这个示例,您将创建四个独立的类,每个类对应于刚才提到的四个元素。
我假设 Hadoop 和 HBase 已经安装和配置好了。请将以下的 ing.jar 文件添加到 Java 类路径中,以便编译并运行下面的示例:
package com.treasuryofideas.hbasemr;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import org.apache.hadoop.io.Longwritable;
import org.apache.hadoop.io.Mapwritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mlapper; public
class NyseMarketDataMapper extends
Mapper < LongWritable, Text, Text, MapWritable > {
public void map(LongWritable key, MapWritable value, Context context)
throws IOException, InterruptedException {
final Text EXCHANGE = new Text("exchange");
final Text STOCK_SYMBOL = new Text("stocksymbol");
final Text DATE = new Text("date");
final Text STOCK_PRICE_OPEN = new Text("stockPriceopen");
final Text STOCK_PRICE_HIGH = new Text("stockPriceHigh");
final Text STOCK_PRICE_LOW = new Text("stockPriceLow");
final Text STOCK_PRICE_CLOSE = new Text("stockPriceclose");
final Text STOCK_VOLUME = new Text("stockVolume");
final Text STOCK_PRICE_ADJ_CLOSE = new Text("stockPriceAdjclose");
try
{
// sample market data csv file
String strFile = "data/NYSE_daily_prices_A.csv";
// create BufferedReader to read csv file
BufferedReader br = new BufferedReader(new FileReader(strFile));
String strLine = "";
int lineNumber = 0;
// read comma separated file line by line
while ((strLine=br.readLine()) != nul1) {
lineNumber++;
if (lineNumber > 1){
String[] data_values =strLine.split(",");
Mapwritable marketData = new Mapwritable();
marketData.put(EXCHANGE, new Text(data_values[0]));
marketData.put(STOCK_SYMBOL, new Text(data_values[1]));
marketData.put(DATE, new Text(data_values[2]));
marketData.put(STOCK_PRICE_OPEN, new Text (data_values[3]));
marketData.put(STOCK_PRICE_HIGH, new Text(data_values[4]));
marketData.put(STOCK_PRICE_LOW, new Text(data_values[5]));
marketData.put(STOCK_PRICE_CLOSE, new Text(data_values[6]));
marketData.put(STOCK_VOLUME, new Text(data_values[7]));
marketData.put(STOCK_PRICE_ADJ_CLOSE, new Text(data_values[8]));
context.write(new Text(String.format("8s-8s", data_values[1], data_values[2])), marketData);
} }
}
catch(Exception e) {
{
System.errout.println("Exception while reading csv file or process interrupted:"+e); }
} }
Java 前面的代码是基本的,只关注于演示映射函数的关键特性。Mapper 类扩展了 org.apache.hadoop.mapreduce。映射器并实现 map 方法。Map 方法将键、值和上下文对象作为输入参数。在发出方法中,您将注意到我通过将股票代码和日期连接在一起创建了一个复杂的键。
Csv 解析逻辑本身很简单,可能需要修改以支持在每个数据项中出现逗号的条件。但是,对于当前的数据集,它工作得很好。
第二部分是一个具有 reduce 方法的 reduce 类。Reduce 方法只是将数据上传到 HBase 表中。减速器的代码如下:
public class NyseMarketDataReducer extends TableReducer<Text, MapWritable, ImmutableBytesWritable> {
public void reduce(Text arg0, Iterable arg1, Context context) (
//since the complex key made up of stock symbol and date is unique
//one value comes for a key.
Map marketData =null;
for (MapWritable value : arg1){
marketData =value;
break; }
ImmutableBytesWritable key = new ImmutableBytesWritable(Bytes
.toBytes (arg0.tostring()));
Put put =new Put(Bytes.toBytes(arg0.tostring()));
put.add(Bytes.toBytes("mdata"), Bytes.toBytes("daily"),Bytes
.toBytes((ByteBuffer) marketData));
try{
context.write(key,put);
} catch (IOException e) {
// TODO Auto-generated catch block
} catch(InterruptedException e) {
// TODO Auto-generated catch block }
}
}
Map 函数和 reduce 函数在驱动程序类中绑定在一起,如下所示:
public class NyseMarketDataDriver extends Configured implements Tool {
@override
public int run(String[] arg0) throws Exception {
HBaseConfiguration conf =new HBaseConfiguration();
Job job = new Job(conf, "NYSE Market Data Sample Application");
job.setJarByClass (NyseMarketDataSampleApplication.class);
job.setInputFormatclass(TextInputFormat.class);
job.setMapperclass(NyseMarketDataMapper.class);
job.setReducerClass (NyseMarketDataReducer.class);
job.setMapOutputKeyclass(Text.class);
job.setMapOutputValueclass(Text.class);
FileInputFormat.addInputPath(job, new Path(
"hdfs://localhost/path/to/NYSE_daily_prices_A.csv"));
TableMapReduceUtil. initTableReducerJob("nysemarketdata",
NyseMarketDataReducer.class,job);
boolean jobsucceeded = job.waitForCompletion(true);
if (jobsucceeded){
return 0;
}else {
return -1;
}
}
}
最后,需要按以下方式触发驱动程序:
package com.treasuryofideas.hbasemr;
import org.apache.hadoop.conf.configuration;
import org.apache.hadoop.util.ToolRunner;
public class NyseMarketDataSampleApplication {
public static void main(String[] args) throws Exception
{ int m_rc=0;
m_rc=ToolRunner.run(new Configuration(),
new NyseMarketDataDriver(), args);
System.exit(m_rc);
}
}
这就用 HBase 结束了一个简单的 MapReduce 案例。
本文正在参加「金石计划 . 瓜分6万现金大奖」