环境说明
- 开发环境:win10
- 开发工具:
- idea-2018
- gradle-4.10
搭建步骤
-
开发工具的安装我就不赘述了,网上一大把资料,自行查阅;
-
将hadoop集群安装的包在windows上解压,我这里的解压路径是:
F:\software\hadoop-2.6.5; -
配置环境变量:
HADOOP_HOME=F:\software\hadoop-2.6.5 # 我前面都是用root用户部署的Hadoop环境,要有你们不是用root用户的话就配置你们的hdfs超级用户--linux上启动集群的用户 # 这个一定要配置,不然本地开发提交jar包到集群执行的时候会以windows用户提交 HADOOP_USER_NAME=root -
下载hadoop对应版本的
hadoop.dll和winutils.ext文件,hadoop.dll复制到C:\Windows\System32,winutils.exe复制到F:\software\hadoop-2.6.5\bin,下载地址为(码云上找到的):jiashu / winutils -
idea上新建gradle项目,不会的百度。然后在
build.gradle中添加对应版本的hadoop依赖,依赖去maven repository中查找,这里贴出我的:plugins { id 'java' } group 'com.bamboos' version '1.0-SNAPSHOT' sourceCompatibility = 1.8 repositories { maven { url 'http://maven.aliyun.com/nexus/content/groups/public/' } mavenCentral() } dependencies { testCompile group: 'junit', name: 'junit', version: '4.12' // https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '2.6.5' // https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs compile group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '2.6.5' // https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce compile group: 'org.apache.hadoop', name: 'hadoop-mapreduce', version: '2.6.5', ext: 'pom' // https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '2.6.5' } sourceSets { main { java { srcDir 'src/main/java' // 指定源码目录 } resources { srcDir 'src/main/resources' //资源目录 } } } tasks.withType(JavaCompile) { options.encoding = "UTF-8" } -
将集群中
core-site.xml,hdfs-site.xml,mapred-site.xml,yarn-site.xml文件下载放入到项目resources目录下,至此连接集群的环境就准备好了。
测试
-
这里以wordcount进行测试,首先在服务器上创建个测试数据word.txt:
hello hadoop hello java hello python hello world -
将测试数据上传到hdfs:
hdfs dfs -mkdir -p /test/wc/input hdfs -put ./word.txt /test/wc/input -
然后编写测试代码,我这里直接贴出代码来,
WordMapper.java:package com.bamboos.bigdata.demo; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; import java.util.StringTokenizer; public class WordMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } -
WordReducer.java:package com.bamboos.bigdata.demo; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class WordReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } -
WordCount.javapackage com.bamboos.bigdata.demo; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(true); System.out.println(conf.get("dfs.nameservices")); // 让框架知道是windows异构平台运行 conf.set("mapreduce.app-submission.cross-platform","true"); System.out.println(conf.get("dfs.nameservices")); Job job = Job.getInstance(conf); job.setJar("F:\\workspace\\idea\\bigdata\\build\\libs\\bigdata-1.0-SNAPSHOT.jar"); job.setJarByClass(WordCount.class); job.setJobName("wordcount"); Path infile = new Path(args[0]); TextInputFormat.addInputPath(job, infile); Path outfile = new Path(args[1]); if (outfile.getFileSystem(conf).exists(outfile)) outfile.getFileSystem(conf).delete(outfile, true); TextOutputFormat.setOutputPath(job, outfile); job.setMapperClass(WordMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setReducerClass(WordReducer.class); job.waitForCompletion(true); } } -
执行之前要做两件事:
- 执行gradle的jar任务,生成项目的jar文件,注意WordCount.java中的jar路径要和生成的jar路径对应;
- 配置main的执行Configuration:
/test/wc/input /test/wc/output,既main方法的中args参数。
-
然后执行即可看在http://node1:8088上查看任务;
-
执行完成后cat下结果数据看是否正确:
hdfs dfs -cat /test/wc/output/part-r-00000 hadoop 1 hello 4 java 1 python 1 world 1