分布式Hadoop开发环境搭建(三):开发环境搭建

942 阅读2分钟

环境说明

  1. 开发环境:win10
  2. 开发工具:
    1. idea-2018
    2. gradle-4.10

搭建步骤

  1. 开发工具的安装我就不赘述了,网上一大把资料,自行查阅;

  2. 将hadoop集群安装的包在windows上解压,我这里的解压路径是:F:\software\hadoop-2.6.5

  3. 配置环境变量:

    HADOOP_HOME=F:\software\hadoop-2.6.5
    # 我前面都是用root用户部署的Hadoop环境,要有你们不是用root用户的话就配置你们的hdfs超级用户--linux上启动集群的用户
    # 这个一定要配置,不然本地开发提交jar包到集群执行的时候会以windows用户提交
    HADOOP_USER_NAME=root
    
  4. 下载hadoop对应版本的hadoop.dll和winutils.ext文件,hadoop.dll复制到C:\Windows\System32,winutils.exe复制到F:\software\hadoop-2.6.5\bin,下载地址为(码云上找到的):jiashu / winutils

  5. idea上新建gradle项目,不会的百度。然后在build.gradle中添加对应版本的hadoop依赖,依赖去maven repository中查找,这里贴出我的:

    plugins {
        id 'java'
    }
    
    group 'com.bamboos'
    version '1.0-SNAPSHOT'
    
    sourceCompatibility = 1.8
    
    repositories {
    
        maven {
            url 'http://maven.aliyun.com/nexus/content/groups/public/'
        }
        mavenCentral()
    }
    
    dependencies {
        testCompile group: 'junit', name: 'junit', version: '4.12'
    
        // https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common
        compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '2.6.5'
    
        // https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs
        compile group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '2.6.5'
    
        // https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce
        compile group: 'org.apache.hadoop', name: 'hadoop-mapreduce', version: '2.6.5', ext: 'pom'
    
        // https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client
        compile group: 'org.apache.hadoop', name: 'hadoop-client', version: '2.6.5'
    
    }
    
    
    sourceSets {
        main {
            java {
                srcDir 'src/main/java' // 指定源码目录
            }
            resources {
                srcDir 'src/main/resources' //资源目录
            }
        }
    
    }
    
    tasks.withType(JavaCompile) {
        options.encoding = "UTF-8"
    }
    
  6. 将集群中core-site.xml,hdfs-site.xml,mapred-site.xml,yarn-site.xml文件下载放入到项目resources目录下,至此连接集群的环境就准备好了。

测试

  1. 这里以wordcount进行测试,首先在服务器上创建个测试数据word.txt:

    hello hadoop
    hello java
    hello python
    hello world
    
  2. 将测试数据上传到hdfs:

    hdfs dfs -mkdir -p /test/wc/input
    hdfs -put ./word.txt /test/wc/input
    
  3. 然后编写测试代码,我这里直接贴出代码来,WordMapper.java:

    package com.bamboos.bigdata.demo;
    
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    
    import java.io.IOException;
    import java.util.StringTokenizer;
    
    public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
    
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
    
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }
    
  4. WordReducer.java:

    package com.bamboos.bigdata.demo;
    
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;
    
    import java.io.IOException;
    
    public class WordReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();
    
        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }
    
    
  5. WordCount.java

    package com.bamboos.bigdata.demo;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.util.GenericOptionsParser;
    
    
    public class WordCount {
    
        public static void main(String[] args) throws Exception {
    
            Configuration conf = new Configuration(true);
    
            System.out.println(conf.get("dfs.nameservices"));
    
            // 让框架知道是windows异构平台运行
            conf.set("mapreduce.app-submission.cross-platform","true");
    
            System.out.println(conf.get("dfs.nameservices"));
    
    
            Job job = Job.getInstance(conf);
    
            job.setJar("F:\\workspace\\idea\\bigdata\\build\\libs\\bigdata-1.0-SNAPSHOT.jar");
            job.setJarByClass(WordCount.class);
    
            job.setJobName("wordcount");
    
            Path infile = new Path(args[0]);
            TextInputFormat.addInputPath(job, infile);
    
            Path outfile = new Path(args[1]);
            if (outfile.getFileSystem(conf).exists(outfile)) outfile.getFileSystem(conf).delete(outfile, true);
            TextOutputFormat.setOutputPath(job, outfile);
    
            job.setMapperClass(WordMapper.class);
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(IntWritable.class);
            job.setReducerClass(WordReducer.class);
    
            job.waitForCompletion(true);
    
        }
    }
    
  6. 执行之前要做两件事:

    1. 执行gradle的jar任务,生成项目的jar文件,注意WordCount.java中的jar路径要和生成的jar路径对应;
    2. 配置main的执行Configuration:/test/wc/input /test/wc/output,既main方法的中args参数。
  7. 然后执行即可看在http://node1:8088上查看任务;

  8. 执行完成后cat下结果数据看是否正确:

    hdfs dfs -cat /test/wc/output/part-r-00000
    
    hadoop  1
    hello   4
    java    1
    python  1
    world   1