hadoop中的hello word!这是我参与更文挑战的第25天，活动详情查看：更文挑战 😄 WordCount是

这是我参与更文挑战的第25天，活动详情查看：更文挑战 😄

WordCount是Hadoop的“Hello World”。程序很简单，不过这里我会侧重于我在上面提到的几点，综合地讨论一些问题。）

WordCount程序是用MapReduce来统计一个集合的输入文档的单词的词频。代码包括三个部分：Mapper，Reducer和main函数。

根据mapreduce的并行程序设计设计原则，方案中的内容切分步骤和数据不相关，可以并行处理，每个获得原始数据的机器只要将输入数据切分成单词就可以了。这可以交给map端。然后在reduce端统计合并相同单词的词频。而中间要通过shuffle完成一些处理才能将map输出交给reduce输入。

所以呢，map阶段完成对输入数据的单词切分，shuffle完成相同单词的聚集和分发（始终记住，map和reduce的task数量不一定是相同的），聚集和分发这个过程是MapReduce默认过程，不需具体配置，reduce负责接受所有单词并统计词频。整个过程传递数据都是<key, value>形式的，shuffle是按照key进行的。因此将map的输出设计成word作为key，1作为value（map的输入可以采用hadoop默认的输入：文件一行作为value，行号作为key）。Reduce输入采用map的输出在shuffle聚集的<key，value-list>。Reduce输出就是key为word，value为词频

         
import java.io.IOException;  
import java.util.*;  
          
import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.conf.*;  
import org.apache.hadoop.io.*;  
import org.apache.hadoop.mapreduce.*;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;  
          
public class WordCount {  
          
 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {  
    private final static IntWritable one = new IntWritable(1);  
    private Text word = new Text();  
          
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {  
        String line = value.toString();  
        StringTokenizer tokenizer = new StringTokenizer(line);  
        while (tokenizer.hasMoreTokens()) {  
            word.set(tokenizer.nextToken());  
            context.write(word, one);  
        }  
    }  
 }   
          
 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {  
  
    public void reduce(Text key, Iterable<IntWritable> values, Context context)   
      throws IOException, InterruptedException {  
        int sum = 0;  
        for (IntWritable val : values) {  
            sum += val.get();  
        }  
        context.write(key, new IntWritable(sum));  
    }  
 }  
          
 public static void main(String[] args) throws Exception {  
    Configuration conf = new Configuration();  
          
        Job job = new Job(conf, "wordcount");  
      
    job.setOutputKeyClass(Text.class);  
    job.setOutputValueClass(IntWritable.class);  
          
    job.setMapperClass(Map.class);  
    job.setReducerClass(Reduce.class);  
          
    job.setInputFormatClass(TextInputFormat.class);  
    job.setOutputFormatClass(TextOutputFormat.class);  
          
    FileInputFormat.addInputPath(job, new Path(args[0]));  
    FileOutputFormat.setOutputPath(job, new Path(args[1]));  
          
    job.waitForCompletion(true);  
 }

package org.myorg;
        
import java.io.IOException;
import java.util.*;
        
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
        
public class WordCount {
        
 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
        
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            context.write(word, one);
        }
    }
 } 
        
 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
 
    public void reduce(Text key, Iterable<IntWritable> values, Context context) 
      throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
 }
        
 public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
        
        Job job = new Job(conf, "wordcount");
    
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
        
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);
        
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
        
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
    job.waitForCompletion(true);
 }
        
}

欲运行上面实现的Mapper和Reduce，则需要生成一个Map-Reduce得任务(Job)，其基本包括以下三部分：

（1）输入的数据，也即需要处理的数据

（2） Map-Reduce程序，也即上面实现的Mapper和Reducer

（3）此任务的配置项JobConf

欲配置JobConf，需要大致了解Hadoop运行job的基本原理：

（1） Hadoop将Job分成task进行处理，共两种task：map task和reduce task

（2） Hadoop有两类的节点控制job的运行：JobTracker和TaskTracker

（3） JobTracker协调整个job的运行，将task分配到不同的TaskTracker上

（4） TaskTracker负责运行task，并将结果返回给JobTracker

（5） Hadoop将输入数据分成固定大小的块，我们称之input split

（6） Hadoop为每一个input split创建一个task，在此task中依次处理此split中的一个个记录(record)

（7） Hadoop会尽量让输入数据块所在的DataNode和task所执行的DataNode(每个DataNode上都有一个TaskTracker)为同一个，可以提高运行效率，所以input split的大小也一般是HDFS的block的大小。

（8） Reduce task的输入一般为Map Task的输出，Reduce Task的输出为整个job的输出，保存在HDFS上。

在reduce中，相同key的所有的记录一定会到同一个TaskTracker上面运行，然而不同的key可以在不同的TaskTracker上面运行，我们称之为partition

（9） partition的规则为：(K2, V2) –> Integer，也即根据K2，生成一个partition的id，具有相同id的K2则进入同一个partition，被同一个TaskTracker上被同一个Reducer进行处理。

真心感谢帅逼靓女们能看到这里，如果这个文章写得还不错，觉得有点东西的话

求点赞👍 求关注❤️ 求分享👥 对8块腹肌的我来说真的非常有用！！！

如果本篇博客有任何错误，请批评指教，不胜感激！❤️❤️❤️❤️