AWS Elastic MapReduce管道到JAR转换失败

41 阅读2分钟

一位用户在使用AWS Elastic MapReduce时遇到了一个问题,当他使用管道版本运行mapper和reducer时,它们工作正常,但是在使用弹性MapReduce向导加载输入、输出、引导等后,引导成功,但在执行过程中仍然发生了错误。用户收到了一个错误消息,其中包含以下内容:

/etc/init.d/hadoop-state-pusher-control: line 35: /mnt/var/lib/hadoop-state-pusher/run-hadoop-state-pusher: No such file or directory

用户试图通过将实例变大来解决这个问题,但没有成功。

huake_00183_.jpg

2. 解决方案

该问题的解决方案是,在创建Elastic MapReduce作业时,需要将Streaming JAR文件指定为作业的输入,而不是将Streaming JAR文件的内容指定为作业的输入。

--input s3://input-bucket/input-data.csv \
--input s3://jar-bucket/streaming-1.0.0.jar \

在代码示例中,streaming-1.0.0.jar是Streaming JAR文件的S3路径。

代码例子

以下是一个使用Java编写的Streaming JAR文件的示例:

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Job;

public class StreamingWordCount {
  public static class MapClass extends Mapper<LongWritable, Text, Text, IntWritable> {
    private IntWritable one = new IntWritable(1);

    @Override
    public void map(LongWritable key, Text value, Context context) {
      String[] words = value.toString().split(",");
      for (String word : words) {
        try {
          context.write(new Text(word), one);
        } catch (IOException | InterruptedException e) {
          throw new RuntimeException(e);
        }
      }
    }
  }

  public static class ReduceClass extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    public void reduce(Text key, Iterable<IntWritable> values, Context context) {
      int sum = 0;
      for (IntWritable value : values) {
        sum += value.get();
      }
      try {
        context.write(key, new IntWritable(sum));
      } catch (IOException | InterruptedException e) {
        throw new RuntimeException(e);
      }
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "Streaming Word Count");
    job.setJarByClass(StreamingWordCount.class);
    job.setMapperClass(MapClass.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(IntWritable.class);
    job.setReducerClass(ReduceClass.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    /*
      The following line specifies the input and output paths to the Streaming JAR file.
      The input path should be the S3 path to the input CSV file.
      The output path should be the S3 path where the word count results will be stored.
    */
    job.addCacheFile(new Path("s3://input-bucket/input-data.csv"));
    job.setOutputPath(new Path("s3://output-bucket/output-data.csv"));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

完整步骤

  1. 确保您的集群具有足够的资源来处理作业。
  2. 创建一个新的Elastic MapReduce作业。
  3. 在“配置”选项卡中,指定作业的名称、作业类型、输入和输出路径以及所需的其他配置。
  4. 将Streaming JAR文件指定为作业的输入。
  5. 单击“提交”按钮以启动作业。