基于MapReduce的离线微博热点分析背景描述我们获取了爬虫收集的微博数据，其中每条记录包含微博内容、发布时间、关键

背景描述

我们获取了爬虫收集的微博数据，其中每条记录包含微博内容、发布时间、关键词等。

目标是分析微博热点，确定指定时间段内出现频率最高的关键词，进而揭示热门话题。

数据示例

爬虫数据以 JSON 格式存储，字段示例如下：

{
  "id": "12345",
  "content": "新出的科幻电影真好看！#科幻#电影#推荐",
  "timestamp": "2024-11-18T10:30:00",
  "keywords": ["科幻", "电影", "推荐"]
}

id：微博唯一标识。
content：微博正文。
timestamp：发布时间（ISO 8601 格式）。
keywords：爬取时提取的关键词列表。

任务目标

统计每个关键词的出现次数。
根据指定时间段，筛选出出现频率最高的前 10 个关键词。

MapReduce 实现

我们基于 Hadoop 的 MapReduce 框架，完成热点关键词分析。

Mapper 实现

读取输入数据（JSON 格式）。
提取微博的 keywords 和 timestamp。
按 timestamp 筛选指定时间段的数据。
输出 <关键词, 1> 键值对。

Mapper 示例代码：

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import com.fasterxml.jackson.databind.ObjectMapper;

import java.io.IOException;

public class HotTopicMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private static final ObjectMapper objectMapper = new ObjectMapper();
    private static final IntWritable one = new IntWritable(1);
    private Text keyword = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 解析 JSON 数据
        try {
            String line = value.toString();
            WeiboData weibo = objectMapper.readValue(line, WeiboData.class);

            // 时间过滤（这里假设任务配置了时间过滤参数 start 和 end）
            long timestamp = parseTimestamp(weibo.getTimestamp());
            long startTime = Long.parseLong(context.getConfiguration().get("startTime"));
            long endTime = Long.parseLong(context.getConfiguration().get("endTime"));
            if (timestamp >= startTime && timestamp <= endTime) {
                // 输出关键词
                for (String word : weibo.getKeywords()) {
                    keyword.set(word);
                    context.write(keyword, one);
                }
            }
        } catch (Exception e) {
            // 忽略解析错误的行
        }
    }

    private long parseTimestamp(String timestamp) {
        // 简单解析 ISO 8601 时间为毫秒
        return java.time.Instant.parse(timestamp).toEpochMilli();
    }

    // 内部类表示微博数据结构
    public static class WeiboData {
        private String id;
        private String content;
        private String timestamp;
        private String[] keywords;

        // Getter 和 Setter 方法
        public String getId() { return id; }
        public void setId(String id) { this.id = id; }

        public String getContent() { return content; }
        public void setContent(String content) { this.content = content; }

        public String getTimestamp() { return timestamp; }
        public void setTimestamp(String timestamp) { this.timestamp = timestamp; }

        public String[] getKeywords() { return keywords; }
        public void setKeywords(String[] keywords) { this.keywords = keywords; }
    }
}

Reducer 实现

接收来自 Mapper 的 <关键词, 1> 键值对。
汇总每个关键词的总频率。
输出 <关键词, 总频率>。

Reducer 示例代码：

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;

import java.io.IOException;

public class HotTopicReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

Driver 实现

配置任务参数，包括输入路径、输出路径和时间过滤范围。
设置 Mapper、Reducer 和输入/输出格式。

Driver 示例代码：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class HotTopicAnalysisDriver {
    public static void main(String[] args) throws Exception {
        if (args.length < 4) {
            System.err.println("Usage: HotTopicAnalysisDriver <input_path> <output_path> <start_time> <end_time>");
            System.exit(-1);
        }

        Configuration conf = new Configuration();
        conf.set("startTime", args[2]); // 起始时间（毫秒）
        conf.set("endTime", args[3]);   // 结束时间（毫秒）

        Job job = Job.getInstance(conf, "Weibo Hot Topic Analysis");
        job.setJarByClass(HotTopicAnalysisDriver.class);

        job.setMapperClass(HotTopicMapper.class);
        job.setReducerClass(HotTopicReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        TextInputFormat.addInputPath(job, new Path(args[0])); // 输入路径
        TextOutputFormat.setOutputPath(job, new Path(args[1])); // 输出路径

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

执行流程

准备输入数据
将爬取的微博数据存储在 HDFS 中，例如 /weibo/input/.
运行任务
提交任务到 Hadoop 集群：
```
hadoop jar HotTopicAnalysis.jar HotTopicAnalysisDriver /weibo/input/ /weibo/output/ 1700000000000 1700003600000
```
- 1700000000000：任务开始时间（毫秒）。
- 1700003600000：任务结束时间（毫秒）。
查看结果
任务完成后，结果存储在 HDFS 输出路径 /weibo/output/ 中。可以使用命令查看：
```
hdfs dfs -cat /weibo/output/part-*
```

扩展与优化

Top N 热点关键词：
- 可以在 Reducer 阶段通过优先队列（PriorityQueue）找出出现次数最多的关键词。
支持多维度分析：
- 按地区、用户群体等维度细化关键词统计。
性能优化：
- 使用压缩格式（如 Gzip）存储输入数据，减少 I/O。
- 调整 Hadoop 集群资源分配，提高任务并行度。

结果示例

关键词    出现次数
科幻      1500
电影      1200
推荐      800
AI        700
ChatGPT   600

通过该分析，我们可以快速获取微博中的热门关键词，助力热点话题监控和内容推荐。