统计youtube影音视频网站的常规指标，各种TopN指标：

--统计视频观看数Top10

--统计视频类别热度Top10

--统计视频观看数Top20所属类别

--统计视频观看数Top50所关联视频的所属类别Rank

--统计每个类别中的视频热度Top10

--统计每个类别中视频流量Top10

--统计上传视频最多的用户Top10以及他们上传的视频

--统计每个类别视频观看数Top10

二、项目表的字段

视频表

字段	备注	详细描述
video id	视频唯一id	11位字符串
uploader	视频上传者	上传视频的用户名String
age	视频年龄	视频在平台上的整数天
category	视频类别	上传视频指定的视频分类
length	视频长度	整形数字标识的视频长度
views	观看次数	视频被浏览的次数
rate	视频评分	满分5分
ratings	流量	视频的流量，整型数字
conments	评论数	一个视频的整数评论数
related ids	相关视频id	相关视频的id，最多20个

用户表

字段	备注	字段类型
uploader	上传者用户名	string
videos	上传视频数	int
friends	朋友数量	int

ETL原始数据

通过观察原始数据形式，可以发现，视频可以有多个所属分类，每个所属分类用&符号分割，且分割的两边有空格字符，同时相关视频也是可以有多个元素，多个相关视频又用“\t”进行分割。为了分析数据时方便对存在多个子元素的数据进行操作，我们首先进行数据重组清洗操作。即：将所有的类别用“&”分割，同时去掉两边空格，多个相关视频id也使用“&”进行分割。

核心要做三件事情

1.长度不够9的删掉

2.视频类别删掉空格

3.该相关视频的分割符

三、进行数据的清洗工作

ETL之ETLUtil : 清理工具类

import org.apache.commons.lang.StringUtils;



public class YoutubeUtil {

    /**

     * ? ? * 这个工具类方法，主要是用于清洗数据

     * ? ? * @param line

     * ? ? * @return

     * ? ?

     */

    public static String checkLine(String line) {

        StringBuilder builder = new StringBuilder();

        //  Video video = new Video();

        if (StringUtils.isBlank(line)) {

            return null;

        }



        //判断数据切割之后，长度小于9，直接丢弃

        String[] split = line.split("\t");



        if (split.length < 9) {

            return null;

        }

        //将第三个字段，也就是视频列表，当中的空格去掉

        split[3] = split[3].replace(" ", ""); //People & Blogs ===> People&Blogs



        //去掉相关视频当中的\t



        for (int i = 0; i < split.length; i++) {

            if (i < 9) {

                builder.append(split[i]).append("\t");

            } else if (i >= 9 && i < split.length - 1) {// 从索引9开始到倒数第二个

                builder.append(split[i]).append("&"); //相关id1&先关id2&相关id3&



            } else if (i == split.length - 1) { //拼接倒数第一个

                builder.append(split[i]);  //相关id1&先关id2&相关id3&相关id4

            }

        }

        return builder.toString();

    }



}

ETL之Mapper

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;



import java.io.IOException;

/*

  K2:行文本数据

  V2:NullWritable

 */



public class YoutubeMapper extends Mapper<LongWritable, Text,Text, NullWritable> {

    @Override

    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        //将V1使用checkline方法进行清洗转换，得到一个新的一行数据

        String video = YoutubeUtil.checkLine(value.toString());

        //判断转换后得到的一行文本字符串是否为null

        if(null != video) {

            //将K2和V2写入上下文

          context.write(new Text(video),NullWritable.get());

        }

    }

}

ETL之Runner

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;



import java.io.IOException;



public class YoutubeMain  {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

      Configuration configuration = new Configuration();

      Job job = Job.getInstance(configuration, "youtube");



      job.setJarByClass(YoutubeMain.class);



      job.setInputFormatClass(TextInputFormat.class);

      //TextInputFormat.addInputPath(job,new Path("file:///E:\\input\\youtube_video"));

      TextInputFormat.addInputPath(job,new Path("hdfs://node1:8020/input/youtube_video"));



      job.setMapperClass(YoutubeMapper.class);

      job.setMapOutputKeyClass(Text.class);

      job.setMapOutputValueClass(NullWritable.class);



      //本案例中没有Reducer，所以不需要设置Reduce

      //job.setReducerClass(YoutubeReducer.class);

      //job.setOutputKeyClass(Text.class);

      //job.setOutputValueClass(NullWritable.class);





      job.setOutputFormatClass(TextOutputFormat.class);

      //TextOutputFormat.setOutputPath(job,new Path("file:///E:\\output\\youtube_video"));

      TextOutputFormat.setOutputPath(job,new Path("hdfs://node1:8020/output/youtube_video"));

      boolean b = job.waitForCompletion(true);

      System.exit(b?0:1);;



    }

  }

四、准备工作

1、创建 hive 表

创建表：youtubevideo_ori，youtubevideo_user_ori

创建表：youtubevideo_orc，youtubevideo_user_orc

--创建: youtubevideo_ori表

create table youtubevideo_ori(

    videoId string, 

    uploader string, 

    age int, 

    category array<string>, 

    length int, 

    views int, 

    rate float, 

    ratings int, 

    comments int,

    relatedId array<string>)

row format delimited

fields terminated by "\t"

collection items terminated by "&"

stored as textfile;



--创建 youtubevideo_user_ori表:

create table youtubevideo_user_ori(

    uploader string,

    videos int,

    friends int)

row format delimited

fields terminated by "\t" 

stored as textfile;



--创建 youtubevideo_orc表:

create table youtubevideo_orc(

    videoId string, 

    uploader string, 

    age int, 

    category array<string>, 

    length int, 

    views int, 

    rate float, 

    ratings int, 

    comments int,

    relatedId array<string>)

row format delimited fields terminated by "\t" 

collection items terminated by "&" 

stored as orc;



--创建 youtubevideo_user_orc表:

create table youtubevideo_user_orc(

    uploader string,

    videos int,

    friends int)

row format delimited

fields terminated by "\t" 

stored as orc;

2、导入ETL后的数据

youtubevideo_ori：

load data inpath "/output/youtube_video" into table youtubevideo_ori;



youtubevideo_user_ori：

load data inpath "/youtube_video/user" into table youtubevideo_user_ori;

3、向ORC表插入数据

youtubevideo_orc：

insert overwrite table youtubevideo_orc select * from youtubevideo_ori;



youtubevideo_user_orc：

insert into table youtubevideo_user_orc select * from youtubevideo_user_ori;

五、业务分析

1、统计视频观看数 top10

思路：使用order by按照views字段做一个全局排序即可，同时我们设置只显示前10条。

select 

    videoId, 

    uploader, 

    age, 

    category, 

    length, 

    views, 

    rate, 

    ratings, 

    comments

from 

    youtubevideo_orc

order by 

    views

desc 

limit 10;





-- 方式2

select * from(

SELECT 

videoid ,age, category,length,views,

ROW_NUMBER() OVER( ORDER BY views DESC) AS rn

FROM  youtubevideo_orc

)t

where t.rn <= 10;

2、统计视频类别热度Top10

思路：即统计每个类别有多少个视频，显示出包含视频最多的前10个类别。

1)我们需要按照类别group by聚合，然后count组内的videoId个数即可。

2) 因为当前表结构为：一个视频对应一个或多个类别。所以如果要group by类别，需要先将类别进行列转行(展开)，然后再进行count即可。

3) 最后按照热度排序，显示前10条。

select 

    category_name as category, 

    count(t1.videoId) as hot

from (

    select 

        videoId,

        category_name

    from 

        youtubevideo_orc lateral view explode(category) t_catetory as category_name) t1

group by 

    t1.category_name

order by 

    hot

desc limit 

    10;

3、统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

思路:

1) 先找到观看数最高的20个视频所属条目的所有信息，降序排列

2) 把这20条信息中的category分裂出来(列转行)

3) 最后查询视频分类名称和该分类下有多少个Top20的视频

select 

    category_name as category, 

    count(t2.videoId) as hot_with_views

from (

    select 

        videoId, 

        category_name

    from (

        select 

            * 

        from 

            youtubevideo_orc

        order by 

            views

        desc limit 

            20) t1 lateral view explode(category) t_catetory as category_name) t2

group by 

    category_name

order by 

    hot_with_views

desc;

4、统计每个类别中的视频热度Top10，以Music为例

思路：

1) 要想统计Music类别中的视频热度Top10，需要先找到Music类别，那么就需要将category展开，所以可以创建一张表用于存放categoryId展开的数据。

2) 向category展开的表中插入数据。

3) 统计对应类别（Music）中的视频热度。

--创建表类别表：

create table youtubevideo_category(

    videoId string, 

    uploader string, 

    age int, 

    categoryId string, 

    length int, 

    views int, 

    rate float, 

    ratings int, 

    comments int, 

    relatedId array<string>)

row format delimited

fields terminated by "\t" 

collection items terminated by "&" 

stored as orc;





--向类别表中插入数据：

insert into table youtubevideo_category  

    select 

        videoId,

        uploader,

        age,

        categoryId,

        length,

        views,

        rate,

        ratings,

        comments,

        relatedId

    from 

        youtubevideo_orc lateral view explode(category) catetory as categoryId;

        

--统计Music类别的Top10（也可以统计其他）

select 

    videoId, 

    views

from 

    youtubevideo_category

where 

    categoryId = "Music" 

order by 

    views

desc limit

10;



-- 方式2

select * from(

SELECT 

videoid ,age, categoryid,length,views,

ROW_NUMBER() OVER( ORDER BY views DESC) AS rn

FROM  youtubevideo_category where  categoryId = "Music" 

)t

where t.rn <= 10;

5、统计每个类别中视频流量Top10

思路：

1) 创建视频类别展开表（categoryId列转行后的表）

2) 按照ratings排序即可

select 

 * 

from(

SELECT 

   videoid,

   age, 

   categoryid,

   length,

   ratings,

   ROW_NUMBER() OVER(PARTITION BY categoryid  ORDER BY ratings  DESC) AS rn

 FROM  youtubevideo_category

    )t  

where t.rn <= 10;

6、统计上传视频最多的用户Top10以及他们上传的观看次数在前 1 0的视频

思路:

1) 先找到上传视频最多的10个用户的用户信息

2) 通过uploader字段与youtubevideo_orc表进行join，得到的信息按照views观看次数进行排序即可。

--第一步:

select 

    * 

from 

    youtubevideo_user_orc

order by 

    videos

desc limit 

    10;

    

--第二步:

select 

t2.videoId,

t2.uploader,

t2.views,

t2.ratings,

t1.videos,

t1.friends

from (

    select 

        * 

    from 

        youtubevideo_user_orc

    order by 

        videos desc 

    limit 

        10) t1

join 

    youtubevideo_orc t2

on 

    t1.uploader = t2.uploader

order by 

    views desc 

limit 

    20;

📢博客主页：lansonli.blog.csdn.net
📢欢迎点赞 👍 收藏 ⭐留言 📝 如有错误敬请指正！
📢本文由 Lansonli 原创，首发于 CSDN博客🙉
📢大数据系列文章会每天更新，停下休息的时候不要忘了别人还在奔跑，希望大家抓紧时间学习，全力奔赴更美好的生活✨

2021年大数据Hive（十二）：Hive综合案例！！！

系列历史文章

前言

Hive综合案例

一、需求描述

二、项目表的字段

​​​​​​​三、进行数据的清洗工作

​​​​​​​四、准备工作

1、 创建 hive 表