HQL实现影评分析（Linux下脚本编写案例）把本地的这个位置的文件导入到集群的flume里面。我们基于以上的数据我们

HQL实现影评分析

1.前提条件

目前有了一个爬虫程序：

爬完数据会在本地 data/collect/下生成爬取到一个日志数据

还有一个flume脚本

是一个采集计划，任务是将目录下的这个内容导入到hdfs里

把本地的这个位置的文件导入到集群的flume里面。

所以总的来说，我们经由爬虫得到了数据，然后经由flume把数据导入了了HDFS（/flume/年-月-日格式的文件夹）

我们基于以上的数据我们要建立hive数仓（规划数仓、表、把数据加载过来），进行相应的分析。

老师的设计：

爬虫程序每天1：00执行一次

Flume agent一直后台运行，只要检测到目录里有对应的log文件，就会自动推送到hdfs里

hive 每天2：00写入今天新加的数据，对包含昨天的数据进行处理

建一个文件夹把老师给的film_rating脚本上传

2.关于hive的操作方式

直接在hive写命令
采用sql脚本方式，后面用hive -f 执行脚本名.sql

但是linux命令式没办法写道sql脚本文件里的，比如说得到昨天的日期的命令

采用shell脚本方式

在脚本内部使用 hive -e “具体的hive命令”

【例】
[hadoop@hadoop01 1_22]$ hive -e "show databases;"
which: no hbase in (/home/hadoop/.local/bin:/home/hadoop/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/model/jdk1.8/bin:/opt/model/nginx/sbin:/opt/model/hadoop-3.2.1/bin:/opt/model/hadoop-3.2.1/sbin:/opt/model/flume-1.9.0/bin:/opt/model/hive-3.1.2/bin:/opt/model/sqoop2/bin:/opt/model/scala-2.12.8/bin:/opt/model/spark-3.0/sbin:/opt/model/spark-3.0/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/model/hive-3.1.2/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/model/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = 88a6a1fb-dcb1-4a3e-8bc1-419db61ae918

Logging initialized using configuration in file:/opt/model/hive-3.1.2/conf/hive-log4j2.properties Async: true
Hive Session ID = d3af9381-9532-4739-bd5f-ce003a0cb257
OK
apply_service_code
city_db
default
demodb
film_db
film_db2
gmall
marketdb
model_service_code
myschooldb
original_service_code
userdb
weather_db
Time taken: 2.28 seconds, Fetched: 13 row(s)
[hadoop@hadoop01 1_22]$

【写一个shell脚本】

【查询hive程序的位置】

[hadoop@hadoop01 hive_dir]$ which hive
/opt/model/hive-3.1.2/bin/hive

[hadoop@hadoop01 film_rating]$ vi test_hive.sh

【给刚刚的脚本加上可执行权限】

[hadoop@hadoop01 film_rating]$ chmod +x test_hive.sh

绿了可执行了

【执行】

[hadoop@hadoop01 film_rating]$ ./test_hive.sh 
which: no hbase in (/home/hadoop/.local/bin:/home/hadoop/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/model/jdk1.8/bin:/opt/model/nginx/sbin:/opt/model/hadoop-3.2.1/bin:/opt/model/hadoop-3.2.1/sbin:/opt/model/flume-1.9.0/bin:/opt/model/hive-3.1.2/bin:/opt/model/sqoop2/bin:/opt/model/scala-2.12.8/bin:/opt/model/spark-3.0/sbin:/opt/model/spark-3.0/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/model/hive-3.1.2/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/model/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = 03e6ee68-0831-4f8c-8aae-453897b4d76e

Logging initialized using configuration in file:/opt/model/hive-3.1.2/conf/hive-log4j2.properties Async: true
Hive Session ID = 2f8675c0-cf62-4eef-afce-b5283d58efee
OK
apply_service_code
city_db
default
demodb
film_db
film_db2
gmall
marketdb
model_service_code
myschooldb
original_service_code
userdb
weather_db
Time taken: 1.946 seconds, Fetched: 13 row(s)

用这种方式就可以在一个脚本里同时写shell命令和hive命令了

在shell命令里声明变量 nowdate=“2021-01-25”

下面可以用${nowdate}进行引用

3.提前准备

分析前一天的数据，把上面的数据拽下来假装是昨天的

[hadoop@hadoop01 film_rating]$ hdfs dfs -get /flume/2021-01-23/* ./ #先把数据拽下来
2021-01-26 14:53:12,107 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
[hadoop@hadoop01 film_rating]$ ll
总用量 476
-rw-rw-r-- 1 hadoop hadoop   1415 1月  26 14:16 film_rating.sh
-rw-rw-rw- 1 hadoop hadoop 292954 1月  26 14:53 log_file_14.1611384696195.log
-rw-rw-rw- 1 hadoop hadoop 183953 1月  26 14:53 log_file_20.1611403201575.log
-rwxrwxr-x 1 hadoop hadoop     92 1月  26 14:41 test_hive.sh
[hadoop@hadoop01 film_rating]$ hdfs dfs -mkdir /flume/2021-01-24/#这里是建错了
[hadoop@hadoop01 film_rating]$ hdfs dfs -mkdir /flume/2021-01-25/#创建昨天的目录
[hadoop@hadoop01 film_rating]$ hdfs dfs -put ./*.log /flume/2021-01-25/#把当前目录的后缀是.log的文件上传到那个文件
2021-01-26 14:56:20,034 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-01-26 14:56:20,317 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false

看一下前四条数据

4.处理完的结果

我们的目标：

在数据库里生成三张表：

源数据表film_rating 存储原始的影评数据

分析完的结果：

rating_class 影评打分的分类（多少人力荐、多少人推荐、多少人认为还行）
rating_detail 对于影评的标题，多少人认为是有用的usecount，多少人认为是没用的nousecount、多少人回复了replycount

分析完的结果也是在源数据表的基础上进行操作的

5.具体步骤

把刚刚get下来的数据删一下没用了

用-getmerge 命令把两个文件搞下来看看一共有多少条数据，一共5000多条数据

film_rating.sh 脚本

#!/bin/bash 
#第一行这个是shell脚本的标识，一定要加上  
#获取昨天
yesterday=`date --date='1 days ago' +%Y-%m-%d`
year=`date --date='1 days ago' +%Y` 
month=`date --date='1 days ago' +%m`
echo ${yesterday}
echo ${year}
echo ${month}

#创建film_rating的ods层
/opt/model/hive-3.1.2/bin/hive -e "#下面就是hive代码辣
--1.创建数据库，记得把之前库里有的film_db删掉,实际的生产环境里这个数据库只会被创建一次
create database if not exists film_db;
--2.引用数据库
use film_db;

--3.创建影评数据表
create table if not exists film_rating(
 spider_time string, --爬取的时间
 title string, --影评title
 name string, --影评人
 rating_score string, --评分
 rating_date string, --评论时间
 use_count int, --多少人认为有用
 not_use_count int, --多少人认为没用
 reply_count int --多少人回应了
)
partitioned by (year int, month int)  --采用年月分区
row format delimited 
fields terminated by ','
stored as textfile
location '/input/film_ratings';

--4.加载数据 
--没有local的话就是从指定的hdfs上加载数据
load data inpath '/flume/${yesterday}/log_file*.log'  into table film_rating partition(year=${year},month=${month});

--5.统计分析评论情况
drop table if exists rating_detail;
create table rating_detail
row format delimited 
fields terminated by ','
stored as textfile
 as
select title,use_count,reply_count,not_use_count
from film_rating
order by use_count desc,reply_count desc,not_use_count asc
limit 10;


--6.统计评论分类情况
drop table if exists rating_class;
create table rating_class
row format delimited 
fields terminated by ','
stored as textfile
 as
select rating_score,count(*)
from film_rating
where rating_score in('力荐','推荐','还行')
group by rating_score;
"

删库的时候记得加级联

拓展：

Linux里面有个定时任务处理crontab。每天定时执行这个shell脚本（自己研究一下）

给这个shell脚本加上可执行权限

1.构建影评数据表

#!/bin/bash 
#第一行这个是shell脚本的标识，一定要加上  
#获取昨天
yesterday=`date --date='1 days ago' +%Y-%m-%d`
year=`date --date='1 days ago' +%Y` 
month=`date --date='1 days ago' +%m`
echo ${yesterday}
echo ${year}
echo ${month}

#创建film_rating的ods层
/opt/model/hive-3.1.2/bin/hive -e "#下面就是hive代码辣
--1.创建数据库，记得把之前库里有的film_db删掉,实际的生产环境里这个数据库只会被创建一次
create database if not exists film_db;
--2.引用数据库
use film_db;

--3.创建影评数据表
create table if not exists film_rating(
 spider_time string, --爬取的时间
 title string, --影评title
 name string, --影评人
 rating_score string, --评分
 rating_date string, --评论时间
 use_count int, --多少人认为有用
 not_use_count int, --多少人认为没用
 reply_count int --多少人回应了
)
partitioned by (year int, month int)  --采用年月分区
row format delimited 
fields terminated by ','
stored as textfile
location '/input/film_ratings';

--4.加载数据 
--没有local的话就是从指定的hdfs上加载数据 //load的操作是剪切，真移动，load完之后原来的目录据没有数据了
load data inpath '/flume/${yesterday}/log_file*.log'  into table film_rating partition(year=${year},month=${month});

2.统计分析影评评论情况

--5.统计分析评论情况
drop table if exists rating_detail;
create table rating_detail
row format delimited 
fields terminated by ','
stored as textfile
 as
select title,use_count,reply_count,not_use_count
from film_rating
order by use_count desc,reply_count desc,not_use_count asc
limit 10;

3.统计评论分类情况

--6.统计评论分类情况
drop table if exists rating_class;
create table rating_class
row format delimited 
fields terminated by ','
stored as textfile
 as
select rating_score,count(*)
from film_rating
where rating_score in('力荐','推荐','还行')
group by rating_score;

效果截图：

1.脚本代码截图

2.生成三张表

源数据表 film_rating

数据太多展示前10条

rating_class

rating_detail(出现这个结果可能是我某天爬了两次数据导致数据重复了)