如何在GCP上使用Google DataProc和云存储进行HiveDZone>大数据专区>如何在GCP上使用Googl

Google Cloud Dataproc是一个可管理的Spark和Hadoop服务，可以让你利用开源的数据工具进行批处理、查询、流媒体和机器学习。这包括Hadoop生态系统（HDFS、Map/Reduce处理框架，以及一些建立在Hadoop之上的应用程序，如Hive、Mahout、Pig、Spark和Hue）。Hive提供了一个类似于SQL的接口，以查询存储在各种数据库和文件系统中的数据，这些数据库和文件系统与Hadoop集成。通过HIVE提交的查询被转换为访问存储数据的Map/Reduce作业，然后将结果汇总并返回给用户或应用程序。

在这个练习中，我们将使用纽约市2019年积累的黄色和绿色出租车出行数据。黄色出租车是唯一获准在纽约市任何地方接载街头叫车乘客的车辆，而绿色出租车则在曼哈顿北部（E 96街和W 110街以上）和外围区提供街头叫车服务和预先安排的服务。该数据集可在城市门户网站上获得。

主要步骤如下。

步骤1：将TLC原始数据（2019年的绿色和黄色出租车数据）上传到云存储中

首先，创建一个合适的GCP云存储桶，并创建文件夹来存储2019年的绿色出租车、黄色出租车数据集，如下图所示。

注意数据文件的云存储位置。这些将用于在以下步骤中创建外部表格。

黄色出租车: gs://2019ah04024/TLC/data/YellowTaxi/
绿色出租车: gs://2019ah04024/TLC/data/GreenTaxi/

接下来，使用上传文件将城市门户的数据集上传到桶中。所有2019年的黄色出租车数据将被上传到 "YellowTaxi/"，绿色出租车将被上传到 "GreenTaxi/"文件夹。

第二步：使用谷歌DataProc创建Hadoop集群环境

使用DataProc在GCP创建一个Hadoop集群，并将通过CLI访问主节点。你可以用多个主节点和工作节点来创建集群，但在这个练习中，我已经创建了一个单一的节点，既是主节点，也是工作节点。

一旦集群被创建并准备好，进入 "VM INSTANCES "查看主节点和工作节点的细节。

选择主节点。点击SSH图标旁边的向下箭头，从下拉菜单中选择在浏览器窗口中打开。一个新的浏览器窗口将打开，中间会出现一个图标，提醒你正在建立SSH连接。一旦SSH连接建立，将出现shell提示。

使用JDBC HIVE接口运行Beeline shell。Hive在localhost上运行，端口为10000。
，在主节点主机名（本例中为cluster-tlc-m）使用Google Cloud用户名。命令行将是以下内容。

beeline -u jdbc:hive2://localhost:10000/default -n myusername@cluster-tlc-m -d org.apache.hive.jdbc.HiveDriver

步骤3：为EDA创建HIVE外部表（暂存环境）

外部表是一个Hive不管理存储的表。如果你删除一个外部表，只有Hive中的定义被删除。在我们的案例中，即使HIVE的表被删除，数据仍然保留在云存储中。

黄色和绿色出租车的数据字典包含了字段、描述以及要存储的相应数值。从数据字典文件中，我们可以注意到绿色出租车数据集有20列，而黄色出租车数据集有18列。因此，我们将创建以下两个不同的外部表（暂存环境），并通过在云存储中指定各自的 "位置 "来指向绿色、黄色出租车数据集。

stg_ny_yellow_taxi
stg_ny_green_taxi

/*Create External table for yellow taxi*/
CREATE EXTERNAL TABLE stg_ny_yellow_taxi (
              vendor_id int, 
              tpep_pickup_datetime string,
              tpep_dropoff_datetime string,
              passenger_count int,
              trip_distance double,
              rate_code_id smallint,
              store_and_fwd_flag string,
              pu_location_id int,
              do_location_id int,
              payment_type smallint,
              fare_amount double,
              extra_charge double,
              mta_tax double,
              tip_amount double,
              tolls_amount double,
              improvement_surcharge double,
              total_amount double,
              congestion_surcharge double
       )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'gs://2019ah04024/tlc/data/YellowTaxi/';

/*Create External table for green taxi*/
CREATE EXTERNAL TABLE stg_ny_green_taxi (
              vendor_id int, 
              lpep_pickup_datetime string,
              lpep_dropoff_datetime string,
              store_and_fwd_flag string,
              rate_code_id smallint,
              pu_location_id int,
              do_location_id int,
              passenger_count int,
              trip_distance double,
              fare_amount double,
              extra_charge double,
              mta_tax double,
              tip_amount double,
              tolls_amount double,
              ehail_fee double,
              improvement_surcharge double,
              total_amount double,
              payment_type smallint,
              trip_type smallint,
              congestion_surcharge double
       )
ROW FORMAT DELIMITED

第四步：数据输入和探索性数据分析（EDA）

我们将进行探索性数据分析，以获得对黄色和绿色出租车数据集的总体了解，提取基本的统计数据，检查数据的有效性，并突出任何不符合要求的值。

**EDA1:**从数据字典中，有两个供应商（1= Creative Mobile Technologies, LLC；2= VeriFone Inc）。我们将检查是否有供应商ID为空或供应商ID不在（1,2）的记录。

/*Green Taxi*/
SELECT sum(CASE when vendor_id = 1 THEN 1 ELSE 0 END)as Creative_Mobile_Tech,
       sum(CASE when vendor_id = 2 THEN 1 ELSE 0 END)as Verifone_Inc, 
       sum(CASE when vendor_id < 1 or vendor_id > 2  THEN 1 ELSE 0 END)as Others,
       sum(CASE when vendor_id IS NULL THEN 1 ELSE 0 END) as NULL_Values,
count(*) as Total_Vendors
from stg_ny_green_taxi;

绿色出租车的查询结果如下。

数据集中总共有6044062条记录。
Creative Mobile Tech的记录是894041，Verifone Inc是4735902。
有414119条记录的供应商ID为空或不在（1，2）。

/*Yellow Taxi*/
SELECT sum(CASE when vendor_id = 1 THEN 1 ELSE 0 END)as Creative_Mobile_Tech,
       sum(CASE when vendor_id = 2 THEN 1 ELSE 0 END)as Verifone_Inc, 
       sum(CASE when vendor_id < 1 or vendor_id > 2  THEN 1 ELSE 0 END)as Others,
       sum(CASE when vendor_id IS NULL THEN 1 ELSE 0 END) as NULL_Values,
count(*) as Total_Vendors
from stg_ny_yellow_taxi;

对黄色出租车的查询结果如下。

在数据集中总共有84399031条记录。
Creative Mobile Tech的记录是30368157，Verifone Inc是53517181。
有246613条记录的供应商ID为空或不在（1，2）。
有267080条记录不是1或2，也不是NULL。

**EDA2:**我们将进行与供应商行程相关的探索性数据分析。

Distinct Vendors count.
接送乘客的最小和最大时间范围。
一个行程中的最小和最大乘客数。
使用了Distinct的支付类型来计数。
使用了Distinct费率代码。
检查是否有记录暗示只有已经开始和完成的行程是在2019年。只有2019年1月至12月之间的数据将被考虑用于我们的分析。

SQL

/*Green Taxi*/
select count(*) as number_of_records, 
       count(distinct vendor_id) as number_of_tpep_vendors, 
       min(to_date(lpep_pickup_datetime)) as oldest_pickup_timestamp,
       max(to_date(lpep_pickup_datetime)) as recent_pickup_timestamp, 
       min(to_date(lpep_dropoff_datetime)) as oldest_dropoff_timestamp, 
       max(to_date(lpep_dropoff_datetime)) as recent_dropoff_timestamp,  
       min(passenger_count) as min_passengers_pertrip, 
       max(passenger_count) as max_passengers_pertrip, 
       avg(passenger_count) as average_passengers_pertrip, 
       min(trip_distance) as min_trip_distance,
       max(trip_distance) as max_trip_distance, 
       avg(trip_distance) as average_trip_distance, 
       count(distinct rate_code_id) as number_of_rate_codes,
       count(distinct store_and_fwd_flag) as types_of_store_forward_flag, 
       count(distinct pu_location_id) as num_of_pickup_zones,
       count(distinct do_location_id) as num_of_dropoff_zones, 
       count(distinct payment_type) as number_of_payment_types
from stg_ny_green_taxi;

绿色出租车的查询输出如下。

数据集中总共有6044062条记录。
有2个TPEP供应商。
取车时间戳和下车时间戳在2008年8月21日和2062年8月15日之间。这是一个不符合要求的问题。
乘客数量在0到9之间。
数据集中有7个不同的费率代码，而数据字典中限制为6个。这是一个不符合要求的问题。
有262个记录的上车地点和264个记录的下车地点。
数据集中有5个不同的支付类型。

SQL

/*Yellow Taxi*/
select count(*) as number_of_records, 
       count(distinct vendor_id) as number_of_tpep_vendors, 
       min(to_date(tpep_pickup_datetime)) as oldest_pickup_timestamp,
       max(to_date(tpep_pickup_datetime)) as recent_pickup_timestamp, 
       min(to_date(tpep_dropoff_datetime)) as oldest_dropoff_timestamp, 
       max(to_date(tpep_dropoff_datetime)) as recent_dropoff_timestamp,  
       min(passenger_count) as min_passengers_pertrip, 
       max(passenger_count) as max_passengers_pertrip, 
       avg(passenger_count) as average_passengers_pertrip, 
       min(trip_distance) as min_trip_distance,
       max(trip_distance) as max_trip_distance, 
       avg(trip_distance) as average_trip_distance, 
       count(distinct rate_code_id) as number_of_rate_codes,
       count(distinct store_and_fwd_flag) as types_of_store_forward_flag, 
       count(distinct pu_location_id) as num_of_pickup_zones,
       count(distinct do_location_id) as num_of_dropoff_zones, 
       count(distinct payment_type) as number_of_payment_types
from stg_ny_yellow_taxi;

对黄色出租车的查询结果如下。

在数据集中总共有84399031条记录。
有3个TPEP供应商。但是，给定的TPEP供应商是2个。这是一个不符合要求的情况
提取时间戳和删除时间戳的范围在2001年1月1日和2090年12月31日之间。这是一个不符合要求的情况。
乘客数量在0到9之间。
数据集中有7个不同的费率代码，而字典中限制为6个。这是一个不符合要求的问题。
有263个记录在案的上客地点和263个记录在案的下客地点。
数据集中有5个不同的支付类型。

**EDA3:**我们将对与票价详情相关的组件进行探索性的数据分析，如以下。

- 最小、最大和平均票价

- 最小、最大和平均额外费用

- 最低、最高和平均MTA费用

- MTA类型的独特计数

- 最小、最大和平均小费金额

- 最小、最大和平均收费

- 最小、最大和平均的改善附加费用

- 改善附加费类型的明确计数

- 最小、最大和平均总金额

SQL

/*Green Taxi*/
select min(fare_amount) as min_fare_charge, 
       max(fare_amount) as max_fare_charge, 
       avg(fare_amount) as average_fare_charge,
       min(extra_charge) as min_extra_charge, 
       max(extra_charge) as max_extra_charge, 
       avg(extra_charge) as average_extra_charge,
       count(distinct mta_tax) as types_of_mta_tax_charge, 
       min(mta_tax) as min_mta_tax_charge, 
       max(mta_tax) as max_mta_tax_charge, 
       avg(mta_tax) as average_mta_tax_charge,
       min(tip_amount) as min_tip_amount, 
       max(tip_amount) as max_tip_amount, 
       avg(tip_amount) as average_tip_amount,
       min(tolls_amount) as min_toll_charge, 
       max(tolls_amount) as max_toll_charge, 
       avg(tolls_amount) as average_toll_charge,
       count(distinct improvement_surcharge) as types_of_surcharge, 
       min(improvement_surcharge) as min_surcharge, 
       max(improvement_surcharge) as max_surcharge, 
       avg(improvement_surcharge) as average_surcharge,
       min(total_amount) as min_total_charge, 
       max(total_amount) as max_total_charge, 
       avg(total_amount) as average_total_charge
from stg_ny_green_taxi;

绿色出租车的输出结果如下。

SQL

/*Yellow Taxi*/
select min(fare_amount) as min_fare_charge, 
       max(fare_amount) as max_fare_charge, 
       avg(fare_amount) as average_fare_charge,
       min(extra_charge) as min_extra_charge, 
       max(extra_charge) as max_extra_charge, 
       avg(extra_charge) as average_extra_charge,
       count(distinct mta_tax) as types_of_mta_tax_charge, 
       min(mta_tax) as min_mta_tax_charge, 
       max(mta_tax) as max_mta_tax_charge, 
       avg(mta_tax) as average_mta_tax_charge,
       min(tip_amount) as min_tip_amount, 
       max(tip_amount) as max_tip_amount, 
       avg(tip_amount) as average_tip_amount,
       min(tolls_amount) as min_toll_charge, 
       max(tolls_amount) as max_toll_charge, 
       avg(tolls_amount) as average_toll_charge,
       count(distinct improvement_surcharge) as types_of_surcharge, 
       min(improvement_surcharge) as min_surcharge, 
       max(improvement_surcharge) as max_surcharge, 
       avg(improvement_surcharge) as average_surcharge,
       min(total_amount) as min_total_charge, 
       max(total_amount) as max_total_charge, 
       avg(total_amount) as average_total_charge
from stg_ny_yellow_taxi;

黄色出租车的输出结果如下所示。

以类似的方式，可以执行EDA来验证与乘客人数、费率代码、支付类型相关的组件，也可以验证是否有任何记录的取车时间戳在下车时间戳之后。通过分析所有这些情况，我们可以看到不确认的记录，其中。

供应商ID为空或不在（1,2）内
取货或送货日期时间！=2019年
提车日期时间在下车时间戳之后
乘客人数不在（1,2,3,4,5,6,7,8,9）内
行程距离 <= 0.0
费率代码ID不在(1,2,3,4,5,6)内
支付类型不在(1,2,3,4,5,6)内
票价金额 <= 0
额外费用不在(0,0.5,1)内
MTA税不在(0,0.5)内
小费金额小于0
支付类型=2且小费金额不为0
过路费金额小于0
改善附加费不在(0,0.3)内
总金额<=0

所以，我们现在有了绿色出租车记录的总不确认记录。

SQL

/*Green Taxi*/
select vendor_id as Vendor_Id_1CreativeMob_2VeriFone, 
       count(*) as NonConf_Records
from stg_ny_green_taxi
where year(lpep_pickup_datetime) =2019 and year(lpep_dropoff_datetime) =2019 and
       unix_timestamp(lpep_pickup_datetime) < unix_timestamp(lpep_dropoff_datetime) and
       passenger_count in (1,2,3,4,5,6,7,8,9) and
       trip_distance > 0.0 and rate_code_id in (1,2,3,4,5,6) and
       payment_type in (1,2,3,4,5,6) and fare_amount > 0 and
       extra_charge in (0,0.5,1) and mta_tax in (0,0.5) and 
       tip_amount > 0.0 and tolls_amount >= 0.0 and 
       improvement_surcharge in (0,0.3) and 
       total_amount > 0 and vendor_id is not null
group by vendor_id
order by vendor_id;

绿色出租车有近692856条非确认记录，占绿色出租车Y2019年总数据集的11.4% ，黄色出租车记录的总非确认记录。

SQL

/*Yellow Taxi*/
select vendor_id as Vendor_Id_1CreativeMob_2VeriFone, 
       count(*) as NonConf_Records
from stg_ny_yellow_taxi
where year(tpep_pickup_datetime) =2019 and year(tpep_dropoff_datetime) =2019 and
       unix_timestamp(tpep_pickup_datetime) < unix_timestamp(tpep_dropoff_datetime) and
       passenger_count in (1,2,3,4,5,6,7,8,9) and
       trip_distance > 0.0 and rate_code_id in (1,2,3,4,5,6) and
       payment_type in (1,2,3,4,5,6) and fare_amount > 0 and
       extra_charge in (0,0.5,1) and mta_tax in (0,0.5) and 
       tip_amount > 0.0 and tolls_amount >= 0.0 and 
       improvement_surcharge in (0,0.3) and 
       total_amount > 0 and vendor_id is not null
group by vendor_id
order by vendor_id;

有近27019826条非确认记录的黄色出租车，占黄色出租车Y2019年总数据集的32%。

在这一部分，我们已经看到了如何创建谷歌数据proc集群，通过指向位于谷歌云存储的数据在HIVE中创建外部表，执行探索性数据分析并找到非确认记录。在下一部分中，我们将看到剩余的步骤，即用有效的记录创建主表，进行详细的深度驱动分析。