谈谈 hive 中 join 下 on 和 where

641

公众号:Java小咖秀,网站:javaxks.com

作者 : 一寸 HUI ,链接: www.cnblogs.com/zsql/

很多人如果先接触 mysql 的执行顺序(from ->on ->join ->where ->group by ->having ->select ->distinct ->order by ->limit),可能会对 hive 中的 on 和 where 会产生一些误解,网上也有一些博客写了关于这些内容的,但是自己也还是想自己亲自试验一波,本文主要从 inner join,left/right join 和 full join 条件下去区别 on 和 where,以及加上分区的条件,其实搞懂这些对写 hql 有很大的帮助,要么可以更加的简洁,要么更优,接下来就开始实践。

版本:本文使用 CDH 6.3.2 的版本,hive 2.1.1+cdh6.3.2 进行测试的

一、试验表和数据

1.1、建表

create  table `user`(  -- 用户表,分区表
department_id int,
age int,
sex string,
name string
)
PARTITIONED BY (`date` string)
row format delimited
fields terminated by ','
STORED AS TEXTFILE;


create table department(  -- 部门表
id int,
name string,
count int
)
row format delimited
fields terminated by ','
STORED AS TEXTFILE;

1.2、数据

-- /data/hive/user1.txt
1,34,male,zhangsan
1,31,female,lili
3,14,female,liushen
3,24,female,sasa
4,54,male,liubei
4,36,female,yuji
4,25,male,zhaoyun
8,18,male,zhangfei

-- /data/hive/user2.txt
3,37,male,wangwu
4,38,female,lisi
3,19,female,caocao
2,22,female,guanyu
1,51,male,wzj
6,31,female,zhenji
6,25,male,sunwukong
6,17,male,tangsz

-- /data/hive/department.txt
1,dashuju,8
2,kaifa,9
3,ui,10
4,hr,3
5,shouxiao,12
6,zongjian,3

1.3、数据导入

load data local inpath '/data/hive/user1.txt'  into table `user` partition (`date`='2020-12-24');
load data local inpath '/data/hive/user2.txt'  into table `user` partition (`date`='2020-12-25');
load data local inpath '/data/hive/department.txt'  into table `department`;

1.4、查询数据

SELECT * from user;

img

SELECT * from department ;

img

1.5、对表进行分析

ANALYZE TABLE `user` partition(`date`='2020-12-24') COMPUTE STATISTICS;
ANALYZE TABLE `user` partition(`date`='2020-12-25') COMPUTE STATISTICS;
ANALYZE TABLE department COMPUTE STATISTICS;

如果不进行如上的操作,在下面的实践中会出问题,数据不真实,看不出效果,所以要做 explain 前对表进行分析,这样更加的准确(刚刚踩了坑,每次 explain 出来都只有 1 条数据,统计有问题)

二、inner join 下的 on 和 where 分析

2.1、不使用分区进行过滤

1、首先看一个没有条件的 inner join 的结果

SELECT  * from `user` u  inner join  department d  on d.id=u.department_id;

img

查看执行计划:

explain SELECT  * from `user` u  inner join  department d  on d.id=u.department_id;
STAGE DEPENDENCIES:
  Stage-4 is a root stage  
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            filterExpr: id is not null (type: boolean)
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: id is not null (type: boolean)
              Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 department_id (type: int)
                  1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            filterExpr: department_id is not null (type: boolean)
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Filter Operator
              predicate: department_id is not null (type: boolean) Statistics: Num rows: 16 Data size: 2944 Basic stats: COMPLETE Column stats: PARTIAL
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 department_id (type: int)
                  1 id (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
                Statistics: Num rows: 17 Data size: 3238 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                  Statistics: Num rows: 17 Data size: 3238 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 17 Data size: 3238 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

如上语句没有做任何的条件过滤,也没有使用分区:

  • 首先对 d 表(department)进行全表扫描,扫描到了 6 条数据,然后默认添加 id **is not null 的过滤条件对其扫描到的 6 条数据进行过滤(自动优化,如果 join 下要保留 null 的行的话需要提前做特殊处理,不然默认就被优化掉了,会导致数据丢失),**最终得到过滤后还剩下 6 条数据参与 inner join。
  • 然后对 u 表(user)进行全表扫描,扫描到了 16 条数据,同样添加默认的过滤条件 department_id **is not null,**最终得到 16 条数据参与 inner join。

2、接下来看一个有 where 条件和 on 条件下的结果

SELECT  * from `user` u inner join  department d on d.id=u.department_id
and d.count > 9
and u.age > 20
where u.age < 30;

img

接下来看看执行计划:

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            filterExpr: ((count > 9) and id is not null) (type: boolean)
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((count > 9) and id is not null) (type: boolean) Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 department_id (type: int)
                  1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            filterExpr: ((age > 20) and department_id is not null and (age < 30)) (type: boolean)
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Filter Operator
              predicate: ((age > 20) and department_id is not null and (age < 30)) (type: boolean) Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: PARTIAL
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 department_id (type: int)
                  1 id (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
                Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                  Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

结果如上所示:

  • 首先扫描 d(department)表,全表扫描 6 条数据,并对其进行过滤:((count > 9) and id is not null) ,过滤结果剩下 2 条数据进行 inner join 操作
  • 然后扫描 u(user)表,也是全表扫描 16 条数据,并对其进行过滤 ((age **> 20) and department_id is not null and (age < 30)),过滤剩下 1 条数据(这里是有误差的,其实应该剩余 4 条数据,hive 的执行计划是一个大概的统计执行过程,不完全正确)**进行 inner join 操作

**小总结:**inner join 在不使用分区过滤的情况下,会对其两个表都进行全表扫描,然后自动为 join 的键(on d.id=u.department_id)添加 is not null 的过滤条件,然后在配合 on 和 where 后面的条件进行过滤,在 inner join 中 where 和 on 是同时进行过滤的。

2.2、使用分区过滤

1、先看一个只有分区过滤的情况

SELECT  * from `user` u inner join  department d
on d.id=u.department_id
where u.`date`='2020-12-25';

img

查看执行计划:

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            filterExpr: id is not null (type: boolean)
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: id is not null (type: boolean) Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 department_id (type: int)
                  1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            filterExpr: (department_id is not null and (date = '2020-12-25')) (type: boolean)
            Statistics: Num rows: 8 Data size: 134 Basic stats: COMPLETE Column stats: NONE #这里一个分区只有8条数据
            Filter Operator
              predicate: department_id is not null (type: boolean) Statistics: Num rows: 8 Data size: 134 Basic stats: COMPLETE Column stats: NONE 
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 department_id (type: int)
                  1 id (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col8, _col9, _col10
                Statistics: Num rows: 8 Data size: 147 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), '2020-12-25' (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                  Statistics: Num rows: 8 Data size: 147 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 8 Data size: 147 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

结果如上,这里和没有添加分区过的情况对比,就是添加了分区后不会对表 u(user)进行全表扫描,这样的话就能提高效率,因为分区的存储就是一个文件夹,所以在分区过滤后就可以指定分区进行扫描,就不会进行全表扫描,这样的情况说明:分区表先进行分区过滤,然后对指定的分区进行全部扫描,然后再使用 on 和 where 以及自动添加的 is not null 条件进行过滤,过滤后的数据才进行 inner join

2、看一个带条件和分区过滤的结果

SELECT  * from `user` u inner join  department d
on d.id=u.department_id
and d.count > 9
and u.age > 20
where u.age < 30  and u.`date`='2020-12-24';

img

看执行计划:

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            filterExpr: ((count > 9) and id is not null) (type: boolean)
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((count > 9) and id is not null) (type: boolean) Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 department_id (type: int)
                  1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            filterExpr: ((age > 20) and department_id is not null and (age < 30)) (type: boolean)
            Statistics: Num rows: 8 Data size: 136 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((age > 20) and department_id is not null and (age < 30)) (type: boolean) Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 department_id (type: int)
                  1 id (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col8, _col9, _col10
                Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), '2020-12-24' (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                  Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

结果如上,得出结果与(不使用分区条件过滤且使用 on 和 where 过滤)对比,可以看出来,使用分区过滤的区别就是在对表进行扫描的时候是扫描全部还是扫描指定的分区,如果没有分区过滤,则扫描全表,否则,只对指定的分区进行扫描。

2.3、inner join 下 on 和 where 的总结

在 inner join 下,如果 where 条件中使用了分区过滤,则扫描指定的分区的数据,然后在通过 where 和 on 条件进行过滤,以及为 join 的键(on d.id=u.department_id)添加 is not null 的过滤条件(这里需要注意的是 join 键为 null 的数据是否要保留,需要保留的话,就需要对 join 键进行特殊的处理,否则数据则被过滤掉,导致数据丢失),这里 on 和 where 是同时过滤的,不区分先后。

三、left/right join 下的 on 和 where 分析

由于 left join 和 right join 属于同一类型,所以本文只针对 left join 进行实践。

3.1、非主表在 on 和 where 条件下执行

先看一条执行语句的结果(在非主表在 on 后面添加过滤条件)(约定:u (主表) left join d(非主表)

SELECT * from user u left join department d on d.id=u.department_id and d.count > 9

查看结果

img

然后看一个执行计划:

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            filterExpr: (count > 9) (type: boolean)
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (count > 9) (type: boolean) Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 department_id (type: int)
                  1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Map Join Operator
              condition map:
                   Left Outer Join0 to 1
              keys:
                0 department_id (type: int)
                1 id (type: int)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
              Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

结果如上,我们发现在 left join 下,约定:u (主表) left join d(非主表),非主表在 on 下面的条件 d.count > 9 过滤有效,最终扫描全部 6 条数据,通过条件过滤剩下 2 条数据然后进行 left join,主表扫描全表进行 left join,这里注意,在 left join 条件下两个表的 join 键(on d.id=u.department_id)都没有加上 is not null 的条件过滤,所以在进行 left join 的时候需要注意 join 键是否为空,为空的情况可以对其进行优化。

看一条执行语句的结果(在非主表在 where 后面添加过滤条件)(约定:u (主表) left join d(非主表)

SELECT  * from `user` u
left join  department d
on d.id=u.department_id
where d.count > 9

结果如下:(与非主表在 on 后面添加的添加结果是不一样的)

img

看执行计划:

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            HashTable Sink Operator
              keys:
                0 department_id (type: int)
                1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Map Join Operator
              condition map:
                   Left Outer Join0 to 1
              keys:
                0 department_id (type: int)
                1 id (type: int)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
              Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
              Filter Operator
                predicate: (_col10 > 9) (type: boolean)
                Statistics: Num rows: 5 Data size: 1039 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                  Statistics: Num rows: 5 Data size: 1039 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 5 Data size: 1039 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

从如上的执行计划来看,对表 u(user)和 d(department)是在没有任何过滤的情况下,进行了去全表扫描的 left join,在 left join 获得结果后,然后再对结果使用非主表的 where 条件 d.count > 9 进行过滤

**小总结:(left join)**在非主表下使用 on 或者使用 where 进行过滤时的结果是不一样的,如果是在 on 下面添加条件过滤,则先进行表的数据过滤,然后在进行 left join,如果是在 where 后面添加条件过滤,则是先进行 left join,然后在对 left join 得到的结果进行 where 条件的过滤,在 left join 中,不会对 join 键添加默认的 is not null 的过滤条件。

3.2、主表在 on 和 where 条件下执行

先看一条执行语句的结果(在主表在 on 后面添加过滤条件)(约定:u (主表) left join d(非主表)

SELECT  * from `user` u
left join  department d
on d.id=u.department_id
and u.age > 20

img

看到如上的结果发现,还是保留了 16 条数据(user 表的全部数据),但是发现 age<=20 的数据好像不参加 left join 一样,后面的值全都是 null

看看执行计划:

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            HashTable Sink Operator
              filter predicates: 0 {(age > 20)} 1 
              keys:
                0 department_id (type: int)
                1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Map Join Operator
              condition map:
                   Left Outer Join0 to 1 filter predicates: 0 {(age > 20)} 1 
              keys:
                0 department_id (type: int)
                1 id (type: int)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
              Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

结果如上,其中在处理 d(department 表) 时,扫描全表 6 条数据,对表 d(department)进行标记 age>20 的条件,然后对 u(user)表进行全表扫描并进行全表的 left join,在 left join 的过程中对 d(department)表(来源于 d 表的 的字段)通过主表的条件 age > 20 进行筛选,如果 u 表的 age <=20, 则来源于 d 表的字段全部置为 null,(如上为个人理解,不一定正确。简单来说,先做个判断标记,然后进行 left join,在 left join 的过程中通过条件进行过滤(不符合条件的数据保留主表的数据,非主表的数据丢弃,置为 null)),这里在 on 后面的条件不会对主表的条数产生影响,也是先进行 left join 并进行相应的过滤。理解起来比较绕,可以自己对应结果看看。

看一条执行语句的结果(在主表在 where 后面添加过滤条件)(约定:u (主表) left join d(非主表)

SELECT  * from `user` u
left join  department d
on d.id=u.department_id
where u.age > 20

img

接下来看执行计划:

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            HashTable Sink Operator
              keys:
                0 department_id (type: int)
                1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            filterExpr: (age > 20) (type: boolean)
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Filter Operator
              predicate: (age > 20) (type: boolean)
              Statistics: Num rows: 5 Data size: 920 Basic stats: COMPLETE Column stats: PARTIAL
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 department_id (type: int)
                  1 id (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
                Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                  Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

结果如上,可以明确的看出来当在主表中使用 where 过滤,会先对主表的数据进行过滤然后在进行 left join,主表扫描出 16 条数据,过滤后剩余 5 条,然后再进行 left join 得到最终的结果

小总结:(left join)在主表下使用 on 或者使用 where 进行过滤时的结果是不一样的,当使用 where 对主表进行过滤的时候,先过滤再进行 left join。当使用 on 对主表进行过滤,先在非主表进行过滤标记,然后再对全表进行 left join 时根据过滤条件把不符合条件的行中来源于非主表的数据设置为 null。

3.3、left/right join 使用分区过滤

看如下语句:

SELECT  * from `user` u
left join  department d
on d.id=u.department_id
where u.age > 20 and u.`date` = '2020-12-24';

结果:

img

看看执行计划:

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            HashTable Sink Operator
              keys:
                0 department_id (type: int)
                1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            filterExpr: ((age > 20) and (date = '2020-12-24')) (type: boolean)
            Statistics: Num rows: 8 Data size: 136 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (age > 20) (type: boolean) Statistics: Num rows: 2 Data size: 34 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 department_id (type: int)
                  1 id (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col8, _col9, _col10
                Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), '2020-12-24' (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                  Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

根据如上的执行计划可以看出来,使用分区过滤可以防止全表扫描,如上值扫描了一个分区,所以数据只有 8 条

3.4、left/right join 下 on 和 where 的总结

  1. 使用分区条件过滤,可以防止全表扫描,最优先过滤
  2. 在主表下进行过滤,使用 on 和 where 过滤的结果是不一样的,当使用 where 对主表进行过滤的时候,先过滤再进行 left join。当使用 on 对主表进行过滤,先在非主表进行过滤标记,然后再对全表进行 left join 时根据过滤条件把不符合条件的行中来源于非主表的数据设置为 null。
  3. 在非主表下进行过滤,使用 on 和 where 过滤的结果是不一样的,如果是在 on 下面添加条件过滤,则先进行表的数据过滤,然后在进行 left join,如果是在 where 后面添加条件过滤,则是先进行 left join,然后在对 left join 得到的结果进行 where 条件的过滤
  4. left/right join 不会对 join 键自动添加 is not null 的过滤条件,所以在 left/right join 的时候要注意 join 键为 null 的情况,这里是可以做优化的

四、full join 下的 on 和 where 分析

4.1、没有过滤条件的 full join

直接看一个没有任何条件的 full join

SELECT  * from `user` u
full join  department d
on d.id=u.department_id

img

查看执行计划:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: department_id (type: int)
              sort order: +
              Map-reduce partition columns: department_id (type: int)
              Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: age (type: int), sex (type: string), name (type: string), date (type: string)
          TableScan
            alias: d
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
              value expressions: name (type: string), count (type: int)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          keys:
            0 department_id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
          Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
            Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

执行计划如上,会对每个表进行升序的排序,没有自动优化(添加 null 过滤),执行全表的 full join。

4.2、有 where 条件的 full join

 SELECT  * from `user` u
full join  department d
on d.id=u.department_id
where u.age > 20 and  d.count > 9

结果如下

img

查看执行计划:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: department_id (type: int)
              sort order: +
              Map-reduce partition columns: department_id (type: int)
              Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: age (type: int), sex (type: string), name (type: string), date (type: string)
          TableScan
            alias: d
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
              value expressions: name (type: string), count (type: int)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          keys:
            0 department_id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
          Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
          Filter Operator
            predicate: ((_col10 > 9) and (_col1 > 20)) (type: boolean)
            Statistics: Num rows: 1 Data size: 207 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
              Statistics: Num rows: 1 Data size: 207 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 1 Data size: 207 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

从执行计划看出来,在 full join 下的使用 where 进行过滤的时候是先进行全表扫描,然后进行 full join,full join 获得结果后才对 where 中的条件进行过滤。

4.3、有 on 条件的 full join(留有疑问)

SELECT  * from `user` u
full join  department d
on d.id=u.department_id
and u.age > 20 and  d.count > 9

img

看到如上结果,可能有点意外, (个人能力有限,厉害的博友可以解释解释),个人的理解为就像 left join 的主表下的 on 条件一样,都是在 full join 的过程中进行过滤,然而两个表的全部数据都有保留下来,只有两个条件都成立的情况下,才没有 null 值。(在 full join 如果不懂,就尽量使用 where 条件判断啦)

查看执行计划:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: department_id (type: int)
              sort order: +
              Map-reduce partition columns: department_id (type: int)
              Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: age (type: int), sex (type: string), name (type: string), date (type: string)
          TableScan
            alias: d
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
              value expressions: name (type: string), count (type: int)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1 filter predicates: 0 {(VALUE._col0 > 20)} 1 {(VALUE._col1 > 9)}
          keys:
            0 department_id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
          Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
            Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

留有疑问????,优秀的博友如果理解了,可以在下面留言

4.4、分区过滤的 full join

SELECT  * from department d full join `user` u 
on d.id=u.department_id 
where u.`date`= '2020-12-24';

img

查看执行计划:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: d
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
              value expressions: name (type: string), count (type: int)
          TableScan
            alias: u
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: department_id (type: int)
              sort order: +
              Map-reduce partition columns: department_id (type: int)
              Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: age (type: int), sex (type: string), name (type: string), date (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          keys:
            0 id (type: int)
            1 department_id (type: int)
          outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8, _col9, _col10
          Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
          Filter Operator
            predicate: (_col10 = '2020-12-24') (type: boolean)
            Statistics: Num rows: 8 Data size: 1663 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: _col0 (type: int), _col1 (type: string), _col2 (type: int), _col6 (type: int), _col7 (type: int), _col8 (type: string), _col9 (type: string), '2020-12-24' (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
              Statistics: Num rows: 8 Data size: 1663 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 8 Data size: 1663 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

根据执行计划得知:在 full join 中,就算使用了分区过滤,还是先 full join 得到结果,然后在通过 where 条件进行过滤,所以推荐使用子查询先过滤,然后在进行 full join。

4.5、full join 中的 on 和 where 总结

  1. 这里在 on 的条件下还是留有疑问。。
  2. 在 where 的条件下不管是否使用分区过滤都是先 full join,再进行过滤的,所以这里现有通过子查询过滤,再进行 full join
  3. 在 full jion 中不会自动添加 join 键为 is not null 的条件

五、总结

1、inner join

  • inner join 首先可以通过分区进行过滤,防止全表扫描。
  • inner join 会自动为 join 的键(on d.id=u.department_id)添加 is not null 的过滤条件
  • inner join 下 on 和 where 后面的条件进行过滤,在 inner join 中 where 和 on 是同时进行过滤,没有顺序的区别

2、left/right join

  • left/right join 使用分区条件过滤,可以防止全表扫描,最优先过滤
  • left/right join 在主表下进行过滤,使用 on 和 where 过滤的结果是不一样的,当使用 where 对主表进行过滤的时候,先过滤再进行 left join。当使用 on 对主表进行过滤,先在非主表进行过滤标记,然后再对全表进行 left join 时根据过滤条件把不符合条件的行中来源于非主表的数据设置为 null。
  • left/right join 在非主表下进行过滤,使用 on 和 where 过滤的结果是不一样的,如果是在 on 下面添加条件过滤,则先进行表的数据过滤,然后在进行 left join,如果是在 where 后面添加条件过滤,则是先进行 left join,然后在对 left join 得到的结果进行 where 条件的过滤,所以过滤非主表的时候可以通过 on 进行条件过滤,这样防止写子查询
  • left/right join 不会对 join 键自动添加 is not null 的过滤条件,所以在 left/right join 的时候要注意 join 键为 null 的情况,这里是可以做优化的

3、full join

  • full join 中 on 下条件下过滤 (有疑问,还待探究)
  • full join 中 where 下条件过滤,不管是不是分区过滤,都是先进行 full join,在根据条件进行过滤,这里推荐子查询先过滤在进行 full join
  • 不会对 join 键自动添加 is not null 的过滤条件