为什么我的left join结果和预期不一致?

34 阅读10分钟

为什么我的left join结果和预期不一致?(Spark/Hive)

--建表
create table zzzt1(id string, name string) partitioned by (dt string);
create table zzzt2(id string, class string) partitioned by (dt string);

-- 插入数据
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT overwrite TABLE zzzt1 partition(dt) values
('1','A','2022-08-10'), 
('2','B','2022-08-10'),
('1','A1','2022-08-11'),
('2','B1','2022-08-11')

INSERT overwrite TABLE zzzt2 partition(dt) values
('1','101', '2022-08-10'),
('2','101', '2022-08-10'),
('1','102','2022-08-11'),
('2','102','2022-08-11')

1. left join的几种条件

先来看以下几个示例。

  • 示例1
SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
FROM zzzt1 AS t1
LEFT JOIN zzzt2 AS t2
ON t1.id = t2.id
WHERE t1.id = '1'
+------+------+-------+--------+-------------+-------------+
| id1  | id2  | name  | class  |     dt1     |     dt2     |
+------+------+-------+--------+-------------+-------------+
| 1    | 1    | A     | 101    | 2022-08-10  | 2022-08-10  |
| 1    | 1    | A     | 102    | 2022-08-10  | 2022-08-11  |
| 1    | 1    | A1    | 101    | 2022-08-11  | 2022-08-10  |
| 1    | 1    | A1    | 102    | 2022-08-11  | 2022-08-11  |
+------+------+-------+--------+-------------+-------------+

执行结果同预期相同。

  • 示例2
SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
FROM zzzt1 AS t1
LEFT JOIN zzzt2 AS t2
ON t1.id = t2.id AND t1.id = '1'
+------+-------+-------+--------+-------------+-------------+
| id1  |  id2  | name  | class  |     dt1     |     dt2     |
+------+-------+-------+--------+-------------+-------------+
| 1    | 1     | A     | 101    | 2022-08-10  | 2022-08-10  |
| 1    | 1     | A     | 102    | 2022-08-10  | 2022-08-11  |
| 1    | 1     | A1    | 101    | 2022-08-11  | 2022-08-10  |
| 1    | 1     | A1    | 102    | 2022-08-11  | 2022-08-11  |
| 2    | NULL  | B     | NULL   | 2022-08-10  | NULL        |
| 2    | NULL  | B1    | NULL   | 2022-08-11  | NULL        |
+------+-------+-------+--------+-------------+-------------+

将左表的t1.id = '1'条件从where移到on中后,为什么该条件没有生效?

  • 示例3
SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
FROM zzzt1 AS t1
LEFT JOIN zzzt2 AS t2
ON t1.id = t2.id AND t2.id = '1'
+------+-------+-------+--------+-------------+-------------+
| id1  |  id2  | name  | class  |     dt1     |     dt2     |
+------+-------+-------+--------+-------------+-------------+
| 1    | 1     | A     | 101    | 2022-08-10  | 2022-08-10  |
| 1    | 1     | A     | 102    | 2022-08-10  | 2022-08-11  |
| 1    | 1     | A1    | 101    | 2022-08-11  | 2022-08-10  |
| 1    | 1     | A1    | 102    | 2022-08-11  | 2022-08-11  |
| 2    | NULL  | B     | NULL   | 2022-08-10  | NULL        |
| 2    | NULL  | B1    | NULL   | 2022-08-11  | NULL        |
+------+-------+-------+--------+-------------+-------------+

右表t2.id = '1'条件写在on中生效。

  • 示例4
SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
FROM zzzt1 AS t1
LEFT JOIN zzzt2 AS t2
ON t1.id = t2.id
WHERE t2.id = '1'
+------+------+-------+--------+-------------+-------------+
| id1  | id2  | name  | class  |     dt1     |     dt2     |
+------+------+-------+--------+-------------+-------------+
| 1    | 1    | A     | 101    | 2022-08-10  | 2022-08-10  |
| 1    | 1    | A     | 102    | 2022-08-10  | 2022-08-11  |
| 1    | 1    | A1    | 101    | 2022-08-11  | 2022-08-10  |
| 1    | 1    | A1    | 102    | 2022-08-11  | 2022-08-11  |
+------+------+-------+--------+-------------+-------------+

右表t2.id = '1'条件写在where中,结果同inner join。

上述4个示意中,存在两个疑问:

  • 为什么on条件中左表的条件不生效(t1.id = '1')而右表的条件生效(t2.id = '1')?
  • 为什么右表中的条件是写在where中,结果同inner join?

分别查看上述三个SQL的执行计划。

示例1执行计划如下,从执行计划中可知写在where中的左表t1.id = '1'条件分别加入到t1表和t2表的fliter和谓词下推中,EqualTo(id,1)、(id#6 = 1)、(id#9 = 1)。

== Physical Plan ==
*CollectLimit 1000
+- *(3) Project [id#6 AS id1#2, id#9 AS id2#3, name#7, class#10, dt#8 AS dt1#4, dt#11 AS dt2#5]
   +- SortMergeJoin [id#6], [id#9], LeftOuter
      :- *(1) Sort [id#6 ASC NULLS FIRST], false, 0
      :  +- *Exchange hashpartitioning(id#6, 900)
      :     +- *(1) Project [id#6, name#7, dt#8]
      :        +- *(1) Filter (isnotnull(id#6) && (id#6 = 1))
      :           +- *(1) FileScan orc zzzt1[id#6,name#7,dt#8] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://***db/zzzt01], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct<id:string,name:string>
      +- *(2) Sort [id#9 ASC NULLS FIRST], false, 0
         +- *Exchange hashpartitioning(id#9, 900)
            +- *(1) Project [id#9, class#10, dt#11]
               +- *(1) Filter ((id#9 = 1) && isnotnull(id#9))
                  +- *(1) FileScan orc zzzt2[id#9,class#10,dt#11] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://***db/zzzt2], PartitionFilters: [], PushedFilters: [EqualTo(id,1), IsNotNull(id)], ReadSchema: struct<id:string,class:string>

示例2执行计划如下,从执行计划可知,写在on中的左表t1.id = '1'条件,并没有作用在左表上,而是作用在了右表,EqualTo(id,1)、(id#9 = 1)。

== Physical Plan ==
*CollectLimit 1000
+- *(3) Project [id#6 AS id1#2, id#9 AS id2#3, name#7, class#10, dt#8 AS dt1#4, dt#11 AS dt2#5]
   +- SortMergeJoin [id#6], [id#9], LeftOuter, (id#6 = 1)
      :- *(1) Sort [id#6 ASC NULLS FIRST], false, 0
      :  +- *Exchange hashpartitioning(id#6, 900)
      :     +- *(1) FileScan orc zzzt1[id#6,name#7,dt#8] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://***db/zzzt01], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,name:string>
      +- *(2) Sort [id#9 ASC NULLS FIRST], false, 0
         +- *Exchange hashpartitioning(id#9, 900)
            +- *(1) Project [id#9, class#10, dt#11]
               +- *(1) Filter ((id#9 = 1) && isnotnull(id#9))
                  +- *(1) FileScan orc zzzt2[id#9,class#10,dt#11] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://***db/zzzt2], PartitionFilters: [], PushedFilters: [EqualTo(id,1), IsNotNull(id)], ReadSchema: struct<id:string,class:string>

示例3执行计划如下,从执行计划可知,写在where中的右表t2.id = '1'条件作用在了右表上,EqualTo(id,1)、(id#9 = 1)。

== Physical Plan ==
*CollectLimit 1000
+- *(3) Project [id#6 AS id1#2, id#9 AS id2#3, name#7, class#10, dt#8 AS dt1#4, dt#11 AS dt2#5]
   +- SortMergeJoin [id#6], [id#9], LeftOuter
      :- *(1) Sort [id#6 ASC NULLS FIRST], false, 0
      :  +- *Exchange hashpartitioning(id#6, 900)
      :     +- *(1) FileScan orc zzzt1[id#6,name#7,dt#8] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://***db/zzzt01], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,name:string>
      +- *(2) Sort [id#9 ASC NULLS FIRST], false, 0
         +- *Exchange hashpartitioning(id#9, 900)
            +- *(1) Project [id#9, class#10, dt#11]
               +- *(1) Filter (isnotnull(id#9) && (id#9 = 1))
                  +- *(1) FileScan orc zzzt2[id#9,class#10,dt#11] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://***db/zzzt2], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct<id:string,class:string>

示例4执行计划如下,从执行计划可知,写在where中的右表t2.id = '1'条件,同时作用在了左表和右表,EqualTo(id,1)、(id#6 = 1)、(id#9 = 1)

== Physical Plan ==
*CollectLimit 1000
+- *(3) Project [id#6 AS id1#2, id#9 AS id2#3, name#7, class#10, dt#8 AS dt1#4, dt#11 AS dt2#5]
   +- *(3) SortMergeJoin [id#6], [id#9], Inner
      :- *(1) Sort [id#6 ASC NULLS FIRST], false, 0
      :  +- *Exchange hashpartitioning(id#6, 900)
      :     +- *(1) Project [id#6, name#7, dt#8]
      :        +- *(1) Filter ((id#6 = 1) && isnotnull(id#6))
      :           +- *(1) FileScan orc zzzt1[id#6,name#7,dt#8] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://***db/zzzt01], PartitionFilters: [], PushedFilters: [EqualTo(id,1), IsNotNull(id)], ReadSchema: struct<id:string,name:string>
      +- *(2) Sort [id#9 ASC NULLS FIRST], false, 0
         +- *Exchange hashpartitioning(id#9, 900)
            +- *(1) Project [id#9, class#10, dt#11]
               +- *(1) Filter (isnotnull(id#9) && (id#9 = 1))
                  +- *(1) FileScan orc zzzt2[id#9,class#10,dt#11] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://***db/zzzt2], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct<id:string,class:string>

从以上4个执行计划中还可以发现,id字段做作为条件是,会自动加上 is not null 限制。

1.1. right join

接下来看下右链接中,左右表条件分别写在where和on的结果。

SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
FROM zzzt1 AS t1
right JOIN zzzt2 AS t2
ON t1.id = t2.id
WHERE t1.id = '1'
-- 左表条件写在where中,同上述示例4
+------+------+-------+--------+-------------+-------------+
| id1  | id2  | name  | class  |     dt1     |     dt2     |
+------+------+-------+--------+-------------+-------------+
| 1    | 1    | A     | 101    | 2022-08-10  | 2022-08-10  |
| 1    | 1    | A     | 102    | 2022-08-10  | 2022-08-11  |
| 1    | 1    | A1    | 101    | 2022-08-11  | 2022-08-10  |
| 1    | 1    | A1    | 102    | 2022-08-11  | 2022-08-11  |
+------+------+-------+--------+-------------+-------------+

SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
FROM zzzt1 AS t1
right JOIN zzzt2 AS t2
ON t1.id = t2.id AND t1.id = '1'
-- 左表条件写在on中,同上述示例3
+-------+------+-------+--------+-------------+-------------+
|  id1  | id2  | name  | class  |     dt1     |     dt2     |
+-------+------+-------+--------+-------------+-------------+
| 1     | 1    | A     | 101    | 2022-08-10  | 2022-08-10  |
| 1     | 1    | A     | 102    | 2022-08-10  | 2022-08-11  |
| 1     | 1    | A1    | 101    | 2022-08-11  | 2022-08-10  |
| 1     | 1    | A1    | 102    | 2022-08-11  | 2022-08-11  |
| NULL  | 2    | NULL  | 102    | NULL        | 2022-08-11  |
| NULL  | 2    | NULL  | 101    | NULL        | 2022-08-10  |
+-------+------+-------+--------+-------------+-------------+

SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
FROM zzzt1 AS t1
right JOIN zzzt2 AS t2
ON t1.id = t2.id AND t2.id = '1'
-- 右表条件写在on中,同上述示例2
+-------+------+-------+--------+-------------+-------------+
|  id1  | id2  | name  | class  |     dt1     |     dt2     |
+-------+------+-------+--------+-------------+-------------+
| 1     | 1    | A     | 101    | 2022-08-10  | 2022-08-10  |
| 1     | 1    | A     | 102    | 2022-08-10  | 2022-08-11  |
| 1     | 1    | A1    | 101    | 2022-08-11  | 2022-08-10  |
| 1     | 1    | A1    | 102    | 2022-08-11  | 2022-08-11  |
| NULL  | 2    | NULL  | 101    | NULL        | 2022-08-10  |
| NULL  | 2    | NULL  | 102    | NULL        | 2022-08-11  |
+-------+------+-------+--------+-------------+-------------+

SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
FROM zzzt1 AS t1
right JOIN zzzt2 AS t2
ON t1.id = t2.id
WHERE t2.id = '1'
-- 右表条件写在where中,同上述示例1
+------+------+-------+--------+-------------+-------------+
| id1  | id2  | name  | class  |     dt1     |     dt2     |
+------+------+-------+--------+-------------+-------------+
| 1    | 1    | A     | 101    | 2022-08-10  | 2022-08-10  |
| 1    | 1    | A     | 102    | 2022-08-10  | 2022-08-11  |
| 1    | 1    | A1    | 101    | 2022-08-11  | 2022-08-10  |
| 1    | 1    | A1    | 102    | 2022-08-11  | 2022-08-11  |
+------+------+-------+--------+-------------+-------------+

结论 在使用left join时最好将右表的条件写在on中。rigth join同理,将左表的条件写在on条件中。

2. 多个join的执行顺序

验证多个left join时得执行顺序

create table zzzt3(id string, des string);
INSERT overwrite TABLE zzzt3 values
('101','班级1'),
('102','班级2')
SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
       ,t3.des
FROM zzzt1 AS t1
LEFT JOIN zzzt2 AS t2
ON t1.id = t2.id and t2.id = '1'
left join zzzt3 t3
on t2.class = t3.id
-- 执行结果为
+------+-------+-------+--------+-------------+-------------+---------+
| id1  |  id2  | name  | class  |     dt1     |     dt2     | t3.des  |
+------+-------+-------+--------+-------------+-------------+---------+
| 1    | 1     | A     | 101    | 2022-08-10  | 2022-08-10  | 班级1     |
| 1    | 1     | A     | 102    | 2022-08-10  | 2022-08-11  | 班级2     |
| 1    | 1     | A1    | 101    | 2022-08-11  | 2022-08-10  | 班级1     |
| 1    | 1     | A1    | 102    | 2022-08-11  | 2022-08-11  | 班级2     |
| 2    | NULL  | B     | NULL   | 2022-08-10  | NULL        | NULL    |
| 2    | NULL  | B1    | NULL   | 2022-08-11  | NULL        | NULL    |
+------+-------+-------+--------+-------------+-------------+---------+

SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
       ,t3.des
FROM zzzt1 AS t1
LEFT JOIN zzzt2 AS t2
ON t1.id = t2.id and t2.id = '1'
join zzzt3 t3
on t2.class = t3.id
+------+------+-------+--------+-------------+-------------+---------+
| id1  | id2  | name  | class  |     dt1     |     dt2     | t3.des  |
+------+------+-------+--------+-------------+-------------+---------+
| 1    | 1    | A     | 101    | 2022-08-10  | 2022-08-10  | 班级1     |
| 1    | 1    | A     | 102    | 2022-08-10  | 2022-08-11  | 班级2     |
| 1    | 1    | A1    | 102    | 2022-08-11  | 2022-08-11  | 班级2     |
| 1    | 1    | A1    | 101    | 2022-08-11  | 2022-08-10  | 班级1     |
+------+------+-------+--------+-------------+-------------+---------+

从结果得知,多表进行join时,执行顺序按照join的前后顺序,先执行前两个表join得到一个虚拟的表,再和第三个表进行join,得到虚拟结果后,继续和第四个表进进行join……