SparkSQL: Cannot have map type columns in DataFrame which calls set operations(i

67 阅读2分钟

更多干货抢先看:大数据干货合集

通过SparkSQL,对两个存在map类型字段的Hive表进行union操作,报如下错误:

org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column map is map<string,string>;

1. 场景模拟

1)通过函数str_to_map/map生成map类型的字段,然后进行union操作

select 1 id, str_to_map("k1:v1,k2:v2") map

union

select 2 id, map("k1","v1","k2","v2") map

2)报错信息

org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column map is map<string,string>;;
Distinct
+- Union
   :- Project [1 AS id#116, str_to_map(k1:v1,k2:v2, ,, :) AS map#117]
   :  +- OneRowRelation
   +- Project [2 AS id#118, map(k1, v1, k2, v2) AS map#119]
      +- OneRowRelation

  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:364)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:85)

2. 问题分析

根据报错信息,我们查看org.apache.spark.sql.catalyst.analysis.CheckAnalysis的checkAnalysis方法,第362行源码处理逻辑(错误信息是不是很熟悉呢?):

image.png

关键看mapColumnInSetOperation中对逻辑计划的匹配: 在这里插入图片描述

针对逻辑计划中有Intersect、Except、Distinct的output"返回"的属性(Attribute)有map类型,或者Deduplicate的keys(也是Attribute)包含map字段类型,都会导致上述问题。

而union导致上述报错,是因为union会对结果去重,即distinct

3. 解决方案

询问后台小伙伴儿,目前的业务场景是考验不需要去重处理的。

那么我们都知道,union和union all的主要区别就是,前者会对结果去重,后者则不会。那么将union改为union all就好了。 更多干货抢先看:大数据干货合集