Spark写CSV空值引号问题
用户有一个作业是从hive中读取数据,经过一系列聚合转换后,写入CSV文件,发生了这样一个现象。 源记录
1,Arrays,null,1000
目标记录
1,Arrays,"",1000
而用户期望的目标记录是
1,Arrays,,1000
看过用户的描述后,我认为用户的期望很合理,反而spark的处理很奇怪,话不多说,写一个简单的测试用例复现用户的问题。
@Test
public void testWriteCSV() {
SparkSession sparkSession = SparkSession.builder().appName("test").master("local[*]").getOrCreate();
String header = "1a,2a,a3,4a";
List<String> dataList = new ArrayList<>();
for (int i = 0; i < 4; i++) {
String tmp = "1b,2b,,4b";
dataList.add(tmp);
}
Dataset<String> datasetS = sparkSession.createDataset(dataList, Encoders.STRING());
Dataset<Row> dataset = sparkSession.read().csv(datasetS).toDF(header.split(","));
dataset.write().mode(SaveMode.Overwrite).format("csv").option("header", true).option("sep", ",")
.save("test.csv");
dataset.show();
}
show的结果如下
+---+---+----+---+
| 1a| 2a| a3| 4a|
+---+---+----+---+
| 1b| 2b|null| 4b|
| 1b| 2b|null| 4b|
| 1b| 2b|null| 4b|
| 1b| 2b|null| 4b|
+---+---+----+---+
a3的结果看起来是正常的null 我们来看一下写csv的结果
1a,2a,a3,4a
1b,2b,"",4b
果然和用户描述一致,null被spark写为了“”,于是查看spark源码,追踪到CSVOptions.scala类发现
* String representation of an empty value in read and in write.
*/
val emptyValue = parameters.get("emptyValue")
/**
* The string is returned when CSV reader doesn't have any characters for input value,
* or an empty quoted string `""`. Default value is empty string.
*/
val emptyValueInRead = emptyValue.getOrElse("")
/**
* The value is used instead of an empty string in write. Default value is `""`
*/
val emptyValueInWrite = emptyValue.getOrElse("\"\"")
如果emptyValue为定义则默认值为"",于是问题的解决方案变为给用户开一个emptyValue的option出去,测试一下:
dataset.write().mode(SaveMode.Overwrite).format("csv").option("header", true).option("sep", ",")
.option("emptyValue", "")
.save("test.csv");
得到的结果
+---+---+----+---+
| 1a| 2a| a3| 4a|
+---+---+----+---+
| 1b| 2b|null| 4b|
| 1b| 2b|null| 4b|
| 1b| 2b|null| 4b|
| 1b| 2b|null| 4b|
+---+---+----+---+
CSV文件的结果:
1a,2a,a3,4a
1b,2b,,4b
和用户预期一致,问题解决。