Spark写CSV空值引号问题

156 阅读2分钟

Spark写CSV空值引号问题

用户有一个作业是从hive中读取数据,经过一系列聚合转换后,写入CSV文件,发生了这样一个现象。 源记录

1,Arrays,null,1000

目标记录

1,Arrays,"",1000

而用户期望的目标记录是

1,Arrays,,1000

看过用户的描述后,我认为用户的期望很合理,反而spark的处理很奇怪,话不多说,写一个简单的测试用例复现用户的问题。

@Test
public void testWriteCSV() {
    SparkSession sparkSession = SparkSession.builder().appName("test").master("local[*]").getOrCreate();
    String header = "1a,2a,a3,4a";
    List<String> dataList = new ArrayList<>();

    for (int i = 0; i < 4; i++) {
        String tmp = "1b,2b,,4b";
        dataList.add(tmp);
    }
    Dataset<String> datasetS = sparkSession.createDataset(dataList, Encoders.STRING());
    Dataset<Row> dataset = sparkSession.read().csv(datasetS).toDF(header.split(","));
    dataset.write().mode(SaveMode.Overwrite).format("csv").option("header", true).option("sep", ",")
            .save("test.csv");
    dataset.show();
}

show的结果如下

+---+---+----+---+
| 1a| 2a|  a3| 4a|
+---+---+----+---+
| 1b| 2b|null| 4b|
| 1b| 2b|null| 4b|
| 1b| 2b|null| 4b|
| 1b| 2b|null| 4b|
+---+---+----+---+

a3的结果看起来是正常的null 我们来看一下写csv的结果

1a,2a,a3,4a
1b,2b,"",4b

果然和用户描述一致,null被spark写为了“”,于是查看spark源码,追踪到CSVOptions.scala类发现

 * String representation of an empty value in read and in write.
 */
val emptyValue = parameters.get("emptyValue")
/**
 * The string is returned when CSV reader doesn't have any characters for input value,
 * or an empty quoted string `""`. Default value is empty string.
 */
val emptyValueInRead = emptyValue.getOrElse("")
/**
 * The value is used instead of an empty string in write. Default value is `""`
 */
val emptyValueInWrite = emptyValue.getOrElse("\"\"")

如果emptyValue为定义则默认值为"",于是问题的解决方案变为给用户开一个emptyValue的option出去,测试一下:

dataset.write().mode(SaveMode.Overwrite).format("csv").option("header", true).option("sep", ",")
        .option("emptyValue", "")
        .save("test.csv");

得到的结果

+---+---+----+---+
| 1a| 2a|  a3| 4a|
+---+---+----+---+
| 1b| 2b|null| 4b|
| 1b| 2b|null| 4b|
| 1b| 2b|null| 4b|
| 1b| 2b|null| 4b|
+---+---+----+---+

CSV文件的结果:

1a,2a,a3,4a
1b,2b,,4b

和用户预期一致,问题解决。