iceberg parquet类型EQdelete文件具体写入流程

887 阅读1分钟
FlinkSink
        .forRow(dataStream, SimpleDataUtil.FLINK_SCHEMA)
        .table(table)
        .tableLoader(tableLoader)
        .writeParallelism(1)
        .equalityFieldColumns(ImmutableList.of("data"))
        .rewriteDataTasksParallelism(2)
        .build();

首先在 build() 方法中通过IcebergStreamWriter<RowData> streamWriter = createStreamWriter(table, flinkRowType, equalityFieldIds);创建streamWriter

static IcebergStreamWriter<RowData> createStreamWriter(Table table,
                                                       RowType flinkRowType,
                                                       List<Integer> equalityFieldIds) {
  Map<String, String> props = table.properties();
  long targetFileSize = getTargetFileSizeBytes(props);
  FileFormat fileFormat = getFileFormat(props);

  TaskWriterFactory<RowData> taskWriterFactory = new RowDataTaskWriterFactory(table.schema(), flinkRowType,
      table.spec(), table.locationProvider(), table.io(), table.encryption(), targetFileSize, fileFormat, props,
      equalityFieldIds);

  return new IcebergStreamWriter<>(table.name(), taskWriterFactory);

其中会创建一个TaskWriterFactory,在IcebergStreamWriter中负责创建写入功能的对象,

public RowDataTaskWriterFactory(Schema schema,
                                RowType flinkSchema,
                                PartitionSpec spec,
                                LocationProvider locations,
                                FileIO io,
                                EncryptionManager encryptionManager,
                                long targetFileSizeBytes,
                                FileFormat format,
                                Map<String, String> tableProperties,
                                List<Integer> equalityFieldIds) {
  this.schema = schema;
  this.flinkSchema = flinkSchema;
  this.spec = spec;
  this.locations = locations;
  this.io = io;
  this.encryptionManager = encryptionManager;
  this.targetFileSizeBytes = targetFileSizeBytes;
  this.format = format;
  this.equalityFieldIds = equalityFieldIds;

  if (equalityFieldIds == null || equalityFieldIds.isEmpty()) {
    this.appenderFactory = new FlinkAppenderFactory(schema, flinkSchema, tableProperties, spec);
  } else {
    // TODO provide the ability to customize the equality-delete row schema.
    Schema deleteSchema = TypeUtil.select(schema, new HashSet<>(equalityFieldIds));
    this.appenderFactory = new FlinkAppenderFactory(schema, flinkSchema, tableProperties, spec,
        ArrayUtil.toIntArray(equalityFieldIds), deleteSchema, null);
  }
}

这里创建一个FlinkAppenderFactory工厂对象,该对象负责创建数据文件写入的相关对象,其中会传入equalityFieldIds,deleteSchema等信息

FlinkAppenderFactory.png

这里FlinkAppenderFactory实现了FileAppenderFactory接口,其中实现了newDataWriter,newEqDeleteWriter,newPosDeleteWriter三个方法,创建写datafile,EqDeleteFile,PoseleteFile的对象。

然后再IcebergStreamWriter的open方法中会调用TaskWriterFactory的create()进行创建TaskWriter,taskWriter是IcebergStreamWriter中具体负责写入数据的对象,

这里是通过 return new UnpartitionedDeltaWriter(spec, format, appenderFactory, outputFileFactory, io,targetFileSizeBytes, schema, flinkSchema, equalityFieldIds); 创建一个UnpartitionedDeltaWriter对象进行非分区增量写入,

这里UnpartitionedDeltaWriter的继承类图如下

UnpartitionedDeltaWriter.png

这里TaskWriter接口负责接收记录写入和提供生成的文件 其中write()方法写入记录到数据文件 dataFiles()方法可以关闭writer并返回已完成的datafile complete()方法则关闭writer并返回已完成的datafile和deletefile

public interface TaskWriter<T> extends Closeable {
  void write(T row) throws IOException;
  void abort() throws IOException;
  default DataFile[] dataFiles() throws IOException {
    WriteResult result = complete();
    Preconditions.checkArgument(result.deleteFiles() == null || result.deleteFiles().length == 0,
        "Should have no delete files in this write result.");

    return result.dataFiles();
  }
  WriteResult complete() throws IOException;
}

BaseTaskWriter实现了TaskWriter接口,实现了很多基本的写逻辑,其中有5个内部类BaseEqualityDeltaWriter,PathOffset,BaseRollingWriter,RollingFileWriter,RollingEqDeleteWriter,其中BaseRollingWriter,RollingFileWriter,RollingEqDeleteWriter的继承关系如图

RollingEqDeleteWriter.png

BaseRollingWriter为写入数据的基本类,主要实现了滚动写入,即控制写入文件大小,在写入到指定大小时,重新生成新的文件进行写入的功能

而RollingFileWriter,RollingEqDeleteWriter继承自BaseRollingWriter实现了其中的抽象方法

abstract W newWriter(EncryptedOutputFile file, StructLike partition);

abstract long length(W writer);

abstract void write(W writer, T record);

abstract void complete(W closedWriter);

PathOffset是一个记录文件偏移量的一个工具类,负责记录写入dataFile的当前的偏移量,在BaseEqualityDeltaWriter中使用