iceberg delete写入

1,498 阅读5分钟

v2写入逻辑在BaseDeltaTaskWriter中

@Override
public void write(RowData row) throws IOException {
  RowDataDeltaWriter writer = route(row);

  switch (row.getRowKind()) {
    case INSERT:
    case UPDATE_AFTER:
      writer.write(row);
      break;

    case DELETE:
    case UPDATE_BEFORE:
      writer.delete(row);
      break;
      
  }
}

这里的write其实就是BaseTaskWriter中BaseEqualityDeltaWriter类的子类RowDataDeltaWriter,所以write.delete(row)就是调用BaseEqualityDeltaWriter中的delete方法

public void delete(T row) throws IOException {
  internalPosDelete(structProjection.wrap(asStructLike(row)));

  eqDeleteWriter.write(row);
}

这里有一个structProjection变量,初始化语句为this.structProjection = StructProjection.create(schema, deleteSchema);StructProjection是一个实现数据结构映射的重要类,下看看一下其实现的接口

public interface StructLike {
  int size();

  <T> T get(int pos, Class<T> javaClass);

  <T> void set(int pos, T value);
}

可以看出这是一个通过位置索引设置和获取值的某种数据结构的接口

/**
 * Creates a projecting wrapper for {@link StructLike} rows.
 *
 * @param dataSchema schema of rows wrapped by this projection
 * @param projectedSchema result schema of the projected rows
 * @return a wrapper to project rows
 */
public static StructProjection create(Schema dataSchema, Schema projectedSchema) {
  return new StructProjection(dataSchema.asStruct(), projectedSchema.asStruct());
}

//映射结果的字段结构
private final StructType type;
//表示字段映射关系的数组
private final int[] positionMap;
//嵌套的映射关系
private final StructProjection[] nestedProjections;
//原始数据
private StructLike struct;

private StructProjection(StructType structType, StructType projection) {
  this.type = projection;
  this.positionMap = new int[projection.fields().size()];
  this.nestedProjections = new StructProjection[projection.fields().size()];

  // set up the projection positions and any nested projections that are needed
  List<Types.NestedField> dataFields = structType.fields();
  for (int pos = 0; pos < positionMap.length; pos += 1) {
    Types.NestedField projectedField = projection.fields().get(pos);

    boolean found = false;
    for (int i = 0; !found && i < dataFields.size(); i += 1) {
      Types.NestedField dataField = dataFields.get(i);
      if (projectedField.fieldId() == dataField.fieldId()) {
        found = true;
        positionMap[pos] = i;
        switch (projectedField.type().typeId()) {
          case STRUCT:
            nestedProjections[pos] = new StructProjection(
                dataField.type().asStructType(), projectedField.type().asStructType());
            break;
          case MAP:
          case LIST:
            throw new IllegalArgumentException(String.format("Cannot project list or map field: %s", projectedField));
          default:
            nestedProjections[pos] = null;
        }
      }
    }

    if (!found) {
      throw new IllegalArgumentException(String.format("Cannot find field %s in %s", projectedField, structType));
    }
  }
}

可以看到positionMap数组长度为映射结果字段数对应,而其值就是元数据结构对应字段的位置,循环遍历找到对应的字段设置positionMap数组,对于嵌套的字段调用StructProjection递归的遍历。

public StructProjection wrap(StructLike newStruct) {
  this.struct = newStruct;
  return this;
}

@Override
public <T> T get(int pos, Class<T> javaClass) {
  if (nestedProjections[pos] != null) {
    return javaClass.cast(nestedProjections[pos].wrap(struct.get(positionMap[pos], StructLike.class)));
  }

  return struct.get(positionMap[pos], javaClass);
}

初始化后映射时使用wrap去包装一个原始结构变量,然后通过get获取结果数据结构的对应位置的值

看下BaseDeltaTaskWriter的类字段结构和初始化

abstract class BaseDeltaTaskWriter extends BaseTaskWriter<RowData> {

  private final Schema schema = schema;
  private final Schema deleteSchema = TypeUtil.select(schema, Sets.newHashSet(equalityFieldIds));
  private final RowDataWrapper wrapper = new RowDataWrapper(flinkSchema, schema.asStruct());
}

其中wrapper是RowDataWrapper类的对象,RowDataWrapper同样继承自StructLike,用于包装RowData类型的数据,使其能够通过StructLike中的接口访问

在上面的delete方法中internalPosDelete(structProjection.wrap(asStructLike(row)));中的asStructLike(row)在其子类中实现

@Override
protected StructLike asStructLike(RowData data) {
  return wrapper.wrap(data);
}

这里其实就是调用了wrapper中的wrap将rowData数据包装成一个可通过StructLike接口去访问的数据,然后再通过structProjection.wrap()将其映射为key

public void write(T row) throws IOException {
  PathOffset pathOffset = PathOffset.of(dataWriter.currentPath(), dataWriter.currentRows());

  // Create a copied key from this row.
  StructLike copiedKey = StructCopy.copy(structProjection.wrap(asStructLike(row)));

  // Adding a pos-delete to replace the old path-offset.
  PathOffset previous = insertedRowMap.put(copiedKey, pathOffset);
  if (previous != null) {
    // TODO attach the previous row if has a positional-delete row schema in appender factory.
    posDeleteWriter.delete(previous.path, previous.rowOffset, null);
  }

  dataWriter.write(row);
}

而在write方法中通过调用StructCopy.copy()的方法用包装映射后的对象生成一个新的对象,只包含key相关的字段,节约空间,然后放入map

这里尝试理解一下StructLike,structProjection,RowDataWrapper等类对象的作用,因为之后要修改这块的代码。

  • StructLike是一个较为公共的接口,是一个通过索引去访问和设置一个数据结构的接口

  • RowDataWrapper是一个flink中RowData类对象的一个包装类,RowData类是一个flink table中的内部数据结构,可以理解为存储了table中一行数据的结构,其中只存储了每个字段的值,没有存储其每个字段的类型,但访问字段需要根据每个字段的类型去调用对应的方法,如RowData中第二个字段的类型为string,需要调用rowData.getString(1)方法来访问该字段,但是这些方法是私有的,RowData中提供了一个通过传入字段位置和字段类型得到获取对应字段值的方法的一个方法,代码如下

/**
 * Creates an accessor for getting elements in an internal row data structure at the given
 * position.
 *
 * @param fieldType the element type of the row
 * @param fieldPos the element type of the row
 */
static FieldGetter createFieldGetter(LogicalType fieldType, int fieldPos) {
    final FieldGetter fieldGetter;
    // ordered by type root definition
    switch (fieldType.getTypeRoot()) {
        case CHAR:
        case VARCHAR:
            fieldGetter = row -> row.getString(fieldPos);
            break;
        case BOOLEAN:
            fieldGetter = row -> row.getBoolean(fieldPos);
            break;
        case BINARY:
        case VARBINARY:
            fieldGetter = row -> row.getBinary(fieldPos);
            break;
        case DECIMAL:
            final int decimalPrecision = getPrecision(fieldType);
            final int decimalScale = getScale(fieldType);
            fieldGetter = row -> row.getDecimal(fieldPos, decimalPrecision, decimalScale);
            break;
        case TINYINT:
            fieldGetter = row -> row.getByte(fieldPos);
            break;
        case SMALLINT:
            fieldGetter = row -> row.getShort(fieldPos);
            break;
        case INTEGER:
        case DATE:
        case TIME_WITHOUT_TIME_ZONE:
        case INTERVAL_YEAR_MONTH:
            fieldGetter = row -> row.getInt(fieldPos);
            break;
        case BIGINT:
        case INTERVAL_DAY_TIME:
            fieldGetter = row -> row.getLong(fieldPos);
            break;
        case FLOAT:
            fieldGetter = row -> row.getFloat(fieldPos);
            break;
        case DOUBLE:
            fieldGetter = row -> row.getDouble(fieldPos);
            break;
        case TIMESTAMP_WITHOUT_TIME_ZONE:
        case TIMESTAMP_WITH_LOCAL_TIME_ZONE:
            final int timestampPrecision = getPrecision(fieldType);
            fieldGetter = row -> row.getTimestamp(fieldPos, timestampPrecision);
            break;
        case TIMESTAMP_WITH_TIME_ZONE:
            throw new UnsupportedOperationException();
        case ARRAY:
            fieldGetter = row -> row.getArray(fieldPos);
            break;
        case MULTISET:
        case MAP:
            fieldGetter = row -> row.getMap(fieldPos);
            break;
        case ROW:
        case STRUCTURED_TYPE:
            final int rowFieldCount = getFieldCount(fieldType);
            fieldGetter = row -> row.getRow(fieldPos, rowFieldCount);
            break;
        case DISTINCT_TYPE:
            fieldGetter =
                    createFieldGetter(((DistinctType) fieldType).getSourceType(), fieldPos);
            break;
        case RAW:
            fieldGetter = row -> row.getRawValue(fieldPos);
            break;
        case NULL:
        case SYMBOL:
        case UNRESOLVED:
        default:
            throw new IllegalArgumentException();
    }
    if (!fieldType.isNullable()) {
        return fieldGetter;
    }
    return row -> {
        if (row.isNullAt(fieldPos)) {
            return null;
        }
        return fieldGetter.getFieldOrNull(row);
    };
}

/**
 * Accessor for getting the field of a row during runtime.
 *
 * @see #createFieldGetter(LogicalType, int)
 */
interface FieldGetter extends Serializable {
    @Nullable
    Object getFieldOrNull(RowData row);
}

这里首先是一个FieldGetter接口其中有一个getFieldOrNull方法用来访问字段,createFieldGetter方法通过传入两个参数,一个类型参数,一个位置参数,返回值为一个lambda表达式,参数为RowData对象,返回值为Object,即对应字段的值,所以最终访问一个RowData字段可以写为如下形式 RowData.createFieldGetter(types[pos], pos).getFieldOrNull(rowData); 其中pos为字段位置,types为字段类型数组,rowData为访问的对象

这里FieldGetter接口是函数式接口,其中只有一个方法getFieldOrNull,所以createFieldGetter中返回的lambda表达式参数和返回值和方法getFieldOrNull对应,可以直接当做lambda表达式实现了FieldGetter中的getFieldOrNull方法并返回一个FieldGetter对象,类似匿名对象的作用。

回到上面的说的RowDataWrapper类,RowData访问需要知道字段的类型,RowDataWrapper的作用可以说就是存储RowData的字段类型,然后提供访问RowData的接口

RowDataWrapper中有一个RowData对象引用变量,在使用前需要将这个引用指向需要访问的变量

  • structProjection

同样实现了StructLike接口,作用在上面分析过,

然后说以下现在需求是 对写入的delete数据进行列裁剪,只写入主键列,这里在分析了下原始代码后尝试使用BaseTaskWriter中的deleteKey实现,