保存json数据为S3 Parquet文件

1,062 阅读3分钟

在工作中我们需要使用Kafka Connect将Kafka中的json数据传输到s3上,并保存成parquet。在这里记录一下Demo。

Kafka connect低版本是不支持S3 Parquet Sink的,我们使用的Confluent 5.5.5.

在另一篇笔记(juejin.cn/post/699841… Parquet需要使用Avro序列化, 那篇笔记中,我们使用producer/consumer直接发送消费Avro数据,这样可以直接将Kafka中的数据保存为Parquet文件。

但是在我们的使用是数据原先保存为Json文件,所以producer使用的是JsonConverter,因为我们不能修改Producer, 所以需要将JsonConverter转化成AvroConverter。

总共有两这种方式处理:

  • 使用外部的处理,使用一个消费者消费Json数据,再转成Avro数据保存到一个新的topic中。这里可以使用Ksql或者写一个streams job实现
  • 在Kafka Connect中自定义Converter,将Json的数据转化成可以保存为Parquet的Avro序列化数据。

使用Ksql转化Json为Avro

我们Kafka中的数据结构:

{
    "id":"string",
    "name":"string",
    "age":"int"
}

在Ksql中创建基于数据源的stream表:

create stream json_table (id varchar, name varchar,age int) WITH (KAFKA_TOPIC='test_topic', VALUE_FORMAT='JSON');

在Ksql中创建Avro数据的stream表:

CREATE STREAM avro_table WITH (KAFKA_TOPIC='test_topic_avro',REPLICAS=2,PARTITIONS=8,VALUE_FORMAT='AVRO') AS SELECT * FROM json_table;

一些使用的sql:

-- 查看所有的streams表
show streams;
print 'test_topic_avro';
-- 停止stream job
TERMINATE CSAS_AVRO_TABLE_0;
-- 删除avro stream
drop stream AVRO_TABLE;

这样在最后的Kafka Connector中就可以消费Avro topic数据了:

{
  "name": "parquet_sink_test",
  "config": {
    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
    "errors.log.include.messages": "true",
    "s3.region": "region",
    "topics.dir": "folder",
    "flush.size": "300",
    "tasks.max": "1",
    "timezone": "UTC",
    "s3.part.size": "5242880",
    "enhanced.avro.schema.support": "true",
    "rotate.interval.ms": "6000",
    "locale": "US",
    "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
    "s3.part.retries": "18",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "errors.log.enable": "true",
    "s3.bucket.name": "bucket",
    "partition.duration.ms": "3600000",
    "topics": "test_topic_avro",
    "batch.max.rows": "100",
    "partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
    "value.converter.schemas.enable": "true",
    "name": "parquet_sink_test",
    "storage.class": "io.confluent.connect.s3.storage.S3Storage",
    "rotate.schedule.interval.ms": "6000",
    "value.converter.schema.registry.url": "http://schema-registry:8081",
    "schema.registry.url": "http://schema-registry:8081",
    "path.format": "'log_year'=YYYY/'log_month'=MM/'log_day'=dd/'log_hour'=HH"
  }
}

自定义Converter转化Json为Avro

Tips: 我们发送Json数据的时候使用的是ByteArraySerializer。 自定义Converter需要实现Converter接口:

public interface Converter {
    void configure(Map<String, ?> configs, boolean isKey);
    //用于将第三方的数据转换为Kafka Object
    byte[] fromConnectData(String topic, Schema schema, Object value);
    default byte[] fromConnectData(String topic, Headers headers, Schema schema, Object value) {
        return fromConnectData(topic, schema, value);
    }
    ////用于将Kafka Object转换为第三方的数据
    SchemaAndValue toConnectData(String topic, byte[] value);
    default SchemaAndValue toConnectData(String topic, Headers headers, byte[] value) {
        return toConnectData(topic, value);
    }
}

我们可以看一下官方的AvroConverter的fromConnectData和toConnectData:

  • fromConnectData
public byte[] fromConnectData(String topic, Schema schema, Object value) {
  try {
    org.apache.avro.Schema avroSchema = avroData.fromConnectSchema(schema);
    return serializer.serialize(
        topic,
        isKey,
        avroData.fromConnectData(schema, avroSchema, value),
        new AvroSchema(avroSchema));
  } catch (SerializationException e) {
    throw new DataException(
        String.format("Failed to serialize Avro data from topic %s :", topic),
        e
    );
  } catch (InvalidConfigurationException e) {
    throw new ConfigException(
        String.format("Failed to access Avro data from topic %s : %s", topic, e.getMessage())
    );
  }
}
  • toConnectData
public SchemaAndValue toConnectData(String topic, byte[] value) {
  try {
    GenericContainerWithVersion containerWithVersion =
        deserializer.deserialize(topic, isKey, value);
    if (containerWithVersion == null) {
      return SchemaAndValue.NULL;
    }
    GenericContainer deserialized = containerWithVersion.container();
    Integer version = containerWithVersion.version();
    if (deserialized instanceof IndexedRecord) {
      return avroData.toConnectData(deserialized.getSchema(), deserialized, version);
    } else if (deserialized instanceof NonRecordContainer) {
      return avroData.toConnectData(
          deserialized.getSchema(), ((NonRecordContainer) deserialized).getValue(), version);
    }
    throw new DataException(
        String.format("Unsupported type returned during deserialization of topic %s ", topic)
    );
  } catch (SerializationException e) {
    throw new DataException(
        String.format("Failed to deserialize data for topic %s to Avro: ", topic),
        e
    );
  } catch (InvalidConfigurationException e) {
    throw new ConfigException(
        String.format("Failed to access Avro data from topic %s : %s", topic, e.getMessage())
    );
  }
}

因为要把kafka中的数据使用connect转移到说s3,所以这里使用到toConnectData方法。在这个方法我们,参数value获取的是json的字节数组,因为后续处理只能接受AVRO序列化之后的字节数组,所以在方法的最开始我们要将json数据进行Avro序列化,转换成Kafka Avro能识别的数据。

其实,可以移步到fromConnectData方法,这个方法就是将第三方数据转成Kafka Avro能识别的数据。我们只需要调用一下这个方法就行了。

org.apache.kafka.connect.data.Schema schema = SchemaBuilder.struct().name("TEST_name")
        .field("id", org.apache.kafka.connect.data.Schema.OPTIONAL_STRING_SCHEMA)
        .field("name", org.apache.kafka.connect.data.Schema.OPTIONAL_STRING_SCHEMA)
        .field("age", org.apache.kafka.connect.data.Schema.OPTIONAL_INT32_SCHEMA)
        ;

  ObjectMapper objectMapper = new ObjectMapper();
  User user = objectMapper.readValue(value, TelenavEventKafka.class);
  Struct struct = new Struct(schema)
        .put("id", event.getId())
        .put("name", event.getName())
        .put("age", event.getAge()));

  org.apache.avro.Schema avroSchema = avroData.fromConnectSchema(schema);

  byte[] serialize = serializer.serialize(
          topic,
          false,
          avroData.fromConnectData(schema, struct),
          new AvroSchema(avroSchema));
          
//这里是AvroConverter官方的代码

Connect 配置:

{
  "name": "parquet_sink_test",
  "config": {
    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
    "errors.log.include.messages": "true",
    "s3.region": "region",
    "topics.dir": "folder",
    "flush.size": "300",
    "tasks.max": "1",
    "timezone": "UTC",
    "s3.part.size": "5242880",
    "enhanced.avro.schema.support": "true",
    "rotate.interval.ms": "6000",
    "locale": "US",
    "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
    "s3.part.retries": "18",
    "value.converter": "MyCustomAvroConverter",
    "errors.log.enable": "true",
    "s3.bucket.name": "bucket",
    "partition.duration.ms": "3600000",
    "topics": "test_topic_avro",
    "batch.max.rows": "100",
    "partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
    "value.converter.schemas.enable": "true",
    "name": "parquet_sink_test",
    "storage.class": "io.confluent.connect.s3.storage.S3Storage",
    "rotate.schedule.interval.ms": "6000",
    "value.converter.schema.registry.url": "http://schema-registry:8081",
    "schema.registry.url": "http://schema-registry:8081",
    "path.format": "'log_year'=YYYY/'log_month'=MM/'log_day'=dd/'log_hour'=HH"
  }
}