在工作中我们需要使用Kafka Connect将Kafka中的json数据传输到s3上,并保存成parquet。在这里记录一下Demo。
Kafka connect低版本是不支持S3 Parquet Sink的,我们使用的Confluent 5.5.5.
在另一篇笔记(juejin.cn/post/699841… Parquet需要使用Avro序列化, 那篇笔记中,我们使用producer/consumer直接发送消费Avro数据,这样可以直接将Kafka中的数据保存为Parquet文件。
但是在我们的使用是数据原先保存为Json文件,所以producer使用的是JsonConverter,因为我们不能修改Producer, 所以需要将JsonConverter转化成AvroConverter。
总共有两这种方式处理:
- 使用外部的处理,使用一个消费者消费Json数据,再转成Avro数据保存到一个新的topic中。这里可以使用Ksql或者写一个streams job实现
- 在Kafka Connect中自定义Converter,将Json的数据转化成可以保存为Parquet的Avro序列化数据。
使用Ksql转化Json为Avro
我们Kafka中的数据结构:
{
"id":"string",
"name":"string",
"age":"int"
}
在Ksql中创建基于数据源的stream表:
create stream json_table (id varchar, name varchar,age int) WITH (KAFKA_TOPIC='test_topic', VALUE_FORMAT='JSON');
在Ksql中创建Avro数据的stream表:
CREATE STREAM avro_table WITH (KAFKA_TOPIC='test_topic_avro',REPLICAS=2,PARTITIONS=8,VALUE_FORMAT='AVRO') AS SELECT * FROM json_table;
一些使用的sql:
-- 查看所有的streams表
show streams;
print 'test_topic_avro';
-- 停止stream job
TERMINATE CSAS_AVRO_TABLE_0;
-- 删除avro stream
drop stream AVRO_TABLE;
这样在最后的Kafka Connector中就可以消费Avro topic数据了:
{
"name": "parquet_sink_test",
"config": {
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"errors.log.include.messages": "true",
"s3.region": "region",
"topics.dir": "folder",
"flush.size": "300",
"tasks.max": "1",
"timezone": "UTC",
"s3.part.size": "5242880",
"enhanced.avro.schema.support": "true",
"rotate.interval.ms": "6000",
"locale": "US",
"format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
"s3.part.retries": "18",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"errors.log.enable": "true",
"s3.bucket.name": "bucket",
"partition.duration.ms": "3600000",
"topics": "test_topic_avro",
"batch.max.rows": "100",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"value.converter.schemas.enable": "true",
"name": "parquet_sink_test",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"rotate.schedule.interval.ms": "6000",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"schema.registry.url": "http://schema-registry:8081",
"path.format": "'log_year'=YYYY/'log_month'=MM/'log_day'=dd/'log_hour'=HH"
}
}
自定义Converter转化Json为Avro
Tips: 我们发送Json数据的时候使用的是ByteArraySerializer。 自定义Converter需要实现Converter接口:
public interface Converter {
void configure(Map<String, ?> configs, boolean isKey);
//用于将第三方的数据转换为Kafka Object
byte[] fromConnectData(String topic, Schema schema, Object value);
default byte[] fromConnectData(String topic, Headers headers, Schema schema, Object value) {
return fromConnectData(topic, schema, value);
}
////用于将Kafka Object转换为第三方的数据
SchemaAndValue toConnectData(String topic, byte[] value);
default SchemaAndValue toConnectData(String topic, Headers headers, byte[] value) {
return toConnectData(topic, value);
}
}
我们可以看一下官方的AvroConverter的fromConnectData和toConnectData:
- fromConnectData
public byte[] fromConnectData(String topic, Schema schema, Object value) {
try {
org.apache.avro.Schema avroSchema = avroData.fromConnectSchema(schema);
return serializer.serialize(
topic,
isKey,
avroData.fromConnectData(schema, avroSchema, value),
new AvroSchema(avroSchema));
} catch (SerializationException e) {
throw new DataException(
String.format("Failed to serialize Avro data from topic %s :", topic),
e
);
} catch (InvalidConfigurationException e) {
throw new ConfigException(
String.format("Failed to access Avro data from topic %s : %s", topic, e.getMessage())
);
}
}
- toConnectData
public SchemaAndValue toConnectData(String topic, byte[] value) {
try {
GenericContainerWithVersion containerWithVersion =
deserializer.deserialize(topic, isKey, value);
if (containerWithVersion == null) {
return SchemaAndValue.NULL;
}
GenericContainer deserialized = containerWithVersion.container();
Integer version = containerWithVersion.version();
if (deserialized instanceof IndexedRecord) {
return avroData.toConnectData(deserialized.getSchema(), deserialized, version);
} else if (deserialized instanceof NonRecordContainer) {
return avroData.toConnectData(
deserialized.getSchema(), ((NonRecordContainer) deserialized).getValue(), version);
}
throw new DataException(
String.format("Unsupported type returned during deserialization of topic %s ", topic)
);
} catch (SerializationException e) {
throw new DataException(
String.format("Failed to deserialize data for topic %s to Avro: ", topic),
e
);
} catch (InvalidConfigurationException e) {
throw new ConfigException(
String.format("Failed to access Avro data from topic %s : %s", topic, e.getMessage())
);
}
}
因为要把kafka中的数据使用connect转移到说s3,所以这里使用到toConnectData方法。在这个方法我们,参数value获取的是json的字节数组,因为后续处理只能接受AVRO序列化之后的字节数组,所以在方法的最开始我们要将json数据进行Avro序列化,转换成Kafka Avro能识别的数据。
其实,可以移步到fromConnectData方法,这个方法就是将第三方数据转成Kafka Avro能识别的数据。我们只需要调用一下这个方法就行了。
org.apache.kafka.connect.data.Schema schema = SchemaBuilder.struct().name("TEST_name")
.field("id", org.apache.kafka.connect.data.Schema.OPTIONAL_STRING_SCHEMA)
.field("name", org.apache.kafka.connect.data.Schema.OPTIONAL_STRING_SCHEMA)
.field("age", org.apache.kafka.connect.data.Schema.OPTIONAL_INT32_SCHEMA)
;
ObjectMapper objectMapper = new ObjectMapper();
User user = objectMapper.readValue(value, TelenavEventKafka.class);
Struct struct = new Struct(schema)
.put("id", event.getId())
.put("name", event.getName())
.put("age", event.getAge()));
org.apache.avro.Schema avroSchema = avroData.fromConnectSchema(schema);
byte[] serialize = serializer.serialize(
topic,
false,
avroData.fromConnectData(schema, struct),
new AvroSchema(avroSchema));
//这里是AvroConverter官方的代码
Connect 配置:
{
"name": "parquet_sink_test",
"config": {
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"errors.log.include.messages": "true",
"s3.region": "region",
"topics.dir": "folder",
"flush.size": "300",
"tasks.max": "1",
"timezone": "UTC",
"s3.part.size": "5242880",
"enhanced.avro.schema.support": "true",
"rotate.interval.ms": "6000",
"locale": "US",
"format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
"s3.part.retries": "18",
"value.converter": "MyCustomAvroConverter",
"errors.log.enable": "true",
"s3.bucket.name": "bucket",
"partition.duration.ms": "3600000",
"topics": "test_topic_avro",
"batch.max.rows": "100",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"value.converter.schemas.enable": "true",
"name": "parquet_sink_test",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"rotate.schedule.interval.ms": "6000",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"schema.registry.url": "http://schema-registry:8081",
"path.format": "'log_year'=YYYY/'log_month'=MM/'log_day'=dd/'log_hour'=HH"
}
}