DZone>大数据专区>使用Docker开始使用Azure Cosmos DB的Kafka连接器

使用Docker开始使用Azure Cosmos DB的Kafka连接器

用于学习Kafka和Cosmos DB的本地开发环境--不涉及任何费用!

通过

阿比谢克-古普塔

Aug. 04, 21 - 大数据区 -教程

喜欢 (2)

保存

鸣叫

2.32K浏览次数

加入DZone社区，获得完整的会员体验。

免费加入

在尝试新的服务或技术时，拥有一个本地开发环境是相当方便的。在这种情况下，Docker已经成为事实上的选择。在你试图整合多个服务的情况下，它尤其有用，并使你能够在每次运行前重新开始。

这篇博文是Azure Cosmos DB的Kafka连接器的入门指南。所有的组件（包括Azure Cosmos DB）都将在你的本地机器上运行，这要感谢。

Azure Cosmos DB Linux模拟器可用于本地开发和测试目的，无需创建Azure订阅或产生任何费用。
而且，Docker Compose是一个用于定义和运行多容器Docker应用程序的工具。它将协调我们设置所需的所有组件，包括Azure Cosmos DB模拟器、Kafka、Zookeeper、Kafka连接器等。

为了方便起见，我们将挑选单一的重点场景，一步一步地进行。

第0步 - 一个简单的场景，检查我们的设置是否正常。
如何处理流式JSON数据
如何处理与Azure Cosmos DB不兼容的流式JSON数据
如何使用模式注册表处理Avro数据

假设你对Kafka很熟悉，并对Kafka Connect有一定的了解。

第一件事是....

...下面是对Azure Cosmos DB模拟器和Kafka连接器的快速概述。

Azure Cosmos DB连接器允许你在Azure Cosmos DB和Kafka之间移动数据。它既可以作为源，也可以作为汇。Azure Cosmos DB Sink连接器将数据从Kafka主题写入Azure Cosmos DB容器，Source连接器将变化从Azure Cosmos DB容器写入Kafka主题。在撰写本文时，该连接器处于pre-production 模式。你可以在GitHub repo上阅读更多关于它的信息，或者从Confluent Hub上安装/下载它。

Azure Cosmos DB Linux模拟器提供了一个模拟Azure Cosmos DB服务的本地环境，用于开发（目前，它只支持SQL API）。它提供了对Azure Cosmos DB服务的高保真模拟，支持创建数据、查询数据、配置和扩展容器以及执行存储程序和触发器等功能。

在撰写本文时，Azure Cosmos DB Linux模拟器还处于预览阶段。

你可以深入了解如何在macOS或Linux上使用模拟器，它与Azure Cosmos DB云服务有什么不同，故障排除等问题。

在你开始之前...

确保你安装了Docker和docker-compose。

另外，从GitHub上克隆该项目。

git clone https://github.com/Azure-Samples/cosmosdb-kafka-connect-docker
cd cosmosdb-kafka-connect-docker

启动所有服务

所有的组件都在docker-compose文件中定义。

Azure Cosmos DB仿真器
Kafka和Zookeeper
Azure Cosmos DB和Datagen连接器（作为单独的Kafka Connect工作者运行）
Confluent Schema Registry

感谢Docker Compose，只需一个命令就可以把环境调出来。当你第一次运行时，可能需要一段时间来下载容器（随后的执行会更快）。你也可以选择在启动Docker Compose之前单独下载镜像。

(optional)
docker pull confluentinc/cp-zookeeper:latest
docker pull confluentinc/cp-kafka:latest
docker pull confluentinc/cp-schema-registry:latest

来启动所有的服务。

docker-compose -p cosmosdb-kafka-docker up --build

几分钟后，检查容器。

docker-compose -p cosmosdb-kafka-docker ps

一旦所有的服务都启动并运行，下一个合乎逻辑的步骤就是安装连接器，对吗？那么，有几件事我们需要注意。对于Java应用程序连接到Azure Cosmos DB模拟器，你需要在Java证书商店中安装证书。在这种情况下，我们将把证书从Azure Cosmos DB模拟器容器种到Cosmos DB Kafka Connect容器中。

虽然这个过程可以自动化，但我是手动进行的，以使事情清楚。

配置Azure Cosmos DB 仿真器证书

执行此命令，将证书存储在Java证书存储中（使用docker exec ）。

docker exec --user root -it cosmosdb-kafka-docker_cosmosdb-connector_1 /bin/bash

# execute the below command inside the container
curl -k https://cosmosdb:8081/_explorer/emulator.pem > ~/emulatorcert.crt && keytool -noprompt -storepass changeit -keypass changeit -keystore /usr/lib/jvm/zulu11-ca/lib/security/cacerts -importcert -alias emulator_cert -file ~/emulatorcert.crt

你应该看到这样的输出 - 证书已被添加到钥匙库中

还有，在我们继续之前还有一件事...

创建Azure Cosmos DB数据库和容器

访问Azure Cosmos DB模拟器门户：https://localhost:8081/_explorer/index.html，并创建以下资源。

数据库命名为testdb
容器 -inventory,orders,orders_avro (确保所有容器的分区密钥是/id)

让我们来探索所有的场景

首先，让我们看看基本方案。在尝试其他东西之前，我们要确保所有的东西都是正常的。

1.Hello World!

启动Cosmos DB的库存数据连接器。

curl -X POST -H "Content-Type: application/json" -d @cosmosdb-inventory-connector_1.json http://localhost:8083/connectors

# to check the connector status
curl http://localhost:8083/connectors/inventory-sink/status

为了测试端到端的流程，向Kafka中的inventory_topic 主题发送一些记录。

docker exec -it kafka bash -c 'cd /usr/bin && kafka-console-producer --topic inventory_topic --bootstrap-server kafka:29092'

一旦提示准备就绪，逐一发送JSON记录。

{"id": "5000","quantity": 100,"productid": 42}
{"id": "5001","quantity": 99,"productid": 43}
{"id": "5002","quantity": 98,"productid": 44}

检查Cosmos DB容器以确认记录是否已被保存。导航到门户网站https://localhost:8081/_explorer/index.html，检查inventory 容器。

好的，它成功了!让我们继续前进，做一些稍微有用的事情。在继续前进之前，删除inventory 连接器。

curl -X DELETE http://localhost:8083/connectors/inventory-sink/

2.从Kafka到Azure Cosmos DB同步流数据（JSON格式

对于剩下的场景，我们将使用生产者组件来生成记录。Kafka Connect Datagen连接器是我们的朋友。它是用来生成测试用的模拟数据的，所以让我们好好利用它吧!

启动Azure Cosmos DB连接器的一个实例。

curl -X POST -H "Content-Type: application/json" -d @cosmosdb-inventory-connector_2.json http://localhost:8083/connectors

# to check the connector status
curl http://localhost:8083/connectors/inventory-sink/status

一旦准备好了，继续启动Datagen连接器，它将生成JSON格式的模拟库存数据。

curl -X POST -H "Content-Type: application/json" -d @datagen-inventory-connector.json http://localhost:8080/connectors

# to check the connector status
curl http://localhost:8080/connectors/datagen-inventory/status

请注意，我们为Datagen连接器使用8080端口，因为它是在一个单独的Kafka Connect容器中运行。

要查看由Datagen连接器产生的数据，请查看inventory_topic1 Kafka主题。

docker exec -it kafka bash -c 'cd /usr/bin && kafka-console-consumer --topic inventory_topic1 --bootstrap-server kafka:29092'

注意这个数据（在你的情况下可能有所不同）。

{"id":5,"quantity":5,"productid":5}
{"id":6,"quantity":6,"productid":6}
{"id":7,"quantity":7,"productid":7}
...

注意id有一个Integer值

检查Azure Cosmos DB容器以确认记录是否已被保存。导航到门户https://localhost:8081/_explorer/index.html，检查inventory 容器。

Cosmos DB中的记录有一个String 数据类型的id 属性。Kafka主题中的原始数据有一个Integer 类型的id 属性--但这不会起作用，因为Azure Cosmos DB要求id ，是一个唯一的用户定义的字符串。这种转换是通过Kafka Connect的转换实现的--Cast ，将字段（或整个键或值）更新为特定的类型，如果有模式的话就更新模式。

这里是连接器配置中的部分，它做了这个技巧。

"transforms": "Cast",
"transforms.Cast.type": "org.apache.kafka.connect.transforms.Cast$Value",
"transforms.Cast.spec": "id:string"

在继续前进之前，删除Cosmos DB和Datageninventory 连接器。

curl -X DELETE http://localhost:8080/connectors/datagen-inventory
curl -X DELETE http://localhost:8083/connectors/inventory-sink/

3.从Kafka向Azure Cosmos DB推送流订单数据（JSON格式）。

现在，让我们换个角度，使用同样的数据（JSON格式）数据，但有一个小小的变化。我们将使用Datagen连接器的一个变体来生成模拟订单数据，同时调整Cosmos DB连接器。

要安装Azure Cosmos DB连接器的不同实例。

curl -X POST -H "Content-Type: application/json" -d @cosmosdb-orders-connector_1.json http://localhost:8083/connectors

# to check the connector status
curl http://localhost:8083/connectors/orders-sink/status

安装Datagen订单连接器。

curl -X POST -H "Content-Type: application/json" -d @datagen-orders-connector.json http://localhost:8080/connectors

# to check the connector status
curl http://localhost:8080/connectors/datagen-orders/status

要看Datagen连接器产生的数据，请查看orders Kafka主题。

docker exec -it kafka bash -c 'cd /usr/bin && kafka-console-consumer --topic orders_topic --bootstrap-server kafka:29092'

注意这些数据（在你的情况下可能不同）。

{"ordertime":1496251410176,"orderid":3,"itemid":"Item_869","orderunits":3.2897805449886226,"address":{"city":"City_99","state":"State_46","zipcode":50570}}

{"ordertime":1500129505219,"orderid":4,"itemid":"Item_339","orderunits":3.6719921257659918,"address":{"city":"City_84","state":"State_55","zipcode":88573}}

{"ordertime":1498873571020,"orderid":5,"itemid":"Item_922","orderunits":8.4506812669258,"address":{"city":"City_48","state":"State_66","zipcode":55218}}

{"ordertime":1513855504436,"orderid":6,"itemid":"Item_545","orderunits":7.82561522361042,"address":{"city":"City_44","state":"State_71","zipcode":87868}}
...

我特意选择了订单数据，因为它与库存数据不同。注意Datagen连接器产生的JSON记录有一个orderid 属性（整数数据类型），但没有id 属性--但我们知道Azure Cosmos DB没有这个属性就不能工作。

检查Cosmos DB容器以确认记录是否已被保存。导航到门户https://localhost:8081/_explorer/index.html，检查orders 容器。

注意，在Azure Cosmos DB中存储的记录中没有orderid 属性。事实上，它已经被id 属性所取代（有一个String 的值）。这是由ReplaceField转化器实现的。

这里是连接器配置中的部分，它使这成为可能。

"transforms": "RenameField,Cast",
"transforms.RenameField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.RenameField.renames": "orderid:id",
"transforms.Cast.type": "org.apache.kafka.connect.transforms.Cast$Value",
"transforms.Cast.spec": "id:string"

根据你的使用情况，完全删除/重命名一个字段可能不是一个理想的解决方案。然而，知道有这样的选择是件好事。另外，请记住，Kafka主题中的原始数据仍然在那里，没有被触及。其他下游的应用程序仍然可以利用它。

在继续前进之前，删除Cosmos DB和Datageninventory 连接器。

curl -X DELETE http://localhost:8080/connectors/datagen-orders
curl -X DELETE http://localhost:8083/connectors/orders-sink/

4.从Kafka推送流式订单数据（AVRO格式）到Azure Cosmos DB

到目前为止，我们处理的是JSON，一种常用的数据格式。但是，Avro在生产中被大量使用，因为其紧凑的格式可以带来更好的性能和成本节约。为了更容易处理Avro ，Confluent Schema Registry为你的元数据提供了一个服务层，以及一个用于存储和检索Avro（以及JSON和Protobuf模式）的RESTful接口。在本博文中，我们将使用Docker版本。

安装一个新的Azure Cosmos DB连接器的实例，它可以处理Avro 。

curl -X POST -H "Content-Type: application/json" -d @cosmosdb-orders-connector_2.json http://localhost:8083/connectors

# to check the connector status
curl http://localhost:8083/connectors/orders-sink/status

安装Datagen连接器，它将生成Avro 格式的模拟订单数据。

curl -X POST -H "Content-Type: application/json" -d @datagen-orders-connector-avro.json http://localhost:8080/connectors

# to check the connector status
curl http://localhost:8080/connectors/datagen-orders/status

要看Datagen连接器产生的Avro 数据，请查看orders_avro_topic Kafka主题。

docker exec -it kafka bash -c 'cd /usr/bin && kafka-console-consumer --topic orders_avro_topic --bootstrap-server kafka:29092'

由于Avro 数据是二进制格式，所以它不是人类可读的。

�����VItem_185lqf�@City_61State_73��
����WItem_219[�C��@City_74State_77��
�����VItem_7167Ix�dF�?City_53State_53��
���֩WItem_126*���?@City_58State_21��
�����VItem_329X�2,@City_49State_79��
�����XItem_886��>�|�@City_88State_27��
��V Item_956�r#�!@City_45State_96��
�ѼҕW"Item_157E�)$���?City_96State_63��
...

检查Cosmos DB容器以确认记录是否被保存。导航到门户https://localhost:8081/_explorer/index.html，检查orders_avro 容器。

很好，事情像预期的那样工作了连接器的配置被更新以处理这个问题。

"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schemas.enable": "true",
"value.converter.schema.registry.url": "http://schema-registry:8081",
...

这些变化包括选择AvroConverter ，启用模式并指向模式注册中心（在我们的案例中，在Docker中本地运行）。

这就是本博文中所涉及的所有用例。我们只覆盖了Sink连接器，但可以自由地进一步探索和实验例如，你可以扩展当前的设置，以包括源连接器，并配置它从Azure Cosmos DB容器发送记录到Kafka。

清理

完成后，你可以删除连接器。

curl -X DELETE http://localhost:8080/connectors/datagen-orders
curl -X DELETE http://localhost:8083/connectors/orders-sink/

来停止所有的Docker组件。

docker-compose -p cosmosdb-kafka-docker down -v

结语

虽然我们为演示目的涵盖了简单的场景，但它表明你可以利用现成的解决方案（连接器、变形器、模式注册表等），并专注于你基于Azure Cosmos DB的应用程序或数据管道所需的繁重工作。由于这个例子采用了基于Docker的方法进行本地开发，它具有成本效益（嗯，是免费的！），而且可以很容易地定制，以满足你的要求。

对于生产场景，你需要设置、配置和操作这些连接器。Kafka Connect工作者是简单的JVM进程，因此本质上是无状态的（所有的状态处理都卸载给了Kafka）。在你的整体架构和协调方面有很大的灵活性--例如，你可以在Kubernetes中运行它们以获得容错和可扩展性。

主题。

azure, 云, nosql, docker, 数据库, 教程, docker compose, azure cosmos

经Abhishek Gupta, DZone MVB许可发表于DZone。点击这里查看原文。

DZone贡献者所表达的观点属于他们自己。

使用Docker开始使用Azure Cosmos DB的Kafka连接器

使用Docker开始使用Azure Cosmos DB的Kafka连接器

用于学习Kafka和Cosmos DB的本地开发环境--不涉及任何费用!

第一件事是....

在你开始之前...

启动所有服务

配置Azure Cosmos DB 仿真器证书

创建Azure Cosmos DB数据库和容器

让我们来探索所有的场景

1.Hello World!

2.从Kafka到Azure Cosmos DB同步流数据（JSON格式

3.从Kafka向Azure Cosmos DB推送流订单数据（JSON格式）。

4.从Kafka推送流式订单数据（AVRO格式）到Azure Cosmos DB

清理

结语

在DZone上受欢迎

大数据合作伙伴资源

使用Docker开始使用Azure Cosmos DB的Kafka连接器

使用Docker开始使用Azure Cosmos DB的Kafka连接器

用于学习Kafka和Cosmos DB的本地开发环境--不涉及任何费用!

第一件事是....

在你开始之前...

启动所有服务

配置Azure Cosmos DB 仿真器证书

创建Azure Cosmos DB数据库和容器

让我们来探索所有的场景

1.Hello World!

2.从Kafka到Azure Cosmos DB同步流数据（JSON格式

3.从Kafka向Azure Cosmos DB推送流订单数据（JSON格式）。

4.从Kafka推送流式订单数据（AVRO格式）到Azure Cosmos DB

清理

结语

在DZone上受欢迎

大数据 合作伙伴资源

大数据合作伙伴资源