在灾难恢复或计划迁移的场景下,我们需要把消息队列的流量平滑的切换到另一个集群。但是由于消息队列的客户端都是长连接,不能按照我们熟知的切域名的方式简单实现,需要保证切换过程平滑,以及切换过程中,数据不丢不重,据了解开源版本的Kafka目前没有提供这样的功能,Pulsar 支持故障转移功能,需要多端配合,可以按照以下步骤实现。
- broker上配置全局ZK
- namespace 开启多集群同步
- 客户端配置故障转移策略和消费进度同步
1、配置全局ZK
topic的元数据通过全局ZK在多个集群间共享。
| 集群名 | 访问地址 | 本地ZK | 全局ZK |
|---|---|---|---|
| c1 | 10.224.144.74 | 10.212.312.164:2181/c1 | 10.212.312.164:2181/global |
| c2 | 10.236.164.98 | 10.212.312.164:2181/c2 | 10.212.312.164:2181/global |
初始化c1、c2集群,metadataStore 代表本地zk,存储broker元信息,configrationStore 代表全局zk,存储topic元数据。
#步骤 0:创建c1、c2集群
bin/pulsar initialize-cluster-metadata \
--cluster c1 \
--zookeeper 10.212.312.164:2181/c1 \
--configuration-store 10.212.312.164:2181/global \
--web-service-url http://10.224.144.74:8080/ \
--broker-service-url pulsar://10.224.144.74:6650;
bin/pulsar initialize-cluster-metadata \
--cluster c2 \
--zookeeper 10.212.312.164:2181/c2 \
--configuration-store 10.212.312.164:2181/global \
--web-service-url http://10.236.164.98:8080/ \
--broker-service-url pulsar://10.236.164.98:6650;
2、开启数据同步
在namespace或topic级别开启数据复制后,broker将异步将topic数据写入备份pulsar集群,pulsar默认只对增量数据进行复制。
#步骤 1:连接复制集群
#配置从 c1 到 c2 的连接,在c1上执行
bin/pulsar-admin --admin-url http://10.224.144.74:8080 clusters create \
--broker-url pulsar://10.236.164.98:6650 \
--url http://10.236.164.98:8080 \
c2
#第 2 步:授予属性权限
bin/pulsar-admin --admin-url http://10.224.144.74:8080 tenants update public \
--admin-roles my-admin-role \
--allowed-clusters c1,c2
#步骤 3:启用地理复制
bin/pulsar-admin --admin-url http://10.224.144.74:8080 namespaces set-clusters public/default --clusters c1,c2;
配置同步后,数据在2个集群上是双向同步的,生产和消费不在同一个集群上,皆可消费到数据不会丢失,但是默认情况下消费者进度不会同步,验证过程如下:
#验证数据同步
bin/pulsar-client --url pulsar://10.224.144.74:6650 produce public/default/tp-test --messages "produce from c1,1"
bin/pulsar-client --url pulsar://10.236.164.98:6650 produce public/default/tp-test --messages "produce from c2,1"
bin/pulsar-client --url pulsar://10.224.144.74:6650 consume public/default/tp-test -s "s-c1" -n 1000 -p Earliest
bin/pulsar-client --url pulsar://10.236.164.98:6650 consume public/default/tp-test -s "s-c2" -n 1000 -p Earliest
#将从头消费
bin/pulsar-client --url pulsar://10.236.164.98:6650 consume public/default/tp-test -s "s-c1" -n 1000 -p Earliest
要想同步消费进度,可以手动开启存量订阅的同步开关
./pulsar-admin --admin-url http://10.224.144.74:8080 topics set-replicated-subscription-status --subscription=s-c1 persistent://public/default/tp-test --enable
也可以在消费者客户端中,开启消费进度同步
// https://pulsar.apache.org/docs/4.0.x/administration-geo/#replicated-subscriptions
Consumer<String> consumer = client.newConsumer(Schema.STRING)
.topic("my-topic")
.subscriptionName("my-subscription")
.replicateSubscriptionState(true)
.subscribe();
3、开启故障切换
客户端在初始化时,需要配置故障转移策略,有自动和手动2种方式,手动故障转移需要提供一个 urlProvider 服务。
/**
* 手动故障转移
*/
public PulsarClient getControlledFailoverClient() throws IOException {
Map<String, String> header = new HashMap();
header.put("service_user_id", "my-user");
header.put("service_password", "tiger");
header.put("clusterA", "tokenA");
header.put("clusterB", "tokenB");
ServiceUrlProvider provider = ControlledClusterFailover.builder().defaultServiceUrl("pulsar://localhost:6650").checkInterval(1, TimeUnit.MINUTES).urlProvider("http://localhost:8080/test").urlProviderHeader(header).build();
PulsarClient pulsarClient = PulsarClient.builder().serviceUrlProvider(provider).build();
provider.initialize(pulsarClient);
return pulsarClient;
}
/**
* 自动故障转移
*/
private PulsarClient getAutoFailoverClient() throws PulsarClientException {
String primaryUrl = "pulsar+ssl://localhost:6651";
String secondaryUrl = "pulsar+ssl://localhost:6661";
String primaryTlsTrustCertsFilePath = "primary/path";
String secondaryTlsTrustCertsFilePath = "secondary/path";
Authentication primaryAuthentication = AuthenticationFactory.create("org.apache.pulsar.client.impl.auth.AuthenticationTls", "tlsCertFile:/path/to/primary-my-role.cert.pem," + "tlsKeyFile:/path/to/primary-role.key-pk8.pem");
Authentication secondaryAuthentication = AuthenticationFactory.create("org.apache.pulsar.client.impl.auth.AuthenticationTls", "tlsCertFile:/path/to/secondary-my-role.cert.pem," + "tlsKeyFile:/path/to/secondary-role.key-pk8.pem");
// You can put more failover cluster config in to map
Map<String, String> secondaryTlsTrustCertsFilePaths = new HashMap<>();
secondaryTlsTrustCertsFilePaths.put(secondaryUrl, secondaryTlsTrustCertsFilePath);
Map<String, Authentication> secondaryAuthentications = new HashMap<>();
secondaryAuthentications.put(secondaryUrl, secondaryAuthentication);
ServiceUrlProvider failover = AutoClusterFailover.builder().primary(primaryUrl).secondary(List.of(secondaryUrl)).failoverDelay(30, TimeUnit.SECONDS).switchBackDelay(60, TimeUnit.SECONDS).checkInterval(1000, TimeUnit.MILLISECONDS).secondaryTlsTrustCertsFilePath(secondaryTlsTrustCertsFilePaths).secondaryAuthentication(secondaryAuthentications).build();
PulsarClient pulsarClient = PulsarClient.builder().serviceUrlProvider(failover).authentication(primaryAuthentication).tlsTrustCertsFilePath(primaryTlsTrustCertsFilePath).build();
failover.initialize(pulsarClient);
return pulsarClient;
}
上述方案需要在集群搭建初期做好规划,让所有客户端都按规范使用。假如不在客户端层面配置故障切换,通过简单修改域名解析会有以下问题:
- 客户端长链接,切换域名后客户端流量还在老集群上。
- 即使老集群下线,客户端也不会重新解析域名重连到新集群,需要所有客户端重启。