Pulsar 集群级故障转移

129 阅读3分钟

在灾难恢复或计划迁移的场景下,我们需要把消息队列的流量平滑的切换到另一个集群。但是由于消息队列的客户端都是长连接,不能按照我们熟知的切域名的方式简单实现,需要保证切换过程平滑,以及切换过程中,数据不丢不重,据了解开源版本的Kafka目前没有提供这样的功能,Pulsar 支持故障转移功能,需要多端配合,可以按照以下步骤实现。

  1. broker上配置全局ZK
  2. namespace 开启多集群同步
  3. 客户端配置故障转移策略和消费进度同步

1、配置全局ZK

topic的元数据通过全局ZK在多个集群间共享。

集群名访问地址本地ZK全局ZK
c110.224.144.7410.212.312.164:2181/c110.212.312.164:2181/global
c210.236.164.9810.212.312.164:2181/c210.212.312.164:2181/global

初始化c1、c2集群,metadataStore 代表本地zk,存储broker元信息,configrationStore 代表全局zk,存储topic元数据。

#步骤 0:创建c1、c2集群
bin/pulsar initialize-cluster-metadata \
--cluster c1 \
--zookeeper 10.212.312.164:2181/c1 \
--configuration-store 10.212.312.164:2181/global \
--web-service-url http://10.224.144.74:8080/ \
--broker-service-url pulsar://10.224.144.74:6650;

bin/pulsar initialize-cluster-metadata \
--cluster c2 \
--zookeeper 10.212.312.164:2181/c2 \
--configuration-store 10.212.312.164:2181/global \
--web-service-url http://10.236.164.98:8080/ \
--broker-service-url pulsar://10.236.164.98:6650;

2、开启数据同步

在namespace或topic级别开启数据复制后,broker将异步将topic数据写入备份pulsar集群,pulsar默认只对增量数据进行复制。

#步骤 1:连接复制集群
#配置从 c1 到 c2 的连接,在c1上执行
bin/pulsar-admin --admin-url http://10.224.144.74:8080 clusters create \
--broker-url pulsar://10.236.164.98:6650 \
--url http://10.236.164.98:8080 \
c2
#第 2 步:授予属性权限
bin/pulsar-admin --admin-url http://10.224.144.74:8080 tenants update public \
--admin-roles my-admin-role \
--allowed-clusters c1,c2
#步骤 3:启用地理复制
bin/pulsar-admin --admin-url http://10.224.144.74:8080 namespaces set-clusters public/default --clusters c1,c2;

配置同步后,数据在2个集群上是双向同步的,生产和消费不在同一个集群上,皆可消费到数据不会丢失,但是默认情况下消费者进度不会同步,验证过程如下:

#验证数据同步
bin/pulsar-client --url pulsar://10.224.144.74:6650 produce public/default/tp-test --messages "produce from c1,1"
bin/pulsar-client --url pulsar://10.236.164.98:6650 produce public/default/tp-test --messages "produce from c2,1"
bin/pulsar-client --url pulsar://10.224.144.74:6650 consume public/default/tp-test -s "s-c1" -n 1000 -p Earliest
bin/pulsar-client --url pulsar://10.236.164.98:6650 consume public/default/tp-test -s "s-c2" -n 1000 -p Earliest
#将从头消费
bin/pulsar-client --url pulsar://10.236.164.98:6650 consume public/default/tp-test -s "s-c1" -n 1000 -p Earliest

要想同步消费进度,可以手动开启存量订阅的同步开关

./pulsar-admin --admin-url http://10.224.144.74:8080 topics   set-replicated-subscription-status --subscription=s-c1 persistent://public/default/tp-test --enable

也可以在消费者客户端中,开启消费进度同步

// https://pulsar.apache.org/docs/4.0.x/administration-geo/#replicated-subscriptions
Consumer<String> consumer = client.newConsumer(Schema.STRING)
            .topic("my-topic")
            .subscriptionName("my-subscription")
            .replicateSubscriptionState(true)
            .subscribe();

3、开启故障切换

客户端在初始化时,需要配置故障转移策略,有自动和手动2种方式,手动故障转移需要提供一个 urlProvider 服务。


    /**
     * 手动故障转移
     */
    public PulsarClient getControlledFailoverClient() throws IOException {
        Map<String, String> header = new HashMap();
        header.put("service_user_id", "my-user");
        header.put("service_password", "tiger");
        header.put("clusterA", "tokenA");
        header.put("clusterB", "tokenB");

        ServiceUrlProvider provider = ControlledClusterFailover.builder().defaultServiceUrl("pulsar://localhost:6650").checkInterval(1, TimeUnit.MINUTES).urlProvider("http://localhost:8080/test").urlProviderHeader(header).build();

        PulsarClient pulsarClient = PulsarClient.builder().serviceUrlProvider(provider).build();

        provider.initialize(pulsarClient);
        return pulsarClient;
    }

    /**
     * 自动故障转移
     */
    private PulsarClient getAutoFailoverClient() throws PulsarClientException {
        String primaryUrl = "pulsar+ssl://localhost:6651";
        String secondaryUrl = "pulsar+ssl://localhost:6661";
        String primaryTlsTrustCertsFilePath = "primary/path";
        String secondaryTlsTrustCertsFilePath = "secondary/path";
        Authentication primaryAuthentication = AuthenticationFactory.create("org.apache.pulsar.client.impl.auth.AuthenticationTls", "tlsCertFile:/path/to/primary-my-role.cert.pem," + "tlsKeyFile:/path/to/primary-role.key-pk8.pem");
        Authentication secondaryAuthentication = AuthenticationFactory.create("org.apache.pulsar.client.impl.auth.AuthenticationTls", "tlsCertFile:/path/to/secondary-my-role.cert.pem," + "tlsKeyFile:/path/to/secondary-role.key-pk8.pem");

        // You can put more failover cluster config in to map
        Map<String, String> secondaryTlsTrustCertsFilePaths = new HashMap<>();
        secondaryTlsTrustCertsFilePaths.put(secondaryUrl, secondaryTlsTrustCertsFilePath);
        Map<String, Authentication> secondaryAuthentications = new HashMap<>();
        secondaryAuthentications.put(secondaryUrl, secondaryAuthentication);
        ServiceUrlProvider failover = AutoClusterFailover.builder().primary(primaryUrl).secondary(List.of(secondaryUrl)).failoverDelay(30, TimeUnit.SECONDS).switchBackDelay(60, TimeUnit.SECONDS).checkInterval(1000, TimeUnit.MILLISECONDS).secondaryTlsTrustCertsFilePath(secondaryTlsTrustCertsFilePaths).secondaryAuthentication(secondaryAuthentications).build();

        PulsarClient pulsarClient = PulsarClient.builder().serviceUrlProvider(failover).authentication(primaryAuthentication).tlsTrustCertsFilePath(primaryTlsTrustCertsFilePath).build();

        failover.initialize(pulsarClient);
        return pulsarClient;
    }

上述方案需要在集群搭建初期做好规划,让所有客户端都按规范使用。假如不在客户端层面配置故障切换,通过简单修改域名解析会有以下问题:

  1. 客户端长链接,切换域名后客户端流量还在老集群上。
  2. 即使老集群下线,客户端也不会重新解析域名重连到新集群,需要所有客户端重启。

参考

# Geo Replication
# Cluster-level failover