ovn probe 梳理由于 ovn 也是 raft 的，默认是单线程设计，（可以支持多线程，但默认的 ddlog 实现

由于 ovn 也是 raft 的，默认是单线程设计，（可以支持多线程，但默认的 ddlog 实现机制不支持该机制），由于 ovn 有 nb，sb，数据库，HA 设计上，基于 raft 主备之间是需要增量同步的，但是受于 cpu 内存以及磁盘性能，以及网络的限制，导致主备之间的同步可能会出现会提，所以在主备之间，client 与服务器之间可以设置 probe 间隔。以便适应这些情况






> > > > >
> > > > > As a summary for the probe setting,
> > > > >
> > > > > +--------------+  driver configuration
> > > > > |  ovn-driver  |
> > > > > +--------------+
> > > > >     ^    |
> > > > >     |    v
> > > > > +--------------+  inactivity_probe in table "Connection"
> > > > > |  ovn-nb-db   |
> > > > > +--------------+
> > > > >     ^    |
> > > > >     |    v
> > > > > +--------------+  options:northd_probe_interval in table
> > "NB_Global"
> > > > > |  ovn-northd  |  in nbdb.
> > > > > +--------------+
> > > > >     ^    |
> > > > >     |    v
> > > > > +--------------+  inactivity_probe in table "Connection"
> > > > > |  ovn-sb-db   |
> > > > > +--------------+
> > > > >     ^    |
> > > > >     |    v
> > > > > +--------------------------------+  in table "Open_vSwitch" in
> > > > > +--------------------------------+ ovsdb-server
> > > > > |        ovn-controller          |  ovn-remote-probe-interval
> for
> > TCP
> > > > > +--------------------------------+  probe to ovsdb-server,
> > > > >     ^    |            ^    |        ovn-openflow-probe-interval
> > for
> > > > UNIX
> > > > >     |    v TCP        |    v UNIX   probe to ovs-vswitchd
> > > > > +--------------+  +--------------+
> > > > > | ovsdb-server |  | ovs-vswitchd |
> > > > > +--------------+  +--------------+
> > > > >
> > > >
> > > > > Is that correct?
> > > >
> > > > Correct. Except that you don't have to use TCP between
> > > > ovn-controller and the local ovsdb-server. Use UNIX and then you
> > > > don't need to worry about the probe between them.
> > > >


ovn 中的 probe


按照搜索到的顺序

1. ovn-ic-nb

Connection 表 inactivity_probe 字段

      <column name="options" key="ic_probe_interval">
        <p>
          The inactivity probe interval of the connection to the OVN IC
          Northbound and Southbound databases from <code>ovn-ic</code>, in
          milliseconds.  If the value is zero, it disables the connection
          keepalive feature.
        </p>


    <group title="Client Failure Detection and Handling">
      <column name="max_backoff"> # 这个东西和probe应该有配合作用
        Maximum number of milliseconds to wait between connection attempts.
        Default is implementation-specific.
      </column>

      <column name="inactivity_probe">
        Maximum number of milliseconds of idle time on connection to the client
        before sending an inactivity probe message.  If Open vSwitch does not
        communicate with the client for the specified number of seconds, it
        will send a probe.  If a response is not received for the same
        additional amount of time, Open vSwitch assumes the connection has been
        broken and attempts to reconnect.  Default is implementation-specific.
        A value of 0 disables inactivity probes.
      </column>


在发送不活动探测消息之前，连接到客户端的最大空闲时间毫秒数。如果Open vSwitch在指定的秒数内没有与客户端通信，它将发送一个探针。如果在相同的额外时间内未收到响应，则Open vSwitch假定连接已断开并尝试重新连接。默认值是特定于实现的。值为0禁用非活动探测。



2. ovn-ic-sb

Connection 表 inactivity_probe 字段

    <group title="Client Failure Detection and Handling">
      <column name="max_backoff">
        Maximum number of milliseconds to wait between connection attempts.
        Default is implementation-specific.
      </column>

      <column name="inactivity_probe">
        Maximum number of milliseconds of idle time on connection to the client
        before sending an inactivity probe message.  If Open vSwitch does not
        communicate with the client for the specified number of seconds, it
        will send a probe.  If a response is not received for the same
        additional amount of time, Open vSwitch assumes the connection has been
        broken and attempts to reconnect.  Default is implementation-specific.
        A value of 0 disables inactivity probes.
      </column>

      在发送不活动探测消息之前，连接到客户端的最大空闲时间毫秒数。如果Open vSwitch在指定的秒数内没有与客户端通信，它将发送一个探针。如果在相同的额外时间内未收到响应，则Open vSwitch假定连接已断开并尝试重新连接。默认值是特定于实现的。值为0禁用非活动探测。






3. ovn-nb


Connection 表 inactivity_probe 字段

      <column name="options" key="northd_probe_interval">
        <p>
          The inactivity probe interval of the connection to the OVN Northbound
          and Southbound databases from <code>ovn-northd</code>, in milliseconds.
          If the value is zero, it disables the connection keepalive feature.
        </p>

        <p>
          If the value is nonzero, then it will be forced to a value of
          at least 1000 ms.
        </p>
      </column>

从OVN-northd连接到OVN北向和南向数据库的不活动探测间隔，以毫秒为单位。如果该值为零，则禁用连接保持连接特性。




      <column name="options" key="ic_probe_interval">
        <p>
          The inactivity probe interval of the connection to the OVN Northbound
          and Southbound databases from <code>ovn-ic</code>, in milliseconds.
          If the value is zero, it disables the connection keepalive feature.
        </p>

      <column name="options" key="nbctl_probe_interval">
        <p>
          The inactivity probe interval of the connection to the OVN Northbound
          database from <code>ovn-nbctl</code> utility, in milliseconds.
          If the value is zero, it disables the connection keepalive feature.
        </p>
        # nb 客户端到数据库 保活



    <group title="Client Failure Detection and Handling">
      <column name="max_backoff">
        Maximum number of milliseconds to wait between connection attempts.
        Default is implementation-specific.
      </column>

      <column name="inactivity_probe">
        Maximum number of milliseconds of idle time on connection to the client
        before sending an inactivity probe message.  If Open vSwitch does not
        communicate with the client for the specified number of seconds, it
        will send a probe.  If a response is not received for the same
        additional amount of time, Open vSwitch assumes the connection has been
        broken and attempts to reconnect.  Default is implementation-specific.
        A value of 0 disables inactivity probes.
      </column>


在发送不活动探测消息之前，连接到客户端的最大空闲时间毫秒数。如果Open vSwitch在指定的秒数内没有与客户端通信，它将发送一个探针。如果在相同的额外时间内未收到响应，则Open vSwitch假定连接已断开并尝试重新连接。默认值是特定于实现的。值为0禁用非活动探测。


4. ovn-sb
Connection 表 inactivity_probe 字段


        <column name="options" key="sbctl_probe_interval">
          <p>
            The inactivity probe interval of the connection to the OVN
            Southbound database from <code>ovn-sbctl</code> utility, in
            milliseconds.  If the value is zero, it disables the connection
            keepalive feature.
          </p>

          # sb 客户端到sb数据库 保活


    <group title="Client Failure Detection and Handling">
      <column name="max_backoff">
        Maximum number of milliseconds to wait between connection attempts.
        Default is implementation-specific.
      </column>

      <column name="inactivity_probe">
        Maximum number of milliseconds of idle time on connection to the client
        before sending an inactivity probe message.  If Open vSwitch does not
        communicate with the client for the specified number of seconds, it
        will send a probe.  If a response is not received for the same
        additional amount of time, Open vSwitch assumes the connection has been
        broken and attempts to reconnect.  Default is implementation-specific.
        A value of 0 disables inactivity probes.
      </column>
    </group>


在发送不活动探测消息之前，连接到客户端的最大空闲时间毫秒数。如果Open vSwitch在指定的秒数内没有与客户端通信，它将发送一个探针。如果在相同的额外时间内未收到响应，则Open vSwitch假定连接已断开并尝试重新连接。默认值是特定于实现的。值为0禁用非活动探测。


5. ovn-controller

      <dt><code>external_ids:ovn-remote-probe-interval</code></dt>
      <dd>
        <p>
          The inactivity probe interval of the connection to the OVN database,
          in milliseconds.
          If the value is zero, it disables the connection keepalive feature.
          连接到OVN数据库的不活动探测间隔，以毫秒为单位。如果该值为零，则禁用连接保持连接特性。
        </p>

        <p>
          If the value is nonzero, then it will be forced to a value of
          at least 1000 ms.
        </p>
      </dd>

      <dt><code>external_ids:ovn-openflow-probe-interval</code></dt>
      <dd>
        <p>
          The inactivity probe interval of the OpenFlow connection to the
          OpenvSwitch integration bridge, in seconds.
          If the value is zero, it disables the connection keepalive feature.
          OpenFlow连接到OpenvSwitch集成网桥的非活动探测间隔，以秒为单位。如果该值为零，则禁用连接保持连接特性。
        </p>

        <p>
          If the value is nonzero, then it will be forced to a value of
          at least 5s.
        </p>
      </dd>



6. ovn-controller vtep
      <dt><code>other_config:ovn-remote-probe-interval</code></dt>
      <dd>
        <p>
          The inactivity probe interval of the connection to the OVN Southbound
          database, in milliseconds. If the value is zero, it disables the
          connection keepalive feature.
          OVN南向数据库连接的不活动探测间隔，以毫秒为单位。如果该值为零，则禁用连接保持连接特性。
        </p>

        <p>
          If the value is nonzero, then it will be forced to a value of at
          least 1000 ms.
        </p>
      </dd>
    </dl>
    </p>




# TODO://

* OVN OCF pacemaker script to support Active / Passive HA for OVN dbs provides
  the option to configure the inactivity_probe value. The default 5 seconds
  inactivity_probe value is not sufficient and ovsdb-server drops the client
  IDL connections for openstack deployments when the neutron server is heavily
  loaded.

  We need to find a proper solution to solve this issue instead of increasing
  the inactivity_probe value.




1. 看上去有不少超时，可以尝试把 ovn-central 的
这个应该默认就有


ovn-nbctl --no-leader-only set Connection . northd_probe_interval=180000
ovn-nbctl --no-leader-only set Connection . inactivity_probe=180000
ovn-sbctl --no-leader-only set Connection . inactivity_probe=180000



ovn-nbctl --no-leader-only get Connection . inactivity_probe
ovn-sbctl --no-leader-only get Connection . inactivity_probe



和 ovn-controller 的：

ovs-vsctl set open . external-ids:ovn-remote-probe-interval=180000
ovs-vsctl set open . external-ids:ovn-openflow-probe-interval=180000



这几个默认值调大看看。




if [[ -z "$NODE_IPS" ]]; then
echo "no node ip"
else
echo "has node ip"
fi




            ovn-nbctl --no-leader-only set NB_Global . options:northd_probe_interval=180000
            ovn-nbctl --no-leader-only set NB_Global . options:use_logical_dp_groups=true



            ovn-nbctl --no-leader-only set NB_Global . options:inactivity_probe=180000
            ovn-sbctl --no-leader-only set SB_Global . options:inactivity_probe=180000


get

            ovn-nbctl --no-leader-only get NB_Global . options:northd_probe_interval
            ovn-nbctl --no-leader-only get NB_Global . options:use_logical_dp_groups



            ovn-nbctl --no-leader-only get NB_Global . options:inactivity_probe
            ovn-sbctl --no-leader-only get SB_Global . options:inactivity_probe

一个是控制客户端以什么样的间隔（结合退避算法）连接到服务端。

一个是控制客户端是如何判断出连接已失效，然后触发直接重连。

github.com/ovn-org/lib…



type options struct {
	endpoints             []string
	tlsConfig             *tls.Config
	reconnect             bool
	leaderOnly            bool
	timeout               time.Duration
	backoff               backoff.BackOff
	logger                *logr.Logger
	registry              prometheus.Registerer
	shouldRegisterMetrics bool   // in case metrics are changed after-the-fact
	metricNamespace       string // prometheus metric namespace
	metricSubsystem       string // prometheus metric subsystem
	inactivityTimeout     time.Duration
}

参考:

mail.openvswitch.org/pipermail/o…