由于 ovn 也是 raft 的,默认是单线程设计,(可以支持多线程,但默认的 ddlog 实现机制不支持该机制),由于 ovn 有 nb,sb,数据库,HA 设计上,基于 raft 主备之间是需要增量同步的,但是受于 cpu 内存以及磁盘性能,以及网络的限制,导致主备之间的同步可能会出现会提,所以在主备之间,client 与 服务器之间可以设置 probe 间隔。以便适应这些情况
> > > > >
> > > > > As a summary for the probe setting,
> > > > >
> > > > > +--------------+ driver configuration
> > > > > | ovn-driver |
> > > > > +--------------+
> > > > > ^ |
> > > > > | v
> > > > > +--------------+ inactivity_probe in table "Connection"
> > > > > | ovn-nb-db |
> > > > > +--------------+
> > > > > ^ |
> > > > > | v
> > > > > +--------------+ options:northd_probe_interval in table
> > "NB_Global"
> > > > > | ovn-northd | in nbdb.
> > > > > +--------------+
> > > > > ^ |
> > > > > | v
> > > > > +--------------+ inactivity_probe in table "Connection"
> > > > > | ovn-sb-db |
> > > > > +--------------+
> > > > > ^ |
> > > > > | v
> > > > > +--------------------------------+ in table "Open_vSwitch" in
> > > > > +--------------------------------+ ovsdb-server
> > > > > | ovn-controller | ovn-remote-probe-interval
> for
> > TCP
> > > > > +--------------------------------+ probe to ovsdb-server,
> > > > > ^ | ^ | ovn-openflow-probe-interval
> > for
> > > > UNIX
> > > > > | v TCP | v UNIX probe to ovs-vswitchd
> > > > > +--------------+ +--------------+
> > > > > | ovsdb-server | | ovs-vswitchd |
> > > > > +--------------+ +--------------+
> > > > >
> > > >
> > > > > Is that correct?
> > > >
> > > > Correct. Except that you don't have to use TCP between
> > > > ovn-controller and the local ovsdb-server. Use UNIX and then you
> > > > don't need to worry about the probe between them.
> > > >
ovn 中的 probe
按照搜索到的顺序
1. ovn-ic-nb
Connection 表 inactivity_probe 字段
<column name="options" key="ic_probe_interval">
<p>
The inactivity probe interval of the connection to the OVN IC
Northbound and Southbound databases from <code>ovn-ic</code>, in
milliseconds. If the value is zero, it disables the connection
keepalive feature.
</p>
<group title="Client Failure Detection and Handling">
<column name="max_backoff"> # 这个东西和probe应该有配合作用
Maximum number of milliseconds to wait between connection attempts.
Default is implementation-specific.
</column>
<column name="inactivity_probe">
Maximum number of milliseconds of idle time on connection to the client
before sending an inactivity probe message. If Open vSwitch does not
communicate with the client for the specified number of seconds, it
will send a probe. If a response is not received for the same
additional amount of time, Open vSwitch assumes the connection has been
broken and attempts to reconnect. Default is implementation-specific.
A value of 0 disables inactivity probes.
</column>
在发送不活动探测消息之前,连接到客户端的最大空闲时间毫秒数。如果Open vSwitch在指定的秒数内没有与客户端通信,它将发送一个探针。如果在相同的额外时间内未收到响应,则Open vSwitch假定连接已断开并尝试重新连接。默认值是特定于实现的。值为0禁用非活动探测。
2. ovn-ic-sb
Connection 表 inactivity_probe 字段
<group title="Client Failure Detection and Handling">
<column name="max_backoff">
Maximum number of milliseconds to wait between connection attempts.
Default is implementation-specific.
</column>
<column name="inactivity_probe">
Maximum number of milliseconds of idle time on connection to the client
before sending an inactivity probe message. If Open vSwitch does not
communicate with the client for the specified number of seconds, it
will send a probe. If a response is not received for the same
additional amount of time, Open vSwitch assumes the connection has been
broken and attempts to reconnect. Default is implementation-specific.
A value of 0 disables inactivity probes.
</column>
在发送不活动探测消息之前,连接到客户端的最大空闲时间毫秒数。如果Open vSwitch在指定的秒数内没有与客户端通信,它将发送一个探针。如果在相同的额外时间内未收到响应,则Open vSwitch假定连接已断开并尝试重新连接。默认值是特定于实现的。值为0禁用非活动探测。
3. ovn-nb
Connection 表 inactivity_probe 字段
<column name="options" key="northd_probe_interval">
<p>
The inactivity probe interval of the connection to the OVN Northbound
and Southbound databases from <code>ovn-northd</code>, in milliseconds.
If the value is zero, it disables the connection keepalive feature.
</p>
<p>
If the value is nonzero, then it will be forced to a value of
at least 1000 ms.
</p>
</column>
从OVN-northd连接到OVN北向和南向数据库的不活动探测间隔,以毫秒为单位。如果该值为零,则禁用连接保持连接特性。
<column name="options" key="ic_probe_interval">
<p>
The inactivity probe interval of the connection to the OVN Northbound
and Southbound databases from <code>ovn-ic</code>, in milliseconds.
If the value is zero, it disables the connection keepalive feature.
</p>
<column name="options" key="nbctl_probe_interval">
<p>
The inactivity probe interval of the connection to the OVN Northbound
database from <code>ovn-nbctl</code> utility, in milliseconds.
If the value is zero, it disables the connection keepalive feature.
</p>
# nb 客户端到数据库 保活
<group title="Client Failure Detection and Handling">
<column name="max_backoff">
Maximum number of milliseconds to wait between connection attempts.
Default is implementation-specific.
</column>
<column name="inactivity_probe">
Maximum number of milliseconds of idle time on connection to the client
before sending an inactivity probe message. If Open vSwitch does not
communicate with the client for the specified number of seconds, it
will send a probe. If a response is not received for the same
additional amount of time, Open vSwitch assumes the connection has been
broken and attempts to reconnect. Default is implementation-specific.
A value of 0 disables inactivity probes.
</column>
在发送不活动探测消息之前,连接到客户端的最大空闲时间毫秒数。如果Open vSwitch在指定的秒数内没有与客户端通信,它将发送一个探针。如果在相同的额外时间内未收到响应,则Open vSwitch假定连接已断开并尝试重新连接。默认值是特定于实现的。值为0禁用非活动探测。
4. ovn-sb
Connection 表 inactivity_probe 字段
<column name="options" key="sbctl_probe_interval">
<p>
The inactivity probe interval of the connection to the OVN
Southbound database from <code>ovn-sbctl</code> utility, in
milliseconds. If the value is zero, it disables the connection
keepalive feature.
</p>
# sb 客户端到sb数据库 保活
<group title="Client Failure Detection and Handling">
<column name="max_backoff">
Maximum number of milliseconds to wait between connection attempts.
Default is implementation-specific.
</column>
<column name="inactivity_probe">
Maximum number of milliseconds of idle time on connection to the client
before sending an inactivity probe message. If Open vSwitch does not
communicate with the client for the specified number of seconds, it
will send a probe. If a response is not received for the same
additional amount of time, Open vSwitch assumes the connection has been
broken and attempts to reconnect. Default is implementation-specific.
A value of 0 disables inactivity probes.
</column>
</group>
在发送不活动探测消息之前,连接到客户端的最大空闲时间毫秒数。如果Open vSwitch在指定的秒数内没有与客户端通信,它将发送一个探针。如果在相同的额外时间内未收到响应,则Open vSwitch假定连接已断开并尝试重新连接。默认值是特定于实现的。值为0禁用非活动探测。
5. ovn-controller
<dt><code>external_ids:ovn-remote-probe-interval</code></dt>
<dd>
<p>
The inactivity probe interval of the connection to the OVN database,
in milliseconds.
If the value is zero, it disables the connection keepalive feature.
连接到OVN数据库的不活动探测间隔,以毫秒为单位。如果该值为零,则禁用连接保持连接特性。
</p>
<p>
If the value is nonzero, then it will be forced to a value of
at least 1000 ms.
</p>
</dd>
<dt><code>external_ids:ovn-openflow-probe-interval</code></dt>
<dd>
<p>
The inactivity probe interval of the OpenFlow connection to the
OpenvSwitch integration bridge, in seconds.
If the value is zero, it disables the connection keepalive feature.
OpenFlow连接到OpenvSwitch集成网桥的非活动探测间隔,以秒为单位。如果该值为零,则禁用连接保持连接特性。
</p>
<p>
If the value is nonzero, then it will be forced to a value of
at least 5s.
</p>
</dd>
6. ovn-controller vtep
<dt><code>other_config:ovn-remote-probe-interval</code></dt>
<dd>
<p>
The inactivity probe interval of the connection to the OVN Southbound
database, in milliseconds. If the value is zero, it disables the
connection keepalive feature.
OVN南向数据库连接的不活动探测间隔,以毫秒为单位。如果该值为零,则禁用连接保持连接特性。
</p>
<p>
If the value is nonzero, then it will be forced to a value of at
least 1000 ms.
</p>
</dd>
</dl>
</p>
# TODO://
* OVN OCF pacemaker script to support Active / Passive HA for OVN dbs provides
the option to configure the inactivity_probe value. The default 5 seconds
inactivity_probe value is not sufficient and ovsdb-server drops the client
IDL connections for openstack deployments when the neutron server is heavily
loaded.
We need to find a proper solution to solve this issue instead of increasing
the inactivity_probe value.
1. 看上去有不少超时,可以尝试把 ovn-central 的
这个应该默认就有
ovn-nbctl --no-leader-only set Connection . northd_probe_interval=180000
ovn-nbctl --no-leader-only set Connection . inactivity_probe=180000
ovn-sbctl --no-leader-only set Connection . inactivity_probe=180000
ovn-nbctl --no-leader-only get Connection . inactivity_probe
ovn-sbctl --no-leader-only get Connection . inactivity_probe
和 ovn-controller 的:
ovs-vsctl set open . external-ids:ovn-remote-probe-interval=180000
ovs-vsctl set open . external-ids:ovn-openflow-probe-interval=180000
这几个默认值调大看看。
if [[ -z "$NODE_IPS" ]]; then
echo "no node ip"
else
echo "has node ip"
fi
ovn-nbctl --no-leader-only set NB_Global . options:northd_probe_interval=180000
ovn-nbctl --no-leader-only set NB_Global . options:use_logical_dp_groups=true
ovn-nbctl --no-leader-only set NB_Global . options:inactivity_probe=180000
ovn-sbctl --no-leader-only set SB_Global . options:inactivity_probe=180000
get
ovn-nbctl --no-leader-only get NB_Global . options:northd_probe_interval
ovn-nbctl --no-leader-only get NB_Global . options:use_logical_dp_groups
ovn-nbctl --no-leader-only get NB_Global . options:inactivity_probe
ovn-sbctl --no-leader-only get SB_Global . options:inactivity_probe
一个是控制客户端以什么样的间隔(结合退避算法)连接到服务端。
一个是控制客户端是如何判断出连接已失效,然后触发直接重连。
type options struct {
endpoints []string
tlsConfig *tls.Config
reconnect bool
leaderOnly bool
timeout time.Duration
backoff backoff.BackOff
logger *logr.Logger
registry prometheus.Registerer
shouldRegisterMetrics bool // in case metrics are changed after-the-fact
metricNamespace string // prometheus metric namespace
metricSubsystem string // prometheus metric subsystem
inactivityTimeout time.Duration
}
参考: