Subnet Management (SM) and Subnet Adminstration (SA) in OpenFabrics

295 阅读19分钟
RDMARemote Direct Memory Access远程直接内存访问
IBInfiniBand无限带宽用于高性能计算和企业数据中心的高性能、低延迟网络技术。
IBTAInfiniBand Trade AssociationInfiniBand 贸易协会定义和维护InfiniBand架构的组织。
SCSteering Committee指导委员会
IBAInfiniBand ArchitectureInfiniBand 规范定义InfiniBand技术的规范
SMSubnet Manager子网管理器负责建立和维护InfiniBand子网的拓扑
SASubnet Administration子网管理提供了额外的管理功能和对子网信息的访问,超出了子网管理的可用范围。
HCAHost Channel Adapter主机通道适配器将计算机连接到InfiniBand网络的设备。
TCATarget Channel Adapter目标通道适配器将存储或网络目标连接到InfiniBand网络的设备。
LIDLocal Identifier本地标识符分配给InfiniBand网络中每个端口的唯一标识符。
MADManagement Datagram管理数据报InfiniBand管理中用于管理器-代理通信的标准消息格式。
UDUnreliable Datagram不可靠数据报
QPQueue Pair队列对InfiniBand架构中的基本通信对象,由发送队列和接收队列组成。
GMPGeneral/Global Management Packet通用/全局管理报文一种用于子网管理之外的通用管理任务的管理数据报。
SMPSubnet Management Packet子网管理数据包一种专门用于子网管理任务的管理数据报。
FDBForwarding Database转发数据库交换机中用于根据其目标LID转发数据包的数据库。
VLVirtual Lane虚拟通道InfiniBand架构中的一种机制,通过将流量分离到不同lane来提供服务质量。
MTUMaximum Transmission Unit最大传输单元
P_KeyPartition Key分区密钥
BTHBase Transport Header基本传输头
FECNForward Explicit Congestion Notification前向显式拥塞通知InfiniBand中用于通过通知发送者和接收者拥塞状态来控制拥塞的机制。
BECNBackward Explicit Congestion Notification反向显式拥塞通知InfiniBand中用于通过通知发送者和接收者拥塞状态来控制拥塞的机制。
CNCongestion Notification拥塞通知
QoSQuality of Service服务质量网络中的一项功能,用于管理数据流量以减少数据包丢失并确保最佳性能。
QoSClassQuality of Service Class服务质量等级
ULPUpper Layer Protocol上层协议
EWGExtended Widths Group扩展宽度组IBTA内的一个工作组,负责定义InfiniBand的扩展链路速度。
MgtWGManagement Working Group管理工作组IBTA内的一个工作组,专注于定义InfiniBand架构的管理方面。
LWGLink Working Group链路工作组IBTA内的一个工作组,专注于InfiniBand的物理和链路层规范。
LMCLocal Management Context本地管理上下文
DORDimension Ordered Routing维度排序路由
gpxeGNU PXEGNU PXE
GUIDGlobally Unique Identifier全局唯一标识符InfiniBand网络中用于设备和端口的唯一标识符。

Subnet Management (SM) and Subnet Adminstration (SA) in OpenFabrics Hal Rosenstock 4/19/13

  • IB Management Architecture Overview
  • IB Management Tools Overview
  • OpenSM

IB Management Architecture Overview

  • InfiniBand Trade Association (IBTA)

  • IB Management Architecture and IBTA

  • What is a subnet ?

  • Subnet Model

  • Basic Management Concepts

  • Management Model

  • Objectives of Subnet Management

  • Path Information

  • SM Maintenance of Subnet

  • Subnet Administration (SA) Information

  • SA Functions

  • SM/SA Architecture

  • Relationship between SM and SA

  • Performance Management

  • Congestion Management

  • Quality of Service

  • Founded in 1999

  • Actively markets and promotes InfiniBand from an industry perspective through public relations engagements, developer conferences and workshops

  • Steering Committee (SC) Members

  • Cray, HP, IBM, Intel, Mellanox, Oracle

  • InfiniBand Architecture (IBA) is specified by InfiniBand Trade Association (IBTA)

  • Currently at version 1.2.1 – issued November 2007

  • In two volumes (1 & 2)

  • IB Management is specified in chapters 13-16 in volume 1

  • Chapter 13: Management Model

  • Chapter 14: Subnet Management

  • Chapter 15: Subnet Administration

  • Chapter 16: General Services

  • Also, various annexes of interest in volume 1

  • Annex A10: Congestion Control

  • Annex A13: Quality of Service

  • Hierarchy Annex

  • IBA has evolved beyond 1.2.1

  • Errata (primarily MgtWG and LWG)

  • 1.3 Volume 2 – Extended Link Speeds (EWG)

  • Released specs and errata are available as non member

  • www.infinibandta.org/content/pag… p?pg=technology_download

channel adapter switch router

  • Subnet = HCAs and TCAs interconnected through switches
  • Each subnet has its own LID space
  • Each subnet has at least one SM and exactly one (logical) Master SM
  • after initialization, mastership could be a distributed function
  • Fabric = subnets interconnected through routers endnode switch router
  • Node: any managed entity - endnode, switch, router
  • Manager: active entity; sources commands and queries. There are few managers.
  • Agent: passive (mostly) entity, responds to managers (but can source traps). Many agents.
  • Management Datagram (MAD): standard message format for manager– agent communication. Carried in an unreliable datagram (UD).
  • All data formats & actions are defined solely in terms of MAD content. Implementation not defined: hardware, firmware, software, whatever...

Pure InfiniBand Management Other Management Features QP0 (virtualized per port) Always uses VL15 MADs called SMPs – LID or Direct-Routed No Flow Control QP1 (virtualized per port) Uses any VL except 15 MADs called GMPs - LID-Routed Subject to Flow Control

  • Node attributes:

  • NodeInfo (type, numPorts, version, GUIDs, deviceID…)

  • NodeDescription

  • Port attributes:

  • PortInfo (M_Key, LIDs, state, capabilities, VLs, width, speed, MTU, Master…)

  • GUIDInfo

  • SLtoVLMapping

  • VLArbitration

  • Partition table

  • Switch attributes:

  • SwitchInfo

  • LinearFDB

  • RandomFDB

  • MulticastFDB

  • SM attribute:

  • SMInfo

  • Router attribute:

  • RouterInfo (TBD)

  • Initialization and configuration of the subnet elements

  • Establishing paths through the subnet

  • Fault isolation

  • Continue these activities during topology changes

  • Prevent unauthorized subnet managers

  • The Subnet Manager establishes / defines paths through the subnet

  • does so using SMPs that set switch forwarding tables, LIDs, etc.

  • Subnet Administration responds to path resolution requests, using GMPs

  • Path Record returned by SA contains:

  • Local header: DLID, SLID, SL

  • Global header: DGID, SGID, (TClass, FlowLabel, HopLimit)

  • Properties: MTU, Rate, Latency, P_Key

  • Physical subnet establishment

  • Subnet discovery

  • Information gathering

  • Path determination

  • Port configuration

  • Switch configuration

  • Subnet activation

  • Topology changes

  • SM Actions

  • SM state machine

  • Determination of the Master

  • Mastership handover

  • Mastership failover

  • Handling topology changes

  • SA provides access to & storage of three kinds of information:

  • info that endnodes need to operate on the subnet

  • examples: paths, multicast info, services

  • info that is non-algorithmic, typically entered by a network administrator

  • examples: partitioning information, SL to VL mappings, etc.

  • required for level 3 interoperability (move to different manager)

  • info that may be useful to, e.g., standby SMs

  • example: network topology, GUIDs of nodes, etc.

  • this is optional

  • The SM and SA cooperate to provide this information

  • To provide that information, SA has two major functions:

  • A query subsystem to identify type of information sent and received

  • An event reporting subsystem to forward SM traps as Report() MADs to subscribers

  • All subnets must have an SA

  • The SA is part of the SM

  • Effectively, it’s the SM talking in “normal” packets to clients on the subnet

  • “Tightly” coupled to SM in IBA

  • Discussed separately for convenience of description only

  • SM-SA communication is “vendor” specific

  • SA can be on a node different from SM

  • Redirection could be used to accomplish this

  • If SM is master, the (or a) related SA must also be master

  • If an SM ceases to be master, a (or the) related SA must also cease to be master (watch out when SA & SM are on different nodes) State Records

  • NodeRecord

  • SwitchRecord*

  • PortInfoRecord

  • PartitionRecord

  • SLToVLMappingTableRecord

  • LinearForwardingTableRecord*

  • RandomForwardingTableRecord*

  • MulticastForwardingTableRecord*

  • SmInfoRecord*

  • LinkRecord*

  • GuidInfoRecord*

  • PathRecord

  • MultiPathRecord

  • InformInfoRecord

  • Notice* (Traps and Notices) (not record) Subscription Records

  • ServiceRecord (Service Advertisement)

  • MCMemberRecord*

  • InformInfo (not record)

  • = optional
  • Purpose:

  • Allows retrieval of performance and error statistics from InfiniBand™ components

  • Provides a means of sampling a specified quantity over a specified interval

  • IBA specifies only the PerfMgt Agent; manager is “beyond the spec”

  • Specified in Annex A10 (updated in LWG errata 8/19/10)

  • Avoid/eliminate congestion spreading

  • Limit flow injection rate at ports which are root cause of congestion

  • Accomplished via FECN/BECN mechanism

  • Formerly reserved bits in BTH header

  • CN (congestion notification) opcode is 0b’10000000

  • Transparent to version 1.1 or earlier switches

  • B (BECN): 0 indicates that no congestion was encountered; 1 indicates that the packet indicated by this header was subject to forward congestion. The B bit is set in an ACK on CN BTH.

  • F (FECN): 0 indicates that it probably did not go through a point of congestion; 1 indicates that the packet went through a point of congestion.

  • Specified in Annex A13 (updated in LWG errata 8/19/10)

  • QoS scheme providing different relative classes of service

  • Requester specifies priority by specifying the type of priority management requested via a 2 bit QoSClass.Type and 8 bit QoSClass.Priority field in [Multi]PathRecord

  • Currently, the only type of priority management defined is DiffServ compatible

  • All other values for the QoSClass.Type are reserved

  • Requester can also specify ServiceID field in PathRecord/MultiPathRecord

  • Also used in subsequent CM REQ messages

IB Management Tools Overview

  • OpenFabrics

  • OpenFabrics Stack

  • OpenFabrics Software

  • OpenSM

  • Open source InfiniBand software is developed under OpenFabrics Open Source Alliance www.openfabrics.org/index.html

  • Primarily targeted at Linux and Windows

  • Other OS ports have been done

  • Mellanox is current maintainer for OpenSM and libibumad

  • I am current maintainer for OpenSM, libibumad, and ibsim

  • I was former maintainer for infiniband-diags and libibmad

  • Git trees’ location is git.openfabrics.org/git/

  • ~halr for opensm, libibumad, and ibsim

  • ~iraweiny for infiniband-diags and libibmad

  • Released tar balls location is www.openfabrics.org/downloads/m…

  • libibumad, libibmad, opensm, infiniband-diags, ibsim

  • Software is dual licensed

  • BSD or GPL v2

OpenSM

  • What is OpenSM ?

  • OpenSM Porting Layers

  • OpenSM “History”

  • OpenSM Gen2 Layering

  • Introducing OpenSM

  • Starting OpenSM

  • OpenSM Command Line Options

  • opensm.conf Parameters

  • Logging

  • Console

  • Stages of Operation

  • Partition Management

  • Quality of Service (QoS)

  • Routing

  • Credit Loops/Deadlock

  • Unicast Routing Stages

  • Unicast Routing Algorithms

  • Up/Down Routing

  • Multicast Routing

  • SM High Availability

  • Some Notable Features

  • Windows OpenSM (and diagnostics)

  • Performance Manager

  • Congestion Manager

  • InfiniBand compliant Subnet Manager and Administration

  • Also contains an (optional) performance manager

  • Now also contains an (optional) (experimental) congestion manager

  • Approximately 2-3 releases/year

  • Latest release is 3.3.16 (February 2013)

  • Component Library (complib)

  • OS portability/abstraction layer

  • threads, timers, locking, events, various data structures

  • Vendor Library (libvendor)

  • IB hardware portability layer

  • osm_vendor_ibumad

  • osm_vendor_ibal

  • Vendor layer for simulator (ibmgtsim not ibsim)

  • osm_vendor_mlx

  • Timeline

  • Gen 1 (IBAL based vendor layer)

  • Intel – 2002 to mid 2003

  • Mellanox – mid 2003 to end 2004

  • Gen 2 (UMAD based vendor layer)

  • Voltaire – 2005 to 2010

  • PathForward “seeded” OpenIB/OpenFabrics

  • Now Mellanox again - 2011

User space Kernel space

Load OpenSM on a node/server

  • Invoked via command line options and various configuration files (even more options)

  • See extensive man page

  • Default options file

  • Sample from OpenSM 3.3.11

  • Maintains log file

  • Command line

  • Default (no parameters)

  • Scans and initializes the IB fabric and periodically sweeps for changes

  • opensm –h for usage flags

  • E.g. to start with up-down routing: opensm –-routing_engine updn

  • Run is logged to two files

  • Windows

  • \Windows\Temp\osm.syslog - opensm messages, registers only general major events

  • \Windows\Temp\osm.log – details of reported errors

  • Linux

  • /var/log/messages – opensm messages, registers only general major events

  • /var/log/opensm.log - details of reported errors

  • Installed as service in Windows

  • /etc/init.d/opensm start (Linux)

  • Start on Boot

  • As a Linux daemon:

  • /etc/init.d/opensmd start|stop|restart|status

  • /etc/opensm.conf for default parameters

ONBOOT

To start OpenSM automatically set ONBOOT=yes ONBOOT=yes

  • SM detection

  • /etc/init.d/opensd status

  • Shows opensm runtime status on a machine

  • sminfo

  • Shows master (and standby) SMs running on the subnet

  • saquery –s

  • Shows SMs on the subnet

  • A few important command line parameters: -c, --create-config OpenSM will dump its configuration to the specified file and exit. This is a way to generate OpenSM configuration file template. -g, --guid This option specifies the local port GUID value with which OpenSM should bind. OpenSM may be bound to 1 port at a time. This option is used if the SM needs to bind to Port 2 of an HCA. -R, --routing_engine This option chooses routing engine instead of Min Hop algorithm (default). Supported engines: updn, dnup, file, ftree, lash, dor, torus-2QoS -x, --honor_guid2lid. This option forces OpenSM to honor the guid2lid file, when it comes out of Standby state, if such file exists under \Windows\Temp (in Windows) and /var/cache/opensm (in Linux) -V This option sets the maximum verbosity level and forces log flushing

Command Description

version guid lmc priority smkey reassign_lids routing_engine

Prints OpenSM version and exits Specifies the local port GUID The # of LIDs per port (power of 2) Specifies SMs priority master = highest This will effect SM authentication (64 bits) Causes SM to reassign Lid’s to all end nodes Chooses routing algorithm (default minhop)

no_default_routingPrevents SM from falling back to default routing do_mesh_analysisEnables additional analysis for lash algorithm

lash_start_vl sm_sl connect_roots ucast_cache

Sets VL to use Lash (default 0) Sets SL to use SM/SA (default 0) Forces routing engines to connect between root Prevents ucast routing changes per heavy sweep

Command Description

lid_matrix_file lfts_file sadb_file root_guid_file cn_guid_file io_guid_file max_reverse_hops ids_guid_file guid_routing_order_file torus_config once sweep timeout (milliseconds) retries (def 3) maxsmps console ignore-guids

Specifies the name of the lid matrix dump file Specifies the name of the lft file to be loaded Specifies the SA dump file Sets the root nodes GUID (updn or fat tree) Sets compute nodes guids for Fat Tree Sets I/O nodes guids for Fat Tree Sets max hops the wrong way Name of map file with set of Id’s instead GUIDs Sets order port GUIDs for Min-Hop and UpDn Defines the file name for extra config info needed Causes SM to configure subnet once then exits Specifies number of seconds between subnet sweeps Specifies transaction timeouts Specifies the number of retries used per transaction Specifies the number of VL15 SMP MADs per wire Activates the SM console (default off) Defines set of ports guid to be ignored by link load algorithm

Command Description

hop_weights_file dimn_ports_file honor_guid2lid (default false) log_file log_limit erase_log_file Pconfig (default partitions.conf) no_part_enforced ar ar_config_file qos qos_policy_file stay_on_fatal daemon inactive prefix_routes_file

Provides the means to define weighting factor per port Provides the means to define mapping between ports Forces SM to honor the guid2lid file (when coming out of standby state) Defines the log to given file name (default /var/log/opensm.log) Defines the max file size in MB Causes deletion of file if previously exists (default is accumulative) Defines the optional partition configuration file Disables partition enforcement on switch ext ports Enables Adaptive Routing Manager (ARM) on SM Specifies optional Adaptive Routing config file Enables QoS setup Defines optional QoS policy file (default qos-policy.conf) Will cause SM not to exit on fatal initialization Will run SM in the background Start SM in inactive rather than normal init state Specifies how SA responds to path record queries

  • opensm.conf provides full access to all OpenSM internal options which control various aspects of its operation. The available options are listed in the following table:

  • Levels control the amount of information logged

  • Error (error messages)

  • Info (basic messages, low volume)

  • Verbose (interesting stuff, moderate volume)

  • Debug (diagnostic, high volume)

  • Functions (function entry/exit, very high volume)

  • Frames (dumps all SMP and GMP frames)

  • Routing (dump FDB routing information)

  • Without –D option, OpenSM defaults to ERROR + INFO

  • Other logging features

  • Limit log file size

  • Log rotation

  • Erase log file (at startup)

  • Now (at OpenSM 3.3.15): per module logging

  • Optional

  • Local

  • Socket

  • telnet

  • Can limit to loopback IP address

  • SSL not yet supported

  • Commands

  • loglevel

  • priority

  • resweep

  • reroute

  • sweep

  • More commands

  • status

  • logflush

  • querylid

  • portstatus

  • switchbalance

  • lidbalance

  • dump_conf

  • update_desc

  • version

  • perfmgr

  • dump_portguid

  • Discovery (Heavy sweep)

  • If master, configuration

  • PKey/QoS setup

  • SM LID configuration

  • LID configuration

  • Switch configuration

  • Unicast

  • Multicast

  • Link/Port configuration

  • Set state to Armed and then Active

  • Subnet is up if all sets worked

  • Otherwise, another heavy sweep is invoked

  • Once subnet up, then light sweeping

  • Poll SwitchInfo for PortStateChange

  • If topology change, trigger heavy sweep

  • Traps can also indicate significant change triggering heavy sweep

  • Link state change (trap 128)

  • Define partitions

  • Assign end ports (Port GUID) to one or more partitions

  • Provide level of access control within the fabric

  • Full or Limited membership

  • End ports can be members of multiple partitions

  • If enabled, partition enforcement is set on leaf links

  • Leaf link is between switch external port and xCA (or router) port

  • Partition syntax

  • Full/limited

  • Also, ipoib on partition preconfigures the IPoIB broadcast group

  • Rate, SL, MTU

  • Can now also preconfigure multicast groups

  • New feature at OpenSM 3.3.15

  • See man page section on PARTITION CONFIGURATION or doc/partition-config.txt for syntax

  • Partial Membership

  • Partial members are able to communicate only with full members

  • Partial can talk only with full, full can talk with all (either full or partial)

  • Some “Common” Use Cases

  • IB Storage (File/Block) target: avoid initiator nodes to communicate over the I/O subnet

  • Network management with IPoIB avoiding managed nodes over the managed IP subnet Subnet Manager port is a Full member, all other ports are Partial members

  • PKEY is a 16 bit integer

  • MSb defines the nature of Membership: is it Full or Partial member ?

  • 15 other bits comprise the partition

  • Example: for the partition 0x0003

  • for Full members: 0x 8003

  • for Partial members: 0x 0003

  • IPoIB uses full membership

  • Note: the child interface name has the partition (low 15 bits) but uses full membership

  • OpenSM enables the configuration of partitions (PKeys) in an InfiniBand fabric.

  • By default, OpenSM searches for the partitions configuration file under the name /usr/etc/opensm/partitions.conf

  • To change this filename, you can use opensm with the ‘--Pconfig’ or ‘-P’ flags.

  • The default partition is created by OpenSM unconditionally even when a partition configuration file does not exist or cannot be accessed

  • The default partition has a P_Key value of 0x7fff (The partition range is 0x0001 – 0x7fff)

  • The port which “runs” OpenSM is assigned full membership in the default partition. All other end ports are assigned full or partial membership.

File format (partitions.conf): : ; [PartitionName][=PKey] [,flag[=value]] [,defmember=full/limited] PartitionName string, will be used with logging. When omitted empty string will be used. Pkey P_Key value for this partition. Only low 15 bits will be used. When omitted will be autogenerated. flag used to indicate IPoIB capability of this partition. Defmember Specifies default membership for port GUID list (Default is limited).

Optional Flags cont ipoib - indicates that this partition may be used for IPoIB, as result IPoIB capable MC group will be created. rate= - specifies rate for this IPoIB MC group (default is 3 (10GBps)) mtu= - specifies MTU for this IPoIB MC group (default is 4 (2048)) sl= - specifies SL for this IPoIB MC group (default is 0) scope= - specifies scope for this IPoIB MC group (default is 2 (link local)). Multiple scope settings are permitted for a partition.

  • Note: Values for rate, mtu, and scope should be specified as defined in the IBTA specification (for example, mtu=4 for 2048)

  • PortGUID - GUID of partition member end port. Hexadecimal numbers should start with 0x, decimal numbers are accepted too Full or limited - Indicates full or limited membership for this port. When omitted (or unrecognized) limited membership is assumed

  • There are two useful keywords for PortGUID definition:

    • 'ALL' means all end ports in this subnet
  • 'ALL_CAS' means all Channel Adapter end ports in this subnet

  • 'ALL_SWITCHES' means all Switch end ports in this subnet

  • 'ALL_ROUTERS' means all Router end ports in this subnet

  • 'SELF' means the subnet manager port

  • Empty list means no ports in this partition

  • When Quality of Service (QoS) in OpenSM is enabled (using the ‘-Q’ or ‘--qos’ flags or qos TRUE option), OpenSM looks for a QoS Policy file under /usr/local/etc/opensm/qos-policy.conf

  • During fabric initialization and at every heavy sweep, OpenSM parses the QoS policy, applies its settings to the discovered fabric elements, and enforces the provided policy on SA client path requests

  • The overall flow for SA client path requests is as follows:

  • The SA request is matched against the defined matching rules such that the QoS Level definition is found

  • Given the QoS Level, a path(s) search is performed with the given restrictions imposed by that level There are two ways to define QoS policy:

  • Simple

  • Advanced policy file syntax

  • Provides the administrator various ways to match a PathRecord/MultiPathRecord (PR/MPR) request

  • Provides the administrator QoS constraints on the requested PR/MPR

  • Enables the administrator to match PR/MPR requests by various ULPs and applications running on top of these ULPs

  • Primitive (Simple)

  • See doc/qos-config.txt

  • Policy (Advanced)

  • Per QoS Annex A13

  • See doc/QoS_management_in_OpenSM.txt

  • In opensm configuration/options file

  • Provides for coarse configuration of SL2VL Mapping and VLArbitration tables

  • Max VLs, VL High Limit, VLArb High/Low, and SL2VL configuration

  • Options for All ports, CA ports, Router ports, Switch port 0 ports, Switch external ports

  • qos TRUE

  • qos_max_vls 8

  • qos_high_limit 4

  • qos_vlarb_high 0:64,1:64,2:64,3:64,4:64,5:64,6:64,7:64

  • qos_vlarb_low 0:4,1:4,2:4,3:4,4:4,5:4,6:4,7:4

  • qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7

  • The QoS policy file has the following sections:

  1. Port Groups (denoted by port-groups) This section defines zero or more port groups that can be referred later by matching rules (see below). Port group lists ports by: a. Port GUID b. Port name - is a combination of Node Description and IB port number c. Pkey - means that all the ports in the subnet that belong to partition with a given PKey belong to this port group d. Partition name - means that all the ports in the subnet that belong to partition with a given name belong to this port group e. Node type - supported node types are: CA, SWITCH, ROUTER, ALL, and SELF (SM port).

  2. QoS Setup (denoted by qos-setup) This section describes how to set up SL2VL and VL Arbitration tables on various nodes in the fabric. However, this is not supported. SL2VL and VLArb tables should be configured in the OpenSM options file (default location - /var/cache/opensm/opensm.conf)

  3. QoS Levels (denoted by qos-levels) Each QoS Level defines Service Level (SL) and a few optional fields: a. MTU limit b. Rate limit c. PKey d. Packet lifetime

  4. When a path search is performed, it is done with regards to restriction that these QoS Level parameters impose. One QoS level that is mandatory to define is a DEFAULT QoS level. It is applied to a PR/MPR query that does not match any existing match rule. Similar to any other QoS Level, it can also be explicitly referred by any match rule.

  5. QoS Matching Rules (denoted by qos-match-rules) Each PathRecord/MultiPathRecord query that OpenSM receives is matched against the set of matching rules. Rules are scanned in order of appearance in the QoS policy file such as the first match takes precedence. Each rule has a name of QOS level that will be applied to the matching query.

  6. A default QOS level is applied to a query that did not match any rule, Queries can be matched by: a. Source port group (whether a source port is a member of a specified group) b. Destination port group (same as above, only for destination port) c. PKey d. QoS class e. Service ID

  • QoS policy definition can also consist of single qos-ulps section

  • Has a list of matching rules and their QoS Level

  • A matching rule has only one criteria

  • Rule goal is to match a certain ULP

  • QOS Level has only one constraint - Service Level (SL)

  • As mentioned earlier, any section of the policy file is optional, and the only mandatory part of the policy file is a default QoS Level.

  • Port groups section is missing because there are no match rules, which means that port groups are not referred to anywhere, so there is no need defining them. Also, since this policy file doesn't have any matching rules, PR/MPR query will not match any rule, and OpenSM will enforce default QoS level. Essentially, the above example is equivalent to not having a QoS policy file at all.

  • The example on the next page shows all the possible options and keywords in the policy file and their syntax.

  • Layer 2 but still routing term still used

  • Topology defines physical connectivity

  • Routing defines paths through topology

  • Unicast routing

  • See ROUTING section in opensm man page or doc/current-routing.txt

  • Multicast routing

  • IBA mandates deadlock free routing

  • C14-62.1.2: When establishing the contents of switch forwarding tables and SL to VL maps, the subnet manager shall ensure that no cyclic flow control dependencies exist in the fabric.

  • Cyclic dependencies in flow control can cause deadlock and subsequent failure of an IBA fabric. There are a number of routing methods that may be employed to prevent these dependencies. These include pruned and fat tree structures, dimension order routing in meshes and hyper cubes, and use of multiple virtual lanes to break the flow control cycle in routing loops. While IBA does not specify a particular routing method, whatever method is utilized must ensure deadlock-free operation.

  • “loss less fabric” = “link level flow control” = packet not sent if there is no receive buffer for it

  • If traffic to DST-1 waits on traffic for DST-2 which in turn depends on traffic to DST-3 which depends on DST-1 we have a dependency loop and the fabric deadlocks

  • Build LID matrices

  • For each switch, LID matrix is a table of LIDs, port number, and number of hops (using that port number)

  • SL too for QoS based algorithms

  • If there are several possible output ports (“default” policy):

  • the one with minimum number of routes is chosen

  • if there are several with minimum number of routes, then the one with lowest port number is chosen

  • Build unicast forwarding tables

    • Pros - Cons MinHop +any topology +LMC support +Fast -Not loop-free UpDown +any topology +LMC support +loop-free -works only in topology where there are root or roots are manually set -SM must be a leaf -roots are hot spots -No communication in the spine FatTree +optimal performance for fat-tree +loop-free -only “almost” fat-tree -LMC is not supported LASH +any topology (without roots) +outperforms UpDown +loop-free -VLs are “wasted” -LMC is not supported DOR +loop-free -unbalanced or under-utilized fabric
  • Min hop

  • Up/down (MLNX)

  • Down/up (PNL)

  • Fat tree (MLNX originated, enhanced by Bull)

  • DOR (Dimension ordered routing) (SGI)

  • Meshes & hypercubes

  • VL Based Routing

  • LASH/mesh (Simula/SystemFabricWorks)

  • SL changes on topology changes

  • torus2qos for 2D/3D torus (Sandia)

  • SL is constant despite many topology changes

  • SSSP/DFSSSP (Dresden University of Technology)

  • Other

  • ucast cache (MLNX)

  • prevents routing recalculation (which is a heavy task in a large cluster) when there was no topology change detected during the heavy sweep, or when the topology change does not require new routing calculation, e.g. when one or more CAs/RTRs/leaf switches going down, or one or more of these nodes coming back after being down.

  • Useful for gpxe installations

  • file save/restore

  • Used for experimentation with routing algorithms

  • Doesn’t handle topology changes

  • Activation through OpenSM

  • Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm.

  • Use `-a <guid_list_file>' for adding an UPDN guid file that contains the root nodes for ranking.

  • If the `-a' option is not used, OpenSM uses its auto-detect root nodes algorithm.

  • Notes on the guid list file:

    1. A valid guid file specifies one guid in each line. Lines with an invalid format will be discarded.
    1. The user should specify the root switch guids. However, it is also possible to specify CA guids; OpenSM will use the guid of the switch (if it exists) that connects the CA to the subnet as a root node.
  • Activation through OpenSM

  • -X, --guid_routing_order_file

  • Set the order port guids will be routed for the MinHop and Up/Down routing algorithms to the guids provided in the given file (one to a line).

  • -m, --ids_guid_file

  • Name of the map file with set of the IDs which will be used by Up/Down routing algorithm instead of node GUIDs (format: per line).

  • Activation through OpenSM

  • Other options

  • Scatter ports

  • Use --scatter-ports » This option randomizes port selection in routing.

  • Port shifting

  • Use –port-shifting » This option enables a feature called port shifting. In some fabrics, particularly cluster environments, routes commonly align and congest with other routes due to algorithmically unchanging traffic patterns. This routing option will "shift“ routing around in an attempt to alleviate this problem.

  • Activation through OpenSM

  • Other options

  • Ucast cache

  • Use –A or --ucast_cache » Prevents routing recalculation in simple cases » This option enables unicast routing cache and prevents routing recalculation (which is a heavy task in a large cluster) when there was no topology change detected during the heavy sweep, or when the topology change does not require new routing calculation, e.g. when one or more CAs/RTRs/leaf switches going down, or one or more of these nodes coming back after being down. A very common case that is handled by the unicast routing cache is host reboot, which otherwise would cause two full routing recalculations: one when the host goes down, and the other when the host comes back online.

  • Activation through OpenSM

  • Other options

  • Connect roots

  • Use –z or –connect_roots » This option enforces routing engines (up/down and fat-tree) to make connectivity between root switches and in this way to be fully IBA compliant. In many cases this can violate "pure" deadlock free algorithm, so use it carefully.

  • Addressing

  • MGIDs and MLIDs

  • Limited MLIDs supported in switches

  • consolidate_ipv6_snm_req » IPv6 Solicited Node Multicast on single MLID

  • Spanning tree per MLID

  • Root determination

  • First (by GUID order) switch with the minimal maximal distance to all the group members

  • Root can move

  • Routing is triggered by MC joins/leaves in ULPs or applications

  • IPoIB

  • Failover/handover

  • Client reregistration mechanism

  • Need to be able to migrate the following during OpenSM failover / handover:

  1. OpenSM configuration
  2. GUID-2-LID assignment
  3. Full Multicast group membership
  4. ServiceRecord registrations
  5. InformInfo registrations
  • Relying on Client ReRegister support by ULPs/application, the list becomes shorter:
  1. OpenSM configuration
  2. GUID-2-LID assignment
  3. MGID-2-MLID assignment
  • Current solution is hybrid

  • Replicating the basic OpenSM and fabric configuration

  • OpenSM configuration

  • GUID-2-LID assignment

  • Replicating all the SA clients’ info

  • Full Multicast group membership

  • ServiceRecord registrations

  • InformInfo registrations replicating all the SA clients info

  • Requiring SA Clients to ReRegister on OpenSM handover/failover

  • Replicating doesn’t lock SA so need to cover the possible holes

  • FDR and FDR-10 support (OpenSM 3.3.11 – Aug 2011)

  • FDR (and EDR) are IBTA standards

  • FDR-10 is MLNX proprietary

  • SRIOV support (OpenSM 3.3.14 – May 2012)

  • Additional GUIDs for virtual machines

  • Minor impact on partition manager

  • SA scalability (future)

  • Distributed SA

  • A step towards Exascale

  • Ported from Upstream Linux Version

  • Now based on OpenSM 3.3.13

  • Switched from use of IBAL vendor layer to UMAD vendor layer so consistent with Linux vendor layer

  • libibumad ported to Windows

  • Also, most of infiniband-diags and libibmad ported to Windows

  • Mainly missing shell script based infiniband-diags tools

  • Fabric discovery

  • Polls each port’s PortCounters periodically

  • Recently enhanced for optional PortCountersExtended

  • 64 bit counters for data

  • Performs data reduction on counters

  • Logs performance data and indicates interesting events

  • See doc/perf-manager-arch.txt and doc/performance- manager-HOWTO.txt

  • Recently added (in OpenSM 3.3.15 released August 2012)

  • Currently experimental status

  • Default mode is disabled

  • Currently lacking congestion log monitoring and SwitchPortCongestionSetting attribute support

  • Also, CongestionKeyInfo and trap logging

Thank You