硬中断亲和性 IRQ Affinity

1,800 阅读4分钟

irqbalance

rqbalance(用户层) 是Linux内核自带的一个中断调节服务,它根据网络负载情况自动将网卡中断和CPU进行亲和请设置。事实上它就是 自动的 smp irq affinity 它有很好的动态调节效果,但是对于大量小包的网络环境,irqbalance几乎是无效的。

特点: 在linux系统网络设备上,使用irqbalance满足普通场景,但对于高负载网络环境,有时它并不能完全利用CPU的处理能力。

SMP IRQ Affinity

多处理器系统上,配置中断和CPU的亲和性,有多个 CPU 处理器 的 系统中, Linux 内核需要处理的问题 :

① 公平共享 : CPU 的负载,需要公平地共享, 不能出现某个 CPU 空闲, 造成资源浪费 ;

② 可设置进程 与 CPU 亲和性 : 可以为 某些类型的 进程 与 指定的 处理器 设置 亲和性 , 可以针对性地匹配 进程 与 处理器 ;

③ 进程迁移 : Linux 内核可以将 进程 在 不同的 CPU 处理器之间进行迁移 ;

Linux内核的 SMP 对称多处理器结构调度, 核心就是将进程迁移到合适的处理器上, 并且可以保持各个处理器的负载均衡 ;

手动配置中断亲和性

备注:需要停掉irqbalance,否则会覆盖掉手动配置。

在网络非常 heavy 的情况下,对于文件服务器、高流量 Web 服务器这样的应用来说,把不同的网卡 IRQ 均衡绑定到不同的 CPU 上将会减轻某个CPU 的负担,提高多个 CPU 整体处理中断的能力。合理的根据自己的生产环境和应用的特点来平衡 IRQ 中断有助于提高系统的整体吞吐能力和性能。

gist.github.com/SaveTheRbtz…

在服务器上执行命令./set_irq_affinity.sh -x local eth0

#!/bin/bash
#
# Copyright (c) 2015, Intel Corporation
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
#     * Redistributions of source code must retain the above copyright notice,
#       this list of conditions and the following disclaimer.
#     * Redistributions in binary form must reproduce the above copyright
#       notice, this list of conditions and the following disclaimer in the
#       documentation and/or other materials provided with the distribution.
#     * Neither the name of Intel Corporation nor the names of its contributors
#       may be used to endorse or promote products derived from this software
#       without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# Affinitize interrupts to cores
#
# typical usage is (as root):
# set_irq_affinity -x local eth1 <eth2> <eth3>
#
# to get help:
# set_irq_affinity

usage()
{
	echo
	echo "Usage: $0 [-x|-X] {all|local|remote|one|custom} [ethX] <[ethY]>"
	echo "	options: -x		Configure XPS as well as smp_affinity"
	echo "	options: -X		Disable XPS but set smp_affinity"
	echo "	options: {remote|one} can be followed by a specific node number"
	echo "	Ex: $0 local eth0"
	echo "	Ex: $0 remote 1 eth0"
	echo "	Ex: $0 custom eth0 eth1"
	echo "	Ex: $0 0-7,16-23 eth0"
	echo
	exit 1
}

usageX()
{
	echo "options -x and -X cannot both be specified, pick one"
	exit 1
}

if [ "$1" == "-x" ]; then
	XPS_ENA=1
	shift
fi

if [ "$1" == "-X" ]; then
	if [ -n "$XPS_ENA" ]; then
		usageX
	fi
	XPS_DIS=2
	shift
fi

if [ "$1" == -x ]; then
	usageX
fi

if [ -n "$XPS_ENA" ] && [ -n "$XPS_DIS" ]; then
	usageX
fi

if [ -z "$XPS_ENA" ]; then
	XPS_ENA=$XPS_DIS
fi

num='^[0-9]+$'
# Vars
AFF=$1
shift

case "$AFF" in
    remote)	[[ $1 =~ $num ]] && rnode=$1 && shift ;;
    one)	[[ $1 =~ $num ]] && cnt=$1 && shift ;;
    all)	;;
    local)	;;
    custom)	;;
    [0-9]*)	;;
    -h|--help)	usage ;;
    "")		usage ;;
    *)		IFACES=$AFF && AFF=all ;;	# Backwards compat mode
esac

# append the interfaces listed to the string with spaces
while [ "$#" -ne "0" ] ; do
	IFACES+=" $1"
	shift
done

# for now the user must specify interfaces
if [ -z "$IFACES" ]; then
	usage
	exit 1
fi

# support functions

set_affinity()
{
	VEC=$core
	if [ $VEC -ge 32 ]
	then
		MASK_FILL=""
		MASK_ZERO="00000000"
		let "IDX = $VEC / 32"
		for ((i=1; i<=$IDX;i++))
		do
			MASK_FILL="${MASK_FILL},${MASK_ZERO}"
		done

		let "VEC -= 32 * $IDX"
		MASK_TMP=$((1<<$VEC))
		MASK=$(printf "%X%s" $MASK_TMP $MASK_FILL)
	else
		MASK_TMP=$((1<<$VEC))
		MASK=$(printf "%X" $MASK_TMP)
	fi

	printf "%s" $MASK > /proc/irq/$IRQ/smp_affinity
	printf "%s %d %s -> /proc/irq/$IRQ/smp_affinity\n" $IFACE $core $MASK
	case "$XPS_ENA" in
	1)
		printf "%s %d %s -> /sys/class/net/%s/queues/tx-%d/xps_cpus\n" $IFACE $core $MASK $IFACE $((n-1))
		printf "%s" $MASK > /sys/class/net/$IFACE/queues/tx-$((n-1))/xps_cpus
	;;
	2)
		MASK=0
		printf "%s %d %s -> /sys/class/net/%s/queues/tx-%d/xps_cpus\n" $IFACE $core $MASK $IFACE $((n-1))
		printf "%s" $MASK > /sys/class/net/$IFACE/queues/tx-$((n-1))/xps_cpus
	;;
	*)
	esac
}

# Allow usage of , or -
#
parse_range () {
        RANGE=${@//,/ }
        RANGE=${RANGE//-/..}
        LIST=""
        for r in $RANGE; do
		# eval lets us use vars in {#..#} range
                [[ $r =~ '..' ]] && r="$(eval echo {$r})"
		LIST+=" $r"
        done
	echo $LIST
}

# Affinitize interrupts
#
setaff()
{
	CORES=$(parse_range $CORES)
	ncores=$(echo $CORES | wc -w)
	n=1

	# this script only supports interrupt vectors in pairs,
	# modification would be required to support a single Tx or Rx queue
	# per interrupt vector

	queues="${IFACE}-.*TxRx"

	irqs=$(grep "$queues" /proc/interrupts | cut -f1 -d:)
	[ -z "$irqs" ] && irqs=$(grep $IFACE /proc/interrupts | cut -f1 -d:)
	[ -z "$irqs" ] && irqs=$(for i in `ls -Ux /sys/class/net/$IFACE/device/msi_irqs` ;\
	                         do grep "$i:.*TxRx" /proc/interrupts | grep -v fdir | cut -f 1 -d : ;\
	                         done)
	[ -z "$irqs" ] && echo "Error: Could not find interrupts for $IFACE"

	echo "IFACE CORE MASK -> FILE"
	echo "======================="
	for IRQ in $irqs; do
		[ "$n" -gt "$ncores" ] && n=1
		j=1
		# much faster than calling cut for each
		for i in $CORES; do
			[ $((j++)) -ge $n ] && break
		done
		core=$i
		set_affinity
		((n++))
	done
}

# now the actual useful bits of code

# these next 2 lines would allow script to auto-determine interfaces
#[ -z "$IFACES" ] && IFACES=$(ls /sys/class/net)
#[ -z "$IFACES" ] && echo "Error: No interfaces up" && exit 1

# echo IFACES is $IFACES

CORES=$(</sys/devices/system/cpu/online)
[ "$CORES" ] || CORES=$(grep ^proc /proc/cpuinfo | cut -f2 -d:)

# Core list for each node from sysfs
node_dir=/sys/devices/system/node
for i in $(ls -d $node_dir/node*); do
	i=${i/*node/}
	corelist[$i]=$(<$node_dir/node${i}/cpulist)
done

for IFACE in $IFACES; do
	# echo $IFACE being modified

	dev_dir=/sys/class/net/$IFACE/device
	[ -e $dev_dir/numa_node ] && node=$(<$dev_dir/numa_node)
	[ "$node" ] && [ "$node" -gt 0 ] || node=0

	case "$AFF" in
	local)
		CORES=${corelist[$node]}
	;;
	remote)
		[ "$rnode" ] || { [ $node -eq 0 ] && rnode=1 || rnode=0; }
		CORES=${corelist[$rnode]}
	;;
	one)
		[ -n "$cnt" ] || cnt=0
		CORES=$cnt
	;;
	all)
		CORES=$CORES
	;;
	custom)
		echo -n "Input cores for $IFACE (ex. 0-7,15-23): "
		read CORES
	;;
	[0-9]*)
		CORES=$AFF
	;;
	*)
		usage
		exit 1
	;;
	esac

	# call the worker function
	setaff
done

# check for irqbalance running
IRQBALANCE_ON=`ps ax | grep -v grep | grep -q irqbalance; echo $?`
if [ "$IRQBALANCE_ON" == "0" ] ; then
	echo " WARNING: irqbalance is running and will"
	echo "          likely override this script's affinitization."
	echo "          Please stop the irqbalance service and/or execute"
	echo "          'killall irqbalance'"
fi

cat /proc/interrupts | grep eth4

image.png

cat /proc/irq/185/smp_affinity

image.png

计算 SMP IRQ Affinity

“echo 2 > /proc/irq/90/smp_affinity” 中的 ”2“ 是怎么来的,这其实是个二进制数字,代表 00000010,00000001 代表 CPU0 的话,00000010 就代表 CPU1, “echo 2 > /proc/irq/90/smp_affinity” 的意思就是说把 90 中断绑定到 00000010(CPU1)上。所以各个 CPU 用二进制和十六进制表示就是:

               Binary       Hex   
    CPU 0    00000001         1   
    CPU 1    00000010         2  
    CPU 2    00000100         4  
    CPU 3    00001000         8  

如果我想把 IRQ 绑定到 CPU2 上就是 00000100=4:

# echo "4" > /proc/irq/90/smp_affinity  

如果我想把 IRQ 同时平衡到 CPU0 和 CPU2 上就是 00000001+00000100=00000101=5

# echo "5" > /proc/irq/90/smp_affinity