A-Tune Ubuntu适配以及对HPL调优A-Tune A-Tune功能介绍 A-Tune是一款基于AI开发的系统性

A-Tune

A-Tune功能介绍

A-Tune是一款基于AI开发的系统性能优化引擎，它利用人工智能技术，对业务场景建立精准的系统画像，感知并推理出业务特征，进而做出智能决策，匹配并推荐最佳的系统参数配置组合，使业务处于最佳运行状态。A-Tune核心技术架构如下图，主要包括智能决策、系统画像和交互系统三层。

智能决策层：包含感知和决策两个子系统，分别完成对应用的智能感知和对系统的调优决策。
系统画像层：主要包括自动特征工程和两层分类模型，自动特征工程用于业务特征的自动选择，两层分类模型用于业务模型的学习和分类。
交互系统层：用于各类系统资源的监控和配置，调优策略执行在本层进行。

A-Tune主要为用户提供以下功能：

基于实时统计数据分析当前负载类型，支持自定义模型。
针对多种应用场景设置默认调优参数和调优脚本，用户开箱即用。
提供离线调优框架，免去人工反复做参数调整、性能评价的调优过程。针对自定义参数寻找的过程，A-Tune提供统一框架，用户设置参数搜索空间，A-Tune提供调优算法和调优结果的展现。

A-Tune模块解析

A-Tune整体由四部分组成：atune-adm，atune Service，atune Rest，atune AI Engine。

atune-adm是整个项目的命令行程序，是用户交互的入口，用户敲入命令后，atune-admin模块通过grpc协议与实际内容提供者atune Server交互，实现展示和处理的解耦。

atune Service是项目中的关键模块，因为它既对上层提供grpc调用服务，同时又负责调用atune Rest，atune AI Engine，处理下层模块的处理结果。

atune Rest基于python实现，通过atune_collector库实现系统数据的采集，与sqlite交互对profile实现增删改查，调用端口是8383。

atune AI Engine是A-Tune整个系统的“智能大脑”，模块已经提前预置一些应用模型，可根据采集的系统数据识别出具体的应用。而且该模块还实现自定义模型，用户自行采集数据并训练出适合用户场景的独特模型。

下面举例分析atune-adm analysis命令的执行过程。

①　atune-adm解析输入参数，传递给内部变量。之后打开grpc通道，调用atune Server模块的Analysis函数，不断监听函数的返回结果，输出到控制台。
②　atune Server验证传递进来的参数，接着调用atune Rest模块采集系统信息，信息存入csv文件。
③　atune Server调用atune AI Engine接口完成当前负载识别。
④　atune Server通过查询sqlite数据库，判断识别结果是否在定义中，然后找到预设的profile setting，激活对应配置，完成优化。

适配Ubuntu

A-Tune本身是基于openEuler系统开发，所以要使其也能在Ubuntu运行，需要安装代替组件，根据安装使用文档，安装并成功运行需要满足以下条件：

以root身份安装软件包和python依赖包。
Golang版本 >= 1.15并加入环境变量。
启用采集程序，配置日志写入规则。
Linux内核>5.4.0 (不是很精确，但ubuntu 18.04 server版经测试不行)

基于上述要求，实现可在Ubuntu系统一键准备安装环境的脚本，如下所示。

基于A-Tune实现Nginx调优

Nginx是一款轻量级的Web服务器、反向代理服务器，由于它的内存占用少，启动极快，高并发能力强，在互联网项目中广泛应用，提升Nginx处理性能，相当于提高整个处理系统的并发度，所以调优Nginx也是运维人员必备功课，实验环境如下表所示。

操作系统	18.04.2
Linux内核版本	5.4.0
CPU	Intel(R) Xeon(R) Gold 5118 CPU @ 2.30 GHz * 2
RAM	64GB DDR4
存储	1 TB HDD
Benchmark软件	httpress

A-Tune本身已经可以对Nginx实现调优，通过遍历参数空间搜索最优参数搭配，原程序主要对以下表中参数进行配置。

调优参数名称	参数空间	说明
nginx.access_log	/var/log/nginx/access.log off	nginx访问日志路径
nginx.error_log	/var/log/nginx/error.log/dev/null	nginx报错日志路径
net.core.netdev_max_backlog	1000~100000	当网卡接收数据包的速度大于内核处理的速度时，会有一个队列保存这些数据包。这个参数表示该队列的最大值
net.ipv4.tcp_keepalive_time	600~36000	TCP发送keepalive消息的频度
net.core.rmem_max	1048576~67108864	内核套接字接收缓存区的最大大小
net.core.wmem_max	1048576~67108864	内核套接字发送缓存区的最大大小
net.ipv4.tcp_tw_reuse	0 1 2	是否允许将TIME-WAIT状态的socket重新用于新的 TCP连接
net.ipv4.ip_local_port_range	32768 609991024 655358192 65535	在UDP和TCP连接中本地端口的取值范围
net.ipv4.tcp_max_tw_buckets	32768~1048576	操作系统允许TIME_WAIT套接字数量的最大值
net.core.somaxconn	128~65536	当每个网络接口接收数据包的速率比内核处理这些包的速率快时，允许发送到队列的数据包的最大数目
net.ipv4.tcp_max_syn_backlog	1024~262144	TCP三次握手建立阶段接收SYN请求队列的最大长度
net.ipv4.tcp_fin_timeout	1~120	当服务器主动关闭连接时，socket保持在FIN-WAIT-2状态的最大时间

除了默认的调优参数，本文扩充可搜索参数列表，配置路径为/etc/atuned/tuning/nginx。

调优参数名称	参数空间	说明
nginx.worker_processes	8~32	worker 进程数量
nginx.accept_mutex	on off	进程轮流接受新链接
nginx.multi_accept	on off	是否同时接受连接所有新请求
nginx.worker_connections	768~65535	一个 woker 进程处理的连接数
net.core.rmem_default	212992~400000	内核套接字接收缓存区默认的大小
net.core.wmem_default	212992~400000	内核套接字发送缓存区默认的大小
fs.file-max	805968~1000000	进程可以同时打开的最大句柄数

实验中发现单次httpress的测试结果出现较大抖动，所以本实验对每次配置进行五次实验，去除一个最小值和一个最大值后取平均，实现结果如下图。

上图中base表示A-Tune默认参数调优结果，pro为扩展参数空间后的调优结果，从上图可以看出以下结论：
1) A-Tune调优有效果，可以提升Nginx处理性能。
2) 在参数空间不大情况下，提高迭代数不会带来收益；添加新参数，扩展参数空间后，调高迭代数，效果明显。
3) 添加合适的新参数后，可以提高调优表现。

HPL

HPL（The High-Performance Linpack Benchmark）是测试高性能计算集群系统浮点性能的基准。HPL通过对高性能计算集群采用高斯消元法求解一元N次稠密线性代数方程组的测试，评价高性能计算集群的浮点计算能力。

理论浮点峰值是该计算机理论上每秒可以完成的浮点计算次数，主要由CPU的主频决定。
理论浮点峰值＝CPU主频×CPU核数×CPU每周期执行浮点（dflop/c）运算的次数。
如何计算自己服务器的理论浮点峰值

HPL环境安装

HPL允许用户为达到最好的性能表现自定义MPI库和BLAS库。

BLAS（Basic Linear Algebra Subprograms）定义了一组应用程序接口标准，是一系列初级操作的规范，如向量之间的乘法、矩阵之间的乘法等。许多数值计算软件库都实现了这一核心，比如OpenBLAS、GotoBLAS、Atlas、再到Intel 的MKL。

MPI（message passing interface）即信息传递接口，是用于跨节点通讯的基础软件环境。它提供让相关进程之间进行通信，同步等操作的API。常用的MPI库有MPICH3、OpenMPI、以及Intel的MPI库。

安装并测试HPL命令步骤：

sudo apt-get install -y gcc g++ gfortran

cd ~
wget https://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz
mv hpl-2.3 hpl
cd hpl/setup
sh make_generic
cp Make.UNKNOWN ../Make.linux
cd ../

# 操作：修改Make文件配置后

make arch=linux -j $(nproc)
cd bin/linux

# 操作：修改HPL.data后运行

mpiexec -n 8 ./xhpl

安装BLAS、MPI库

各库资源链接，自己使用没啥问题
链接：pan.baidu.com/s/1S4KgWV44… 提取码：lcnm

gotoBLAS

# copy lapack to the project dir
mv ../lapack-3.1.1.tgz ./lapack
# 修改以下地方
vim f_check
# content：wq
# if (($linker_l ne "") || ($linker_a ne "")) {
#    print MAKEFILE "FEXTRALIB=$linker_L  -lgfortran -lm -lquadmath $linker_a\n";
# }

make CC=gcc BINARY=64 TARGET=NEHALEM

openBLAS

cd OpenBLAS-0.3.20
make

Atlas

sudo apt-get install -y libatlas-base-dev

INTEL-MKL

www.intel.com/content/www…
找到并下载oneMKL，之后执行

sh xxx.sh

安装MPI

MPICH3

sudo ./configure --prefix=/usr/local/mpich
make -j $(nproc)
make install

OpenMPI

./configure --prefix="/root/MPI/openmpi"
make -j $(nproc)
make install

Intel-MPI

www.intel.com/content/www…
找到并下载MPI Library，之后执行

sh xxx.sh

选优组件

如何确定基准HPL.data
ulhpc-tutorials.readthedocs.io/en/latest/p… help.aliyun.com/document_de…

Intel-MPI与其他BLAS不能搭配，好像需要Intel自家的编译库，还是商业版，遂放弃，只进行了Intel-MPI与Intel-MKL的测试，测试命令如下：

# Inter-MPI的安装路径
cd xx/xx/benchmark/mp_linkpack
# 切记不要手动设置
source /opt/intel/oneapi/mpi/latest/env/vars.sh
./build.sh

其他库，两两搭配，看效果

#!/usr/bin/env bash
# set -v on
PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:~/bin
export PATH

#=================================================
#	System Required: Ubuntu 18.04
#	Description: generate Make.intel for different BLAS and MPI
#       Time: 2022_4_3
#	Version: 1.0.1
#	Author: no_one
#=================================================

Green_font_prefix="\033[32m" && Red_font_prefix="\033[31m" && Green_background_prefix="\033[42;37m" && Red_background_prefix="\033[41;37m" && Font_color_suffix="\033[0m"
Info="${Green_font_prefix}[信息]${Font_color_suffix}"
Error="${Red_font_prefix}[错误]${Font_color_suffix}"
Tip="${Green_font_prefix}[注意]${Font_color_suffix}"

MPdir=""
MPinc=""
MPlib=""
LAdir=""
LAinc=""
LAlib=""
BinDir=""
# export LD_LIBRARY_PATH

# activate BLASes
activate_gotoBlas(){
    echo -e "${Info} set gotoblas env..."
    LAdir=""
    LAinc=""
    LAlib="/root/blas/GotoBLAS2-1.13/GotoBLAS2/libgoto2.a -lpthread -lm" 
}

activate_openBlas(){
    echo -e "${Info} setopenblas env..."
    LAdir=""
    LAinc=""
    LAlib="/root/blas/OpenBLAS-0.3.20/libopenblas.a -lpthread -lm" 
}

activate_atlas(){
    echo -e "${Info} set atlas env..."
    LAdir=""
    LAinc="-I/usr/include/x86_64-linux-gnu/atlas"
    LAlib="-lblas" 
}

activate_mkl(){
    echo -e "${Info} set mkl env..."
    LAdir="/opt/intel/oneapi/mkl/latest"
    LAinc='-I$(LAdir)/include'
    LAlib='-Wl,--start-group $(LAdir)/lib/intel64/libmkl_intel_lp64.a $(LAdir)/lib/intel64/libmkl_sequential.a $(LAdir)/lib/intel64/libmkl_core.a -Wl,--end-group -ldl -lpthread -lm'
}

# active MPIs
activate_mpich3(){
    echo -e "${Info} set mpich3 env..."
    export LD_LIBRARY_PATH=/root/mpi/env/mpich/lib/
    MPdir=""
    MPinc="-I/root/mpi/env/mpich/include/"
    MPlib="/root/mpi/env/mpich/lib/libmpich.so -L/root/mpi/env/mpich/lib/"
    BinDir="/root/mpi/env/mpich/bin"
}

activate_openmpi(){
    echo -e "${Info} set openmpi env..."
    MPdir=""
    MPinc="-I/root/mpi/env/openmpi/include/"
    MPlib="/root/mpi/env/openmpi/lib/libmpi.so"
    BinDir="/root/mpi/env/openmpi/bin"   
    export OMPI_ALLOW_RUN_AS_ROOT=1
    export OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
}

activate_oneapi(){
    echo -e "${Info} set oneapi-mpi env..."
    MPdir="/opt/intel/oneapi/mpi/latest"
    MPinc='-I$(MPdir)/include/'
    MPlib='$(MPdir)/lib/libmpicxx.a -L/opt/intel/oneapi/mpi/latest/lib/ -L/opt/intel/oneapi/mpi/latest/lib/release'
    BinDir="/opt/intel/oneapi/mpi/latest/bin"
    export LD_LIBRARY_PATH=/opt/intel/oneapi/mpi/latest/lib/release
}

generate_makefile(){
    # Make.intel需要自己移动到工作目录下，
    cp ./tempalte.intel ./Make.intel
    echo -e "${Info} ${BinDir} with ${LAinc}/${LAlib}"
    sed -i "s#^MPdir\s*=.*#MPdir = ${MPdir}#g" ./Make.intel
    sed -i "s#^MPinc\s*=.*#MPinc = ${MPinc}#g" ./Make.intel
    sed -i "s#^MPlib\s*=.*#MPlib = ${MPlib}#g" ./Make.intel

    sed -i "s#^LAdir\s*=.*#LAdir = ${LAdir}#g" ./Make.intel
    sed -i "s#^LAinc\s*=.*#LAinc = ${LAinc}#g" ./Make.intel
    sed -i "s#^LAlib\s*=.*#LAlib = ${LAlib}#g" ./Make.intel

    sed -i "s#^CC\s*=.*#CC = ${BinDir}/mpicc#g" ./Make.intel
    sed -i "s#^LINKER\s*=.*#LINKER = ${BinDir}/mpif77#g" ./Make.intel
}

make_file(){
    make arch=intel clean > /dev/null 2>&1
    make arch=intel -j $(nproc) -w >> make.log 2>&1
    # make arch=intel -j $(nproc)
    if [ $? -ne 0 ]; then
        exit
    fi
}

run_test(){
    cp ./bin/HPL.dat ./bin/intel/
    cd ./bin/intel || exit
    ${BinDir}/mpiexec -n 8 ./xhpl
}

run(){
    generate_makefile
    make_file
    run_test
    cd /root/hpl || exit
}

PS：改变gcc参数有奇效，比如以下参数参数名称 | 参数说明 | | --------------------- | -------------- | | -fomit-frame-pointer | 少了栈帧的切换和栈地址的保存 | | -O3 | 提高代码的并行执行程度 | | -funroll-loops | 启发式地决定展开哪些代码循环

基于A-Tune实现HPL调优

这里主要用到了A-Tune的离线调优命令

# Start to tuning
atune-adm tuning --project hpl --detail hpl_client.yaml
# Restore the environment
atune-adm tuning --restore --project hpl

新建目录文件，充当命令行参数

$ cat hpl_client.yaml
project: "hpl"
engine : "gbrt"
iterations : 30
random_starts : 10

benchmark : "sh /root/hpl/bin/intel/run.sh"
evaluations :
  -
    name: "Gflops"
    info:
        get: "echo '$out' | grep 'WC11C2R4' | awk '{print $7}'"
        type: "negative"
        weight: 100

修改server配置，路径/etc/atuned/tuning/，新建目录hpl

$ cat tuning_params_hpl.yaml
project: hpl
maxiterations: 100
startworkload: ''
stopworkload: ''
object:
- name: problems.N
  info:
    desc: The size of problems
    get: cat /root/hpl/bin/intel/HPL.dat |awk '/Ns$/{print}'|awk '{print $1}'
    set: sed -i 's#.*Ns$#$value    Ns#g' /root/hpl/bin/intel/HPL.dat
    needrestart: 'false'
    type: discrete
    scope:
    - 64880
    - 78776
    step: 2024
    items: null
    dtype: int
- name: problems.NB
  info:
    desc: The minimum granularity of the calculation
    get: cat /root/hpl/bin/intel/HPL.dat |awk '/NBs$/{print}'|awk '{print $1}'
    set: sed -i 's#.*NBs$#$value    NBs#g' /root/hpl/bin/intel/HPL.dat
    needrestart: 'false'
    type: discrete
    scope:
    - 96
    - 256
    step: 16
    items: null
    dtype: int
- name: vm.swappiness
  info:
    desc: A larger value indicates that the swap partition is used more actively.
      A smaller value indicates that the memory is used more actively.
    get: sysctl -n vm.swappiness
    set: sysctl -w vm.swappiness=$value
    needrestart: 'false'
    type: discrete
    scope:
    - 0
    - 60
    step: 10
    items: null
    dtype: int
- name: kernel.randomize_va_space
  info:
    desc: Setting Memory Address Randomization
    get: sysctl -n kernel.randomize_va_space
    set: sysctl -w kernel.randomize_va_space=$value
    needrestart: 'false'
    type: discrete
    scope:
    - 0
    - 2
    step: 1
    items: null
    dtype: int

之后操作就是运行离线调优命令
效果提升不大，结果就不放了。。。

心得体会

VMWare不适合HPL测试，它是通过时间片的方式“虚拟”出一个cpu，而且还受分配cpu总数的影响。
A-Tune不支持变量调优，比如对调优的变量存在线性关系的情况，还不能处理
（前前后后折腾两周，麻

参考来源：

gitee.com/openeuler/A…
blog.csdn.net/weixin_4702…
scc.ustc.edu.cn/zlsc/pxjz/2…
warmshawn.github.io/2019/02/17/…
ulhpc-tutorials.readthedocs.io/en/latest/p…
gist.github.com/Levi-Hope/2…
barry-flynn.github.io/2021/10/13/…
www.intel.com/content/www…
community.arm.com/arm-communi…
help.aliyun.com/document_de…
blog.csdn.net/greenapple_…
blog.csdn.net/sishuiliuni…