环境和使用工具

Linux Ubuntu

安装了lm-sensors工具包，可使用sensors查看CPU状态

具有nvidia GPU驱动，可使用nvidia-smi查看GPU状态
WXPusher

注册一个开发者身份即可，这个每个人都可以

对WXPusher的介绍

我总结了一下，如果我们想要搭建最简单的单向（应用→用户）的推送，大致意思如下图所示

首先，开发者可以创建一些应用，每个应用都会有自己的一个api-Token可供调用

一个应用里可以有多个主题，例如，你的应用是同时监控各大平台购物节开始情况的，但并不是所有用户都想同时关注所有平台，有些人只想得知该应用提供的淘宝端的信息，有些用户只想得知该应用提供的京东端信息，另一些则只想看拼多多。此时，就可以对用户们进行编组，实际上，也就是让这些群体身处一个个“主题”当中。

如果用户只是订阅一个应用，则开发者希望推送消息时一般会把自己所有的信息整合起来，把专门的用户ID列表作为参数。根据这个参数，WXPusher平台方会根据这一指定的用户ID列表进行推送（是的，大致意思就是，即便所有用户都订阅了，也可以不给其中一部分发，不发的原则可以是付费，可以是黑名单）

如果用户订阅的是主题，此时发消息就简单得多，直接选择消息，提供主题ID列表即可，WXPusher会向这些主题下的所有用户推送消息

Linux环境下的温度监测

让我们做一些有趣的事情。我们在学校的时候，科研工作过程通常需要使用实验室的服务器。然而，作为学生，我们无法在所有时间都保证自己在学校里，更不可能在平时呆在服务主机旁边。虽然校方往往也会集中各实验室服务器统一管理，但依然会存在把服务器放在自己屋内角落的实验室。如果出现什么意外，导致服务器温度过高甚至引发火灾，后果不堪设想。

现在微信人人都有，把温度告警做成一个推送服务，虽然不能从物理上避免险情发生，但却可以第一时间发现问题，及时通报，以免事态进一步严重。

sensors命令

用于监控CPU状态的。直接输入sensors即可。下面对这个命令输出的内容做一些解释：

ucsi_source_psy_1_00081-i2c-1-08
Adapter: NVIDIA GPU I2C adapter
in0:           0.00 V  (min =  +0.00 V, max =  +0.00 V)
curr1:         0.00 A  (max =  +0.00 A)

pch_lewisburg-virtual-0
Adapter: Virtual device
temp1:        +45.0°C  

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +37.0°C  (high = +79.0°C, crit = +89.0°C)
Core 0:        +35.0°C  (high = +79.0°C, crit = +89.0°C)
Core 1:        +35.0°C  (high = +79.0°C, crit = +89.0°C)
Core 2:        +35.0°C  (high = +79.0°C, crit = +89.0°C)
Core 3:        +35.0°C  (high = +79.0°C, crit = +89.0°C)
Core 4:        +34.0°C  (high = +79.0°C, crit = +89.0°C)
Core 5:        +35.0°C  (high = +79.0°C, crit = +89.0°C)
Core 6:        +35.0°C  (high = +79.0°C, crit = +89.0°C)
Core 7:        +36.0°C  (high = +79.0°C, crit = +89.0°C)

ucsi_source_psy_0_00081-i2c-0-08
Adapter: NVIDIA GPU I2C adapter
in0:           0.00 V  (min =  +0.00 V, max =  +0.00 V)
curr1:         0.00 A  (max =  +0.00 A)

coretemp-isa-0001
Adapter: ISA adapter
Package id 1:  +39.0°C  (high = +79.0°C, crit = +89.0°C)
Core 0:        +37.0°C  (high = +79.0°C, crit = +89.0°C)
Core 1:        +38.0°C  (high = +79.0°C, crit = +89.0°C)
Core 2:        +37.0°C  (high = +79.0°C, crit = +89.0°C)
Core 3:        +37.0°C  (high = +79.0°C, crit = +89.0°C)
Core 4:        +38.0°C  (high = +79.0°C, crit = +89.0°C)
Core 5:        +38.0°C  (high = +79.0°C, crit = +89.0°C)
Core 6:        +37.0°C  (high = +79.0°C, crit = +89.0°C)
Core 7:        +39.0°C  (high = +79.0°C, crit = +89.0°C)

重点看coretemp-isa-0000和coretemp-isa-0001。这里列举了CPU的当前温度，高温阈值（high）和危险阈值（crit）

nvidia-smi命令

用于查看GPU运行状态。在默认输出的面板上，会有当前温度信息，不过主要还是当前显存占用情况以及相应的进程id。

在本文中，我推荐使用nvidia-smi -q -d TEMPERATURE。下面对这个命令输出的内容做一些解释：

==============NVSMI LOG==============

Timestamp                                 : Fri Jul  7 10:28:54 2023
Driver Version                            : 510.108.03
CUDA Version                              : 11.6

Attached GPUs                             : 2
GPU 00000000:18:00.0
    Temperature
        GPU Current Temp                  : 52 C
        GPU Shutdown Temp                 : 94 C
        GPU Slowdown Temp                 : 91 C
        GPU Max Operating Temp            : 89 C
        GPU Target Temperature            : 84 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A

GPU 00000000:D8:00.0
    Temperature
        GPU Current Temp                  : 56 C
        GPU Shutdown Temp                 : 94 C
        GPU Slowdown Temp                 : 91 C
        GPU Max Operating Temp            : 89 C
        GPU Target Temperature            : 84 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A

上面几行是时间戳，驱动版本，CUDA版本。

下面开始可以看到，两颗GPU，温度这里一共7行，我们看前5行

参数	描述
GPU Current Temp	GPU当前温度
GPU Shutdown Temp	GPU强制关停温度（危险温度）
GPU Slowdown Temp	GPU降低性能温度（警告温度，类比CPU高温后降频）
GPU Max Operating Temp	GPU允许自己满血运行的最高温度（警告温度，再高就要降低性能换取安全了）
GPU Target Temperature	GPU目标温度（GPU希望在自己满血运行时维持在的温度）

脚本编写

数值提取

#!/bin/bash

# 定义监控时间间隔（单位：秒）
interval=20
# 这里是输出的日志，你可以自定义它的路径位置
# output_file="monitor.log"
while true; do
    final_result=""
    # 获取当前时间
    current_time=$(date "+%Y-%m-%d %H:%M:%S")

    # 获取CPU温度以及对应的预警、危险阈值
    cpu_info=$(sensors)
    cpu_temps=$(echo "$cpu_info" | awk '/^Package id/ {print $4}' | tr -d '+')
    core_temps=$(echo "$cpu_info" | awk '/^Core/ {print $3}' | tr -d '+')
    # 提取预警温度信息
    high_temps=$(echo "$cpu_info" | awk '/^Package id/ {print $7}' | tr -d ',)+')
    crit_temps=$(echo "$cpu_info" | awk '/^Package id/ {print $10}' | tr -d ',)+')

    # 输出温度信息
    final_result+=$(echo "时间: $current_time\n")

    # 将temperature变量的内容传递给read命令，使用换行符作为分隔符
    IFS=$'\n' read -d '' -ra cpu_temps <<< "$cpu_temps"
    IFS=$'\n' read -d '' -ra high_temps <<< "$high_temps"
    IFS=$'\n' read -d '' -ra crit_temps <<< "$crit_temps"
    IFS=$'\n' read -d '' -ra core_temps <<< "$core_temps"
    result_cpu=""
    # 输出存储在数组中的温度值
    for index in "${!cpu_temps[@]}"; do
        final_result+="\n"
        temp=$(echo "${cpu_temps[index]}" | tr -d "°C")
        temp=$(printf "%f" "$temp")
        high=$(echo "${high_temps[index]}" | tr -d "°C")
        high=$(printf "%f" "$high")
        result=""
        if (( $(echo "$temp < $high" | bc -l) )); then
                result="正常"
            else
                result="异常"
                result_cpu="异常"
        fi
        final_result+=$(echo "CPU: $index, 状态: $result\n")
        final_result+=$(echo "    温度值: ${cpu_temps[index]} (高温阈值: ${high_temps[index]}, 危险阈值: ${crit_temps[index]})\n")
        for ((i=0; i<8; i++)); do
            final_result+=$(echo "    核心$((i+1)): ${core_temps[i+8*index]}\n")
        done
    done
    
    # 开始提取GPU信息
    final_result+="\n"
    gpu_info=$(nvidia-smi -q -d TEMPERATURE)
    gpu_currents=$(echo "$gpu_info" | awk '/GPU Current Temp/ {print $5}')
    gpu_shutdowns=$(echo "$gpu_info" | awk '/GPU Shutdown Temp/ {print $5}')
    gpu_max=$(echo "$gpu_info" | awk '/GPU Max Operating Temp/ {print $6}')

    IFS=$'\n' read -d '' -ra gpu_currents <<< "$gpu_currents"
    IFS=$'\n' read -d '' -ra gpu_shutdowns <<< "$gpu_shutdowns"
    IFS=$'\n' read -d '' -ra gpu_max <<< "$gpu_max"
    result_gpu=""
    for index in "${!gpu_currents[@]}"; do
        temp=$(printf "%d" "${gpu_currents[index]}")
        ope=$(printf "%d" "${gpu_shutdowns[index]}")
        result=""
        if (($(echo "$temp < $ope"))); then
                result="正常"
            else
                result="异常"
                result_gpu="异常"
            fi
            final_result+=$(echo "GPU核心: $index, $result\n")
        final_result+=$(echo "温度: ${gpu_currents[index]}°C; (高温阈值: ${gpu_max[index]}°C, 危险阈值: ${gpu_shutdowns[index]}°C)\n")
    done

    # 输出结果到日志，采用的方式是覆盖方式
    echo -e "$final_result" > "$output_file"
    sleep $interval
done

告警请求发送

那自然是强大的curl命令了

本文是基本的推送服务，让我们看看官方文档：

那么实际上，我们加上一个判断，当温度异常时发请求就好了。

加到上面那段代码第74行的位置就行

# 如果有CPU或GPU异常就报警
if [[ "$result_cpu" == "异常" || "$result_gpu" == "异常" ]]; then
    result=""
    if [[ "$result_cpu" == "异常" ]]; then
        result+="CPU "
    fi
    if [[ "$result_cpu" == "异常" ]]; then
        result+="GPU "
    fi
    url="https://wxpusher.zjiecode.com/api/send/message"
    data="{
        \"appToken\": \"AT_***，填上你的应用token，创建的时候它会提醒的\", 
        \"content\": \"$final_result\",
        \"summary\": \"$result 温度告警！\",
        \"contentType\": \"1\",
        \"topicIds\": [10607],
        \"uids\":[],
        \"verifyPay\": false
    }"
    response=$(curl -X POST -H "Content-Type: application/json" -d "$data" "$url")
    final_result+="\n"
    final_result+=$response
fi

结果演示

我们将主题或应用的订阅二维码发给小伙伴们，就可以一起关注告警情况了。这个结果是我改了下判断条件强制发送的，总之就是展示一下报警时的效果。

使用WXPusher搭建的服务器自动检测告警应用