记一次「排查WASM应用的线上卡顿问题」

229 阅读4分钟

背景:

线上用户反馈播放器卡顿,联系到用户,录制了一个perfomace 发现录制结果如下,同时拿到了底层wasm 的版本

image.png

可以看到 worker 里的都变成 wasm-function

我们怎么排查呢?首先需要拿到项目的符号,这里编辑器的产物在jenkens上,我们可以下载到,下一步就是怎么将符号文件和wasm 结合,得到新的performance

我们写下先认识一下performance文件

"args": {
	"data": {
		"cpuProfile": {
			"nodes": [{
				{
					"callFrame": {
						"codeType": "wasm",
						"columnNumber": 4504070,
						"functionName": "wasm-function[16766]",
						"lineNumber": 0,
						"scriptId": 4,
						"url": "wasm://wasm/07dea386"
					},
					"id": 6,
					"parent": 5
				}
				"id": 1
			}, {
				"callFrame": {
					"codeType": "JS",
					"columnNumber": 11,
					"functionName": "process",
					"lineNumber": 211,
					"scriptId": 3,
					"url": "blob:https://xxxxxxxxxxx"
				},
				"id": 2,
				"parent": 1
			},
			"samples": []
		},
		"lines": [],
		"timeDeltas": []
	}
},
"cat": "disabled-by-default-v8.cpu_profiler",
"id": "0x5",
"name": "ProfileChunk",
"ph": "P",
"pid": 783,
"tid": 85111,
"ts": 106611036180,
"tts": 2473

这里非常关键的就是 拿到node 的callframe 里的 wasm 需要把wasm-function[xxxx] 的匿名函数名,替换为C++源码中的实际函数名

因此我们写下了如下代码 拿到wasm[8283] 去符号表里去替换

image.png

就可以得到完整的json,因此我们写下如下脚本

#!/usr/bin/env python3
# -*- coding: UTF-8 -*-

import sys
import subprocess
import json
import os
import argparse



class ChromeSymbolParser(object):
    def __init__(self, profile_path, ve_version, symbol_path=None):
        print("{} {} {}".format(ve_version, profile_path, symbol_path))
        self.ve_version = ve_version
        self.profile_path = profile_path
        self.symbol_path = symbol_path
        self.symbol_list = self.__parse_symbol()


    def start_parse(self):
        msg = open(self.profile_path, "r").read()
        j = json.loads(msg)

        if "traceEvents" in j:      # windows profile 
            j = j["traceEvents"]

        try:
            size = len(j)
            for i in range(size):
                args = j[i]["args"]
                self.__find_symbol(args)
        except:
            pass

        if "nodes" in j:
            self.__find_symbol2(j)
        self.__save(j)


    def __find_symbol(self, args):
        if "data" not in args:
            return
        data = args["data"]

        if "cpuProfile" not in data:
            return

        cpuProfile = data["cpuProfile"]

        if "nodes" not in cpuProfile:
            return

        nodes = cpuProfile["nodes"]

        for x in range(len(nodes)):
            frame = nodes[x]
            callFrame = frame["callFrame"]
            if callFrame["codeType"] == "wasm":
                functionName = callFrame["functionName"]
                callFrame["functionName"] = self.__replace_symbol(functionName)


    def __find_symbol2(self, args): # 只录制一条线程
        nodes = args["nodes"]

        for x in range(len(nodes)):
            frame = nodes[x]
            callFrame = frame["callFrame"]
            functionName = callFrame["functionName"]
            callFrame["functionName"] = self.__replace_symbol(functionName)

    def __parse_symbol(self):
        symbol_path = self.symbol_path

        if (not symbol_path) or (not os.path.isfile(symbol_path)):
            symbol_path = self.__getjenkins()


        # 11342:webrtcimported::WeightedAverage\28short*\2c\20short\2c\20short\20const*\29
        # 符号中有转义字符(ASCII),需要还原回来
        lines = open(symbol_path, "r").readlines()
        for i in range(len(lines)):
            lines[i] = self.__trans_hex_to_ascii(lines[i])
        return lines

    def __save(self, j):
        strj = json.dumps(j)
        outname = ""

        if not self.ve_version:
            path, subffix = self.profile_path.split('.')
            outname = path + "out." + subffix
        else:
            idx = self.profile_path.rfind('/')
            outname =  os.getcwd() + "/out"+self.profile_path[idx+1:]

        f = open(outname, 'w')
        f.write(strj)
        f.close()



    def __getjenkins(self):
        if len(self.ve_version) > 4:
            TOS_URL = "https://xxxxx" % (self.ve_version)
        else:
            TOS_URL = "https://xxxxx % (self.ve_version)
        print(TOS_URL)
        cmd = '''
        rm -f xxx.zip
        rm -rf xxx
        wget -O out.zip %s
        mv out.zip xxx.zip
        echo "unziping"
        unzip -o xxxx.zip -d xxxx
        echo "clean and update"
''' % (TOS_URL)
        subprocess.run(cmd, shell=True)

        return os.path.join(os.getcwd(), "xxxx", "xxxx-xxxxx-opt.js.symbols")     # simd, 非simd版本自己改改路径



    def __trans_hex_to_ascii(self, symbol):
        symbol = symbol.strip()
        i = 0
        last = 0
        final = ''
        while i < len(symbol):
            if symbol[i] == '\\':
                code = symbol[i+1:i+3]
                dec_code = int(code, 16)
                ascii_code = chr(dec_code)
                final += symbol[last:i] + ascii_code
                last = i + 3
                i = last
                continue

            i += 1
        final += symbol[last:]
        return final


    # "wasm-function[5450]"
    def __replace_symbol(self, symbol):
        try:
            tmp = symbol.split('[')[1]
        except Exception as e:
            return symbol

        symbol_index_str = tmp[:-1]
        symbol_index = int(symbol_index_str)
        if symbol_index > len(self.symbol_list):
            print("symbol out of index, give npm version:%s may not match" %
                  self.ve_version)
            subprocess.run("npm list", shell=True)
            return symbol

        tmp = self.symbol_list[symbol_index]
        cpp_symbol = tmp[len(symbol_index_str)+1:]

        return cpp_symbol

        def find_index_of_id(_ids, _tid):
            # 通过Profile 条目的id,找到它对应的index
            for k, v in _ids.items():
                if v[1] == _tid:
                    return k, v[0]
            return None

        for tid, idx_pid in threads.items():
            idx = idx_pid[0]
            if j[idx]['args']['name'] == 'DedicatedWorker thread':
                id_idx = find_index_of_id(ids, tid)
                if id_idx is None or result.get(id_idx[0]) is None:
                    # 假如thread没有profile id,或者堆栈数据中,没有这个id的数据,那可以通过删除thread_name条目,让它排到后面
                    j[idx]['args']['name'] = 'DedicatedWorker thread deleted'
                    j[idx]['name'] = 'thread_name_deleted'
                pass

        return

再download 下来就可以得到解密后的performance

image.png

由于我们是卡顿问题,优先关注gl线程,我们的问题是怎么找到gl线程呢,以下是我的几个方法

  1. 如果对项目 足够熟悉直接 search 比如一般子线程抛给gl线程渲染 都有统一的方法名
  2. 挨个看线程,除去挂起的,慢慢看也能找到 基本方法都会带些textture 关键字
  3. 结合performance frame 的状态 在最卡的时候 扫一遍各自的worker 基本能发现

找到后 就是常规分析 看long task 了 是否合理了

作者的问题 就是long task 和 codec 在同一个线程里 阻塞 解码了

由于作者也是第一次排查 记录一下