async-profiler源码(一)cpu profiling

106 阅读36分钟

一、概述

本章分析async-profiler的cpu profiling实现,包括:

  1. javaagent如何挂载;
  2. 不同的cpu分析引擎的实现方式;

注:

  1. async-profiler4.0.0
  2. openjdk17

案例

async-profiler支持多种ProfilingMode:

  1. cpu:默认模式,也可以通过-e cpu指定。收集方法调用栈样本,分析消耗cpu的方法;
  2. lock:-e lock,分析锁竞争情况;
  3. alloc:-e alloc,分析堆内存分配情况;
  4. nativemem:-e nativemem,分析非堆内存分配情况;
  5. wall:-e wall,以固定时间间隔对所有线程进行均等采样(无论线程处于运行中、休眠还是阻塞状态),适用于分析应用启动时间等场景;

本章主要分析cpu模式,通过-d指定采集时间(秒),对目标jvm进程发起cpu分析。

asprof -d 30 14047(java进程id

分析结果通过标准输出,包含两部分,按照采样数量倒排展示:调用栈详情、栈顶栈帧。

--- Execution profile ---
Total samples       : 4

--- 20000000 ns (50.00%), 2 samples
  [ 0] __GI___futex_abstimed_wait_cancelable64
  [ 1] pthread_cond_timedwait@@GLIBC_2.17
  [ 2] os::PlatformMonitor::wait
  [ 3] Monitor::wait_without_safepoint_check
  [ 4] WatcherThread::sleep
  [ 5] WatcherThread::run
  [ 6] Thread::call_run
  [ 7] thread_native_entry
  [ 8] start_thread
  [ 9] thread_start

--- 10000000 ns (25.00%), 1 sample
  [ 0] AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<548964ul, G1BarrierSet>, (AccessInternal::BarrierType)2, 548964ul>::oop_access_barrier
  [ 1] JavaThread::sleep
  [ 2] JVM_Sleep
  [ 3] java.lang.Thread.sleep
  [ 4] com.xxx.NativeMemBurnJob.normalJob
  [ 5] jdk.internal.reflect.GeneratedMethodAccessor6.invoke
  [ 6] jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke
  [ 7] java.lang.reflect.Method.invoke
  [ 8] org.springframework.scheduling.support.ScheduledMethodRunnable.runInternal
  [ 9] org.springframework.scheduling.support.ScheduledMethodRunnable.lambda$run$2
  [10] org.springframework.scheduling.support.ScheduledMethodRunnable$$Lambda$859.0x00000002013c32c8.run
  [11] io.micrometer.observation.Observation.observe
  [12] org.springframework.scheduling.support.ScheduledMethodRunnable.run
  [13] org.springframework.scheduling.config.Task$OutcomeTrackingRunnable.run
  [14] org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run
  [15] java.util.concurrent.Executors$RunnableAdapter.call
  [16] java.util.concurrent.FutureTask.runAndReset
  [17] java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run
  [18] java.util.concurrent.ThreadPoolExecutor.runWorker
  [19] java.util.concurrent.ThreadPoolExecutor$Worker.run
  [20] java.lang.Thread.run

--- 10000000 ns (25.00%), 1 sample
  [ 0] java.util.concurrent.ThreadPoolExecutor.getTask
  [ 1] java.util.concurrent.ThreadPoolExecutor.runWorker
  [ 2] java.util.concurrent.ThreadPoolExecutor$Worker.run
  [ 3] java.lang.Thread.run

          ns  percent  samples  top
  ----------  -------  -------  ---
    20000000   50.00%        2  __GI___futex_abstimed_wait_cancelable64
    10000000   25.00%        1  AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<548964ul, G1BarrierSet>, (AccessInternal::BarrierType)2, 548964ul>::oop_access_barrier
    10000000   25.00%        1  java.util.concurrent.ThreadPoolExecutor.getTask

通过指定-f选项,输出火焰图:

asprof -d 30 -f flame.html 14047(java进程id

越宽的方法,代表cpu使用率越高,可能是性能瓶颈。

image.png

Makefile

从Makefile说起,对于async-profiler构建产物主要由2个部分组成:

  1. asprof:命令行工具,c实现;
  2. libasyncProfiler.so:基于jvmti实现的javaagent,c++实现;
ASPROF=bin/asprof
LIB_PROFILER=lib/libasyncProfiler.$(SOEXT)
build/$(ASPROF): src/main/* src/jattach/* src/fdtransfer.h
    $(CC) $(CPPFLAGS) $(CFLAGS) $(DEFS) -o $@ src/main/*.cpp src/jattach/*.c
    $(STRIP) $@
build/$(LIB_PROFILER): $(SOURCES) $(HEADERS) $(RESOURCES) $(JAVA_HELPER_CLASSES)
    for f in src/*.cpp; do echo '#include "'$$f'"'; done |\
    $(CXX) $(CPPFLAGS) $(CXXFLAGS) $(DEFS) $(INCLUDES) -fPIC -g -shared -o $@ -xc++ - $(LIBS)

其中asprof分为两个部分:

  1. main:asprof的入口,管理整个命令的执行流程;
  2. jattach:挂载agent的入口,与JVM交互;

image.png

整体流程

从进程角度来说,asprof执行涉及三个进程:

  1. main:执行asprof命令;
  2. jattach:是main通过fork创建的子进程,与JVM通讯,需要挂载两次agent,开始和结束profiling;
  3. JVM进程:被profiling的用户进程,执行async-profiler提供的libasyncProfiler.so;

image.png

从JVM线程角度来说,涉及两类线程:

  1. AttachListener:jvm内部线程,用于动态加载javaagent并执行逻辑;
  2. 各JVM线程:async-profiler对所有JVM线程开启profiling,各JVM线程获取当前的方法栈;

image.png

二、main

src/main/main.cpp:asprof的程序入口

  1. 解析入参
  2. 执行jattach挂载agent,执行start指令,输入参数如:start,quiet,file=数据输出文件,,log=日志文件;
  3. 注册信号处理器(kill pid和ctrl+c),可提前退出profiling;
  4. 睡眠-d指定的时间;
  5. 再次执行jattach挂载agent,执行stop指令;
static void sigint_handler(int sig) {
    end_time = 0;
}
int main(int argc, const char** argv) {
  // ...入参解析
  // start,quiet,file=/tmp/asprof.{asprof的pid}.{目标进程pid},,log=/tmp/asprof-log.{asprof的pid}.{目标进程pid}
  run_jattach(pid, String("start,quiet,file=") << file << "," << output << format << params << ",log=" << logfile);

  fprintf(stderr, "Profiling for %d seconds\n", duration);
  // 计算profiling结束时间
  end_time = time_micros() + duration * 1000000ULL; 

  // 注册信号处理器,如果用户结束asprof进程,end_time设置为0,提前退出睡眠
  signal(SIGINT, sigint_handler);
  signal(SIGTERM, sigint_handler);

  // 睡眠等待profiling时间
  while (time_micros() < end_time) {
      if (kill(pid, 0) != 0) {
          fprintf(stderr, "Process exited\n");
          if (use_tmp_file) print_file(file, STDOUT_FILENO);
          return 0;
      }
      sleep(1);
  }

  fprintf(stderr, end_time != 0 ? "Done\n" : "Interrupted\n");
  signal(SIGINT, SIG_DFL);

  // stop,file=/tmp/asprof.{asprof的pid}.{目标进程pid},,log=/tmp/asprof-log.{asprof的pid}.{目标进程pid}
  run_jattach(pid, String("stop,file=") << file << "," << output << format << ",log=" << logfile);
}

src/main/main.cpp:run_jattach,fork子进程执行attach,父进程阻塞等待子进程结束。

传入四个参数:

  1. load,常量,代表加载javaagent,对应jdk里load_agent方法;
  2. libpath,libasyncProfiler.so的文件位置;
  3. true,libpath是绝对路径;
  4. cmd.str(),agent参数;
static void run_jattach(int pid, String& cmd) {
    pid_t child = fork();
    if (child == -1) {
        error("fork failed", errno);
    }
    if (child == 0) {
        // 子进程
        const char* argv[] = {"load", libpath.str(), libpath.str()[0] == '/' ? "true" : "false", cmd.str()};
        exit(jattach(pid, 4, argv, 0));
    } else {
        // 父进程
        int ret = wait_for_exit(child);
        if (ret != 0) {
            print_file(logfile, STDERR_FILENO);
            exit(WEXITSTATUS(ret));
        }
        print_file(logfile, STDERR_FILENO);
        // 未指定-f,读取/tmp/asprof.{asprof的pid}.{目标进程pid},写入标准输出
        // 所以javaagent写tmp文件,attach进程读tmp文件,从标准输出打印
        if (use_tmp_file) print_file(file, STDOUT_FILENO);
    }
}

三、jattach

这里是用c写的attach javaagent逻辑,本质上和使用VirtualMachine#attach(pid)的逻辑是一致的。

image.png

动态挂载agent,依赖于jvm进程主动打开一个unix domain socket,让attach进程与jvm进程通讯。

jattach/jattach_hotspot.c:attach进程,jattach_hotspot挂载javaagent(libasyncProfiler.so)。

1)check_socket:校验socket是否存在,一般是/tmp/.java_pid{pid}文件;

2)start_attach_mechanism:如果socket不存在,发起attach逻辑;

3)connect_socket:建立socket连接;

4)write_command:attach进程向jvm进程发送load指令,指定javaagent和相关参数;

5)read_response:jvm执行javaagent结束后,attach进程通过socket得到执行结果;

int jattach_hotspot(int pid, int nspid, int argc, char** argv, int print_output) {
    // check_socket: 校验socket是否存在
    // start_attach_mechanism: 如果socket不存在,发起attach逻辑
    if (check_socket(nspid) != 0 && start_attach_mechanism(pid, nspid) != 0) {
        perror("Could not start attach mechanism");
        return 1;
    }
    // 建立socket连接
    int fd = connect_socket(nspid);
    if (fd == -1) {
        perror("Could not connect to socket");
        return 1;
    }
    if (print_output) {
        printf("Connected to remote JVM\n");
    }
    // 向socket写入参数
    if (write_command(fd, argc, argv) != 0) {
        perror("Error writing to socket");
        close(fd);
        return 1;
    }
    // 读取socket响应
    int result = read_response(fd, argc, argv, print_output);
    close(fd);
    return result;
}

jattach/jattach_hotspot.c:start_attach_mechanism通过attach_pid文件和SIGQUIT信号,通知JVM进程处理打开socket。

1)创建/proc/{pid}/cwd/.attach_pid{pid}文件;

2)向目标jvm进程发送SIGQUIT信号;

3)睡眠并check_socket校验socket文件是否创建;

4)删除/proc/{pid}/cwd/.attach_pid{pid}文件;

static int start_attach_mechanism(int pid, int nspid) {
    char path[MAX_PATH];
    snprintf(path, sizeof(path), "/proc/%d/cwd/.attach_pid%d", mnt_changed > 0 ? nspid : pid, nspid);
    int fd = creat(path, 0660);
    kill(pid, SIGQUIT);
    struct timespec ts = {0, 20000000};
    int result;
    do {
        nanosleep(&ts, NULL);
        result = check_socket(nspid);
    } while (result != 0 && (ts.tv_nsec += 20000000) < 500000000);
    unlink(path);
    return result;
}

src/hotspot/share/runtime/os.cpp:jvm侧的SignalHandler线程,监听SIGQUIT信号,如果attach_pid文件存在,创建AttachListener线程。

static void signal_thread_entry(JavaThread* thread, TRAPS) {
  os::set_priority(thread, NearMaxPriority);
  while (true) {
    int sig;
    {
      sig = os::signal_wait();
    }
    if (sig == os::sigexitnum_pd()) {
       return;
    }
    switch (sig) {
      case SIGBREAK: { // SIGQUIT
        if (!DisableAttachMechanism) {
          // 将状态从AL_NOT_INITIALIZED->AL_INITIALIZING
          AttachListenerState cur_state = AttachListener::transit_state(AL_INITIALIZING, AL_NOT_INITIALIZED);
          if (cur_state == AL_INITIALIZING) {
            continue;
          } else if (cur_state == AL_NOT_INITIALIZED) {
             // 创建AttachListener线程
            if (AttachListener::is_init_trigger()) {
              continue;
            } else {
              AttachListener::set_state(AL_NOT_INITIALIZED);
            }
          } else if (AttachListener::check_socket_file()) {
            continue;
          }
        }
      }
      // ...
    }
  }
}
// src/hotspot/os/linux/attachListener_linux.cpp
bool AttachListener::is_init_trigger() {
  char fn[PATH_MAX + 1];
  int ret;
  struct stat64 st;
  sprintf(fn, ".attach_pid%d", os::current_process_id());
  RESTARTABLE(::stat64(fn, &st), ret);
  if (ret == 0) {
    // .attach_pid存在
    if (os::Posix::matches_effective_uid_or_root(st.st_uid)) {
      init(); // 创建线程
      return true;
    }
  }
  return false;
}
// src/hotspot/share/services/attachListener.cpp
void AttachListener::init() {
  const char thread_name[] = "Attach Listener";
  // ...
  JavaThread* listener_thread = new JavaThread(&attach_listener_thread_entry);
  // ...
  Thread::start(listener_thread);
}

src/hotspot/share/services/attachListener.cpp:attach_listener_thread_entry是AttachListener线程的执行逻辑,创建socket,读socket,执行指定方法,如load指令对应load_agent方法,最后再写socket。

static AttachOperationFunctionInfo funcs[] = {
  { "agentProperties",  get_agent_properties },
  { "datadump",         data_dump },
  { "dumpheap",         dump_heap },
  { "load",             load_agent },
  { "properties",       get_system_properties },
  { "threaddump",       thread_dump },
  { "inspectheap",      heap_inspection },
  { "setflag",          set_flag },
  { "printflag",        print_flag },
  { "jcmd",             jcmd },
  { NULL,               NULL }
};
static void attach_listener_thread_entry(JavaThread* thread, TRAPS) {
  // 创建socket --- /tmp/.java_pid{pid}
  if (AttachListener::pd_init() != 0) {
    AttachListener::set_state(AL_NOT_INITIALIZED);
    return;
  }
  // AttachListener状态标记为AL_INITIALIZED
  AttachListener::set_initialized();

  for (;;) {
    // 读取socket,封装为AttachOperation
    AttachOperation* op = AttachListener::dequeue();
    AttachOperationFunctionInfo* info = NULL;
    for (int i=0; funcs[i].name != NULL; i++) {
      const char* name = funcs[i].name;
      // 根据指令匹配要执行的function
      if (strcmp(op->name(), name) == 0) {
        info = &(funcs[i]);
        break;
      }
    }
    // 执行目标function --- load_agent
    res = (info->func)(op, &st);
    // function结果写socket
    op->complete(res, &st);
  }
}

src/hotspot/share/services/attachListener.cpp:load_agent_library传入load指令的参数,包括agent-agent位置,absParam-是否绝对路径,options-给agent的参数。

static jint load_agent(AttachOperation* op, outputStream* out) {
  const char* agent = op->arg(0);
  const char* absParam = op->arg(1);
  const char* options = op->arg(2);
  return JvmtiExport::load_agent_library(agent, absParam, options, out);
}

src/hotspot/share/prims/jvmtiExport.cpp:load_agent_library加载agent并执行agent的Agent_OnAttach方法。

jint JvmtiExport::load_agent_library(const char *agent, const char *absParam,
                                     const char *options, outputStream* st) {
  char ebuf[1024] = {0};
  void* library = NULL;
  jint result = JNI_ERR;
  const char *on_attach_symbols[] = AGENT_ONATTACH_SYMBOLS;
  size_t num_symbol_entries = ARRAY_SIZE(on_attach_symbols);
  bool is_absolute_path = (absParam != NULL) && (strcmp(absParam,"true")==0);
  // 加载agent
  AgentLibrary *agent_lib = new AgentLibrary(agent, options, is_absolute_path, NULL);
  if (!os::find_builtin_agent(agent_lib, on_attach_symbols, num_symbol_entries)) {
    if (is_absolute_path) {
      library = os::dll_load(agent, ebuf, sizeof ebuf);
    }
  }
  if (library != NULL) {
    agent_lib->set_os_lib(library);
    agent_lib->set_valid();
  }
  if (agent_lib->valid()) {
    OnAttachEntry_t on_attach_entry = NULL;
    on_attach_entry = CAST_TO_FN_PTR(OnAttachEntry_t, os::find_agent_function(agent_lib, false, on_attach_symbols, num_symbol_entries));
    extern struct JavaVM_ main_vm;
    // 执行agent的Agent_OnAttach方法
    result = (*on_attach_entry)(&main_vm, (char*)options, NULL);
  }
  return result;
}

四、libasyncProfiler

src/vmEntry.cpp:asprof提供的javaagent实现入口,从这里开始所有逻辑都跑在目标jvm进程内。

extern "C" DLLEXPORT jint JNICALL
Agent_OnAttach(JavaVM* vm, char* options, void* reserved) {
    Arguments args;
    // 解析入参
    Error error = args.parse(options);
    // 初始化
    if (!VM::init(vm, true)) {
        return COMMAND_ERROR;
    }
    // 执行profiling
    error = Profiler::instance()->run(args);
    if (error) {
        return COMMAND_ERROR;
    }
    return 0;
}

1、init

src/vmEntry.cpp:init执行初始化,做数据准备

1)多次attach只会执行一次;

2)通过JavaVM可以拿到jvmti(JVM Tool Interface)执行jvm提供的方法;

3)通过dlopen找到libjvm.so,拿到AsyncGetCallTrace方法;

4)updateSymbols初始化非内核符号;

5)使用jvmti挂载多个钩子;

6)loadAllMethodIDs触发methodId分配;

bool VM::init(JavaVM* vm, bool attach) {
    // 多次attach只会执行一次
    if (_jvmti != NULL) return true;
    _vm = vm;
    // 获取jvmti
    if (_vm->GetEnv((void**)&_jvmti, JVMTI_VERSION_1_0) != 0) {
        return false;
    }
    // 找到libjvm.so
    void* libjvm = RTLD_DEFAULT;
    if (OS::isLinux() && (libjvm = dlopen("libjvm.so", RTLD_LAZY)) == NULL) {
        libjvm = RTLD_DEFAULT;
    }
    // 找到AsyncGetCallTrace方法,后面获取java的方法栈需要用到
    _asyncGetCallTrace = (AsyncGetCallTrace)dlsym(libjvm, "AsyncGetCallTrace");
    Profiler* profiler = Profiler::instance();
    // 初始化jvm非内核符号,后面native的方法栈需要用到
    if (VMStructs::libjvm() == NULL) {
        profiler->updateSymbols(false);
        VMStructs::init(profiler->findLibraryByAddress((const void*)_asyncGetCallTrace));
    }
    jvmtiCapabilities capabilities = {0};
    capabilities.can_generate_all_class_hook_events = 1;
    capabilities.can_retransform_classes = 1;
    capabilities.can_retransform_any_class = isOpenJ9() ? 0 : 1;
    capabilities.can_generate_vm_object_alloc_events = isOpenJ9() ? 1 : 0;
    capabilities.can_get_bytecodes = 1;
    capabilities.can_get_constant_pool = 1;
    capabilities.can_get_source_file_name = 1;
    capabilities.can_get_line_numbers = 1;
    capabilities.can_generate_compiled_method_load_events = 1;
    capabilities.can_generate_monitor_events = 1;
    capabilities.can_generate_garbage_collection_events = 1;
    capabilities.can_tag_objects = 1;
    _jvmti->AddCapabilities(&capabilities);
    jvmtiEventCallbacks callbacks = {0};
    callbacks.VMInit = VMInit;
    callbacks.VMDeath = VMDeath;
    // 1
    callbacks.ClassLoad = ClassLoad; 
    // 2
    callbacks.ClassPrepare = ClassPrepare;
    callbacks.ClassFileLoadHook = Instrument::ClassFileLoadHook;
    callbacks.CompiledMethodLoad = Profiler::CompiledMethodLoad;
    callbacks.DynamicCodeGenerated = Profiler::DynamicCodeGenerated;
    callbacks.ThreadStart = Profiler::ThreadStart;
    callbacks.ThreadEnd = Profiler::ThreadEnd;
    callbacks.MonitorContendedEnter = LockTracer::MonitorContendedEnter;
    callbacks.MonitorContendedEntered = LockTracer::MonitorContendedEntered;
    callbacks.VMObjectAlloc = J9ObjectSampler::VMObjectAlloc;
    callbacks.SampledObjectAlloc = ObjectSampler::SampledObjectAlloc;
    callbacks.GarbageCollectionStart = ObjectSampler::GarbageCollectionStart;
    callbacks.GarbageCollectionFinish = Profiler::GarbageCollectionFinish;
    _jvmti->SetEventCallbacks(&callbacks, sizeof(callbacks));
    _jvmti->SetEventNotificationMode(JVMTI_ENABLE, JVMTI_EVENT_VM_DEATH, NULL);
    // 1
    _jvmti->SetEventNotificationMode(JVMTI_ENABLE, JVMTI_EVENT_CLASS_LOAD, NULL);
     // 2
    _jvmti->SetEventNotificationMode(JVMTI_ENABLE, JVMTI_EVENT_CLASS_PREPARE, NULL);
    _jvmti->SetEventNotificationMode(JVMTI_ENABLE, JVMTI_EVENT_DYNAMIC_CODE_GENERATED, NULL);
    _jvmti->SetEventNotificationMode(JVMTI_ENABLE, JVMTI_EVENT_GARBAGE_COLLECTION_FINISH, NULL);
    // 3
    loadAllMethodIDs(jvmti(), jni());
    _jvmti->GenerateEvents(JVMTI_EVENT_DYNAMIC_CODE_GENERATED);
    _jvmti->GenerateEvents(JVMTI_EVENT_COMPILED_METHOD_LOAD);
    return true;
}

这里有两个比较重要的钩子,实现都在vmEntry.h中。

ClassLoad钩子,是个空实现,他的作用只是后面AsyncGetCallTrace获取java方法栈的时候,会判断是否有ClassLoad钩子,如果有才能获取成功。

static void JNICALL ClassLoad(jvmtiEnv* jvmti, JNIEnv* jni, jthread thread, jclass klass) {
    // Needed only for AsyncGetCallTrace support
}

见src/hotspot/share/prims/forte.cpp,只有加了ClassLoad钩子,这里should_post_class_load才能为true,AsyncGetCallTrace才能获取java方法栈。

JNIEXPORT
void AsyncGetCallTrace(ASGCT_CallTrace *trace, jint depth, void* ucontext) {
  JavaThread* thread;

  if (!JvmtiExport::should_post_class_load()) {
    trace->num_frames = ticks_no_class_load; // -1
    return;
  }
  // ....
}

ClassPrepare钩子,此时类的method、field都已经准备完成。

这里目的和loadAllMethodIDs是一样的,都是触发jmethodID分配。

static void JNICALL ClassPrepare(jvmtiEnv* jvmti, JNIEnv* jni, jthread thread, jclass klass) {
    loadMethodIDs(jvmti, jni, klass);
}

init里loadAllMethodIDs是确保目前已加载的class,触发jmethodID分配;

ClassPrepare钩子,是确保后面加载的class,能触发jmethodID分配。

void VM::loadAllMethodIDs(jvmtiEnv* jvmti, JNIEnv* jni) {
    jint class_count;
    jclass* classes;
    // 遍历所有已经加载的class
    if (jvmti->GetLoadedClasses(&class_count, &classes) == 0) {
        for (int i = 0; i < class_count; i++) {
            // 处理这个class的所有method
            loadMethodIDs(jvmti, jni, classes[i]);
        }
        // 释放classes
        jvmti->Deallocate((unsigned char*)classes);
    }
}
void VM::loadMethodIDs(jvmtiEnv* jvmti, JNIEnv* jni, jclass klass) {
    jint method_count;
    jmethodID* methods;
    // 获取这个类的所有method,然后再全部释放
    if (jvmti->GetClassMethods(klass, &method_count, &methods) == 0) {
        jvmti->Deallocate((unsigned char*)methods);
    }
}

这个jmethodID本质上是指向Method的指针,而jmethodID有缓存机制,需要通过GetClassMethods将Method指针缓存起来,AsyncGetCallTrace时才能拿到jmethodID,使用jvmti+jmethodID能拿到方法名。

生成jmethodID,jvm代码见src/hotspot/share/oops/method.cpp#make_jmethod_id。

jmethodID Method::make_jmethod_id(ClassLoaderData* loader_data, Method* m) {
  ClassLoaderData* cld = loader_data;
  if (!SafepointSynchronize::is_at_safepoint()) {
    MutexLocker ml(JmethodIdCreation_lock,  Mutex::_no_safepoint_check_flag);
    if (cld->jmethod_ids() == NULL) {
      cld->set_jmethod_ids(new JNIMethodBlock());
    }
    // jmethodID is a pointer to Method*
    return (jmethodID)cld->jmethod_ids()->add_method(m);
  } else {
    if (cld->jmethod_ids() == NULL) {
      cld->set_jmethod_ids(new JNIMethodBlock());
    }
    // jmethodID is a pointer to Method*
    return (jmethodID)cld->jmethod_ids()->add_method(m);
  }
}

2、start

profiler.cpp:Profiler是一个单例对象,在JVM进程存活期间一直会存在,不会销毁。多次attach上来都是同一个对象。

// The instance is not deleted on purpose, since profiler structures
// can be still accessed concurrently during VM termination
Profiler* const Profiler::_instance = new Profiler();

profiler.cpp:run,启动使用标准输出,停止使用传入的file输出。

Error Profiler::run(Arguments& args) {
    if (!args.hasOutputFile()) { 
        // start 标准输出
        LogWriter out;
        return runInternal(args, out);
    } else {
        // stop 文件输出
        MutexLocker ml(_state_lock);
        FileWriter out(args.file());
        if (!out.is_open()) {
            return Error("Could not open output file");
        }
        return runInternal(args, out);
    }
}

profiler.cpp:runInternal,根据指令执行不同方法,start执行start方法,stop执行stop和dump。

Error Profiler::runInternal(Arguments& args, Writer& out) {
    switch (args._action) {
        case ACTION_START:
        case ACTION_RESUME: {
            Error error = start(args, args._action == ACTION_START);
            if (error) {
                return error;
            }
            if (!args._quiet) {
                out << "Profiling started\n";
            }
            break;
        }
        case ACTION_STOP: {
            Error error = stop();
            if (args._output == OUTPUT_NONE) {
                if (error) {
                    return error;
                }
                if (!args._quiet) {
                    out << "Profiling stopped after " << uptime() << " seconds. No dump options specified\n";
                }
                break;
            }
            // Fall through
        }
        case ACTION_DUMP: {
            Error error = dump(out, args);
            if (error) {
                return error;
            }
            break;
        }
        case ...
        default:
            break;
    }
    return Error::OK;
}

profiler.cpp:start

1)清理缓存,因为profiler是单例,每次start先要清理上次profiling的缓存;

2)为记录方法调用栈分配内存_calltrace_buffer,这里为了减少并发修改,所以分配了16个_calltrace_buffer,后面每个线程尽量hash到不同的_calltrace_buffer记录调用栈;

3)updateSymbols更新内核符号表,内核方法地址转方法名;

4)根据event选择cpu执行引擎;

5)启动cpu执行引擎;

6)如果开启多个event,则启动其他类型引擎,如alloc内存分配;(只有-f xxx.jfr导出jfr格式才支持多event)

Error Profiler::start(Arguments& args, bool reset) {
    // 根据-e参数,设置_event_mask
    _event_mask = (args._event != NULL ? EM_CPU : 0) |
                  (args._alloc >= 0 ? EM_ALLOC : 0) |
                  (args._lock >= 0 ? EM_LOCK : 0) |
                  (args._wall >= 0 ? EM_WALL : 0) |
                  (args._nativemem >= 0 ? EM_NATIVEMEM : 0);

    // 清理缓存数据
    if (reset || _start_time == 0) {
        _total_samples = 0;
        _total_stack_walk_time = 0;
        memset(_failures, 0, sizeof(_failures));
        lockAll();
        _class_map.clear();
        _thread_filter.clear();
        _call_trace_storage.clear();
        _add_event_frame = args._output != OUTPUT_JFR;
        _add_thread_frame = args._threads && args._output != OUTPUT_JFR;
        _add_sched_frame = args._sched;
        unlockAll();
        MutexLocker ml(_thread_names_lock);
        _thread_names.clear();
        _thread_ids.clear();
    }

    // 为记录方法调用栈 分配内存
    if (_max_stack_depth != args._jstackdepth) {
        _max_stack_depth = args._jstackdepth;
        size_t nelem = _max_stack_depth + MAX_NATIVE_FRAMES + RESERVED_FRAMES;
        for (int i = 0; i < 16; i++) {
            free(_calltrace_buffer[i]);
            _calltrace_buffer[i] = (CallTraceBuffer*)calloc(nelem, sizeof(CallTraceBuffer));
            if (_calltrace_buffer[i] == NULL) {
                _max_stack_depth = 0;
                return Error("Not enough memory to allocate stack trace buffers (try smaller jstackdepth)");
            }
        }
    }

    // 更新符号表,用于将native方法地址转名称,perf_events支持内核符号,其他只需要普通符号
    updateSymbols(_engine == &perf_events && !args._alluser);
    // 根据event选择cpu执行引擎
    _engine = selectEngine(args._event);

    _cstack = args._cstack;

    // 不同cpu引擎实现start方法
    error = _engine->start(args);

    // 可以-e指定多个event,则启动多个后台采集任务
    if (_event_mask & EM_ALLOC) {
        _alloc_engine = selectAllocEngine(args._alloc, args._live);
        error = _alloc_engine->start(args);
    }
    if (_event_mask & EM_LOCK) {
        error = lock_tracer.start(args);
    }
    if (_event_mask & EM_WALL) {
        error = wall_clock.start(args);
    }
    if (_event_mask & EM_NATIVEMEM) {
        error = malloc_tracer.start(args);
    }

    _state = RUNNING;
    _start_time = time(NULL);
    _epoch++;

    return Error::OK;
}

profiler.cpp:selectEngine,根据事件名选择不同的cpu profiling执行引擎。

1)默认不指定,就是EVENT_CPU,如果系统允许则使用perf_events,降级使用ctimer或wall;

2)通过-e wall/ctimer/itimer指定其他引擎;

3)通过-e java方法指定instrument引擎;

Engine* Profiler::selectEngine(const char* event_name) {
    if (event_name == NULL) {
        return &noop_engine;
    } else if (strcmp(event_name, EVENT_CPU) == 0) {
        if (FdTransferClient::hasPeer() || PerfEvents::supported()) {
            return &perf_events; // 默认
        } else if (CTimer::supported()) {
            return &ctimer;
        } else {
            return &wall_clock;
        }
    } else if (strcmp(event_name, EVENT_WALL) == 0) {
        if (VM::isOpenJ9()) {
            return &j9_wall_clock;
        } else {
            return &wall_clock;
        }
    } else if (strcmp(event_name, EVENT_CTIMER) == 0) {
        return &ctimer;
    } else if (strcmp(event_name, EVENT_ITIMER) == 0) {
        return &itimer;
    } else if (strchr(event_name, '.') != NULL && strchr(event_name, ':') == NULL) {
        return &instrument;
    } else {
        return &perf_events;
    }
}

perf_events

perfEvents_linux.cpp:perf_events仅支持linux。通过执行perf_event_open验证是否支持perf_event。perf_event_open系统调用受制于权限约束:

1)非容器环境下需要配置sysctl kernel.perf_event_paranoid=1;(默认2)

2)容器环境下需要通过seccomp或其他方式;

bool PerfEvents::supported() {
    struct perf_event_attr attr = {0};
    attr.size = sizeof(attr);
    attr.type = PERF_TYPE_SOFTWARE;
    attr.config = PERF_COUNT_SW_CPU_CLOCK;
    attr.sample_period = 1000000000;
    attr.sample_type = PERF_SAMPLE_CALLCHAIN;
    attr.disabled = 1;
    int fd = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0);
    if (fd == -1) {
        return false;
    }
    close(fd);
    return true;
}

perfEvents_linux.cpp:start。

1)如果未设置kernel.kptr_restrict=0,则无法获取内核符号地址,降级只能获取用户空间方法;

2)adjustFDLimit,调整rlimit,用perf_events需要每个线程占用一个fd;

3)创建events数组,容量=当前系统最大线程id;

4)注册信号处理方法signalHandler;

5)循环处理每个线程;

Error PerfEvents::start(Arguments& args) {
    _event_type = PerfEventType::forName(args._event);
    if (!setupThreadHook()) {
        return Error("Could not set pthread hook");
    }
    _target_cpu = args._target_cpu;
    _interval = args._interval ? args._interval : _event_type->default_interval;
    _cstack = args._cstack;
    _signal = args._signal == 0 ? OS::getProfilingSignal(0) : args._signal & 0xff;
    _count_overrun = false;

    // 如果未设置kernel.kptr_restrict=0,只能采集用户空间的方法栈
    _alluser = args._alluser;
    _kernel_stack = !_alluser && _cstack != CSTACK_NO;
    if (_kernel_stack && !Symbols::haveKernelSymbols()) {
        Log::warn("Kernel symbols are unavailable due to restrictions. Try\n"
                  "  sysctl kernel.perf_event_paranoid=1\n"
                  "  sysctl kernel.kptr_restrict=0");
        _kernel_stack = false;
        _alluser = strcmp(args._event, EVENT_CPU) != 0 && !supported();
    }

    // 调整rlimit,因为每个线程要用一个fd
    adjustFDLimit();

    // 创建events数组,容量=当前系统最大线程id
    int max_events = OS::getMaxThreadId();
    if (max_events != _max_events) {
        free(_events);
        _events = (PerfEvent*)calloc(max_events, sizeof(PerfEvent));
        _max_events = max_events;
    }

    // 注册信号处理函数
    OS::installSignalHandler(_signal, signalHandler);

    // Enable pthread hook before traversing currently running threads
    enableThreadHook();

    // 循环处理每个线程
    int err = createForAllThreads();
    return Error::OK;
}

perfEvents_linux.cpp:createForThread为每个线程开启perf

1)组装attr参数(采样模式,基于CPU时钟周期),调用perf_event_open,得到一个fd;

2)mmap为每个fd创建一个内存映射page,用于后续读perf数据;

3)fcntl设置fd触发信号,由每个线程异步处理;

4)ioctl开启perf;

int PerfEvents::createForThread(int tid) {
    if (!__sync_bool_compare_and_swap(&_events[tid]._fd, 0, -1)) {
        return -1;
    }

    PerfEventType* event_type = _event_type;
    struct perf_event_attr attr = {0};
    attr.size = sizeof(attr);
    // 即attr.type = PERF_TYPE_SOFTWARE;
    attr.type = event_type->type;
    // 即attr.config = PERF_COUNT_SW_CPU_CLOCK;
    attr.config = event_type->config;
    attr.config1 = event_type->config1;
    attr.config2 = event_type->config2;

    attr.precise_ip = 2;
    // 每n个cpu时钟采集一次
    attr.sample_period = _interval;
    // 采集调用栈
    attr.sample_type = PERF_SAMPLE_CALLCHAIN;
    // 暂时禁用,用ioctl开启
    attr.disabled = 1;
    attr.wakeup_events = 1;

    // 如果只采集用户空间方法,忽略内核方法
    if (_alluser) {
        attr.exclude_kernel = 1;
    }

    if (!_kernel_stack) {
        attr.exclude_callchain_kernel = 1;
    }

    // perf_event_open系统调用,针对目标tid线程
    int fd = syscall(__NR_perf_event_open, &attr, tid, _target_cpu, -1, PERF_FLAG_FD_CLOEXEC);

    // 为每个fd创建一个内存映射page
    void* page = NULL;
    if (_kernel_stack || _cstack == CSTACK_DEFAULT || _cstack == CSTACK_LBR) {
        page = mmap(NULL, 2 * OS::page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
        if (page == MAP_FAILED) {
            Log::warn("perf_event mmap failed: %s", strerror(errno));
            page = NULL;
        }
    }
    // 将线程对应的fd和page缓存到events数组
    _events[tid].reset();
    _events[tid]._fd = fd;
    _events[tid]._page = (struct perf_event_mmap_page*)page;

    // fcntl设置perf_event由每个线程自己处理
    // ioctl开启perf_event
    struct f_owner_ex ex;
    ex.type = F_OWNER_TID;
    ex.pid = tid;
    int err;
    if (fcntl(fd, F_SETFL, O_ASYNC) < 0 || fcntl(fd, F_SETSIG, _signal) < 0 || fcntl(fd, F_SETOWN_EX, &ex) < 0) {
        err = errno;
        Log::warn("perf_event fcntl failed: %s", strerror(err));
    } else if (ioctl(fd, PERF_EVENT_IOC_RESET, 0) < 0 || ioctl(fd, PERF_EVENT_IOC_REFRESH, 1) < 0) {
        err = errno;
        Log::warn("perf_event ioctl failed: %s", strerror(err));
    } else {
        return 0;
    }

    // 失败处理...
    if (page != NULL) {
        munmap(page, 2 * OS::page_size);
        _events[tid]._page = NULL;
    }
    close(fd);
    _events[tid]._fd = 0;

    return err;
}

ctimer

ctimer.h:ctimer只有linux支持,相较于perf_events无法采集内核方法调用栈,在perf_events不可用时会被降级采用。

#ifdef __linux__

class CTimer : public CpuEngine {
  private:
    static int _max_timers;
    static int* _timers;

    int createForThread(int tid);
    void destroyForThread(int tid);

  public:
    const char* type() {
        return "ctimer";
    }

    Error check(Arguments& args);
    Error start(Arguments& args);
    void stop();

    static bool supported() {
        return true;
    }
};

#else

ctimer_linux.cpp:ctimer为每个线程创建一个timer(timer_create系统调用)定时发出信号,由signalHandler处理。

Error CTimer::start(Arguments& args) {
    if (!setupThreadHook()) {
        return Error("Could not set pthread hook");
    }
    _interval = args._interval ? args._interval : DEFAULT_INTERVAL;
    _cstack = args._cstack;
    _signal = args._signal == 0 ? OS::getProfilingSignal(0) : args._signal & 0xff;
    _count_overrun = true;
    int max_timers = OS::getMaxThreadId();
    if (max_timers != _max_timers) {
        free(_timers);
        _timers = (int*)calloc(max_timers, sizeof(int));
        _max_timers = max_timers;
    }

    // 注册信号处理方法
    OS::installSignalHandler(_signal, signalHandler);

    // Enable pthread hook before traversing currently running threads
    enableThreadHook();

    // 循环目标进程的每个线程,创建timer
    int err = createForAllThreads();
    return Error::OK;
}

int CTimer::createForThread(int tid) {
    struct sigevent sev;
    sev.sigev_value.sival_ptr = NULL;
    sev.sigev_signo = _signal;
    sev.sigev_notify = SIGEV_THREAD_ID;
    ((int*)&sev.sigev_notify)[1] = tid;
    clockid_t clock = thread_cpu_clock(tid);
    int timer;
    if (syscall(__NR_timer_create, clock, &sev, &timer) < 0) {
        return -1;
    }

    // Kernel timer ID may start with zero, but we use zero as an empty slot
    if (!__sync_bool_compare_and_swap(&_timers[tid], 0, timer + 1)) {
        // Lost race
        syscall(__NR_timer_delete, timer);
        return -1;
    }

    struct itimerspec ts;
    ts.it_interval.tv_sec = (time_t)(_interval / 1000000000);
    ts.it_interval.tv_nsec = _interval % 1000000000;
    ts.it_value = ts.it_interval;
    syscall(__NR_timer_settime, timer, 0, &ts, NULL);
    return 0;
}

itimer

itimer同时支持linux和mac。

itimer.cpp:itimer也是注册signalHandler信号处理函数,通过setitimer开启一个timer定时发送信号。itimer只能向当前进程定时发送信号,无法让信号在线程之间均匀分配。

Error ITimer::start(Arguments& args) {
    _interval = args._interval ? args._interval : DEFAULT_INTERVAL;
    _cstack = args._cstack;
    _signal = SIGPROF;
    _count_overrun = false;

    OS::installSignalHandler(SIGPROF, signalHandler);

    time_t sec = _interval / 1000000000;
    suseconds_t usec = (_interval % 1000000000) / 1000;
    struct itimerval tv = {{sec, usec}, {sec, usec}};
    if (setitimer(ITIMER_PROF, &tv, NULL) != 0) {
        return Error("ITIMER_PROF is not supported on this system");
    }
    return Error::OK;
}

wall

wall常用于配合-t参数,诊断应用的启动耗时。

wall定时采样线程,无论线程处于何种状态(运行、睡眠、阻塞)。

wallClock.cpp:start,注册信号处理器函数,pthread_create创建一个线程。

Error WallClock::start(Arguments& args) {
    // 默认是WALL_BATCH
    if (args._wall >= 0 || strcmp(args._event, EVENT_WALL) == 0) {
        _mode = args._nobatch ? WALL_LEGACY : WALL_BATCH;
    } else {
        _mode = CPU_ONLY;
    }

    _interval = args._wall >= 0 ? args._wall : args._interval;
    if (_interval == 0) {
        _interval = _mode == CPU_ONLY ? DEFAULT_INTERVAL : DEFAULT_INTERVAL * 5;
    }

    _signal = args._signal == 0 ? OS::getProfilingSignal(1)
                                : ((args._signal >> 8) > 0 ? args._signal >> 8 : args._signal);
    // 注册信号处理方法
    OS::installSignalHandler(_signal, signalHandler);

    _running = true;

    // 创建一个线程
    if (pthread_create(&_thread, NULL, threadEntry, this) != 0) {
        return Error("Unable to create timer thread");
    }

    return Error::OK;
}

wallClock.cpp:这个线程定时(睡眠)采集n个线程的方法栈

1)线程cpu耗时超过10000ns发送一次信号;

2)如果线程一直未消耗cpu,采集到1000次也要发送一次信号;

void WallClock::timerLoop() {
    int self = OS::threadId();
    ThreadFilter* thread_filter = Profiler::instance()->threadFilter();
    bool thread_filter_enabled = thread_filter->enabled();
    Mode mode = _mode;

    ThreadSleepMap thread_sleep_state;
    // 获取当前jvm进程下的所有线程
    ThreadList* thread_list = OS::listThreads();
    _thread_cpu_time_buf.reset();
    u64 cycle_start_time = OS::nanotime();

    while (_running) {
        bool enabled = _enabled;
        // 每次最多通知THREADS_PER_TICK=8个线程
        for (int signaled_threads = 0; signaled_threads < THREADS_PER_TICK && thread_list->hasNext(); ) {
            int thread_id = thread_list->next();
            if (thread_id == self || thread_id <= 0) {
                continue;
            }
            if (thread_filter_enabled && !thread_filter->accept(thread_id)) {
                continue;
            }

            if (mode == CPU_ONLY) {
                if (!enabled || OS::threadState(thread_id) == THREAD_SLEEPING) {
                    continue;
                }
            } else if (mode == WALL_BATCH) {
                // 默认模式
                // 如果本次cpu时间 - 上次cpu时间 <= 10000ns,持续累计到ThreadSleepState中
                ThreadSleepState& tss = thread_sleep_state[thread_id];
                u64 new_thread_cpu_time = enabled ? OS::threadCpuTime(thread_id) : 0;
                if (new_thread_cpu_time != 0 && new_thread_cpu_time - tss.last_cpu_time <= RUNNABLE_THRESHOLD_NS) { // <= 10000ns
                    if (++tss.counter < MAX_IDLE_BATCH) { // 最多累积1000次
                        if (tss.counter == 1) tss.start_time = TSC::ticks();
                        continue;
                    }
                }
                // jfr相关忽略
                // ...
            }
            // 累积cpu时间足够,发送信号给目标线程
            if (enabled && OS::sendSignalToThread(thread_id, _signal)) {
                signaled_threads++;
            }
        }

        // 睡眠一段时间
        u64 current_time = OS::nanotime();
        if (thread_list->hasNext()) {
            // 本轮还未遍历完所有线程
            long long sleep_time = cycle_start_time + (u64)_interval * thread_list->index() / thread_list->count() - current_time;
            OS::sleep(sleep_time < MIN_INTERVAL ? MIN_INTERVAL : sleep_time);
        } else {
            // 本轮遍历完所有线程
            cycle_start_time += (u64)_interval;
            long long sleep_time = cycle_start_time - current_time;
            if (sleep_time < MIN_INTERVAL) {
                cycle_start_time = current_time + MIN_INTERVAL;
                sleep_time = MIN_INTERVAL;
            }
            OS::sleep(sleep_time);
            thread_list->update();
        }

        _thread_cpu_time_buf.drain(thread_sleep_state);
    }

    // ... 资源清理
}

java方法

对于-e com.x.y.z.XXXmethod分析java方法,实际上是对原始java方法做了增强,在方法开始插入了Instrument#recordSample调用。

import one.profiler.Instrument;
private void XXXmethod() {
    Instrument.recordSample();
    // 原始业务逻辑
}

Instrument#recordSample是async-profiler提供的native方法。

public class Instrument {
    private Instrument() {
    }
    public static native void recordSample();
}

在vmEntry.cpp的init阶段,就会注册ClassFileLoadHook钩子。

bool VM::init(JavaVM* vm, bool attach) {
    // 多次attach只会执行一次
    if (_jvmti != NULL) return true;
    _vm = vm;
    // 获取jvmti
    if (_vm->GetEnv((void**)&_jvmti, JVMTI_VERSION_1_0) != 0) {
        return false;
    }
    // ...
    jvmtiEventCallbacks callbacks = {0};
    callbacks.VMInit = VMInit;
    callbacks.VMDeath = VMDeath;
    callbacks.ClassLoad = ClassLoad; 
    callbacks.ClassPrepare = ClassPrepare;
    // 类加载回调
    callbacks.ClassFileLoadHook = Instrument::ClassFileLoadHook;

    _jvmti->SetEventCallbacks(&callbacks, sizeof(callbacks));
    // ...
    return true;
}

instrument.cpp:未通过-e java方法开启profiling,这里class增强就不会生效。

void JNICALL Instrument::ClassFileLoadHook(jvmtiEnv* jvmti, JNIEnv* jni,
                                           jclass class_being_redefined, jobject loader,
                                           const char* name, jobject protection_domain,
                                           jint class_data_len, const u8* class_data,
                                           jint* new_class_data_len, u8** new_class_data) {
    // 没指定-e java方法,跳过
    if (!_running) return;

    if (name == NULL || strcmp(name, _target_class) == 0) {
        BytecodeRewriter rewriter(class_data, class_data_len, _target_class);
        rewriter.rewrite(new_class_data, new_class_data_len);
    }
}

instrument.cpp:和普通javaagent一样,如果目标class已经被加载,这里对目标class做retransform。此外check需要使用jni定义Instrument#recordSample native方法。

Error Instrument::start(Arguments& args) {
    // 定义Instrument#recordSample native方法
    Error error = check(args);

    // 通过event解析目标class
    setupTargetClassAndMethod(args._event);
    _interval = args._interval ? args._interval : 1;
    _calls = 0;
    _running = true;

    // 允许CLASS_FILE_LOAD_HOOK生效
    jvmtiEnv* jvmti = VM::jvmti();
    jvmti->SetEventNotificationMode(JVMTI_ENABLE, JVMTI_EVENT_CLASS_FILE_LOAD_HOOK, NULL);

    // 如果目标class已经被加载,触发retransform,增强class
    retransformMatchedClasses(jvmti);

    return Error::OK;
}
Error Instrument::check(Arguments& args) {
    if (!_instrument_class_loaded) {

        JNIEnv* jni = VM::jni();
        const JNINativeMethod native_method = {(char*)"recordSample", (char*)"()V", (void*)recordSample};

        jclass cls = jni->DefineClass(INSTRUMENT_NAME, NULL, (const jbyte*)INSTRUMENT_CLASS, INCBIN_SIZEOF(INSTRUMENT_CLASS));
        // 通过jni注册recordSample实现
        if (cls == NULL || jni->RegisterNatives(cls, &native_method, 1) != 0) {
            jni->ExceptionDescribe();
            return Error("Could not load Instrument class");
        }
        _instrument_class_loaded = true;
    }
    return Error::OK;
}

instrument.cpp:native方法实现如下,和前面的cpu引擎不同,这里没有使用任何信号,直接触发recordSample,记录方法调用栈。

void JNICALL Instrument::recordSample(JNIEnv* jni, jobject unused) {
    if (!_enabled) return;
    if (_interval <= 1 || ((atomicInc(_calls) + 1) % _interval) == 0) {
        ExecutionEvent event(TSC::ticks());
        Profiler::instance()->recordSample(NULL, _interval, INSTRUMENTED_METHOD, &event);
    }
}

3、信号处理

不同引擎,通过不同的方式发出信号,最终由信号处理函数采集线程方法调用栈,这里处理信号的线程就是不同的jvm线程了。

perfEvents_linux.cpp:perf_events

void PerfEvents::signalHandler(int signo, siginfo_t* siginfo, void* ucontext) {
    if (siginfo->si_code <= 0) {
        // Looks like an external signal; don't treat as a profiling event
        return;
    }

    ExecutionEvent event(TSC::ticks());
    u64 counter = readCounter(siginfo, ucontext);
    Profiler::instance()->recordSample(ucontext, counter, PERF_SAMPLE, &event);

    // 重置perf_event计数器
    ioctl(siginfo->si_fd, PERF_EVENT_IOC_RESET, 0);
    ioctl(siginfo->si_fd, PERF_EVENT_IOC_REFRESH, 1);
}

cpuEngine.cpp:itimer/ctimer

void CpuEngine::signalHandler(int signo, siginfo_t* siginfo, void* ucontext) {
    if (!_enabled) return;

    ExecutionEvent event(TSC::ticks());
    // Count missed samples when estimating total CPU time
    u64 total_cpu_time = _count_overrun ? u64(_interval) * (1 + OS::overrun(siginfo)) : u64(_interval);
    Profiler::instance()->recordSample(ucontext, total_cpu_time, EXECUTION_SAMPLE, &event);
}

wallClock.cpp:wall

void WallClock::signalHandler(int signo, siginfo_t* siginfo, void* ucontext) {
    if (_mode == WALL_BATCH) { // 默认
        WallClockEvent event;
        event._start_time = TSC::ticks();
        event._thread_state = getThreadState(ucontext);
        event._samples = 1;
        u64 trace = Profiler::instance()->recordSample(ucontext, _interval, WALL_CLOCK_SAMPLE, &event);
        if (event._thread_state == THREAD_SLEEPING && trace != 0) {
            _thread_cpu_time_buf.add(trace);
        }
    } else {
        ExecutionEvent event(TSC::ticks());
        event._thread_state = _mode == CPU_ONLY ? THREAD_UNKNOWN : getThreadState(ucontext);
        Profiler::instance()->recordSample(ucontext, _interval, EXECUTION_SAMPLE, &event);
    }
}

不同cpu分析引擎最终都调用Profiler#recordSample,只是传入的类型不同:

1)perf_events:PERF_SAMPLE

2)ctimer/itimer:EXECUTION_SAMPLE

3)wall:WALL_CLOCK_SAMPLE

4)java方法:INSTRUMENTED_METHOD

profiler.cpp:recordSample

1)获取锁lock_index,找到_calltrace_buffer[lock_index]._asgct_frames数组存储调用栈;

2)getNativeTrace获取非java方法栈;

3)getJavaTraceAsync获取java方法栈;

4)释放锁;

注:这是目前默认获取方法栈的逻辑,以后会用--cstack vm代替(CSTACK_VM),见ISSUE#795。

u64 Profiler::recordSample(void* ucontext, u64 counter, EventType event_type, Event* event) {
    // 获取当前线程id
    int tid = fastThreadId();
    // 获取锁
    u32 lock_index = getLockIndex(tid);
    if (!_locks[lock_index].tryLock() &&
        !_locks[lock_index = (lock_index + 1) % 16].tryLock() &&
        !_locks[lock_index = (lock_index + 2) % 16].tryLock())
    {
        return 0;
    }
    // _calltrace_buffer用于存储调用栈
    ASGCT_CallFrame* frames = _calltrace_buffer[lock_index]->_asgct_frames;
    int num_frames = 0;
    StackContext java_ctx = {0};
    // native栈 非java方法栈
    if (hasNativeStack(event_type)) {
        if (_cstack != CSTACK_NO) {
            num_frames += getNativeTrace(ucontext, frames + num_frames, event_type, tid, &java_ctx);
        }
    }
    // java方法栈
    if (_cstack == CSTACK_VMX) {
        num_frames += StackWalker::walkVM(ucontext, frames + num_frames, _max_stack_depth, VM_EXPERT);
    } else if (event_type <= WALL_CLOCK_SAMPLE) {
        if (_cstack == CSTACK_VM) {
            num_frames += StackWalker::walkVM(ucontext, frames + num_frames, _max_stack_depth, VM_NORMAL);
        } else {
            // 使用AGCT获取java方法栈
            int java_frames = getJavaTraceAsync(ucontext, frames + num_frames, _max_stack_depth, &java_ctx);
            num_frames += java_frames;
        }
    } else if (event_type >= ALLOC_SAMPLE && event_type <= ALLOC_OUTSIDE_TLAB && _alloc_engine == &alloc_tracer) {
        // ..
    } else if (event_type == MALLOC_SAMPLE) {
       // ..
    } else {
        // -e java方法/lock
        int start_depth = event_type == INSTRUMENTED_METHOD ? 1 : 0;
        num_frames += getJavaTraceJvmti(jvmti_frames + num_frames, frames + num_frames, start_depth, _max_stack_depth);
    }

    // 将调用栈frames放入缓存
    u32 call_trace_id = _call_trace_storage.put(num_frames, frames, counter);

    _locks[lock_index].unlock();
    return (u64)tid << 32 | call_trace_id;
}

获取native栈

profiler.cpp:getNativeTrace,对于perf_events走PerfEvents#walk,对于ctimer/itimer走StackWalker#walkFP,获取native栈到callchain,convertNativeTrace将方法地址转方法名存入frames。

int Profiler::getNativeTrace(void* ucontext, ASGCT_CallFrame* frames, EventType event_type, int tid, StackContext* java_ctx) {
    const void* callchain[MAX_NATIVE_FRAMES];
    int native_frames;
    if (event_type == PERF_SAMPLE) {
        // perf_events
        native_frames = PerfEvents::walk(tid, ucontext, callchain, MAX_NATIVE_FRAMES, java_ctx);
    } else if (_cstack >= CSTACK_VM) {
        return 0;
    } else if (_cstack == CSTACK_DWARF) {
        native_frames = StackWalker::walkDwarf(ucontext, callchain, MAX_NATIVE_FRAMES, java_ctx);
    } else {
        // ctimer/itimer
        native_frames = StackWalker::walkFP(ucontext, callchain, MAX_NATIVE_FRAMES, java_ctx);
    }
    return convertNativeTrace(native_frames, callchain, frames, event_type);
}
// callchain转ASGCT_CallFrame,native栈的method_id是方法名,bci是常量-10
int Profiler::convertNativeTrace(int native_frames, const void** callchain, ASGCT_CallFrame* frames, EventType event_type) {
    int depth = 0;
    for (int i = 0; i < native_frames; i++) {
        const char* current_method_name = findNativeMethod(callchain[i]);
        jmethodID current_method = (jmethodID)current_method_name;
        frames[depth].bci = BCI_NATIVE_FRAME; // -10
        frames[depth].method_id = current_method;
        depth++;
    }
    return depth;
}

perfEvents_linux.cpp:读取perf数据,记录ip(Instruction Pointer)到callchain数组,即为调用栈。

int PerfEvents::walk(int tid, void* ucontext, const void** callchain, int max_depth, StackContext* java_ctx) {
    PerfEvent* event = &_events[tid];
    int depth = 0;
    // mmap第一页是元数据页
    struct perf_event_mmap_page* page = event->_page;
    if (page != NULL) {
        // 数据范围tail-head
        u64 tail = page->data_tail;
        u64 head = page->data_head;
        rmb();
        RingBuffer ring(page);
        // 遍历perf记录
        while (tail < head) {
            // perf记录的头 64位
            struct perf_event_header* hdr = ring.seek(tail);
            // 找到perf采样记录
            if (hdr->type == PERF_RECORD_SAMPLE) {
                // 遍历n个ip
                u64 nr = ring.next();
                while (nr-- > 0) {
                    u64 ip = ring.next();
                    if (ip < PERF_CONTEXT_MAX) {
                        const void* iptr = (const void*)ip;
                        // 到达java方法,退出循环
                        if (CodeHeap::contains(iptr) || depth >= max_depth) {
                            java_ctx->pc = iptr;
                            goto stack_complete;
                        }
                        // 记录ip到callchain
                        callchain[depth++] = iptr;
                    }
                }
                break;
            }
            tail += hdr->size;
        }
stack_complete:
        page->data_tail = head;
    }
    event->unlock();
    // ...
    return depth;
}

关于mmap映射的数据结构:

1)mmap映射大小为(1+2^n)*pagesize,其中n在createForThread设置为0,pagesize一般是4096;

2)mmap映射第一页是元数据页,data_tail到data_head范围内是perf数据;(实际是环形队列)

3)每条perf数据先是一个perf_event_header,type是类型,size是数据大小;在这之后根据sample_type而不同,这里是调用栈PERF_SAMPLE_CALLCHAIN,包含两个属性:nr=调用栈深度、ips[nr]=InstructionPointer数组,每个ip就是指令地址;

image.png

stackWalker.cpp:如果采用ctimer/itimer/wall,从当前上下文(ucontext)出发,使用帧指针(fp)回溯调用栈,并将每个函数调用地址(pc)存储到callchain数组中。

int StackWalker::walkFP(void* ucontext, const void** callchain, int max_depth, StackContext* java_ctx) {
    StackFrame frame(ucontext);
    // 程序计数器 program counter
    const void* pc = (const void*)frame.pc();
    // 当前帧的帧指针(Frame Pointer)
    uintptr_t fp = frame.fp();
    // 当前栈指针(Stack Pointer)
    uintptr_t sp = frame.sp();
    uintptr_t bottom = (uintptr_t)&sp + MAX_WALK_SIZE;
    int depth = 0;
    // Walk until the bottom of the stack or until the first Java frame
    while (depth < max_depth) {
        // 如果遇到Java帧,则停止
        if (CodeHeap::contains(pc) && !(depth == 0 && frame.unwindAtomicStub(pc))) {
            java_ctx->set(pc, sp, fp);
            break;
        }
        // 记录指令地址
        callchain[depth++] = pc;
        // Check if the next frame is below on the current stack
        if (fp < sp || fp >= sp + MAX_FRAME_SIZE || fp >= bottom) {
            break;
        }
        // Frame pointer must be word aligned
        if (!aligned(fp)) {
            break;
        }
        // 通过帧指针回溯上一个指令
        pc = stripPointer(SafeAccess::load((void**)fp + FRAME_PC_SLOT));
        if (inDeadZone(pc)) {
            break;
        }
        sp = fp + (FRAME_PC_SLOT + 1) * sizeof(void*);
        fp = *(uintptr_t*)fp;
    }
    return depth;
}

profiler.cpp:findNativeMethod,对于native栈,在采集栈的同时会将方法地址转方法名。

先根据地址定位库,再在库里通过二分查找定位方法名。

const char* Profiler::findNativeMethod(const void* address) {
    CodeCache* lib = findLibraryByAddress(address);
    return lib == NULL ? NULL : lib->binarySearch(address);
}

在profiler的start阶段,updateSymbols将解析当前进程用到的符号,都存放在profiler实例的CodeCacheArray中。CodeCacheArray中有n个CodeCache,每个CodeCache代表一个库下的符号。

void Symbols::parseLibraries(CodeCacheArray* array, bool kernel_symbols) {
    // 解析内核符号
    if (kernel_symbols && !haveKernelSymbols()) {
        CodeCache* cc = new CodeCache("[kernel]");
        parseKernelSymbols(cc);
        if (haveKernelSymbols()) {
            cc->sort();
            array->add(cc);
        } else {
            delete cc;
        }
    }
    std::unordered_map<u64, SharedLibrary> libs;
    // 获取所有库
    collectSharedLibraries(libs, MAX_NATIVE_LIBS - array->count());
    for (auto& it : libs) {
        u64 inode = it.first;
        _parsed_inodes.insert(inode);

        SharedLibrary& lib = it.second;
        CodeCache* cc = new CodeCache(lib.file, array->count(), false, lib.map_start, lib.map_end);
        // 解析库中的方法到CodeCache ...
        free(lib.file);
        cc->sort();
        applyPatch(cc);
        array->add(cc);
    }
}

内核符号通过/proc/kallsyms获取。

cat /proc/kallsyms | head
<内存地址> <符号类型> <符号名称>
ffffc398f9210000 T _stext
ffffc398f9210000 t __pi__stext
ffffc398f9210000 T __irqentry_text_start
...

其他动态库通过/proc/{pid}/maps获取库后,解析库文件得到方法符号。

cat /proc/目标进程pid/maps
ffff99090000-ffff9a342000 r-xp 00000000 00:25 355243                     /usr/lib/jvm/java-17-openjdk-arm64/lib/server/libjvm.so
ffff9a430000-ffff9a461000 rw-p 01390000 00:25 355243                     /usr/lib/jvm/java-17-openjdk-arm64/lib/server/libjvm.so
ffff9a748000-ffff9a74a000 r--p 0002e000 00:25 3040794                    /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
ffff9a74a000-ffff9a74c000 rw-p 00030000 00:25 3040794                    /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
...

获取java栈

-e java方法,直接使用jvmti提供的GetStackTrace获取java方法栈。

int Profiler::getJavaTraceJvmti(jvmtiFrameInfo* jvmti_frames, ASGCT_CallFrame* frames, int start_depth, int max_depth) {
    int num_frames = 0;
    if (VM::jvmti()->GetStackTrace(NULL, start_depth, max_depth, jvmti_frames, &num_frames) == 0 && num_frames > 0) {
        for (int i = 0; i < num_frames; i++) {
            jint bci = jvmti_frames[i].location;
            frames[i].method_id = jvmti_frames[i].method;
            frames[i].bci = bci;
            LP64_ONLY(frames[i].padding = 0;)
        }
    }
    return num_frames;
}

除-e java方法(perf_events、ctimer、itimer、wall),都会默认采用AsyncGetCallTrace方式获取java栈。

profiler.cpp:getJavaTraceAsync调用libjvm提供的AsyncGetCallTrace方法,传入ASGCT_CallTrace,AsyncGetCallTrace会将java调用栈存入ASGCT_CallFrame数组。

int Profiler::getJavaTraceAsync(void* ucontext, ASGCT_CallFrame* frames, int max_depth, StackContext* java_ctx) {
    JNIEnv* jni = VM::jni();
    if (jni == NULL) {
        return 0;
    }
    JitWriteProtection jit(false);
    ASGCT_CallTrace trace = {jni, 0, frames};
    // 调用libjvm.so AsyncGetCallTrace
    VM::_asyncGetCallTrace(&trace, max_depth, ucontext);
    if (trace.num_frames > 0) {
        return trace.num_frames;
    }
    // ... 异常处理
}

hotspot/share/prims/forte.cpp:jvm提供AsyncGetCallTrace方法入口。

AsyncGetCallTrace有众多约束,比如只能线程获取自己的调用栈、should_post_class_load要求必须通过jvmti挂载ClassLoad钩子、当前不在gc等等。

JNIEXPORT
void AsyncGetCallTrace(ASGCT_CallTrace *trace, jint depth, void* ucontext) {
  JavaThread* thread;

  if (trace->env_id == NULL ||
    (thread = JavaThread::thread_from_jni_environment(trace->env_id)) == NULL ||
    thread->is_exiting()) {
    trace->num_frames = ticks_thread_exit; // -8
    return;
  }

  if (thread->in_deopt_handler()) {
    trace->num_frames = ticks_deopt; // -9
    return;
  }

  assert(JavaThread::current() == thread,
         "AsyncGetCallTrace must be called by the current interrupted thread");

  if (!JvmtiExport::should_post_class_load()) {
    trace->num_frames = ticks_no_class_load; // -1
    return;
  }

  if (Universe::heap()->is_gc_active()) {
    trace->num_frames = ticks_GC_active; // -2
    return;
  }

  switch (thread->thread_state()) {
   // ...
  case _thread_in_Java:
  case _thread_in_Java_trans:
    {
      frame fr;

      // 先找到栈顶栈帧fr
      if (!thread->pd_get_top_frame_for_signal_handler(&fr, ucontext, true)) {
        trace->num_frames = ticks_unknown_Java;  // -5 unknown frame
      } else {
        trace->num_frames = ticks_not_walkable_Java;  // -6, non walkable frame by default
        // 从栈顶向下遍历,填充trace
        forte_fill_call_trace_given_top(thread, trace, depth, fr);
      }
    }
    break;
  default:
    trace->num_frames = ticks_unknown_state; // -7
    break;
  }
}

hotspot/os_cpu/linux_aarch64/thread_linux_aarch64.cpp:pd_get_top_frame获取栈顶栈帧,根据cpu架构不同,从ucontext中获取pc(程序计数器)、sp(栈指针)、fp(帧指针)。

bool JavaThread::pd_get_top_frame(frame* fr_addr, void* ucontext, bool isInJava) {
    // ...
    ucontext_t* uc = (ucontext_t*) ucontext;
    intptr_t* ret_fp;
    intptr_t* ret_sp;
    address addr = os::fetch_frame_from_context(uc, &ret_sp, &ret_fp);
    frame ret_frame(ret_sp, ret_fp, addr);
    *fr_addr = ret_frame;
    return true;
    //...
}

address os::fetch_frame_from_context(const void* ucVoid,
                    intptr_t** ret_sp, intptr_t** ret_fp) {
  const ucontext_t* uc = (const ucontext_t*)ucVoid;
  address epc = os::Posix::ucontext_get_pc(uc);
  *ret_sp = os::Linux::ucontext_get_sp(uc);
  *ret_fp = os::Linux::ucontext_get_fp(uc);
  return epc;
}
typedef struct ucontext_t
  {
    unsigned long __ctx(uc_flags);
    struct ucontext_t *uc_link;
    stack_t uc_stack;
    sigset_t uc_sigmask;
    mcontext_t uc_mcontext;
  } ucontext_t;
typedef struct
  {
    unsigned long long int __ctx(fault_address);
    unsigned long long int __ctx(regs)[31]; // fp在29下标寄存器
    unsigned long long int __ctx(sp); // sp
    unsigned long long int __ctx(pc); // pc
    unsigned long long int __ctx(pstate);
    unsigned char __reserved[4096] __attribute__ ((__aligned__ (16)));
  } mcontext_t;

hotspot/share/prims/forte.cpp:forte_fill_call_trace_given_top遍历java方法栈,将jmethodId填充到trace数组里。

static void forte_fill_call_trace_given_top(JavaThread* thd,
                                            ASGCT_CallTrace* trace,
                                            int depth,
                                            frame top_frame) {
  NoHandleMark nhm;
  frame initial_Java_frame;
  Method* method;
  int bci = -1;
  int count;
  count = 0;
  // 从第一个栈帧开始,找第一个java方法栈帧initial_Java_frame和对应方法method
  find_initial_Java_frame(thd, &top_frame, &initial_Java_frame, &method, &bci);
  // ...
  vframeStreamForte st(thd, initial_Java_frame, false);
  // 遍历java方法栈
  for (; !st.at_end() && count < depth; st.forte_next(), count++) {
    bci = st.bci();
    method = st.method();
    // ...
    // 把jMethodId存入trace
    trace->frames[count].method_id = method->find_jmethod_id_or_null();
    // ...
  }
  trace->num_frames = count;
  return;
}

4、stop

profiling期间将java方法jmethodId缓存到_call_trace_storage中,在stop时需要转换为方法名称,并以指定格式输出。

jattach再次挂载libasyncProfiler,这次执行指令是stop,stop执行完成后会执行dump。

Error Profiler::runInternal(Arguments& args, Writer& out) {
    switch (args._action) {
        case ACTION_START:
        case ACTION_RESUME: {
            Error error = start(args, args._action == ACTION_START);
            break;
        }
        case ACTION_STOP: {
            Error error = stop();
            // Fall through
        }
        case ACTION_DUMP: {
            Error error = dump(out, args);
            break;
        }
        // ...
        default:
            break;
    }
    return Error::OK;
}

停止引擎

profiler.cpp:stop停止引擎

Error Profiler::stop(bool restart) {
    MutexLocker ml(_state_lock);
    if (_state != RUNNING) {
        return Error("Profiler is not active");
    }
    if (_event_mask & EM_WALL) wall_clock.stop();
    if (_event_mask & EM_LOCK) lock_tracer.stop();
    if (_event_mask & EM_ALLOC) _alloc_engine->stop();
    if (_event_mask & EM_NATIVEMEM) malloc_tracer.stop();
    _engine->stop();
    // ...
    _state = IDLE;
    return Error::OK;
}

perfEvents_linux.cpp:perf_events引擎关闭每个线程的perf,释放fd和mmap。

void PerfEvents::destroyForThread(int tid) {
    if (tid >= _max_events) {
        return;
    }
    PerfEvent* event = &_events[tid];
    int fd = event->_fd;
    if (fd > 0 && __sync_bool_compare_and_swap(&event->_fd, fd, 0)) {
        ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
        close(fd);
    }
    if (event->_page != NULL) {
        event->lock();
        munmap(event->_page, 2 * OS::page_size);
        event->_page = NULL;
        event->unlock();
    }
}

ctimer_linux.cpp:ctimer关闭每个线程的timer。

同理itimer和wall也类似,不再赘述。

void CTimer::destroyForThread(int tid) {
    int timer = _timers[tid];
    if (timer != 0 && __sync_bool_compare_and_swap(&_timers[tid], timer--, 0)) {
        syscall(__NR_timer_delete, timer);
    }
}

instrument.cpp:-e java方法,标记_running=false后重新retransform目标class即可恢复原class定义。

void Instrument::stop() {
    _running = false;
    jvmtiEnv* jvmti = VM::jvmti();
    retransformMatchedClasses(jvmti);  // undo transformation
    jvmti->SetEventNotificationMode(JVMTI_DISABLE, JVMTI_EVENT_CLASS_FILE_LOAD_HOOK, NULL);
}

dump

profiler.cpp:dump可以支持不同的格式,常用的就是火焰图flamegraph。

注:asprof支持-o设置输出格式,也支持根据-f文件名后缀推断,如-f x.html是火焰图、x.jfr是jfr格式。

Error Profiler::dump(Writer& out, Arguments& args) {
    MutexLocker ml(_state_lock);
    switch (args._output) {
        case OUTPUT_COLLAPSED:
            // -o collapsed
            dumpCollapsed(out, args);
            break;
        case OUTPUT_FLAMEGRAPH:
            // -o flamegraph
            dumpFlameGraph(out, args, false);
            break;
        case OUTPUT_TREE:
            // -o tree
            dumpFlameGraph(out, args, true);
            break;
        case OUTPUT_TEXT:
            // -o text
            dumpText(out, args);
            break;
        case OUTPUT_JFR:
            // -o jfr
            if (_state == RUNNING) {
                lockAll();
                _jfr.flush();
                unlockAll();
            }
            break;
        default:
            return Error("No output format selected");
    }
    return Error::OK;
}

无论哪种格式,最终都需要将方法地址转换为方法名,以默认格式text为例。

profiler.cpp:dumpText。将_call_trace_storage缓存的方法栈提取到集合中,先打印所有调用栈,再打印栈顶方法,都按照样本数量降序。

void Profiler::dumpText(Writer& out, Arguments& args) {
    FrameName fn(args, args._style | STYLE_DOTTED, _epoch, _thread_names_lock, _thread_names);
    char buf[1024] = {0};

    // 将_call_trace_storage缓存的方法栈,提取到samples集合
    std::vector<CallTraceSample> samples;
    u64 total_counter = 0;
    {
        std::map<u64, CallTraceSample> map;
        _call_trace_storage.collectSamples(map);
        samples.reserve(map.size());

        for (std::map<u64, CallTraceSample>::const_iterator it = map.begin(); it != map.end(); ++it) {
            CallTrace* trace = it->second.trace; // 调用栈
            u64 counter = it->second.counter; // 样本数量
            if (trace == NULL || counter == 0) continue;

            total_counter += counter;
            if (trace->num_frames == 0 || excludeTrace(&fn, trace)) continue;
            samples.push_back(it->second);
        }
    }

    // Print summary
    snprintf(buf, sizeof(buf) - 1,
            "--- Execution profile ---\n"
            "Total samples       : %lld\n",
            _total_samples);
    out << buf;

    double cpercent = 100.0 / total_counter;
    const char* units_str = activeEngine()->units();

    // 打印调用栈,按照样本数量降序
    if (args._dump_traces > 0) {
        std::sort(samples.begin(), samples.end(), [](const CallTraceSample& a, const CallTraceSample& b) {
            return a.counter > b.counter;
        });

        int max_count = args._dump_traces;
        for (std::vector<CallTraceSample>::const_iterator it = samples.begin(); it != samples.end() && --max_count >= 0; ++it) {
            snprintf(buf, sizeof(buf) - 1, "--- %lld %s (%.2f%%), %lld sample%s\n",
                     it->counter, units_str, it->counter * cpercent,
                     it->samples, it->samples == 1 ? "" : "s");
            out << buf;

            CallTrace* trace = it->trace;
            for (int j = 0; j < trace->num_frames; j++) {

                const char* frame_name = fn.name(trace->frames[j]);
                snprintf(buf, sizeof(buf) - 1, "  [%2d] %s\n", j, frame_name);
                out << buf;
            }
            out << "\n";
        }
    }

    // 打印栈顶方法,按照样本数量降序
    if (args._dump_flat > 0) {
        std::map<std::string, MethodSample> histogram;
        for (std::vector<CallTraceSample>::const_iterator it = samples.begin(); it != samples.end(); ++it) {
            const char* frame_name = fn.name(it->trace->frames[0]);
            histogram[frame_name].add(it->samples, it->counter);
        }

        std::vector<NamedMethodSample> methods(histogram.begin(), histogram.end());
        std::sort(methods.begin(), methods.end(), sortByCounter);

        snprintf(buf, sizeof(buf) - 1, "%12s  percent  samples  top\n"
                                       "  ----------  -------  -------  ---\n", units_str);
        out << buf;

        int max_count = args._dump_flat;
        for (std::vector<NamedMethodSample>::const_iterator it = methods.begin(); it != methods.end() && --max_count >= 0; ++it) {
            snprintf(buf, sizeof(buf) - 1, "%12lld  %6.2f%%  %7lld  %s\n",
                     it->second.counter, it->second.counter * cpercent, it->second.samples, it->first.c_str());
            out << buf;
        }
    }
}

无论哪种格式,都需要用到FrameName来处理方法地址到方法名的转换。

frameName.cpp:

native方法,因为method_id存储的就是方法名,这里可以直接输出(decodeNativeSymbol只是能在方法名前加lib库名,忽略);

java方法,有一层缓存;

const char* FrameName::name(ASGCT_CallFrame& frame, bool for_matching) {
    switch (frame.bci) {
        case BCI_NATIVE_FRAME:
            // -10 native方法
            return decodeNativeSymbol((const char*)frame.method_id);
        // ...
        default: {
            // java方法 type_suffix忽略
            const char* type_suffix = typeSuffix(FrameType::decode(frame.bci));
            // 有一层缓存
            JMethodCache::iterator it = _cache.lower_bound(frame.method_id);
            if (it != _cache.end() && it->first == frame.method_id) {
                it->second[0] = _cache_epoch;
                const char* name = it->second.c_str() + 1;
                if (type_suffix != NULL) {
                    return _str.assign(name).append(type_suffix).c_str();
                }
                return name;
            }
            // 缓存未命中,method_id转名称
            javaMethodName(frame.method_id);
            _cache.insert(it, JMethodCache::value_type(frame.method_id, std::string(1, _cache_epoch) + _str));
            if (type_suffix != NULL) {
                _str += type_suffix;
            }
            return _str.c_str();
        }
    }
}

frameName.cpp:通过jvmti根据jmethodID获取类名和方法名。

void FrameName::javaMethodName(jmethodID method) {
    jclass method_class = NULL;
    char* class_name = NULL;
    char* method_name = NULL;
    char* method_sig = NULL;
    jvmtiEnv* jvmti = VM::jvmti();
    jvmtiError err;
    if ((err = jvmti->GetMethodName(method, &method_name, &method_sig, NULL)) == 0 &&
        (err = jvmti->GetMethodDeclaringClass(method, &method_class)) == 0 &&
        (err = jvmti->GetClassSignature(method_class, &class_name, NULL)) == 0) {
        // Trim 'L' and ';' off the class descriptor like 'Ljava/lang/Object;'
        javaClassName(class_name + 1, strlen(class_name) - 2, _style);
        _str.append(".").append(method_name);
        if (_style & STYLE_SIGNATURES) {
            if (_style & STYLE_NO_SEMICOLON) {
                for (char* s = method_sig; *s; s++) {
                    if (*s == ';') *s = '|';
                }
            }
            _str.append(method_sig);
        }
    } else {
        char buf[32];
        snprintf(buf, sizeof(buf), "[jvmtiError %d]", err);
        _str.assign(buf);
    }

    if (method_class) {
        _jni->DeleteLocalRef(method_class);
    }
    jvmti->Deallocate((unsigned char*)class_name);
    jvmti->Deallocate((unsigned char*)method_sig);
    jvmti->Deallocate((unsigned char*)method_name);
}

总结

本章简单分析了async-profiler在cpu分析方面的实现。

async-profiler涉及三个进程:

image.png

1)asprof进程:asprof命令行工具,负责解析用户指令,发起agent挂载;

2)jattach进程:asprof每次挂载agent通过fork创建jattach进程;

3)jvm进程:被分析的进程,运行agent逻辑;

jattach挂载agent的流程如下:

image.png

重点逻辑在于libasyncProfiler.so这个agent实现。

agent需要挂载两次,第一次start开始分析,第二次stop输出分析结果。

image.png

首先执行初始化,libasyncProfiler多次挂载只会初始化一次:

1)通过jvm构造传入的JavaVM拿到jvmti;

2)通过dlopen找到jvm动态库libjvm.so,拿到AsyncGetCallTrace方法,用于后面处理java栈;

3)updateSymbols初始化非内核符号,用于后面处理native栈;

4)使用jvmti挂载多个钩子:ClassLoad钩子为了能让AsyncGetCallTrace被正常调用;ClassPrepare钩子为了更新jmethodId;

5)loadAllMethodIDs触发jmethodId分配;

接下来根据用户选择的cpu引擎走不同逻辑:

1)默认情况下优先选择perf_events引擎,降级使用ctimer,兜底使用wall;

2)可以通过-e指定引擎,比如itimer、java方法级profiling;

perf_events:

1)为目标进程的每个线程开启perf_event_open发送信号;

2)受制于权限约束,非容器环境下需要配置sysctl kernel.perf_event_paranoid=1;(默认2)容器环境下需要通过seccomp或其他方式;sysctl kernel.kptr_restrict=0才能获取内核方法符号;

ctimer:

1)为每个线程创建一个timer(timer_create系统调用)定时发出信号;

2)相较于perf_events,无法采集内核方法;

itimer:

1)通过setitimer开启一个timer定时发送信号;

2)相较于ctimer,itimer只能向当前进程定时发送信号,无法让信号在线程之间均匀分配;

wall:

1)pthread_create创建一个采样线程,定时扫描n个线程并发送信号,无论线程处于何种状态(运行、睡眠、阻塞);

java方法:

1)基于字节码增强,在目标方法执行前插入one.profiler.Instrument#recordSample;

2)Instrument#recordSample不会发送信号,直接采集java栈;

上面不同引擎采用不同方式发送信号,是agent挂载的主要任务。

profiling期间持续产生信号,记录调用栈信息。

采集native栈:

1)perf_events:直接读perf数据拿到指令地址;

2)ctimer/itimer/wall:每个线程收到信号,使用帧指针(fp)回溯,获取指令地址(pc);

3)采集native指令地址后,基于初始化的符号信息,将指令地址转方法名缓存;

采集java栈:

1)java方法引擎,使用jvmti的GetStackTrace方法;

2)其他引擎,默认会采用libjvm.so提供的AsyncGetCallTrace非标方法获取;

3)java栈是一组jmethodId,jmethodId实际上是Method指针;

profiling完成再次挂载agent,执行stop指令:

1)停止不同的cpu引擎,比如perf_events关闭每个线程的perf;

2)dump根据-o或-f指定格式输出分析结果,默认格式是text,native栈在profiling期间就拿到了方法名,java栈需要用jvmti的GetMethodName等方法传入jmethodId获取方法名;