Android进程管理3——内存回收LMKD
关于android内存的调节一共可以分为三个方面。当设备内存紧张的时候开始开始采用LMKD杀进程,对于杀不掉的进程以TrimMemroy的方式通知进程自己清理内存,最极端的情况直接OOM爆掉。这里我们主要讨论LMKD。
1.LMKD与Lowmemorykiller
从Android9.版本开始,系统放弃了传统的Lowmemorykiller改用LMKD(Low Memory Killer Daemon)进行低内存查杀。
从Android10版本开始,lmkd的监测内存模式从vmpressure变为了PSI方式。
1.1 LMKD和Lowmemorykiller的一些区别?
- Lowmemorykiller 运行于 Linux 内核中,而LMKD作为一个独立的守护进程运行,后者扩展性更好更灵活
- Lowmemorykiller依赖于oom_score_adj,LMKD的监测和参考维度更多
- Lowmemorykiller杀进程的速度要弱于LMKD 代码中也可以看见逻辑
9.0之前的版本主要依靠以下位置的文件进行判断
# /sys/module/lowmemorykiller/parameters/minfree
18432,23040,27648,32256,55296,80640
# /sys/module/lowmemorykiller/parameters/adj
0,100,200,300,900,906
1.2 PSI代替vmpressure**?**
vmpressure 信号(由内核生成,用于内存压力检测并由 lmkd 使用)通常包含大量误报,因此 lmkd 必须执行过滤以确定内存是否真的有压力。这会导致不必要的 lmkd 唤醒并使用额外的计算资源。使用 PSI 监视器可以实现更精确的内存压力检测,并最大限度地减少过滤开销。
支持LMKD需要的编译配置
CONFIG_ANDROID_LOW_MEMORY_KILLER=n
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
支持PSI需要的编译配置
CONFIG_PSI=y
所以在Android10及以上的版本中,当系统初始化完成后启动lmkd之后,会首先判断是否use_inkernel_interface(高版本都是false),然后判断是否支持PSI,不支持则用vmpressure。然后在根据是否是低内存设备和是否用use_minfree_levels采用不同的策略。
2 LMKD代码流程
2.1 lmkd启动
lmkd.rc
service lmkd /system/bin/lmkd
class core //核心进程 class_start core init.rc中 onboot
user lmkd
group lmkd system readproc
capabilities DAC_OVERRIDE KILL IPC_LOCK SYS_NICE SYS_RESOURCE
critical // 4min之内crash4次,则重启bootloader
socket lmkd seqpacket 0660 system system // 设置socket
writepid /dev/cpuset/system-background/tasks //对应cpuset
critical的具体代码参考 system/core/init/service.cpp
2.2 LMKD main方法
int main(int argc, char **argv) {
update_props(); //更新prop 一系列和lmk相关的prop值
ctx = create_android_logger(KILLINFO_LOG_TAG);//eventlog
if (!init()) {
if (!use_inkernel_interface) {//正常都是false
/*
* MCL_ONFAULT pins pages as they fault instead of loading
* everything immediately all at once. (Which would be bad,
* because as of this writing, we have a lot of mapped pages we
* never use.) Old kernels will see MCL_ONFAULT and fail with
* EINVAL; we ignore this failure.
*
* N.B. read the man page for mlockall. MCL_CURRENT | MCL_ONFAULT
* pins ⊆ MCL_CURRENT, converging to just MCL_CURRENT as we fault
* in pages.
*/
//锁住该实时进程在物理内存上全部地址空间。这将阻止Linux将这个内存页调度到交换空间(swap space),
// 及时该进程已有一段时间没有访问这段空间。
/* CAP_IPC_LOCK required */
if (mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT) && (errno != EINVAL)) {
ALOGW("mlockall failed %s", strerror(errno));
}
/* CAP_NICE required */
struct sched_param param = {
.sched_priority = 1,
};
if (sched_setscheduler(0, SCHED_FIFO, ¶m)) { //实时调度
ALOGW("set SCHED_FIFO failed %s", strerror(errno));
}
}
//循环处理事件 接收socket
mainloop();
}
android_log_destroy(&ctx);
ALOGI("exiting");
return 0;
}
对应更新的prop各项表格如下
属性 | 使用 | 默认 |
---|---|---|
ro.config.low_ram | 指定设备是低内存设备还是高性能设备。 | false |
ro.lmk.use_psi | 使用 PSI 监视器(而不是 vmpressure 事件)。 | true |
ro.lmk.use_minfree_levels | 使用可用内存和文件缓存阈值来做出进程终止决策(即与内核中的 LMK 驱动程序的功能一致)。 | false |
ro.lmk.low | 在低 vmpressure 水平下可被终止的进程的最低 oom_adj 得分。 | 1001(停用) |
ro.lmk.medium | 在中等 vmpressure 水平下可被终止的进程的最低 oom_adj 得分。 | 800(已缓存或非必要服务) |
ro.lmk.critical | 在临界 vmpressure 水平下可被终止的进程的最低 oom_adj 得分。 | 0(任何进程) |
ro.lmk.critical_upgrade | 支持升级到临界水平。 | false |
ro.lmk.upgrade_pressure | 由于系统交换次数过多,将在该水平执行水平升级的 mem_pressure 上限。 | 100(停用) |
ro.lmk.downgrade_pressure | 由于仍有足够的可用内存,将在该水平忽略 vmpressure 事件的 mem_pressure 下限。 | 100(停用) |
ro.lmk.kill_heaviest_task | 终止符合条件的最繁重任务(最佳决策)与终止符合条件的任何任务(快速决策)。 | true |
ro.lmk.kill_timeout_ms | 从某次终止后到其他终止完成之前的持续时间(以毫秒为单位)。 | 0(停用) |
ro.lmk.debug | 启用 lmkd 调试日志。 | false |
2.3 LMKD init()
lmkd通过试探进入内核lmk模块路径(/sys/module/lowmemorykiller/parameters/minfree)的方式判断当前系统是否含义lmk模块。如果存在内核lmk模块,并且用户配置了enable_userspace_lmk为false,直接使用内核lmk。否则使用用户空间lmkd。
在init_monitors()( 判断通过psi 还是vmpressure检测内存,我这里的设备Android10以上均为psi方式。
static int init(void) {
static struct event_handler_info kernel_poll_hinfo = { 0, kernel_event_handler };
struct reread_data file_data = { ///proc/zoneinfo
.filename = ZONEINFO_PATH,
.fd = -1,
};
epollfd = epoll_create(MAX_EPOLL_EVENTS);//创建全局epoll文件句柄
/*
MAX_EPOLL_EVENTS
* 1 ctrl listen socket, 3 ctrl data socket, 3 memory pressure levels,
* 1 lmk events + 1 fd to wait for process death
*/
ctrl_sock.sock = android_get_control_socket("lmkd"); //打开lmkd socket文件句柄
···
ret = listen(ctrl_sock.sock, MAX_DATA_CONN);
···
epev.events = EPOLLIN;
//lmkd的socket连接时回调这里 这里打印"lmkd data connection established"
ctrl_sock.handler_info.handler = ctrl_connect_handler;
epev.data.ptr = (void *)&(ctrl_sock.handler_info);
has_inkernel_module = !access(INKERNEL_MINFREE_PATH, W_OK);//"/sys/module/lowmemorykiller/parameters/minfree"
use_inkernel_interface = has_inkernel_module;
if (use_inkernel_interface) { //false 高版本不玩儿这一套了
ALOGI("Using in-kernel low memory killer interface");
if (init_poll_kernel()) {
···
}
} else {
if (!init_monitors()) { // initmonitor
return -1;
}
/* let the others know it does support reporting kills */
property_set("sys.lmk.reportkills", "1");
}
return 0;
}
#### **init_psi_monitors**
ro.lmk.use_psi判断是否支持psi,init_psi_monitors()后者主要是psi.c中是否初始化成功。
```cpp
static bool init_monitors() {
/* Try to use psi monitor first if kernel has it */
use_psi_monitors = property_get_bool("ro.lmk.use_psi", true) &&
init_psi_monitors();
/* Fall back to vmpressure */
if (!use_psi_monitors &&
(!init_mp_common(VMPRESS_LEVEL_LOW) ||
!init_mp_common(VMPRESS_LEVEL_MEDIUM) ||
!init_mp_common(VMPRESS_LEVEL_CRITICAL))) {
ALOGE("Kernel does not support memory pressure events or in-kernel low memory killer");
return false;
}
if (use_psi_monitors) {
ALOGI("Using psi monitors for memory pressure detection");
} else {
ALOGI("Using vmpressure for memory pressure detection");
}
return true;
}
init_psi_monitors
这里主要通过下面两个配置,进行不同的kill策略
ro.config.low_ram
配置设备为低内存ro.lmk.use_minfree_levels
与内核中的 LMK 驱动程序相同的kill策略(即可用内存和文件缓存阈值(file cache thresholds))做出终止决策。
static bool init_psi_monitors() {
/*
* When PSI is used on low-ram devices or on high-end devices without memfree levels
* use new kill strategy based on zone watermarks, free swap and thrashing stats
*/
//low_ram_device即ro.config.low_ram use_minfree_levels则是ro.lmk.use_minfree_levels
//当为低内存设备 或用旧模式的时候,使用use_new_strategy
bool use_new_strategy =
property_get_bool("ro.lmk.use_new_strategy", low_ram_device || !use_minfree_levels);
/* In default PSI mode override stall amounts using system properties */
if (use_new_strategy) {
/* Do not use low pressure level */
psi_thresholds[VMPRESS_LEVEL_LOW].threshold_ms = 0;
//ro.lmk.psi_partial_stall_ms 低内存设备 200ms or 70ms
psi_thresholds[VMPRESS_LEVEL_MEDIUM].threshold_ms = psi_partial_stall_ms;
//ro.lmk.psi_complete_stall_ms 700ms
psi_thresholds[VMPRESS_LEVEL_CRITICAL].threshold_ms = psi_complete_stall_ms;
}
//重点分析init_mp_psi
if (!init_mp_psi(VMPRESS_LEVEL_LOW, use_new_strategy)) {
return false;
}
if (!init_mp_psi(VMPRESS_LEVEL_MEDIUM, use_new_strategy)) {
destroy_mp_psi(VMPRESS_LEVEL_LOW);
return false;
}
if (!init_mp_psi(VMPRESS_LEVEL_CRITICAL, use_new_strategy)) {
destroy_mp_psi(VMPRESS_LEVEL_MEDIUM);
destroy_mp_psi(VMPRESS_LEVEL_LOW);
return false;
}
return true;
}
/* memory pressure levels */
enum vmpressure_level {
VMPRESS_LEVEL_LOW = 0,
VMPRESS_LEVEL_MEDIUM,
VMPRESS_LEVEL_CRITICAL,
VMPRESS_LEVEL_COUNT
};
static struct psi_threshold psi_thresholds[VMPRESS_LEVEL_COUNT] = {
{ PSI_SOME, 70 }, /* 70ms out of 1sec for partial stall */
{ PSI_SOME, 100 }, /* 100ms out of 1sec for partial stall */
{ PSI_FULL, 70 }, /* 70ms out of 1sec for complete stall */
};
init_mp_psi
• 只有当设备不是低内存设备,同时使用minfree级别时,不使用新策略。
static bool init_mp_psi(enum vmpressure_level level, bool use_new_strategy) {
int fd;
/* Do not register a handler if threshold_ms is not set */
if (!psi_thresholds[level].threshold_ms) {
return true;
}
//往该节点(/proc/pressure/memory)写入stall_type、threshold_ms 、PSI_WINDOW_SIZE_MS
//调用psi.cpp 窗口大小时间(1000ms),PSI监视器监控窗口大小,
//在每个窗口最多生成一次事件,因此在PSI窗口大小的持续时间内轮询内存状态
fd = init_psi_monitor(psi_thresholds[level].stall_type,
psi_thresholds[level].threshold_ms * US_PER_MS,
PSI_WINDOW_SIZE_MS * US_PER_MS);
···
vmpressure_hinfo[level].handler = use_new_strategy ? mp_event_psi : mp_event_common;//判断是否是use_new_strategy
vmpressure_hinfo[level].data = level;
if (register_psi_monitor(epollfd, fd, &vmpressure_hinfo[level]) < 0) { // 调用psi.cpp
destroy_psi_monitor(fd);
return false;
}
···
return true;
}
调用psi
psi主要监控了proc/pressure下 io memory cpu三项指标。
init_psi_monitor
register_psi_monitor
2.4 LMKD接收SystemServer socket消息
lmkd进程的客户端是ActivityManager,通过socket(dev/socket/lmkd)跟 lmkd 进行通信, 当有客户连接时,就会回调ctrl_connect_handler函数 > ctrl_data_handler > ctrl_command_handler
// lmkd进程的客户端是ActivityManager,通过socket(dev/socket/lmkd)跟 lmkd 进行通信,
// 当有客户连接时,就会回调ctrl_connect_handler函数 > ctrl_data_handler > ctrl_command_handler
这里我们直接看ctrl_command_handler
ctrl_command_handler
static void ctrl_command_handler(int dsock_idx) {
......
switch(cmd) {
case LMK_TARGET:
// 解析socket packet里面传过来的数据,写入lowmem_minfree和lowmem_adj两个数组中,
// 用于控制low memory的行为;
// 设置sys.lmk.minfree_levels,比如属性值:
// [sys.lmk.minfree_levels]: [18432:0,23040:100,27648:200,85000:250,191250:900,241920:950]
cmd_target(targets, packet);
case LMK_PROCPRIO:
// 设置进程的oomadj,把oomadj写入对应的节点(/proc/pid/oom_score_adj)中;
// 将oomadj保存在一个哈希表中。
// 哈希表 pidhash 是以 pid 做 key,proc_slot 则是把 struct proc 插入到以 oomadj 为 key 的哈希表 procadjslot_list 里面
cmd_procprio(packet);
case LMK_PROCREMOVE:
// 解析socket传过来进程的pid,
// 通过pid_remove 把这个 pid 对应的 struct proc 从 pidhash 和 procadjslot_list 里移除
cmd_procremove(packet);
case LMK_PROCPURGE:
cmd_procpurge();
case LMK_GETKILLCNT:
kill_cnt = cmd_getkillcnt(packet);
........
}
命令 | 功能 | 方法 |
---|---|---|
LMK_TARGET | 初始化 oom_adj | ProcessList::setOomAdj() |
LMK_PROCPRIO | 更新 oom_adj | ProcessList::updateOomLevels() |
LMK_PROCREMOVE | 移除进程(暂时无用) | ProcessList::remove() |
当监听到系统内存压力过大时,会通过/proc/pressure/memory上报内存压力,由于配置的是some 60、some 100、full70,当一秒内内存占用70ms\100ms时会上报内存压力,上报压力后,会判断use_new_strategy触发不同的事件。
3. mp_event_psi和mp_event_common不同kill策略
3.1 mp_event_psi流程
mp_event_psi 使用zone_watermark监测。当设备为低内存或者不使用旧模式minfree时,均如下处理方式。
static void mp_event_psi(int data, uint32_t events, struct polling_params *poll_params) {
bool kill_pending = is_kill_pending();//判断last_kill_pid_or_fd节点是否存在,存在则为true
if (kill_pending && (kill_timeout_ms == 0 ||
get_time_diff_ms(&last_kill_tm, &curr_tm) < static_cast<long>(kill_timeout_ms))) {
/* Skip while still killing a process */
wi.skipped_wakeups++;
goto no_kill;
}
/*
* Process is dead or kill timeout is over, stop waiting. This has no effect if pidfds are
* supported and death notification already caused waiting to stop.
*/
//进程已死或杀死超时结束,停止等待。 如果支持pidfds,并且死亡通知已经导致等待停止,
stop_wait_for_proc_kill(!kill_pending);
if (vmstat_parse(&vs) < 0) {// 解析/proc/vmstat
ALOGE("Failed to parse vmstat!");
return;
}
/* Starting 5.9 kernel workingset_refault vmstat field was renamed workingset_refault_file */
workingset_refault_file = vs.field.workingset_refault ? : vs.field.workingset_refault_file;
if (meminfo_parse(&mi) < 0) {/// 解析/proc/meminfo并匹配各个字段的信息,获取可用内存页信息:
ALOGE("Failed to parse meminfo!");
return;
}
/* Reset states after process got killed */
if (killing) {
killing = false;
cycle_after_kill = true;
/* Reset file-backed pagecache size and refault amounts after a kill */
base_file_lru = vs.field.nr_inactive_file + vs.field.nr_active_file;
init_ws_refault = workingset_refault_file;
thrashing_reset_tm = curr_tm;
prev_thrash_growth = 0;
}
/* Check free swap levels */
if (swap_free_low_percentage) {//ro.lmk.swap_free_low_percentage 默认为10
if (!swap_low_threshold) {
swap_low_threshold = mi.field.total_swap * swap_free_low_percentage / 100;
}
//当swap可用空间低于ro.lmk.swap_free_low_percentage属性定义的百分比时,设置swap_is_low = true
swap_is_low = mi.field.free_swap < swap_low_threshold; // meminfo
}
/* Identify reclaim state */
//通过判断pgscan_direct/pgscan_kswapd字段较上一次的变化,
if (vs.field.pgscan_direct > init_pgscan_direct) { // 直接回收(DIRECT_RECLAIM)
init_pgscan_direct = vs.field.pgscan_direct;
init_pgscan_kswapd = vs.field.pgscan_kswapd;
reclaim = DIRECT_RECLAIM;
} else if (vs.field.pgscan_kswapd > init_pgscan_kswapd) {//通过swap回收(KSWAPD_RECLAIM),
init_pgscan_kswapd = vs.field.pgscan_kswapd;
reclaim = KSWAPD_RECLAIM;
} else if (workingset_refault_file == prev_workingset_refault) {
// 如果都不是(NO_RECLAIM),说明内存压力不大,不进行kill
/*
* Device is not thrashing and not reclaiming, bail out early until we see these stats
* changing
*/
goto no_kill;
}
prev_workingset_refault = workingset_refault_file;
/*
* It's possible we fail to find an eligible process to kill (ex. no process is
* above oom_adj_min). When this happens, we should retry to find a new process
* for a kill whenever a new eligible process is available. This is especially
* important for a slow growing refault case. While retrying, we should keep
* monitoring new thrashing counter as someone could release the memory to mitigate
* the thrashing. Thus, when thrashing reset window comes, we decay the prev thrashing
* counter by window counts. If the counter is still greater than thrashing limit,
* we preserve the current prev_thrash counter so we will retry kill again. Otherwise,
* we reset the prev_thrash counter so we will stop retrying.
*/
/*
* 有可能找不到合适的进程进行杀进程(例如没有进程高于oom_adj_min)。 在这种情况下,
* 每当有新的合格进程可用时,我们应重试找到新的进程进行杀进程,这对于缓慢增长的
* 回页错误情况尤其重要。 在重试期间,我们应继续监控新的抖动计数器,因为有人可能释放
* 内存来缓解抖动。 因此,当抖动重置窗口来临时,我们通过窗口计数递减前一个抖动计数器。
* 如果计数器仍大于抖动限制,我们保留当前的前一个抖动计数器,这样我们将再次尝试杀死。
* 否则,我们重置prev_thrash计数器,这样我们就停止重试了。
*/
//更新trashing,trashing过高说明内存存在压力,过低说明内存空闲
since_thrashing_reset_ms = get_time_diff_ms(&thrashing_reset_tm, &curr_tm);
if (since_thrashing_reset_ms > THRASHING_RESET_INTERVAL_MS) {
long windows_passed;
/* Calculate prev_thrash_growth if we crossed THRASHING_RESET_INTERVAL_MS */
prev_thrash_growth = (workingset_refault_file - init_ws_refault) * 100
/ (base_file_lru + 1);
windows_passed = (since_thrashing_reset_ms / THRASHING_RESET_INTERVAL_MS);
/*
* Decay prev_thrashing unless over-the-limit thrashing was registered in the window we
* just crossed, which means there were no eligible processes to kill. We preserve the
* counter in that case to ensure a kill if a new eligible process appears.
*/
if (windows_passed > 1 || prev_thrash_growth < thrashing_limit) {
prev_thrash_growth >>= windows_passed;
}
/* Record file-backed pagecache size when crossing THRASHING_RESET_INTERVAL_MS */
base_file_lru = vs.field.nr_inactive_file + vs.field.nr_active_file;
init_ws_refault = workingset_refault_file;
thrashing_reset_tm = curr_tm;
thrashing_limit = thrashing_limit_pct;
} else {
/* Calculate what % of the file-backed pagecache refaulted so far */
thrashing = (workingset_refault_file - init_ws_refault) * 100 / (base_file_lru + 1);
}
/* Add previous cycle's decayed thrashing amount */
thrashing += prev_thrash_growth;
if (max_thrashing < thrashing) {
max_thrashing = thrashing;
}
//更新水位线
/*
* Refresh watermarks once per min in case user updated one of the margins.
* TODO: b/140521024 replace this periodic update with an API for AMS to notify LMKD
* that zone watermarks were changed by the system software.
*/
if (watermarks.high_wmark == 0 || get_time_diff_ms(&wmark_update_tm, &curr_tm) > 60000) {
struct zoneinfo zi;
// 解析/proc/zoneinfo并匹配相应字段信息,
// 获取保留页的大小:zi->field.totalreserve_pages += zi->field.high;(获取可用内存)
//并计算min/low/hight水位线,
if (zoneinfo_parse(&zi) < 0) {
ALOGE("Failed to parse zoneinfo!");
return;
}
calc_zone_watermarks(&zi, &watermarks);
wmark_update_tm = curr_tm;
}
/* Find out which watermark is breached if any */
wmark = get_lowest_watermark(&mi, &watermarks);//zmi->nr_free_pages - zmi->cma_free和watermarks比较
/*
* TODO: move this logic into a separate function
* Decide if killing a process is necessary and record the reason
*/
//根据水位线、thrashing值、压力值、swap_low值、内存回收模式等进行多种场景判断,并添加不同的kill原因
if (cycle_after_kill && wmark < WMARK_LOW) {
/*防止杀死进程时无法释放足够的内存,可能导致 OOM 杀死进程。
当一个进程消耗内存的速度比回收能够释放的速度更快时,即使进行杀死操作后,仍然可能发生这种情况。
这通常发生在运行内存压力测试时。
*/
kill_reason = PRESSURE_AFTER_KILL;
strncpy(kill_desc, "min watermark is breached even after kill", sizeof(kill_desc));
} else if (level == VMPRESS_LEVEL_CRITICAL && events != 0) {
/*设备正在繁忙地回收内存,这可能导致 ANR。
当 PSI 完全停滞(所有任务因内存拥塞而被阻塞)超过配置的阈值时,就会触发关键级别。
*/
kill_reason = NOT_RESPONDING;
strncpy(kill_desc, "device is not responding", sizeof(kill_desc));
} else if (swap_is_low && thrashing > thrashing_limit_pct) { //ro.lmk.thrashing_limit 30 or 100
/* Page cache is thrashing while swap is low */
kill_reason = LOW_SWAP_AND_THRASHING;
snprintf(kill_desc, sizeof(kill_desc), "device is low on swap (%" PRId64
"kB < %" PRId64 "kB) and thrashing (%" PRId64 "%%)",
mi.field.free_swap * page_k, swap_low_threshold * page_k, thrashing);
/* Do not kill perceptible apps unless below min watermark or heavily thrashing */
if (wmark > WMARK_MIN && thrashing < thrashing_critical_pct) { //WMARK_MIN = 0 thrashing_limit_pct * 2 上面的
min_score_adj = PERCEPTIBLE_APP_ADJ + 1; //200
}
check_filecache = true;
} else if (swap_is_low && wmark < WMARK_HIGH) { //对应上边的百分比
/* Both free memory and swap are low */
kill_reason = LOW_MEM_AND_SWAP;
snprintf(kill_desc, sizeof(kill_desc), "%s watermark is breached and swap is low (%"
PRId64 "kB < %" PRId64 "kB)", wmark < WMARK_LOW ? "min" : "low",
mi.field.free_swap * page_k, swap_low_threshold * page_k);
/* Do not kill perceptible apps unless below min watermark or heavily thrashing */
if (wmark > WMARK_MIN && thrashing < thrashing_critical_pct) {
min_score_adj = PERCEPTIBLE_APP_ADJ + 1; //200
}
} else if (wmark < WMARK_HIGH && swap_util_max < 100 &&
(swap_util = calc_swap_utilization(&mi)) > swap_util_max) {
/*
* Too much anon memory is swapped out but swap is not low.
* Non-swappable allocations created memory pressure.
*/
kill_reason = LOW_MEM_AND_SWAP_UTIL;
snprintf(kill_desc, sizeof(kill_desc), "%s watermark is breached and swap utilization"
" is high (%d%% > %d%%)", wmark < WMARK_LOW ? "min" : "low",
swap_util, swap_util_max);
} else if (wmark < WMARK_HIGH && thrashing > thrashing_limit) {
/* Page cache is thrashing while memory is low */
kill_reason = LOW_MEM_AND_THRASHING;
snprintf(kill_desc, sizeof(kill_desc), "%s watermark is breached and thrashing (%"
PRId64 "%%)", wmark < WMARK_LOW ? "min" : "low", thrashing);
cut_thrashing_limit = true;
/* Do not kill perceptible apps unless thrashing at critical levels */
if (thrashing < thrashing_critical_pct) {
min_score_adj = PERCEPTIBLE_APP_ADJ + 1;
}
check_filecache = true;
} else if (reclaim == DIRECT_RECLAIM && thrashing > thrashing_limit) {
/* Page cache is thrashing while in direct reclaim (mostly happens on lowram devices) */
kill_reason = DIRECT_RECL_AND_THRASHING;
snprintf(kill_desc, sizeof(kill_desc), "device is in direct reclaim and thrashing (%"
PRId64 "%%)", thrashing);
cut_thrashing_limit = true;
/* Do not kill perceptible apps unless thrashing at critical levels */
if (thrashing < thrashing_critical_pct) {
min_score_adj = PERCEPTIBLE_APP_ADJ + 1;
}
check_filecache = true;
} else if (check_filecache) {
int64_t file_lru_kb = (vs.field.nr_inactive_file + vs.field.nr_active_file) * page_k;
if (file_lru_kb < filecache_min_kb) {
/* File cache is too low after thrashing, keep killing background processes */
kill_reason = LOW_FILECACHE_AFTER_THRASHING;
snprintf(kill_desc, sizeof(kill_desc),
"filecache is low (%" PRId64 "kB < %" PRId64 "kB) after thrashing",
file_lru_kb, filecache_min_kb);
min_score_adj = PERCEPTIBLE_APP_ADJ + 1;
} else {
/* File cache is big enough, stop checking */
check_filecache = false;
}
}
/* Kill a process if necessary */
if (kill_reason != NONE) {
struct kill_info ki = {
.kill_reason = kill_reason,
.kill_desc = kill_desc,
.thrashing = (int)thrashing,
.max_thrashing = max_thrashing,
}; //最终kill的走向
int pages_freed = find_and_kill_process(min_score_adj, &ki, &mi, &wi, &curr_tm);
if (pages_freed > 0) {
killing = true;
max_thrashing = 0;
if (cut_thrashing_limit) {
/*
* Cut thrasing limit by thrashing_limit_decay_pct percentage of the current
* thrashing limit until the system stops thrashing.
*/
thrashing_limit = (thrashing_limit * (100 - thrashing_limit_decay_pct)) / 100;
}
}
}
no_kill:
/* Do not poll if kernel supports pidfd waiting */
if (is_waiting_for_kill()) {
/* Pause polling if we are waiting for process death notification */
poll_params->update = POLLING_PAUSE;
return;
}
/*初始 PSI 事件后开始轮询;
在设备处于直接回收内存或进程被杀死时,延长轮询时间;
当 kswapd 回收内存时不延长轮询时间,因为这可能会持续很长时间而不会引起内存压力。
*/
if (events || killing || reclaim == DIRECT_RECLAIM) {
poll_params->update = POLLING_START;
}
/* Decide the polling interval */
if (swap_is_low || killing) {
/* Fast polling during and after a kill or when swap is low */
poll_params->polling_interval_ms = PSI_POLL_PERIOD_SHORT_MS; //10ms
} else {
/* By default use long intervals */
poll_params->polling_interval_ms = PSI_POLL_PERIOD_LONG_MS; //100ms
}
}
这段代码的主要逻辑是:
-
检查是否有需要kill的进程,如果有正在kill的进程则跳过本次循环
-
解析/proc/vmstat和/proc/meminfo获取内存状态信息
-
根据内存水位线、thrashing值、swap使用情况等判断是否需要kill进程
- 如果刚kill完进程但内存使用依然过高,则再次kill
- 如果设备长时间无响应,则kill进程试图让设备响应
- 如果swap使用过高且thrashing过高,则kill进程
- 如果内存使用过高且swap空间不足,则kill进程
- 如果内存使用过高且thrashing过高,则kill进程
- 等等
-
如果确定需要kill进程,则调用find_and_kill_process函数找到进程kill
-
根据内存状况决定PSI事件的轮询间隔,如果内存压力大则增大轮询频率
-
如果正在等待已kill进程退出,则暂停轮询
3.2 mp_event_common流程
非低内存设备并且使用iuse_minfree_levels
static void mp_event_common(int data, uint32_t events, struct polling_params *poll_params) {
if (meminfo_parse(&mi) < 0 || zoneinfo_parse(&zi) < 0) { //读取meminfo 和zoneinfo
ALOGE("Failed to get free memory!");
return;
}
if (use_minfree_levels) {//走到这都是true
int i;
//other_free 表示系统可用的内存页的数目,MemFree - high
// nr_free_pages为proc/meminfo中MemFree,当前系统的空闲内存大小,是完全没有被使用的内存
// totalreserve_pages为proc/zoneinfo中max_protection+high,其中max_protection在android中为0
other_free = mi.field.nr_free_pages - zi.totalreserve_pages;
//nr_file_pages = cached + swap_cached + buffers;有时还会有多余的页(other_file就是多余的),需要减去
if (mi.field.nr_file_pages > (mi.field.shmem + mi.field.unevictable + mi.field.swap_cached)) {
//other_file 基本就等于除 tmpfs 和 unevictable 外的缓存在内存的文件所占用的 page 数
other_file = (mi.field.nr_file_pages - mi.field.shmem -
mi.field.unevictable - mi.field.swap_cached);
} else {
other_file = 0;
}
//到这里计算出other_free 、other_file
min_score_adj = OOM_SCORE_ADJ_MAX + 1; //1000
//遍历oomadj和minfree数组 根据lowmem_minfree 的值来确定 min_score_adj,oomadj小于 min_score_adj 的进程在这次回收过程中不会被杀死
for (i = 0; i < lowmem_targets_size; i++) {
minfree = lowmem_minfree[i];
if (other_free < minfree && other_file < minfree) {
min_score_adj = lowmem_adj[i];
break;
}
}
if (min_score_adj == OOM_SCORE_ADJ_MAX + 1) { //adj没变化不做任何处理
if (debug_process_killing) {
ALOGI("Ignore %s memory pressure event "
"(free memory=%ldkB, cache=%ldkB, limit=%ldkB)",
level_name[level], other_free * page_k, other_file * page_k,
(long)lowmem_minfree[lowmem_targets_size - 1] * page_k);
}
return;
}
goto do_kill;
}
//对于没有配置use_minfree_levels的情况,内存压力low时会调用record_low_pressure_levels,记录low等级时,
if (level == VMPRESS_LEVEL_LOW) {
record_low_pressure_levels(&mi); //这里主要是赋值low_pressure_mem.min_nr_free_pages low_pressure_mem.max_nr_free_pages
}
if (level_oomadj[level] > OOM_SCORE_ADJ_MAX) {//大于1000不考虑
/* Do not monitor this pressure level */
return;
}
// 当前memory使用情况,不含swap
if ((mem_usage = get_memory_usage(&mem_usage_file_data)) < 0) {//"/dev/memcg/memory.usage_in_bytes"
goto do_kill;
}
// 当前memory使用情况,含swap
if ((memsw_usage = get_memory_usage(&memsw_usage_file_data)) < 0) {//"/dev/memcg/memory.memsw.usage_in_bytes"
goto do_kill;
}
// Calculate percent for swappinness.
// 这个指标类似于swapness,值越大,swap使用越少,剩余swap空间越大
mem_pressure = (mem_usage * 100) / memsw_usage;
if (enable_pressure_upgrade && level != VMPRESS_LEVEL_CRITICAL) {//ro.lmk.critical_upgrade
// We are swapping too much.
// 指标偏小说明swap使用很厉害,但仍然内存压力很大
// 提高level,杀得更激进
if (mem_pressure < upgrade_pressure) { //ro.lmk.upgrade_pressure 代码default100 我的设备35
level = upgrade_level(level); //升级vmpressure level
if (debug_process_killing) {
ALOGI("Event upgraded to %s", level_name[level]);
}
}
}
// If we still have enough swap space available, check if we want to
// ignore/downgrade pressure events.
// swap_free_low_percentage为swap低阈值 此时swap空间还没到低阈值,有可操作空间
if (mi.field.free_swap >=
mi.field.total_swap * swap_free_low_percentage / 100) { //ro.lmk.swap_free_low_percentage 10或者15
// If the pressure is larger than downgrade_pressure lmk will not
// kill any process, since enough memory is available.
if (mem_pressure > downgrade_pressure) {// 虽然有内存压力警报,但是swap还是足够的,不杀进程
if (debug_process_killing) {
ALOGI("Ignore %s memory pressure", level_name[level]);
}
return;
} else if (level == VMPRESS_LEVEL_CRITICAL && mem_pressure > upgrade_pressure) {
if (debug_process_killing) {
ALOGI("Downgrade critical memory pressure");
}//swap空间足够的话,只有mem_pressure压力足够大,才会杀得更激进
// Downgrade event, since enough memory available.
level = downgrade_level(level);
}
}
do_kill:
if (low_ram_device) {//如果是低内存设备
/* For Go devices kill only one task */
if (find_and_kill_process(level_oomadj[level], NULL, &mi, &wi, &curr_tm) == 0) {
if (debug_process_killing) {
ALOGI("Nothing to kill");
}
}
} else {
int pages_freed;
static struct timespec last_report_tm;
static unsigned long report_skip_count = 0;
if (!use_minfree_levels) {//高版本设备一般不会走到这,只有用vmpressure策略并且不用use_minfree_levels
/* Free up enough memory to downgrate the memory pressure to low level */
if (mi.field.nr_free_pages >= low_pressure_mem.max_nr_free_pages) {
if (debug_process_killing) {
ALOGI("Ignoring pressure since more memory is "
"available (%" PRId64 ") than watermark (%" PRId64 ")",
mi.field.nr_free_pages, low_pressure_mem.max_nr_free_pages);
}
return;
}
min_score_adj = level_oomadj[level];
}
//最终进程被杀
pages_freed = find_and_kill_process(min_score_adj, NULL, &mi, &wi, &curr_tm);
···
/* Log whenever we kill or when report rate limit allows */
if (use_minfree_levels) {
ALOGI("Reclaimed %ldkB, cache(%ldkB) and free(%" PRId64 "kB)-reserved(%" PRId64 "kB) "
"below min(%ldkB) for oom_score_adj %d",
pages_freed * page_k,
other_file * page_k, mi.field.nr_free_pages * page_k,
zi.totalreserve_pages * page_k,
minfree * page_k, min_score_adj);
} else {
ALOGI("Reclaimed %ldkB at oom_score_adj %d", pages_freed * page_k, min_score_adj);
}
}
低内存设备(low-memory device)和高性能设备(high-performance device)的kill策略有所不同:
- 对于内存不足的设备,一般情况下,系统会选择承受较大的内存压力。
- 对于高性能设备,如果出现内存压力,则会视为异常情况,应及时修复,以免影响整体性能。
-
解析/proc/meminfo和/proc/zoneinfo获取内存状态
-
如果配置了use_minfree_levels,则根据lowmem_minfree数组计算合适的min_score_adj
逐个比较other_free和other_file是否低于minfree,是则使用对应的oomadj作为min_score_adj
-
如果没有配置use_minfree_levels,则根据vmpressure等级计算min_score_adj
对低内存压力级别,记录当时的内存使用情况
根据级别对应表获取oomadj
如果swap空间充足,检查是否需要降级内存压力级别
-
使用计算出的min_score_adj找到进程并kill
3.3 mp_event_psi和mp_event_common的不同之处
- mp_event_psi主要基于zoneinfo的水位线方式判断内存状态,mp_event_common主要检测meminfo中的free memory大小。
- mp_event_psi会计算thrashing和swap使用情况,mp_event_common主要检测vmpressure级别。
- mp_event_psi有定期轮询逻辑, mp_event_common仅在收到事件时触发。
- mp_event_psi会更细致地判断不同内存压力场景,mp_event_common较简单直接。
- mp_event_psi自身就可以完成整个判断和杀进程流程,mp_event_common仅完成内存判断后交给上层管理杀进程。
- mp_event_psi可以动态调整轮询间隔,mp_event_common没有这方面逻辑。
- mp_event_psi记录更多调试统计信息。
3.4 find_and_kill_process 杀进程
这里针对adj<200的情况,默认会杀最重的进程。
static int find_and_kill_process(int min_score_adj, struct kill_info *ki, union meminfo *mi,
struct wakeup_info *wi, struct timespec *tm) {
for (i = OOM_SCORE_ADJ_MAX; i >= min_score_adj; i--) {//遍历adj
struct proc *procp;
if (!choose_heaviest_task && i <= PERCEPTIBLE_APP_ADJ) { //ro.lmk.kill_heaviest_task 默认是false
choose_heaviest_task = true;// 可以理解成adj < 200 杀最重的进程
}
while (true) {
procp = choose_heaviest_task ? //根据adj200 判断杀最重或者根据lru杀
proc_get_heaviest(i) : proc_adj_lru(i);
killed_size = kill_one_process(procp, min_score_adj, ki, mi, wi, tm);
if (killed_size >= 0) {
if (!lmk_state_change_start) {
lmk_state_change_start = true;
stats_write_lmk_state_changed(STATE_START);
}
break;
}
}
}
if (lmk_state_change_start) {
stats_write_lmk_state_changed(STATE_STOP);
}
return killed_size;
}
kill_one_process
代码里表明提高被杀进程的优先级,尽快干掉他
/* Kill one process specified by procp. Returns the size (in pages) of the process killed */
static int kill_one_process(struct proc* procp, int min_oom_score, struct kill_info *ki,
union meminfo *mi, struct wakeup_info *wi, struct timespec *tm) {
/* CAP_KILL required */
if (pidfd < 0) { // 对应proc/pid/pidfd 如果打不开直接调用kill
start_wait_for_proc_kill(pid);
r = kill(pid, SIGKILL);
} else {
start_wait_for_proc_kill(pidfd);//来等待该进程被杀死。这个函数会启动一个新的线程或进程,在其中轮询该进程是否被杀死,并在该进程被杀死后返回。
r = pidfd_send_signal(pidfd, SIGKILL, NULL, 0);
}
set_process_group_and_prio(pid, SP_FOREGROUND, ANDROID_PRIORITY_HIGHEST); //调整最高优先级 保证进程尽快被杀死
last_kill_tm = *tm;
inc_killcnt(procp->oomadj);
out:
/*
* WARNING: After pid_remove() procp is freed and can't be used!
* Therefore placed at the end of the function.
*/
pid_remove(pid);
return result;
}
4.内存指标
4.1 zoneinfo
字段 | 含义 |
---|---|
nr_free_pages | 该zone空闲页数目 |
nr_file_pages | 该zone文件页大小 |
nr_shmem | 该zone中shmem/tmpfs占用内存大小 |
nr_unevictable | 该zone不可回收页个数 |
high | 该zone的高水位线 |
protection | 该zone的保留内存 |
lmkd中zoneinfo_field_names保存了需要从zoneinfo中解析的字段,union zoneinfo则用来保存解析出来的数据。
解析中使用了小技巧,zoneinfo为union,因此可以通过遍历zoneinfo_field_names的同时遍历zoneinfo的attr,实现快速解析。在使用时,又可以通过zone的field快速访问。
zoneinfo中多计算了个totalreserve_pages,该值时根据high水线和protection保护页面数量(防止过度借出页面)共同计算得来(high水线 + protection选取最大保留页)。
lmkd中计算出来的zoneinfo为总大小,并未区分各个zone
4.2 meminfo
字段 | 含义 |
---|---|
MemFree | 系统尚未使用的内存 |
Cached | 文件页缓存,其中包括tmpfs中文件(未发生swap-out) |
SwapCached | 匿名页或者shmem/tmpfs,被swapout过,当前swapin后未改变,如果改变会从SwapCached删除 |
Buffers | io设备占用的缓存页,也统计在file lru |
Mapped | 正在与用户进程关联的文件页 |
/proc/meminfo信息打印的地方在[kernel/msm-5.4/fs/proc/meminfo.c]的meminfo_proc_show函数当中;其中主要是调用show_val_kb()函数将字符串和具体的数值凑成一个字符串,然后把这些字符串打印出来。
shmem比较特殊,基于文件系统所以不算匿名页,但又不能pageout,因此在内存中被统计进了Cached (pagecache)和Mapped(shmem被attached),但lru里是放在anon lru,因为可能会被swap out。
lmkd的meminfo中也多计算了一个字段nr_file_pages,该值包括cached + swap_cached + buffers。可以理解为能够被drop的文件页。
4.3 memcg
字段 | 含义 |
---|---|
memory.usage_in_bytes | 该memcg的内存(不含swap)使用情况 |
memory.memsw.usage_in_bytes | 该memcg的内存(含swap)使用情况 |
进程信息
进程rss信息获取:"/proc/pid/statm"
统计的数据依次为:虚拟地址空间大小,rss,共享页数,代码段大小,库文件大小,数据段大小,和脏页大小(单位为page)。
进程状态信息: "/proc/pid/status"
进程统计信息: "/proc/pid/stat"
lmkd比较关心第10位pgfault,12位pgmajfault,22位进程开始时间,rss大小(单位page)。
source.android.com/docs/core/p…
github.com/reklaw-tech… Q LMKD原理简介.md#use_minfree_levels