Fuchsia  |  Zircon Scheduling

402 阅读13分钟
  1. 本文写于2021年7月,可能官方文档会继续更新,因此不能保证在将来本文依然对大家有帮助。
  2. 英文原文使用了非常长的定语从句,由于能力有限,尽量拆分为小段句子,并尽量能让中文读起来顺畅。
  3. 尽量保留专业术语&词汇,以免引起二义性。
  4. 由于目前对 Fuchsia 和 Zircon 不熟悉,难免有翻译失误的地方,欢迎斧正。
  5. 一边翻译一边学习,暂时不能理解的地方留了 TODO,争取以后补上。对暂时不知如何翻译的单词会用【 xxx 】做翻译参考。
  6. 上班996,下班有空做翻译,不能保证更新速度。
  7. 谢绝转载
  8. 谢绝转载
  9. 谢绝转载

背景 | Background

The primary responsibility of any scheduler is to share the limited resource of processor time between all threads that wish to use it. In a general purpose operating system, it tries to do so in a fair way, ensuring that all threads are allowed to make some progress.

Our scheduler is an evolution of LK’s scheduler. As such it started as a minimal scheduler implementation and was extended to meet our needs as the project grew.

任何一个 scheduler 的主要职责都是将有限的处理器资源分享给所有的 thread。在一个通用操作系统中,它尝试使用一些公平的方式,来确保所有的 thread 都能取得一些进展。

我们的 scheduler 是 Little Kernel scheduler 的进化版。这样一来最初它是一个最小 scheduler 的实现,随着项目发展,被逐渐扩展来满足我们的需求。

设计|Design

概览 | Overview

In essence there is a scheduler running on each logical CPU in the machine. These schedulers run independently and use IPI (Inter-Processor Interrupts) to coordinate. However each CPU is responsible for scheduling the threads that are running on it. See CPU Assignment below for how we decide which CPU a thread is on, and how/when it migrates.

Each CPU has its own set of priority queues. One for each priority level in the system, currently 32. Note that these are fifo queues, not the data structure known as a priority queue. In each queue is an ordered list of runnable threads awaiting execution. When it is time for a new thread to run, the scheduler simply looks at the highest numbered queue that contains a thread, pops the head off of that queue and runs that thread.See Priority Management below for more details about how it decides which thread should be in which queue. If there are no threads in the queues to run it will instead run the idle thread, see Realtime and idle threads below for more details.

Each thread is assigned the same timeslice size (THREAD_INITIAL_TIME_SLICE) when it is picked to start running. If it uses its whole timeslice it will be reinserted at the end of the appropriate priority queue. However if it has some of its timeslice remaining from a previous run it will be inserted at the head of the priority queue so it will be able to resume as quickly as possible. When it is picked back up again it will only run for the remainder of its previous timeslice.

When the scheduler selects a new thread from the priority queue it sets the CPU's preemption timer for either a full timeslice, or the remainder of the previous timeslice. When that timer fires the scheduler will stop execution on that thread, add it to the appropriate queue, select another thread and start over again.

If a thread blocks waiting for a shared resource then it's taken out of its priority queue and is placed in a wait queue for the shared resource. When it is unblocked it will be reinserted in the appropriate priority queue of an eligible CPU (CPU Assignment) and if it had remaining timeslice to run it will be added to the front of the queue for expedited handling.

事实上,设备上的每个逻辑 CPU 都有一个运行的 scheduler。这些 scheduler 独立运行,并使用 IPI (Inter-Processor Interrupts) 来协作。但是每个 CPU 都负责对运行在它上面的 thread 进行调度。 参阅下面的《CPU 分配 | CPU Assignment》以获取我们如何决定一个 thread 运行在哪个 CPU 上,并如何/何时进行迁移。(vx公众号:摩卡Code, MochaCode)

每个 CPU 都拥有自己的优先级队列集合。系统中目前总共有32个优先级等级。注意这些都是 FIFO 队列,并不是所谓的优先级队列数据结构。在每个队列中都有一个有序列表,列表中包含了等待执行的可运行 thread。当要运行一个新 thread 时,scheduler 只需找到包含线程的最高数字的队列,从顶部弹出该 thread 并运行之。关于 scheduler 如何决定哪个队列的哪个 thread 应该被运行的相关信息,参阅下面的《优先级管理 | Process Management》。如果所有队列中都没有 thread 可以被执行,它则会运行 idle thread,参阅下面的《实时和空闲线程 | Realtime and idle threads》获取更多详情。

当每个 thread 开始运行时,都被分配了相同的时间片大小(THREAD_INITIAL_TIME_SLICE)。当它使用完了整个时间片,它会被重新插入到合适的优先级队列末尾。但是如果它还有一些上次运行时未使用完的时间片,它就会被插入到优先级队列的头部,这样它就能被尽快恢复运行。当它被再次唤起时,它仅会运行上次剩余的时间片。

当 scheduler 从优先级队列中选取了一个新的 thread,它会将 CPU 抢占计时器设置为要么是整个时间片的,要么是上一个时间片剩余的部分。当计时器触发时,schedule 会停止执行该 thread ,将其加入到合适的队列中,选择另外一个 thread 重新开始执行。

当一个 thread 阻塞并等待共享资源,他会被从优先级队列中取出,并放置到共享资源的等待队列中。当其被解除阻塞是,它会被重新插入到合乎条件 CPU(CPU 分配 | CPU Assignment) 的相应优先级队列中,当它有剩余的时间片来执行时,则会被添加到队列头部以进行加速处理。

优先级管理|Priority management

There are three different factors used to determine the effective priority of a thread, the effective priority being what is used to determine which queue it will be in.

The first factor is the base priority, which is simply the thread’s requested priority. There are currently 32 levels with 0 being the lowest and 31 being the highest.

The second factor is the priority boost. This is a value bounded between [-MAX_PRIORITY_ADJ, MAX_PRIORITY_ADJ] used to offset the base priority, it is modified by the following cases:

  • When a thread is unblocked, after waiting on a shared resource or sleeping, it is given a one point boost.
  • When a thread yields (volunteers to give up control), or volunteers to reschedule, its boost is decremented by one but is capped at 0 (won’t go negative).
  • When a thread is preempted and has used up its entire timeslice, its boost is decremented by one but is able to go negative.

The third factor is its inherited priority. If the thread is in control of a shared resource and it is blocking another thread of a higher priority then it is given a temporary boost up to that thread’s priority to allow it to finish quickly and allow the higher priority thread to resume.(vx: MochaCode)

The effective priority of the thread is either the inherited priority, if it has one, or the base priority plus its boost. When this priority changes, due to any of the factors changing, the scheduler will move it to a new priority queue and reschedule the CPU. Allowing it to have control if it is now the highest priority task, or relinquish control if it is no longer highest.

The intent in this system is to ensure that interactive threads are serviced quickly. These are usually the threads that interact directly with the user and cause user-perceivable latency. 

These threads usually do little work and spend most of their time blocked awaiting another user event. So they get the priority boost from unblocking while background threads that do most of the processing receive the priority penalty for using their entire timeslice.

有3个不同的因素来决定 thread 有效优先级,有效优先级被用于决定 thread 会被放入哪一个队列中。

第一个因素是基本优先级,它仅是 thread 的请求优先级。目前总共有32个等级,0是最低优先级,31最高。

第二个因素是优先级提升。这是一个居于-MAX_PRIORITY_ADJ 和 MAX_PRIORITY_ADJ 之间的值,用于补偿基本优先级,它会被下面几种情况修改:

  • 当 thread 被取消阻塞,在等待共享资源或休眠后,它会被赋予1点增强优先级。
  • 当 thread 让步(自愿放弃控制)、或自愿被重新调度,它的优先级提升会被减去1,但下限是0(不会为负)。
  • 当 thread 被抢占并且用完了它的整个时间片,它的优先级提升会被减去1而且可能会为负。

第三个因素是它的继承优先级。如果 thread 拥有贡献资源,并且它阻塞了另外一个高优先级的 thread,那么它的优先级会被临时提升到于被它阻塞的 thread 优先级一样,这样可以使该 thread 能快速完成并让高优先级的 thread 恢复执行。

Thread 的有效优先级要么是继承优先级(如果它有的话),或者是基本优先级加上提升优先级。当该优先级由于任何因素改变而改变时, scheduler 会将其移动到新的优先级队列中,并对 CPU 进行 reschedule。当它拥有高优先级任务时,允许它获得控制权;当它不再是最高时,使其放弃控制权。这个系统的目的是确保交互式 thread 能够得到快速服务。那些 thread 通常与用户直接进行交互,并导致用户可感知的延迟。这些 thread 通常只做非常少量的工作,并花费绝大多数时间在阻塞中以等待其他用户事件。所以当那些进行了大量处理的后台 thread ,由于他们使用了所有的时间片而受到优先级惩罚时【第2个因素的第3点】,同时这些与用户交互的 thread 取消阻塞后获得了优先级提升。

CPU 分配和迁移 | CPU assignment and migration

Threads are able to request which CPUs on which they wish to run using a CPU affinity mask, a 32 bit mask where 0b001 is CPU 1, 0b100 is CPU 3, and 0b101 is either CPU 1 or CPU 3. This mask is usually respected but if the CPUs it requests are all inactive it will be assigned to another CPU. Also notable, if it is “pinned” to a CPU, that is its mask contains only one CPU, and that CPU becomes inactive the thread will sit unserviced until that CPU becomes active again. See CPU activation below for details.

When selecting a CPU for a thread the scheduler will choose, in order:

  1. The CPU doing the selection, if it is idle and in the affinity mask.
  2. The CPU the thread last ran on, if it is idle and in the affinity mask.
  3. Any idle CPU in the affinity mask.
  4. The CPU the thread last ran on, if it is active and in the affinity mask.
  5. The CPU doing the selection, if it is the only one in the affinity mask or all cpus in the mask are not active.
  6. Any active CPU in the affinity mask.

If the thread is running on a CPU not in its affinity mask (due to case 5 above) the scheduler will try to rectify this every time the thread is preempted, yields, or voluntarily reschedules. Also if the thread changes its affinity mask the scheduler may migrate it.

Every time a thread comes back from waiting on a shared resource or sleeping and needs to be assigned a priority queue, the scheduler will re-evaluate its CPU choice for the thread, using the above logic, and may move it.

Thread 能够通过设置 CPU 关联掩码来要求自己运行在哪个 CPU 上,这个关联掩码是32位的,0b001表示 CPU1,0b100表示 CPU3,0b101 表示 CPU1 或 CPU3。该关联掩码的建议通常是被采纳,但是如果它要求的 CPU 都是处于非活动状态,那么它会被分配到其他 CPU 上运行。同样值得注意的是,如果它被“固定”到一个 CPU 上,也就是说它的关联掩码只包含一个 CPU,并且该 CPU 处于非活动状态,那么该 thread 会处于非服务状态,直到它指定的 CPU 再次变为活动状态。参阅下面的 CPU 激活 | CPU Activation。

当为 thread 选择 CPU 时,scheduler 的选择顺序是:

  1. 正在执行选择的 CPU ,如果它处于空闲状态且在关联掩码中
  2. thread 最后运行的 CPU ,如果它处于空闲状态且在关联掩码中
  3. 关联掩码中的处于空闲状态的 CPU
  4. Thread 最后运行的 CPU ,如果它处于活动状态且在关联掩码中
  5. 正在执行选择的 CPU ,如果它是关联掩码中唯一的 CPU,或者关联掩码中的所有 CPU 都处于非活动状态
  6. 关联掩码中的任意活动状态的 CPU

如果 thread 在非关联掩码中的 CPU 上运行(由上面情况5导致),那么每当该 thread 被抢占、让步或自愿被重新调度时,scheduler 都会尝试纠正这种情况。(vx公众号:摩卡Code, MochaCode)此外,当 thread 改变关联掩码时, scheduler 也可能会迁移它。

每当 thread 从等待共享资源或休眠中返回,并需要被分配到优先级队列中时,scheduler 会使用上述逻辑重新评估 thread 的 CPU 选择,并可能移动它。

CPU 激活 | CPU activation

When a CPU is being deactivated, that is shutdown and removed from the system, the scheduler will transition all running threads onto other CPUs. The only exception is threads that are “pinned”, that is they only have the deactivating CPU in their affinity mask, these threads are put back into the run queue where they will sit unserviced until the CPU is reactivated.

When a CPU is reactivated it will service the waiting pinned threads and threads that are running on non-Affinity CPUs should be migrated back pretty quickly by their CPUs scheduler due to the above rules. There is no active rebalancing of threads to the newly awakened CPU, but as it should be idle more often, it should see some migration due to the logic laid out above in CPU assignment and migration.

当 CPU 被停止运行,也就是被关闭并从系统中移除,scheduler 会将运行在该 CPU 上的所有 thread 转移到其他 CPU 上。仅有一种状态例外,thread 被固定到该 CPU 上,也即 thread 的关联掩码中只有被激活状态的 CPU,这些 thread 会被放回到运行队列中,它们将一直处于非服务状态,直到 CPU 被重新激活。

当 CPU 被重新激活,他会执行被固定在它上面的等待中的 thread, 以及那些运行在非关联掩码中的 CPU 上的 thread,也会被 scheduler 基于上述规则快速迁移回来。对新唤醒的 CPU ,并没有对 thread 做主动再平衡,但是由于它可能会更频繁的处于空闲状态,基于上面列出的逻辑,它应该会收到一些迁移。

实时和空闲线程 | Realtime and idle threads

These are special threads that are treated a little differently.

The idle thread runs when no other threads are runnable. There is one on each CPU and it lives outside of the priority queues, but effectively in a priority queue of -1. It is used to track idle time and can be used by platform implementations for a low power wait mode.

Realtime threads (marked with THREAD_FLAG_REAL_TIME) are allowed to run without preemption and will run until they block, yield, or manually reschedule.

对一些特殊的线程处理方式稍有不同。

  • 当无其他 thread 可运行时, idle thread 会开始运行。每个 CPU 上都会有一个 idle thread,并且它存在于优先级队列之外,但实际上在优先级队列中,它的优先级为 -1 。它被用做跟踪空闲时间,并且能被平台用来实现低功耗待机模式。

  • 实时 thread (被标记为 THREAD_FLAG_REAL_TIME)被允许在不被抢占的情况下运行,并一直运行,直到他们被阻塞、让步或手动重新调度。

qrcode_for_gh_bd971a719771_258.jpg

喜欢的话可以关注微信公众号:摩卡Code (MochaCode)