ThreadPool类代码位于mindspore/core/mindrt/src/thread/threadpool.cc和mindspore/core/mindrt/include/thread/threadpool.h中。
数据结构
ThreadPool类全部数据结构如下:
protected:
std::mutex pool_mutex_;
std::vector<Worker *> workers_;
std::vector<std::unique_ptr<HQueue<TaskSplit>>> task_queues_;
std::unordered_map<std::thread::id, size_t> worker_ids_;
CoreAffinity *affinity_{nullptr};
std::atomic<size_t> actor_thread_num_{0};
std::atomic<size_t> kernel_thread_num_{0};
bool occupied_actor_thread_{true};
std::atomic_int max_spin_count_{kDefaultSpinCount};
std::atomic_int min_spin_count_{kMinSpinCount};
float server_cpu_frequence = -1.0f; // Unit : GHz
static std::mutex create_thread_pool_muntex_;
其中两个mutex对象是保护共享资源用的,不解释了,其它逐一解释:
1.workers_
存放本线程池中所有的工作线程。注意:可以存放最基本的Worker对象,也可以存Worker的子类对象如ActorWorker(ActorWorker和ActorThreadPool以后介绍),也就是说:线程池中的各个工作线程可能同构也可能异构(Worker和ActorWorker的工作逻辑有差异)
看下面一段代码就很清楚了,来自ActorThreadPool::CreateThreads:
2.task_queues_
存放本线程池中每个工作线程的任务队列。
3.worker_ids_
unordered_map<std::thread::id, size_t>,存放键值对。键是线程的id,值是该线程在workers_中的索引,搜了下代码这个数据结构仅在CPUProfiler类中用到,也就是在性能统计时才用到,与线程池的主要逻辑无关。
4.affinity_
指向一个CoreAffinity类型对象,用于后续进行线程与CPU核绑核的相关操作(需要绑核时)。
5.actor_thread_num_和kernel_thread_num_
在MindSpore架构中,有两种算子:actor算子和kernel算子,简要定义如下:
1)actor算子:管理任务依赖、通信、异步执行,如融合算子(FusionActor,将具有执行依赖关系的多个Actor融合为一个整体,从而减少调度开销和执行延迟)
2)kernel算子:执行具体的数值计算和内存操作,如矩阵乘法算子(MatMul)。
可简单认为,actor_thread_num_就是执行actor算子的线程数量(actor线程,有时actor线程也可以执行kernel算子),kernel_thread_num_是执行kernel算子的线程数量(kernel线程)。
6.occupied_actor_thread
bool变量,当值为false时,意思是不允许占用actor thread,即线程池给工作线程分配任务时,只分配给kernel线程,而不分配给actor线程。这是因为Actor 线程可能是为特定任务(如异步 I/O、实时响应、通信)预留的,通过控制是否占用这些线程,可以避免常规计算任务阻塞关键操作。
7.max_spin_count_和min_spin_count_
spin_count_的值用来表征线程自旋次数,当spin_count_大于门限时,线程将被休眠。门限越小,线程就越容易因为自旋而被休眠,及时地让出CPU资源。门限越大,线程越不容易因自旋而被休眠。min_spin_count_和max_spin_count_这两个参数,可用来灵活设置工作线程因自旋而被休眠的门限。
8.server_cpu_frequence
存储CPU核的频率(单位GHz),没有在代码中找到实际使用它的地方
关键代码解析
ThreadPool类有很多方法,这里介绍关键的几个:
ThreadPool::TaskQueuesInit
int ThreadPool::TaskQueuesInit(size_t thread_num) {
for (size_t i = 0; i < thread_num; ++i) {
(void)task_queues_.emplace_back(std::make_unique<HQueue<TaskSplit>>());
}
for (size_t i = 0; i < thread_num; ++i) {
if (task_queues_[i]->Init(kMaxHqueueSize) != true) {
THREAD_ERROR("init task queue failed.");
return THREAD_ERROR;
}
}
THREAD_INFO("init task queues success.");
return THREAD_OK;
}
作用:初始化线程池的任务队列。
流程:
1.循环创建 thread_num 个 HQueue 类型的队列,存入 task_queues_
2.对每个队列调用 Init 方法,设置最大容量为 kMaxHqueueSize
3.若任意队列初始化失败,返回错误;否则返回成功
ThreadPool::ParallelLaunch
int ThreadPool::ParallelLaunch(const Func &func, Content content, int task_num) {
// if single thread, run master thread
if (task_num <= 1) {
return SyncRunFunc(func, content, 0, task_num);
}
// distribute task to the KernelThread and the idle ActorThread,
// if the task num is greater than the KernelThread num
THREAD_DEBUG("launch: %d", task_num);
Task task = {func, content};
std::vector<TaskSplit> task_list;
for (int i = 0; i < task_num; ++i) {
(void)task_list.emplace_back(TaskSplit{&task, i});
}
Worker *curr = CurrentWorker();
DistributeTask(&task_list, &task, task_num, curr);
// synchronization
// wait until the finished is equal to task_num
while (task.finished != task_num) {
if (curr != nullptr) {
(void)curr->RunLocalKernelTask();
}
std::this_thread::yield();
}
// check the return value of task
if (task.status != THREAD_OK) {
return THREAD_ERROR;
}
return THREAD_OK;
}
作用:并行执行任务。
流程:
1.若任务数 task_num <= 1,直接在本线程执行(调用 SyncRunFunc)
2.将任务封装为 Task 对象,分割为 task_num 个子任务(TaskSplit)
3.调用 DistributeTask 分发任务到工作线程
4.等待所有任务完成
5.当前线程若为 Worker,则在等待时执行本地任务(RunLocalKernelTask)
ThreadPool::DistributeTask
void ThreadPool::DistributeTask(std::vector<TaskSplit> *task_list, Task *task, int task_num, Worker *curr) const {
int sum_frequency = 0;
std::vector<Worker *> assigned;
assigned.reserve(task_num);
int num = static_cast<int>(workers_.size()) - 1;
int offset = 0;
bool use_curr = (curr != nullptr);
// if the current thread isn't nullptr, that is the curr is a ActorThread,
// then assign (task_num - 1) tasks to workers, and run the last one by itself
int num_assigned = use_curr ? task_num - 1 : task_num;
int count = 0;
if (!occupied_actor_thread_) {
offset = static_cast<int>(actor_thread_num_);
}
for (int i = num; i >= offset && count < num_assigned; --i) {
if (workers_[i]->available()) {
assigned.push_back(workers_[i]);
sum_frequency += workers_[i]->frequency();
(void)++count;
}
}
if (use_curr) {
assigned.push_back(curr);
sum_frequency += curr->frequency();
} else if (assigned.size() != static_cast<size_t>(task_num)) {
CalculateScales(assigned, sum_frequency);
ActiveWorkers(assigned, task_list, assigned.size(), curr);
SyncRunTask(task, assigned.size(), task_num);
return;
}
CalculateScales(assigned, sum_frequency);
ActiveWorkers(assigned, task_list, task_num, curr);
}
作用:将任务分发给可用工作线程。
流程:
1.收集可用 Worker(优先使用非 Actor 线程)
2.若当前线程是 Worker,则保留一个任务自己处理
3.调用 CalculateScales 根据 CPU 频率分配任务比例
4.调用 ActiveWorkers 激活线程执行任务,剩余任务由当前线程同步执行
ThreadPool::ActiveWorkers
void ThreadPool::ActiveWorkers(const std::vector<Worker *> &workers, std::vector<TaskSplit> *task_list, int task_num,
const Worker *curr) const {
// recalculate task num for each worker.
int worker_num = static_cast<int>(workers.size());
if (worker_num > 0) {
int each_worker_task_num = task_num / worker_num;
int rest_task_num = task_num % worker_num;
int start = 0;
int end;
for (int i = 0; i < worker_num; ++i) {
Worker *worker = workers[i];
THREAD_RETURN_IF_NULL(worker);
if (i < rest_task_num) {
end = start + each_worker_task_num + 1;
} else {
end = start + each_worker_task_num;
}
worker->Active(task_list, start, end);
if (worker == curr) {
(void)worker->RunLocalKernelTask();
}
start = end;
}
}
}
作用:激活工作线程执行指定范围的任务。
流程:
- 将任务分段分配给每个
Worker - 调用
worker->Active提交任务,若Worker是当前线程则立即执行