问题描述
(base) lou@lou-dell:~/6.5840/src/main$ bash test-mr.sh
*** Starting wc test.
--- wc test: PASS
*** Starting indexer test.
--- indexer test: PASS
*** Starting map parallelism test.
--- map parallelism test: PASS
*** Starting reduce parallelism test.
--- reduce parallelism test: PASS
*** Starting job count test.
--- job count test: PASS
*** Starting early exit test.
sort: cannot read: 'mr-out*': No such file or directory
cmp: EOF on mr-wc-all-initial which is empty
--- output changed after first worker exited
--- early exit test: FAIL
*** Starting crash test.
--- crash test: PASS
*** FAILED SOME TESTS
这里发生了sort: cannot read: 'mr-out*': No such file or directory的错误
通过分析发现是cmp命令在比较文件时,发现文件mr-wc-all-initial是空的,导致EOF(文件结尾)错误。
这通常表明在生成mr-wc-all-initial文件时,没有任何内容被写入,可能是因为mr-out*文件没有生成或内容为空。
通过查看测试使用的mapf和reducef方法,该测试的流程是,在Reduce函数中,对于包含“sherlock”或“tom”的key,会进行长达3秒的休眠(time.Sleep(time.Duration(3 * time.Second)))。这就会导致有些已经执行完成了reducef任务,但个别任务还没有执行完成。
解决方法就是,添加一个等待的状态,waiting状态允许worker在等待新任务时保持活跃,而不是立即退出。也就不会导致early exit时出错。
参考代码
// main/mrworker.go calls this function.
func Worker(mapf func(string, string) []KeyValue,
reducef func(string, []string) string) {
// Your worker implementation here.
// uncomment to send the Example RPC to the coordinator.
//CallExample()
for {
task := RequestTask()
if task.TaskType == "map" {
DoMapTask(task, mapf)
} else if task.TaskType == "reduce" {
DoReduceTask(task, reducef)
} else if task.TaskType == "waiting" {
// 添加一个等待的状态
time.Sleep(time.Second * 3)
} else if task.TaskType == "exit" {
break
}
ReportTaskCompletion(task)
time.Sleep(time.Second) // 避免忙等待
}
}
// coordinator.go
// Your code here -- RPC handlers for the worker to call.
func (c *Coordinator) RequestTask(args *RequestTaskArgs, reply *RequestTaskReply) error {
c.Mu.Lock()
defer c.Mu.Unlock()
allTasksCompleted := func(tasks []Task) bool {
for _, task := range tasks {
if task.Status != COMPLETED {
return false
}
}
return true
}
if c.Phase == MapPhase {
for i, task := range c.NMapTask {
if task.Status == IDLE {
c.NMapTask[i].Status = INPROGRESS
c.NMapTask[i].StartTime = time.Now()
reply.RequestTask = c.NMapTask[i]
return nil
}
}
if false == allTasksCompleted(c.NMapTask) {
reply.RequestTask = Task{TaskType: "waiting"}
return nil
}
} else if c.Phase == ReducePhase {
for i, task := range c.NReduceTask {
if task.Status == IDLE {
c.NReduceTask[i].Status = INPROGRESS
c.NReduceTask[i].StartTime = time.Now()
reply.RequestTask = c.NReduceTask[i]
return nil
}
}
if false == allTasksCompleted(c.NReduceTask) {
reply.RequestTask = Task{TaskType: "waiting"}
return nil
}
}
reply.RequestTask = Task{TaskType: "exit"}
return nil
}
添加该代码之后的运行结果
(base) lou@lou-dell:~/6.5840/src/main$ bash test-mr.sh
*** Starting wc test.
--- wc test: PASS
*** Starting indexer test.
--- indexer test: PASS
*** Starting map parallelism test.
--- map parallelism test: PASS
*** Starting reduce parallelism test.
2024/08/02 10:02:56 dialing:dial unix /var/tmp/5840-mr-1000: connect: connection refused
--- reduce parallelism test: PASS
*** Starting job count test.
--- job count test: PASS
*** Starting early exit test.
2024/08/02 10:03:34 dialing:dial unix /var/tmp/5840-mr-1000: connect: connection refused
2024/08/02 10:03:34 dialing:dial unix /var/tmp/5840-mr-1000: connect: connection refused
--- early exit test: PASS
*** Starting crash test.
--- crash test: PASS
*** PASSED ALL TESTS