mit 6.824 lab1 mapreduce 分布式系统课程实验笔记

303 阅读1分钟

问题描述

(base) lou@lou-dell:~/6.5840/src/main$ bash test-mr.sh
*** Starting wc test.
--- wc test: PASS
*** Starting indexer test.
--- indexer test: PASS
*** Starting map parallelism test.
--- map parallelism test: PASS
*** Starting reduce parallelism test.
--- reduce parallelism test: PASS
*** Starting job count test.
--- job count test: PASS
*** Starting early exit test.
sort: cannot read: 'mr-out*': No such file or directory
cmp: EOF on mr-wc-all-initial which is empty
--- output changed after first worker exited
--- early exit test: FAIL
*** Starting crash test.
--- crash test: PASS
*** FAILED SOME TESTS

这里发生了sort: cannot read: 'mr-out*': No such file or directory的错误

通过分析发现是cmp命令在比较文件时,发现文件mr-wc-all-initial是空的,导致EOF(文件结尾)错误。 这通常表明在生成mr-wc-all-initial文件时,没有任何内容被写入,可能是因为mr-out*文件没有生成或内容为空。

通过查看测试使用的mapf和reducef方法,该测试的流程是,在Reduce函数中,对于包含“sherlock”或“tom”的key,会进行长达3秒的休眠(time.Sleep(time.Duration(3 * time.Second)))。这就会导致有些已经执行完成了reducef任务,但个别任务还没有执行完成。

解决方法就是,添加一个等待的状态,waiting状态允许worker在等待新任务时保持活跃,而不是立即退出。也就不会导致early exit时出错。

参考代码

// main/mrworker.go calls this function.
func Worker(mapf func(string, string) []KeyValue,
	reducef func(string, []string) string) {

	// Your worker implementation here.

	// uncomment to send the Example RPC to the coordinator.
	//CallExample()
	for {
		task := RequestTask()
		if task.TaskType == "map" {
			DoMapTask(task, mapf)
		} else if task.TaskType == "reduce" {
			DoReduceTask(task, reducef)
		} else if task.TaskType == "waiting" {
			// 添加一个等待的状态
			time.Sleep(time.Second * 3)
		} else if task.TaskType == "exit" {
			break
		}
		ReportTaskCompletion(task)

		time.Sleep(time.Second) // 避免忙等待
	}
}
// coordinator.go
// Your code here -- RPC handlers for the worker to call.
func (c *Coordinator) RequestTask(args *RequestTaskArgs, reply *RequestTaskReply) error {
	c.Mu.Lock()
	defer c.Mu.Unlock()

	allTasksCompleted := func(tasks []Task) bool {
		for _, task := range tasks {
			if task.Status != COMPLETED {
				return false
			}
		}
		return true
	}

	if c.Phase == MapPhase {
		for i, task := range c.NMapTask {
			if task.Status == IDLE {
				c.NMapTask[i].Status = INPROGRESS
				c.NMapTask[i].StartTime = time.Now()
				reply.RequestTask = c.NMapTask[i]
				return nil
			}
		}
		if false == allTasksCompleted(c.NMapTask) {
			reply.RequestTask = Task{TaskType: "waiting"}
			return nil
		}
	} else if c.Phase == ReducePhase {
		for i, task := range c.NReduceTask {
			if task.Status == IDLE {
				c.NReduceTask[i].Status = INPROGRESS
				c.NReduceTask[i].StartTime = time.Now()
				reply.RequestTask = c.NReduceTask[i]
				return nil
			}
		}
		if false == allTasksCompleted(c.NReduceTask) {
			reply.RequestTask = Task{TaskType: "waiting"}
			return nil
		}
	}

	reply.RequestTask = Task{TaskType: "exit"}
	return nil
}

添加该代码之后的运行结果

(base) lou@lou-dell:~/6.5840/src/main$ bash test-mr.sh
*** Starting wc test.
--- wc test: PASS
*** Starting indexer test.
--- indexer test: PASS
*** Starting map parallelism test.
--- map parallelism test: PASS
*** Starting reduce parallelism test.
2024/08/02 10:02:56 dialing:dial unix /var/tmp/5840-mr-1000: connect: connection refused
--- reduce parallelism test: PASS
*** Starting job count test.
--- job count test: PASS
*** Starting early exit test.
2024/08/02 10:03:34 dialing:dial unix /var/tmp/5840-mr-1000: connect: connection refused
2024/08/02 10:03:34 dialing:dial unix /var/tmp/5840-mr-1000: connect: connection refused
--- early exit test: PASS
*** Starting crash test.
--- crash test: PASS
*** PASSED ALL TESTS