Golang动态保活Worker工作池设计

338 阅读3分钟

如果知道一个Goroutine已经死亡

Go语言并没有给我们暴露如何知道一个Goroutine是否存在的接口,如果要证明一个Go是否存在,可以在子Goroutine的业务中,定期写向一个keep live的Channel,然后主Goroutine来发现当前子Go的状态。Go语言在对于Go和Go之间没有像进程和线程一样有强烈的父子,兄弟等关系。每个Go实际上对于调度器都是一个独立的,平等的执行流程。

PS:如果你是监控子线程、子进程的死亡状态,就没有这么简单了,这里也要感谢go的调度器给我们提供的方便,我们既然要用go,就要基于Go的调度器来实现该模式

那么,我们如何做到一个Goroutine已经死亡了呢?

子Goroutine

可以通过给一个被监控的Goroutine添加一个defer,然后recover() 捕获到当前Goroutine的异常状态,最后给主Goroutine发送一个死亡信号,通过Channel。

主Goroutine

主Goroutine上,从这个Channel读取内容,当读到内容时,就重启这个子Goroutine,当然主Goroutine需要记录子Goroutine的ID,这样也就可以针对性的启动了。

代码实现

我们这里以一个工作池的场景来对上述方式进行实现。

WorkerManager作为主协程, worker作为子协程

WorkerManager


type WorkerManager struct {
   //用来监控Worker是否已经死亡的缓冲Channel
   workerChan chan *worker
   //一共要监控的worker数量
   nWorkers int
}

//创建一个WorkerManager对象
func NewWorkerManager(nworks int) *WorkerManager {
   return &WorkerManager{
      workerChan: make(chan *worker, nworks),
      nWorkers:   nworks,
   }
}

//启动worker池,并为每个Worker分配一个ID,让每个Worker进行工作
func (wm *WorkerManager) StartWorkerPool() {
   //开启一定数量的Worker
   for i := 0; i < wm.nWorkers; i++ {
      wk := &worker{id: i}
      go wk.work(wm.workerChan)
   }

   //启动保活监控
   wm.KeepLiveWorkers()
}

//如果有worker死亡,workChan会得到具体死亡的worker,然后重启
func (wm *WorkerManager) KeepLiveWorkers() {
   for wk := range wm.workerChan {
      fmt.Printf("Worker %d stopped with err: [%v] \n", wk.id, wk.err)
      wk.err = nil
      //重启此worker
      go wk.work(wm.workerChan)
   }

}

worker

type worker struct {
   id  int
   err error
}

func (wk *worker) work(workerChan chan<- *worker) (err error) {
   //任何Goroutine只要异常退出或正常退出,都会调用defer函数,所以在此函数中向WorkerManger的chan发送通知
   defer func() {
      //捕获异常信息,防止panic直接退出
      if r := recover(); r != nil {
         if err, ok := r.(error); ok {
            wk.err = err
         } else {
            wk.err = fmt.Errorf("Panic happened with [%v]", r)
         }
      } else {
         wk.err = err
      }
      //通知主Goroutine,当前协程已经死亡
      workerChan <- wk
   }()

   //做一些业务处理
   fmt.Println("Start Worker...ID = ", wk.id)

   //睡眠一段时间,panic退出或Goexit退出
      time.Sleep(time.Second*1)

   panic("worker panic")
   //runtime.Goexit()

   return err

}


测试

func main() {
   wm:=NewWorkerManager(10)
   wm.StartWorkerPool()
   time.Sleep(100*time.Second)
}

结果

Start Worker...ID =  0
Start Worker...ID =  1
Start Worker...ID =  9
Start Worker...ID =  4
Start Worker...ID =  5
Start Worker...ID =  6
Start Worker...ID =  7
Start Worker...ID =  2
Start Worker...ID =  8
Start Worker...ID =  3
Worker 7 stopped with err: [Panic happened with [worker panic]] 
Worker 2 stopped with err: [Panic happened with [worker panic]] 
Worker 0 stopped with err: [Panic happened with [worker panic]] 
Worker 8 stopped with err: [Panic happened with [worker panic]] 
Worker 6 stopped with err: [Panic happened with [worker panic]] 
Worker 9 stopped with err: [Panic happened with [worker panic]] 
Start Worker...ID =  2
Worker 5 stopped with err: [Panic happened with [worker panic]] 
Worker 3 stopped with err: [Panic happened with [worker panic]] 
Start Worker...ID =  9