golang netpoll 在 net包中的应用声明：此源码基于 go 1.19 版本，在源码走读中会省略一些非关

声明：

此源码基于 go 1.19 版本，在源码走读中会省略一些非关键代码
netpoll 部分基于参考文章的总结，net部分原创

说起 IO多路复用，大家比较熟悉都是 select、poll、epoll，这些也都是操作系统内核提供，在用户层我们有需要用到时只需要进行系统调用。在 golang 中，其实也实现了 epoll 功能调用，今天分享的内容就是 epoll 在 net包中应用的源码走读，限于本人能力，其中如有理解不对的地方，望指正。

1.netpoll

首先看看 golang 是如何调用 epoll 的，这其中也会结合 golang 自身的运行时调度问题。

了解过 epoll 的人应该都知道，epoll 主要就如下几个重要操作：

epoll_create，数据结构初始化，创建epoll
epoll_ctl，add fd,也可以理解为注册相关 fd
epoll_wait，监听事件，主要是 read/write 事件，当事件就绪时取出对应 fd，结合 golang，这里需要结合 goroutine，后面会详述

首先看看 golang 中关于 epoll 的相关函数：

// runtime/netpoll.go
const (
   pdReady uintptr = 1 // 就绪状态，可读或可写
   pdWait  uintptr = 2 // 继续等待
)

// 重要结构体，后面的操作基本都基于此结构体
type pollDesc struct {
   link *pollDesc // in pollcache, protected by pollcache.lock
   fd   uintptr   // constant for pollDesc usage lifetime
   // rg, wg are accessed atomically and hold g pointers.
   // (Using atomic.Uintptr here is similar to using guintptr elsewhere.)
   rg atomic.Uintptr // pdReady, pdWait, G waiting for read or nil
   wg atomic.Uintptr // pdReady, pdWait, G waiting for write or nil

   ...
}

// epoll 初始化
func netpollGenericInit() {}

// epoll add
func poll_runtime_pollOpen(fd uintptr) (*pollDesc, int) {}

// epoll wait
func poll_runtime_pollWait(pd *pollDesc, mode int) int {}

具体的调用声明如下：

// runtime/netpoll_epoll.go

// epoll_create
func netpollinit() {}

// epoll_ctl
func netpollopen(fd uintptr, pd *pollDesc) int32 {}

// epoll_wait
func netpoll(delay int64) gList {}

接下来看看具体的源码实现。

1.1 epoll_create

runtime/netpoll.go

//go:linkname poll_runtime_pollServerInit internal/poll.runtime_pollServerInit
func poll_runtime_pollServerInit() { // 调用epoll初始化
   netpollGenericInit() 
}

func netpollGenericInit() {
   if atomic.Load(&netpollInited) == 0 { // 未初始化时，netpollInited=0
      lockInit(&netpollInitLock, lockRankNetpollInit)
      lock(&netpollInitLock) // 考虑并发操作，加锁解锁
      if netpollInited == 0 {
         netpollinit() // 调用此完成epoll初始化
         atomic.Store(&netpollInited, 1) // 原子操作，将netpollInited的值改为1
      }
      unlock(&netpollInitLock)
   }
}

上面的 netpollinit() 实际是调用自 netpoll_epoll.go，接下来看看源码的进一步实现。

netpoll_epoll.go

func netpollinit() {
   epfd = epollcreate1(_EPOLL_CLOEXEC) // 先调用函数完成初始化
   if epfd < 0 { // 上面初始化不成功时，调用下面兼容 < Linux 2.6.8 内核，1024表示fd数量，epoll可监听的最大事件数量
      epfd = epollcreate(1024)
      if epfd < 0 {
         println("runtime: epollcreate failed with", -epfd)
         throw("runtime: netpollinit failed")
      }
      closeonexec(epfd)
   }
   r, w, errno := nonblockingPipe()
   if errno != 0 {
      println("runtime: pipe failed with", -errno)
      throw("runtime: pipe failed")
   }
   ev := epollevent{
      events: _EPOLLIN,
   }
   *(**uintptr)(unsafe.Pointer(&ev.data)) = &netpollBreakRd
   errno = epollctl(epfd, _EPOLL_CTL_ADD, r, &ev)
   if errno != 0 {
      println("runtime: epollctl failed with", -errno)
      throw("runtime: epollctl failed")
   }
   netpollBreakRd = uintptr(r)
   netpollBreakWr = uintptr(w)
}

1.2 epoll_ctl

runtime/netpoll.go

//go:linkname poll_runtime_pollOpen internal/poll.runtime_pollOpen
func poll_runtime_pollOpen(fd uintptr) (*pollDesc, int) {
   pd := pollcache.alloc() // 给当前fd分配内存
   lock(&pd.lock) // 加锁
   wg := pd.wg.Load()
   if wg != 0 && wg != pdReady {
      throw("runtime: blocked write on free polldesc")
   }
   rg := pd.rg.Load()
   if rg != 0 && rg != pdReady {
      throw("runtime: blocked read on free polldesc")
   }
   pd.fd = fd // 以下是pd完成对象初始化
   pd.closing = false
   pd.setEventErr(false)
   pd.rseq++
   pd.rg.Store(0)
   pd.rd = 0
   pd.wseq++
   pd.wg.Store(0)
   pd.wd = 0
   pd.self = pd
   pd.publishInfo()
   unlock(&pd.lock)

   errno := netpollopen(fd, pd) // fd注册到epoll中
   if errno != 0 { // 如果失败，释放内存
      pollcache.free(pd)
      return nil, int(errno)
   }
   return pd, 0
}

根据以上源码，可以概括为以下内容：

从 pollcache 中分配一个 pollDesc 对象
初始化 pollDesc 对象
调用 netpollopen 方法将 fd 注册到 epoll 的监听事件中

接下来看看 netpollopen 的具体实现。

runtime/netpoll_epoll.go

func netpollopen(fd uintptr, pd *pollDesc) int32 {
   var ev epollevent // 声明event
   ev.events = _EPOLLIN | _EPOLLOUT | _EPOLLRDHUP | _EPOLLET
   *(**pollDesc)(unsafe.Pointer(&ev.data)) = pd // pd作为event的数据底层
   return -epollctl(epfd, _EPOLL_CTL_ADD, int32(fd), &ev) // 完成实际注册
}

上面源码中可以概括为：

声明个 event 事件，可以包含若干事件类型，将 pollDesc 类型的 pd 传入 event 的底层，然后完成调用 epollctl 完成事件的注册，在 epoll_wait 中如果事件可读或者可写或者可读写就绪了，可以返回相关事件，就可以取到 pollDesc，从而完成相应调度。

到此就完成了事件的注册，我们再回到 pollDesc 结构体，看看内部定义的字段：

runtime/netpoll.go

const (
   pdReady uintptr = 1
   pdWait  uintptr = 2
)


type pollDesc struct {
    ...
    // rg, wg are accessed atomically and hold g pointers.
    // (Using atomic.Uintptr here is similar to using guintptr elsewhere.)
    rg atomic.Uintptr // pdReady, pdWait, G waiting for read or nil
    wg atomic.Uintptr // pdReady, pdWait, G waiting for write or nil
    ...
}

上面中的 rg、wg 实际表示读写事件的 goroutine，根据状态可以分为下面几种：

rg
- pdReady，1，可读event准备就绪
- pdWait，2，没有可读准备好了，goroutine 将要阻塞中
- G waiting for read，goroutine 正在等待可读事件到来，处于阻塞中
- nil，0，初始状态
wg
- pdReady，1，可写event准备就绪
- pdWait，2，没有可写准备好了，goroutine 将要阻塞中
- G waiting for write，goroutine 正在等待可写事件到来，处于阻塞中
- nil，0，初始状态

1.3 epoll_wait

在此过程中，如果有可读可写事件到来，就取出事件进行处理，接下来继续进入源码走读。

runtime/netpoll.go

func poll_runtime_pollWait(pd *pollDesc, mode int) int { // mode -> 'r' || 'w' || 'r'+'w'
   errcode := netpollcheckerr(pd, int32(mode))
   ...
   for !netpollblock(pd, int32(mode), false) {
      errcode = netpollcheckerr(pd, int32(mode))
      if errcode != pollNoError {
         return errcode
      }
      // Can happen if timeout has fired and unblocked us,
      // but before we had a chance to run, timeout has been reset.
      // Pretend it has not happened and retry.
   }
   return pollNoError
}


// 下面注释已经说的很清楚，当 IO 就绪就返回 true
// returns true if IO is ready, or false if timedout or closed
// waitio - wait only for completed IO, ignore errors
// Concurrent calls to netpollblock in the same mode are forbidden, as pollDesc
// can hold only a single waiting goroutine for each mode.
func netpollblock(pd *pollDesc, mode int32, waitio bool) bool {
    // 根据事件，置为相关地址
   gpp := &pd.rg
   if mode == 'w' {
      gpp = &pd.wg
   }

   // set the gpp semaphore to pdWait
   for {
      // Consume notification if already ready.
      if gpp.CompareAndSwap(pdReady, 0) {
         return true
      }
      if gpp.CompareAndSwap(0, pdWait) {
         break
      }

      // Double check that this isn't corrupt; otherwise we'd loop
      // forever.
      if v := gpp.Load(); v != pdReady && v != 0 {
         throw("runtime: double wait")
      }
   }

   // need to recheck error states after setting gpp to pdWait
   // this is necessary because runtime_pollUnblock/runtime_pollSetDeadline/deadlineimpl
   // do the opposite: store to closing/rd/wd, publishInfo, load of rg/wg
   if waitio || netpollcheckerr(pd, mode) == pollNoError { // 挂起 goroutine
      gopark(netpollblockcommit, unsafe.Pointer(gpp), waitReasonIOWait, traceEvGoBlockNet, 5)
   }
   // be careful to not lose concurrent pdReady notification
   old := gpp.Swap(0)
   if old > pdWait {
      throw("runtime: corrupted polldesc")
   }
   return old == pdReady
}

上面源码循环调用 netpollblock ，只要有可读||可写||可读写事件到来，才会返回 true，从而退出循环。

以 mode = 'r' 可读事件为例：

1.gpp 设置为 pd.rg 的地址
2.在 for 循环中，首先判断如果 gpp 中内部数据状态是 pdready 的话，表示有可读事件，不再阻塞，直接返回true，否则设置状态为 pdwait，跳出循环
3.接着调用 gopark，将当前 goroutine 阻塞，调用 netpollblockcommit 将 gpp 内部数据指向本goroutine，当前 goroutine 放入阻塞队列中
当 goroutine 再次被运行时，继续执行下面代码，将 gpp 内部地址设置为 0，返回之前旧值，如果为 pdready，表示为可读事件，显然满足 old == pdread -> true，故返回 true

我们知道在 epoll_wait 中，如果有相关事件就绪了，在 golang 中就应该唤醒 goroutine，看看运行时的相关调用： runtime/netpoll_epoll.go

func netpoll(delay int64) gList {
   ...
   var events [128]epollevent
retry:
   n := epollwait(epfd, &events[0], int32(len(events)), waitms)
   if n < 0 {
      if n != -_EINTR {
         println("runtime: epollwait on fd", epfd, "failed with", -n)
         throw("runtime: netpoll failed")
      }
      // If a timed sleep was interrupted, just return to
      // recalculate how long we should sleep now.
      if waitms > 0 {
         return gList{}
      }
      goto retry
   }
   var toRun gList
   for i := int32(0); i < n; i++ {
      ev := &events[i]
      if ev.events == 0 {
         continue
      }

      if *(**uintptr)(unsafe.Pointer(&ev.data)) == &netpollBreakRd {
         if ev.events != _EPOLLIN {
            println("runtime: netpoll: break fd ready for", ev.events)
            throw("runtime: netpoll: break fd ready for something unexpected")
         }
         if delay != 0 {
            // netpollBreak could be picked up by a
            // nonblocking poll. Only read the byte
            // if blocking.
            var tmp [16]byte
            read(int32(netpollBreakRd), noescape(unsafe.Pointer(&tmp[0])), int32(len(tmp)))
            atomic.Store(&netpollWakeSig, 0)
         }
         continue
      }

      var mode int32
      if ev.events&(_EPOLLIN|_EPOLLRDHUP|_EPOLLHUP|_EPOLLERR) != 0 {
         mode += 'r'
      }
      if ev.events&(_EPOLLOUT|_EPOLLHUP|_EPOLLERR) != 0 {
         mode += 'w'
      }
      if mode != 0 {
         pd := *(**pollDesc)(unsafe.Pointer(&ev.data))
         pd.setEventErr(ev.events == _EPOLLERR)
         netpollready(&toRun, pd, mode)
      }
   }
   return toRun
}

总体来说，此函数完成的逻辑就是发生系统调用 epollwait()，最终返回待运行的任务队列。

epollwait发生时，当前协程阻塞，直到有 IO事件或者超时发生。

retry的主要逻辑：

IO事件就绪就被设置到 events数组中，epollwait返回事件数量n，如小于0则继续 epollwait，否则接着遍历 events数组，根据每个 event事件类型设置 mode，如 'r'/'w'/'r'+'w'
toRun，根据定义声明，是个 g链表
对ev.events 与事件类型做与运算，更改 mode 值
另外在 net 中，socket 创建新的连接时，会将 pollDesc对象注册到 epoll 中，所以在 event 中的 data 属性，可以取出 pollDesc对象
接着调用 netpollready 函数，传入参数分别为 toRun/pd/mode

在 netpollready 中，该方法会把可运行的 goroutine push到 toRun 链表中，toRun 在netpoll中会被返回，调用者就会将 toRun 中的 goroutine 从阻塞状态 -> 可运行状态。

runtime/netpoll.go

func netpollready(toRun *gList, pd *pollDesc, mode int32) {
   var rg, wg *g
   if mode == 'r' || mode == 'r'+'w' {
      rg = netpollunblock(pd, 'r', true)
   }
   if mode == 'w' || mode == 'r'+'w' {
      wg = netpollunblock(pd, 'w', true)
   }
   if rg != nil {
      toRun.push(rg)
   }
   if wg != nil {
      toRun.push(wg)
   }
}

注意到 netpollready 中会调用 netpollunblock 函数，根据其返回值是否为 nil，决定是否加入 toRun 中。

runtime/netpoll.go

func netpollunblock(pd *pollDesc, mode int32, ioready bool) *g {
   gpp := &pd.rg
   if mode == 'w' {
      gpp = &pd.wg
   }

   for {
      old := gpp.Load()
      if old == pdReady {
         return nil
      }
      if old == 0 && !ioready {
         // Only set pdReady for ioready. runtime_pollWait
         // will check for timeout/cancel before waiting.
         return nil
      }
      var new uintptr
      if ioready {
         new = pdReady
      }
      if gpp.CompareAndSwap(old, new) {
         if old == pdWait {
            old = 0
         }
         return (*g)(unsafe.Pointer(old))
      }
   }
}

netpollunblock 与 netpollblock 对应，将 pollDesc 的 rg 或者 wg 设置为 pdready，并将阻塞的g对象返回。

到此，netpoll 中初始化、注册、wait的源码就基本结束，接下来将结合 net包看看网络事件的处理。

2.netpoll在net包中应用

在了解 netpoll 应用之前，我们首先看看在 net包中原生地实现 TCP server 的一些关键数据结构，了解这些有利于我们清楚整个调用链条。

实现一个简单的 TCP server 如下：

package main

import (
   "fmt"
   "net"
)

func main()  {
   lis, err := net.Listen("tcp", ":8000")
   if err != nil {
      fmt.Println(err)
      return
   }
   defer lis.Close()

   fmt.Println("server is up now...")
   for {
      conn, err := lis.Accept()
      if err != nil {
         fmt.Println(err)
         return
      }
      clientAddr := conn.RemoteAddr()
      fmt.Printf("recv client [%s] connection...\n", clientAddr)
      go serve(conn)
   }
}

func serve(conn net.Conn)  {
   defer conn.Close()
   buf := make([]byte, 64)
   n, _ := conn.Read(buf)
   conn.Write([]byte("pong: "))
   conn.Write(buf[:n])
}

在示例中，重要的如下：

net.Listen(network, address string) (Listener, error)，返回的是 TCPListener 对象
lis.Accept()，返回的是 TCPConn 对象
TCPConn 实现了 io.Reader、io.Writer 接口，故可以通过调用 Read() 方法实现读取客户端的请求，通过 Write() 方法，可以将响应返回客户端

基于上面的简单分析，下面看看对应的几个数据结构：

// src/net/net.go
// 如 TCPListener, SysListener 都实现了该接口，这也是golang使用接口带来的好处之一
type Listener interface {
   // Accept waits for and returns the next connection to the listener.
   Accept() (Conn, error)

   // Close closes the listener.
   // Any blocked Accept operations will be unblocked and return errors.
   Close() error

   // Addr returns the listener's network address.
   Addr() Addr
}

// src/net/dial.go
type ListenConfig struct {
   Control func(network, address string, c syscall.RawConn) error
   KeepAlive time.Duration
}

// sysListener contains a Listen's parameters and configuration.
type sysListener struct {
   ListenConfig
   network, address string
}


// src/net/tcpsock.go
// TCPListener is a TCP network listener. Clients should typically
// use variables of type Listener instead of assuming TCP.
type TCPListener struct {
   fd *netFD
   lc ListenConfig
}

// src/net/net.go
type Conn interface {
    Read(b []byte) (n int, err error)
    Write(b []byte) (n int, err error)
    ...
}

// src/net/net.go
type conn struct {
    fd *netFD
}

// src/net/tcpsock.go
type TCPConn struct {
    conn
}

// src/net/fd_posix.go
// Network file descriptor.
type netFD struct {
   pfd poll.FD
   ...
}

// src/internal/poll/fd_unix.go
type FD struct {
   // System file descriptor. Immutable until Close.
   Sysfd int
   // I/O poller.
   pd pollDesc
   ...
}

// src/internal/poll/fd_poll_runtime.go
type pollDesc struct {
   runtimeCtx uintptr
}

上述结构归纳如下：

Listener，接口，声明 Accept()/Close()/Addr() 三个方法
ListenConfig，结构体，主要封装了 Listen() 方法，用于返回监听对象
TCPListener，结构体，组合了 netFD和ListenConfig，实现了 Listener 接口
sysListener，结构体，组合了 ListenConfig，封装了一堆内部使用的listenxxx()，如listenTCP/listenUDP/listenIP等，用于创建不同协议类型的监听对象，如TCPConn/UDPConn/IPConn等
poll.pollDesc，结构体，封装运行时方法，与运行时交互
poll.FD，结构体，封装用户层最底层的读写、accpet、close等，在这些方法中通过运行时发起读写等的系统调用
netFD，结构体，封装更底层的读写操作
Conn，接口，声明 Read/Write/Close/SetxxxDeadline等方法
conn，结构体，主要实现了各种读写方法，连接关闭方法，以及设置读写deadline方法
TCPConn，结构体，组合conn，或者认为是继承，实现了Conn接口，丰富了关闭读侧写侧方法等

golang netpoll 在 net包 中的应用

1.netpoll

1.1 epoll_create

1.2 epoll_ctl

1.3 epoll_wait

2.netpoll在net包中应用

golang netpoll 在 net包中的应用