真实如刀的洞见：nio,epoll,多路复用，更好的理解io书接上文，java代码和系统调用有一定的关系，Java是解释

这是我参与8月更文挑战的第1天，活动详情查看：8月更文挑战

前言

书接上文，java代码和系统调用有一定的关系，Java是解释型语言（Java并不值钱，值钱的是jvm），我们所写的java代码最终都编译成字节码，然后去进行系统调用，本文我们还是从一个简单的服务端程序学习理解下io。

BIO

无论是哪种语言只要是服务端的程序，一定会有如下操作

调用socket得到文件描述（符代表了这个socket）
bind绑定端口如8090
listen 监听状态
accept 接收客户端连接

继续使用上一篇文章的测试demo，通过一个客户端连接服务端，从下面的图片中可以看到主线程到了一个poll（多路复用后面介绍）的状态并且阻塞在这个地方

public class ServerSocketTest {
    public static void main(String[] args) throws IOException {
        ServerSocket server = new ServerSocket(8090);
        System.out.println("step1:new ServerSocket(8090)");
        while (true){
            Socket client = server.accept();
            System.out.println("step2:client\t"+client.getPort());
            new Thread(() ->{
                try {
                    InputStream in = client.getInputStream();
                    BufferedReader reader = new BufferedReader(new InputStreamReader(in));
                    while (true){
                        System.out.println(reader.readLine());
                    }
                } catch (IOException e) {
                    e.printStackTrace();
                }

            }).start();
        }
    }
}

然后我们再通过一个客户端连接服务端看下；可以看到刚刚（上图）还是阻塞住的，现在就有事件连接进来了，进来之后开始accept接收，并且产生了新的文件描述符6（就是客户端）

再往下走发现了一个clone，这是个什么玩意呢？想必你已经猜到就是我们上面的 new Thread(()，每接收到一个客户端创建了一个新的线程，线程id:16872

我们可以验证下，是不是多了一个16872线程，如下图果然是多了一个16872的线程

我们看下新建的这个线程，发现这个线程阻塞在recvfrom(6，这个就和理论结合了，客户端数据未到达线程一直被阻塞

接下来我们在客户端给一个输入，然后再看下服务端的反应

发现从6描述符中读到了内容，然后write1（前面我们说过1是标准输出），并且我们在服务端看到了客户端输入的内容

到这里我们比较清楚的看到io阻塞的点了，上面的例子是通过多线程来处理多客户端的请求（每个客户端对应一个线程），并且在这个过程中必然会有内核的参与（无论走的jvm自带的函数还是lib最终走的都是sys call）

上面一顿操作，下面我们稍微搞个图描述下

其实这就是BIO，典型的BIO，然后这里面有什么问题呢

线程太多了，线程也是资源，创建线程会走系统调用（上文描述的clone），所以必然会发生软中断；我们知道线程栈的内存是独立的所以会造成内存资源的消耗，cpu的切换也会有浪费，其实这都是表面的问题，根本问题其实是IO的阻塞

那么如何让它不阻塞呢，通过man socket 这个命令我们看下socket系统调用的说明；下图可以看到内核中是可以让一个socket非阻塞的（socket Linux 2.6.27内核开始支持非阻塞模式。）

NIO

调用的时候加一个参数可以使socket非阻塞，这个时候read fd的时候就是非阻塞的了，有数据就直接读，没有数据就直接返回，这就是所谓的NIO，对于NIO其实有如下两种说法

从App lib这个角度看，N是new的意思，是一个新的io体系，有channel、buffer…这些新的组件
从操作系统内核的角度N是no blocking非阻塞的意思

这种模型相对于前面的BIO解决了开辟很多线程的问题；好像比前面更牛逼了，那么这么做有什么问题呢？

不知道你有没有听过C10K的问题，比如说有1w个client，那么每次会循环1w次调用系统内核，看下有没有数据，也就是说每次循环会有O(n)复杂度的sys call的过程，但是可能1w次中只有几个是有数据或者说是准备就绪的，也就是绝大多数的系统调用都是白忙活的，这有点浪费资源了吧！

对于这个问题，应该怎么解决呢？是否可以将O(n)的复杂度进行一个降解，这一块还是要看内核怎么做优化的，我们可以看下select这个命令，下面关于它的描述是“允许一个程序监听多个文件描述符，等待一个或者多个文件描述符变成了可用状态”

SYNOPSIS
       /* According to POSIX.1-2001, POSIX.1-2008 */
       #include <sys/select.h>

       /* According to earlier standards */
       #include <sys/time.h>
       #include <sys/types.h>
       #include <unistd.h>
       -- 
       int select(int nfds, fd_set *readfds, fd_set *writefds,
                  fd_set *exceptfds, struct timeval *timeout);

       void FD_CLR(int fd, fd_set *set);
       int  FD_ISSET(int fd, fd_set *set);
       void FD_SET(int fd, fd_set *set);
       void FD_ZERO(fd_set *set);

       #include <sys/select.h>

       int pselect(int nfds, fd_set *readfds, fd_set *writefds,
                   fd_set *exceptfds, const struct timespec *timeout,
                   const sigset_t *sigmask);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       pselect(): _POSIX_C_SOURCE >= 200112L

DESCRIPTION
       select() and pselect() allow a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become "ready" for some class of I/O operation (e.g., input pos-
       sible).  A file descriptor is considered ready if it is possible to perform a corresponding I/O operation (e.g., read(2) without blocking, or a sufficiently small write(2)).

       select() can monitor only file descriptors numbers that are less than FD_SETSIZE; poll(2) does not have this limitation.  See BUGS.

       The operation of select() and pselect() is identical, other than these three differences:

       (i)    select() uses a timeout that is a struct timeval (with seconds and microseconds), while pselect() uses a struct timespec (with seconds and nanoseconds).

       (ii)   select() may update the timeout argument to indicate how much time was left.  pselect() does not change this argument.

通过下面的图就容易理解点，程序先调了一个叫select的系统调用，然后传入fds（如果有1w个文件描述符，在这一次的系统调用中将这1w个文件描述符发给内核），内核会返回若干个可用状态的文件描述符，最终读数据是基于这若干个可用状态的文件描述符访问内核去读，这个系统复杂度可以理解为O(m)，前面是nio（每一个客户端问下是否就绪），现在是将所有的客户端连接扔到一个工具select（多路复用器）中；所以很多个客户端连接复用了一个系统调用，返回了准备就绪的连接，然后程序自己去发生读写

这种多路复用的模型，相对于上面的nio系统复杂度o(n),多路复用减少了系统调用的次数变成了o(m)，但是在内核中还需要完成O(n)的主动遍历，所以还存在着下面的两个问题

select每次需要传值（10w个fds）
内核主动遍历哪些fd可读可写

针对这个问题有没有办法解决呢，有是肯定有的，我们继续看下哈...

epoll多路复用

如果内核中可以开辟一块空间，程序每收到一个连接就把这个连接的文件描述符存到这快内核中下次就不用重复传递了，减少了传递的过程，那么如何知道这些fd哪些是可读/可写的呢？曾经的方式是主动遍历，如果有1w个就会遍历1w次，如何优化这种主动遍历的方式让它变得更快一些；其实有一种事件驱动的方式，这个时候就需要epoll登场了。可以通过man epoll 命令看下epoll的文档

 The  epoll  API  performs a similar task to poll(2): monitoring multiple file descriptors to see if I/O is possible on any of them.  The epoll API can be used either as an edge-triggered or a
       level-triggered interface and scales well to large numbers of watched file descriptors.  The following system calls are provided to create and manage an epoll instance:

       *  epoll_create(2) creates a new epoll instance and returns a file descriptor referring to that instance.  (The more recent epoll_create1(2) extends the functionality of epoll_create(2).)

       *  Interest in particular file descriptors is then registered via epoll_ctl(2).  The set of file descriptors currently registered on an epoll instance is sometimes called an epoll set.

       *  epoll_wait(2) waits for I/O events, blocking the calling thread if no events are currently available.

可以看到epoll主要有三个命令epoll_create、epoll_ctl、epoll_wait

epoll_create

epoll_create：内核会产生一个epoll 实例数据结构并返回一个文件描述符epfd，这个文件描述符其实描述的是上面介绍的内核中开辟的一快空间

DESCRIPTION
       epoll_create() creates a new epoll(7) instance.  Since Linux 2.6.8, the size argument is ignored, but must be greater than zero; see NOTES below.

       epoll_create()  returns a file descriptor referring to the new epoll instance.  This file descriptor is used for all the subsequent calls to the epoll interface.  When no longer required, the
       file descriptor returned by epoll_create() should be closed by using close(2).  When all file descriptors referring to an epoll instance have been closed, the kernel destroys the instance and
       releases the associated resources for reuse.

   epoll_create1()
       If  flags  is  0,  then, other than the fact that the obsolete size argument is dropped, epoll_create1() is the same as epoll_create().  The following value can be included in flags to obtain
       different behavior:

       EPOLL_CLOEXEC
              Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor.  See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.

RETURN VALUE
       On success, these system calls return a nonnegative file descriptor.  On error, -1 is returned, and errno is set to indicate the error.

epoll_ctl

对文件描述符 fd 和其监听事件 epoll_event 进行注册，删除，或者修改其监听事件 epoll_event

DESCRIPTION
       This  system call performs control operations on the epoll(7) instance referred to by the file descriptor epfd.  It requests that the operation op be performed for the target file descriptor,
       fd.

       Valid values for the op argument are:

       EPOLL_CTL_ADD
              Register the target file descriptor fd on the epoll instance referred to by the file descriptor epfd and associate the event event with the internal file linked to fd.

       EPOLL_CTL_MOD
              Change the event event associated with the target file descriptor fd.

       EPOLL_CTL_DEL
              Remove (deregister) the target file descriptor fd from the epoll instance referred to by epfd.  The event is ignored and can be NULL (but see BUGS below).

epoll_wait

阻塞等待注册的事件发生，返回事件的数目，并将触发的可用事件写入epoll_events数组中。

SYNOPSIS
       #include <sys/epoll.h>

       int epoll_wait(int epfd, struct epoll_event *events,
                      int maxevents, int timeout);
       int epoll_pwait(int epfd, struct epoll_event *events,
                      int maxevents, int timeout,
                      const sigset_t *sigmask);

DESCRIPTION
       The  epoll_wait()  system  call  waits  for events on the epoll(7) instance referred to by the file descriptor epfd.  The memory area pointed to by events will contain the events that will be
       available for the caller.  Up to maxevents are returned by epoll_wait().  The maxevents argument must be greater than zero.

       The timeout argument specifies the number of milliseconds that epoll_wait() will block.  Time is measured against the CLOCK_MONOTONIC clock.  The call will block until either:

       *  a file descriptor delivers an event;

       *  the call is interrupted by a signal handler; or

       *  the timeout expires.

程序整个运行过程

socket fd5    创建socket等到fd5
bind	8090    绑定8090
listen	fd5   监听fd5
epoll_create fd8  在内核中创建一块区域 fd8
epoll_ctl(fd8,add fd5，accept) 在fd8这个区域中把fd5放进去
epoll_wait(fd8)  让程序等待文件描述符达到就绪状态
accept fd5  -> fd6 此时新进来一个文件描述符为fd6的客户端
epoll_ctl(fd8,fd6) 把fd6放到fd8的内核空间中去，如果有很多客户端，fd8内核区域中会有很多的fd

事件驱动如何实现的呢，就需要详细介绍下中断的概念了，这边简单介绍下（后面希望搞一篇文章单独介绍下）：如当客户端发送数据到服务端的网卡，网卡收到数据后会向cpu发生中断，cpu会回调，会到内核通过DMA（直接内存访问）知道是哪个fd，然后将就绪的fd从图中的a区域转移到b；客户端直接获取到就绪的fd去读写

到这里可以稍微总结下，epoll实际上更加充分发挥了硬件，不浪费cpu；从bio->nio->epoll多路复用，都是为了解决现有模型存在的问题从而衍生出新了模型，不仅是技术所有的事物都如此，今天写到这里吧，希望后面自己继续加油！

狭巷短兵相接处，杀人如草不闻声 -沈明臣