从零开始构建Web应用-PART 1

656 阅读10分钟

译者前言

使用Python开发web应用非常方便,有很多成熟的框架,比如Flask,Django等等。而这个系列文章是从零开始构建,从中可以学习HTTP协议以及很多原理知识,这对深入理解Web应用的开发非常有帮助。目前,本系列文章共4篇,这是第一篇的译文。

我将使用Python从零开始构建一个web应用(以及它的web服务器),本文是这个系列文章的首篇。为了完成这个系列,唯一的依赖就是Python标准库,并且我会忽略WSGI标准。

言归正传,我们马上开始!

Web服务器

首先,我们将编写一个HTTP服务器用于运行我们的web应用。但是,我们先要花一点时间了解一下HTTP协议的工作原理。

HTTP如何工作

简单来说,HTTP客户端通过网络连接HTTP服务器,并且向它们发送包含字符串数据的请求。服务器会解析这些请求,并且向客户端返回一个响应。整个协议以及请求和响应的格式在RFC2616 中详细的介绍,而我会在本文中通俗地讲解一下,所以你无需阅读整个协议的文档。

请求格式

请求是由一些由\r\n分隔的行来表示,第一行叫做“请求行”。请求行由以下部分组成:HTTP方法,后跟一个空格,再后跟文件的请求路径,再后跟一个空格,然后是客户端指定的HTTP协议的版本,最后是回车\r和换行\n符。

GET /some-path HTTP/1.1\r\n

请求行之后,可能会有零个或者多个请求头。每个请求头都由以下内容组成:一个请求头名称,后跟冒号,然后是可选值,最后是\r\n

Host: example.com\r\n
Accept: text/html\r\n

使用空行来标记请求头的结束:

\r\n

最后,请求可能包含一个请求体——一个任意的有效负荷,随着这个请求发向服务器。

将上述内容汇总一下,得到一个简单的GET请求:

GET / HTTP/1.1\r\n
Host: example.com\r\n
Accept: text/html\r\n
\r\n

以下是一个带有请求体的POST请求:

POST / HTTP/1.1\r\n
Host: example.com\r\n
Accept: application/json\r\n
Content-type: application/json\r\n
Content-length: 2\r\n
\r\n
{}

响应格式

响应,和请求类似,也是由一些\r\n分隔的行组成。响应的首行叫做“状态行”,它包含以下信息:HTTP协议版本,后跟一个空格,后跟响应状态码,后跟一个空格,然后是状态码的信息,最后还是\r\n

HTTP/1.1 200 OK\r\n

状态行之后是响应头,然后是一个空行,再就是可选的响应体:

HTTP/1.1 200 OK\r\n
Content-type: text/html\r\n
Content-length: 15\r\n
\r\n
<h1>Hello!</h1>

一个简单的服务器

根据我们目前对协议的了解,让我们来编写一个服务器,该服务器不管接受什么请求都返回相同的响应。

我们需要创建一个套接字,将其绑定到一个地址,然后开始监听连接:

import socket

HOST = "127.0.0.1"
PORT = 9000

# By default, socket.socket creates TCP sockets.
with socket.socket() as server_sock:
    # This tells the kernel to reuse sockets that are in `TIME_WAIT` state.
    server_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

    # This tells the socket what address to bind to.
    server_sock.bind((HOST, PORT))

    # 0 is the number of pending connections the socket may have before
    # new connections are refused.  Since this server is going to process
    # one connection at a time, we want to refuse any additional connections.
    server_sock.listen(0)
    print(f"Listening on {HOST}:{PORT}...")

如果你现在就运行代码,它将输出它在监听127.0.0.1:9000,立马就结束了。为了能够处理来的连接,我们需要调用套接字的accept方法。这样做就可以阻塞处理过程直到有一个客户端连接到我们的服务器。

with socket.socket() as server_sock:
    server_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    server_sock.bind((HOST, PORT))
    server_sock.listen(0)
    print(f"Listening on {HOST}:{PORT}...")

    client_sock, client_addr = server_sock.accept()
    print(f"New connection from {client_addr}.")

一旦我们有一个套接字连接到客户端,我们就可以开始和它通信。使用sendall方法,向客户端发送响应:

RESPONSE = b"""\
HTTP/1.1 200 OK
Content-type: text/html
Content-length: 15

<h1>Hello!</h1>""".replace(b"\n", b"\r\n")

with socket.socket() as server_sock:
    server_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    server_sock.bind((HOST, PORT))
    server_sock.listen(0)
    print(f"Listening on {HOST}:{PORT}...")

    client_sock, client_addr = server_sock.accept()
    print(f"New connection from {client_addr}.")
    with client_sock:
        client_sock.sendall(RESPONSE)

此时如果你运行代码,然后在浏览器里访问 http://127.0.0.1:9000 ,你会看到字符串 “Hello!” 。不幸的是,服务器发送了这个响应后就立即结束了,所以刷新浏览器就会报错。下面修复这个问题:

RESPONSE = b"""\
HTTP/1.1 200 OK
Content-type: text/html
Content-length: 15

<h1>Hello!</h1>""".replace(b"\n", b"\r\n")

with socket.socket() as server_sock:
    server_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    server_sock.bind((HOST, PORT))
    server_sock.listen(0)
    print(f"Listening on {HOST}:{PORT}...")

    while True:
        client_sock, client_addr = server_sock.accept()
        print(f"New connection from {client_addr}.")
        with client_sock:
            client_sock.sendall(RESPONSE)

此时,我们就拥有了一个web服务器,它可以运行一个简单的HTML网页,一共才25行代码。这还不算太遭!

一个文件服务器

我们继续扩展这个HTTP服务器,让它可以处理硬盘上的文件。

请求抽象

在修改之前,我们需要能够读取并且解析来自客户端的请求。因为我们已经知道,请求数据是由一系列的行表示,每行由\r\n分隔,让我们编写一个生成器函数,它可以读取套接字中的数据,并且解析出每一行的数据:

import typing


def iter_lines(sock: socket.socket, bufsize: int = 16_384) -> typing.Generator[bytes, None, bytes]:
    """Given a socket, read all the individual CRLF-separated lines
    and yield each one until an empty one is found.  Returns the
    remainder after the empty line.
    """
    buff = b""
    while True:
        data = sock.recv(bufsize)
        if not data:
            return b""

        buff += data
        while True:
            try:
                i = buff.index(b"\r\n")
                line, buff = buff[:i], buff[i + 2:]
                if not line:
                    return buff

                yield line
            except IndexError:
                break

以上代码看上去有点困难,实际上,它只是从套接字中尽可能的读取数据,将它们放到一个缓冲区里,不断得将缓冲到的数据拆分成单独的行,每次给出一行。一旦它发现一个空行,它就会返回提取到的数据。

使用iter_lines,我们可以开始打印出从客户端读取到的请求:

with socket.socket() as server_sock:
    server_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    server_sock.bind((HOST, PORT))
    server_sock.listen(0)
    print(f"Listening on {HOST}:{PORT}...")

    while True:
        client_sock, client_addr = server_sock.accept()
        print(f"New connection from {client_addr}.")
        with client_sock:
            for request_line in iter_lines(client_sock):
                print(request_line)

            client_sock.sendall(RESPONSE)

此时如果你运行代码,然后在浏览器里访问 http://127.0.0.1:9000 ,你会在控制台里看到以下内容:

Received connection from ('127.0.0.1', 62086)...
b'GET / HTTP/1.1'
b'Host: localhost:9000'
b'Connection: keep-alive'
b'Cache-Control: max-age=0'
b'Upgrade-Insecure-Requests: 1'
b'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36'
b'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'
b'Accept-Encoding: gzip, deflate, br'
b'Accept-Language: en-US,en;q=0.9,ro;q=0.8'

相当整齐!让我们抽象出一个Request类:

import typing


class Request(typing.NamedTuple):
    method: str
    path: str
    headers: typing.Mapping[str, str]

现在,这个请求类只知道请求方法,路径,请求头,后续,我们继续支持查询字符串参数以及读取请求体。

为了封装逻辑需要构建一个请求,我们在Request类中增加一个类方法from_socket

class Request(typing.NamedTuple):
    method: str
    path: str
    headers: typing.Mapping[str, str]

    @classmethod
    def from_socket(cls, sock: socket.socket) -> "Request":
        """Read and parse the request from a socket object.

        Raises:
          ValueError: When the request cannot be parsed.
        """
        lines = iter_lines(sock)

        try:
            request_line = next(lines).decode("ascii")
        except StopIteration:
            raise ValueError("Request line missing.")

        try:
            method, path, _ = request_line.split(" ")
        except ValueError:
            raise ValueError(f"Malformed request line {request_line!r}.")

        headers = {}
        for line in lines:
            try:
                name, _, value = line.decode("ascii").partition(":")
                headers[name.lower()] = value.lstrip()
            except ValueError:
                raise ValueError(f"Malformed header line {line!r}.")

        return cls(method=method.upper(), path=path, headers=headers)

这里用到了iter_lines函数,刚才我们在读取请求行时用过它。这里获取了请求方法和路径,然后读取每一个请求头并且进行转换。最终,它构建了一个Request对象并返回了该对象。如果我们把它放到之前的服务器循环里,会像下面这样:

with socket.socket() as server_sock:
    server_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    server_sock.bind((HOST, PORT))
    server_sock.listen(0)
    print(f"Listening on {HOST}:{PORT}...")

    while True:
        client_sock, client_addr = server_sock.accept()
        print(f"Received connection from {client_addr}...")
        with client_sock:
            request = Request.from_socket(client_sock)
            print(request)
            client_sock.sendall(RESPONSE)

如果你现在连接到服务器,你会看到如下信息:

Request(method='GET', path='/', headers={'host': 'localhost:9000', 'user-agent': 'curl/7.54.0', 'accept': '*/*'})

因为from_socket在特定的情况下会抛出一个异常,如果你现在给出一个非法的请求,那么服务器就可能会宕机。为了模拟这种请求,你可以使用telnet连接到服务器,然后发送一些伪造的数据:

> telnet 127.0.0.1 9000
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello
Connection closed by foreign host.

果然,这个服务器宕机了:

Received connection from ('127.0.0.1', 62404)...
Traceback (most recent call last):
  File "server.py", line 53, in parse
    request_line = next(lines).decode("ascii")
ValueError: not enough values to unpack (expected 3, got 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "server.py", line 82, in <module>
    with client_sock:
  File "server.py", line 55, in parse
    raise ValueError("Request line missing.")
ValueError: Malformed request line 'hello'.

为了能够更加优雅地处理这种情况,我们使用try-except包裹起对from_socket的调用,然后当遇到有缺陷的请求时,就向客户端发送一个“400 Bad Request“响应:

BAD_REQUEST_RESPONSE = b"""\
HTTP/1.1 400 Bad Request
Content-type: text/plain
Content-length: 11

Bad Request""".replace(b"\n", b"\r\n")

with socket.socket() as server_sock:
    server_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    server_sock.bind((HOST, PORT))
    server_sock.listen(0)
    print(f"Listening on {HOST}:{PORT}...")

    while True:
        client_sock, client_addr = server_sock.accept()
        print(f"Received connection from {client_addr}...")
        with client_sock:
            try:
                request = Request.from_socket(client_sock)
                print(request)
                client_sock.sendall(RESPONSE)
            except Exception as e:
                print(f"Failed to parse request: {e}")
                client_sock.sendall(BAD_REQUEST_RESPONSE)

如果我们再去尝试搞挂服务器,我们的客户端会得到一个响应,并且服务器会继续正常运行:

~> telnet 127.0.0.1 9000
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello
HTTP/1.1 400 Bad Request
Content-type: text/plain
Content-length: 11

Bad RequestConnection closed by foreign host.

现在我们准备开始实现处理文件的部分,首先,我们在定义一个默认的”404 Not Found“响应:

NOT_FOUND_RESPONSE = b"""\
HTTP/1.1 404 Not Found
Content-type: text/plain
Content-length: 9

Not Found""".replace(b"\n", b"\r\n")

#...

with socket.socket() as server_sock:
    server_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    server_sock.bind((HOST, PORT))
    server_sock.listen(0)
    print(f"Listening on {HOST}:{PORT}...")

    while True:
        client_sock, client_addr = server_sock.accept()
        print(f"Received connection from {client_addr}...")
        with client_sock:
            try:
                request = Request.from_socket(client_sock)
                print(request)
                client_sock.sendall(NOT_FOUND_RESPONSE)
            except Exception as e:
                print(f"Failed to parse request: {e}")
                client_sock.sendall(BAD_REQUEST_RESPONSE)

此外,再增加一个“405 Method Not Allowed ”响应。我们将会只处理GET请求:

METHOD_NOT_ALLOWED_RESPONSE = b"""\
HTTP/1.1 405 Method Not Allowed
Content-type: text/plain
Content-length: 17

Method Not Allowed""".replace(b"\n", b"\r\n")

我们来定一个SERVER_ROOT 常量和一个serve_file函数,这个常量用于表示服务器处理哪里的文件。

import mimetypes
import os
import socket
import typing

SERVER_ROOT = os.path.abspath("www")

FILE_RESPONSE_TEMPLATE = """\
HTTP/1.1 200 OK
Content-type: {content_type}
Content-length: {content_length}

""".replace("\n", "\r\n")


def serve_file(sock: socket.socket, path: str) -> None:
    """Given a socket and the relative path to a file (relative to
    SERVER_SOCK), send that file to the socket if it exists.  If the
    file doesn't exist, send a "404 Not Found" response.
    """
    if path == "/":
        path = "/index.html"

    abspath = os.path.normpath(os.path.join(SERVER_ROOT, path.lstrip("/")))
    if not abspath.startswith(SERVER_ROOT):
        sock.sendall(NOT_FOUND_RESPONSE)
        return

    try:
        with open(abspath, "rb") as f:
            stat = os.fstat(f.fileno())
            content_type, encoding = mimetypes.guess_type(abspath)
            if content_type is None:
                content_type = "application/octet-stream"

            if encoding is not None:
                content_type += f"; charset={encoding}"

            response_headers = FILE_RESPONSE_TEMPLATE.format(
                content_type=content_type,
                content_length=stat.st_size,
            ).encode("ascii")

            sock.sendall(response_headers)
            sock.sendfile(f)
    except FileNotFoundError:
        sock.sendall(NOT_FOUND_RESPONSE)
        return

serve_file获得客户端套接字和一个文件的路径。然后它尝试解决真正文件的路径,这些文件位于SERVER_ROOT,对于SERVER_ROO之外的文件就返回“not found”。然后尝试打开文件,找到它的mime类型和大小(使用os.fstat),接着构造响应头,然后使用sendfile系统调用将文件写入套接字。如果在硬盘上找不到文件,就返回"not found"响应。

如果我们增加serve_file,我们的服务器循环像这个样子:

with socket.socket() as server_sock:
    server_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    server_sock.bind((HOST, PORT))
    server_sock.listen(0)
    print(f"Listening on {HOST}:{PORT}...")

    while True:
        client_sock, client_addr = server_sock.accept()
        print(f"Received connection from {client_addr}...")
        with client_sock:
            try:
                request = Request.from_socket(client_sock)
                if request.method != "GET":
                    client_sock.sendall(METHOD_NOT_ALLOWED_RESPONSE)
                    continue

                serve_file(client_sock, request.path)
            except Exception as e:
                print(f"Failed to parse request: {e}")
                client_sock.sendall(BAD_REQUEST_RESPONSE)

如果你增加一个文件www\index.html,靠着server.py文件,然后访问http://localhost:9000 ,你就会看到文件的内容。

尾声

这是Part 1。在Part 2中,我们将提取ServerResponse的抽象,以及如何处理多个并发的请求。如果你想获得完整的源码,访问这里

原文:WEB APPLICATION FROM SCRATCH, PART I

  • *作者:*Bogdan Popa
  • 译者:noONE

更多精彩内容,关注公众号SeniorEngineer:

me