土法神经网络 Part II：感知机感知机是神经网络的一种小型形态，是构成更复杂的结构的基本块。在详细介绍之前，我们先

原文：Deep Learning From Scratch II: Perceptrons - deep ideas

翻译：孙一萌

感知机（Perceptrons）

令人激动的例子

感知机是神经网络的一种小型形态，是构成更复杂的结构的基本块。

在详细介绍之前，我们先看看这个令人激动的例子。假设我们有一个数据集，包含平面上的一百个点，其中有一半的点是红色的，另一半是蓝色。

点击运行代码，观察点的分布。

import numpy as np
import matplotlib.pyplot as plt[/amalthea_pre_exercise_code]
[amalthea_sample_code]
# 创建一些集中于 (-2, -2) 的红点
red_points = np.random.randn(50, 2) - 2*np.ones((50, 2))

# 创建一些集中于 (2, 2) 的蓝点
blue_points = np.random.randn(50, 2) + 2*np.ones((50, 2))

# 把红点和蓝点都在图上画出来
plt.scatter(red_points[:,0], red_points[:,1], color='red')
plt.scatter(blue_points[:,0], blue_points[:,1], color='blue')

如图，红点集中在 $(-2, -2)$ ，而蓝点集中在 $(2, 2)$ 。看了数据，你认为有没有一种方法，可以判断某个点是红的还是蓝的？

如果问你 $(3, 2)$ 是什么颜色，你马上就会回答蓝色，即便这个点不在上面的数据里头，我们依然可以依据它所位于的区域（蓝色），判断出它的颜色。

但有没有更加通用的方法，能得出蓝色的可能性更大的结论？显然，我们可以在上面图上画一条线 $y = -x$ ，把空间完美地分为红色和蓝色两个区域。

# 创建一些集中于 (-2, -2) 的红点
red_points = np.random.randn(50, 2) - 2*np.ones((50, 2))

# 创建一些集中于 (2, 2) 的蓝点
blue_points = np.random.randn(50, 2) + 2*np.ones((50, 2))

# 把红点和蓝点都在图上画出来
plt.scatter(red_points[:,0], red_points[:,1], color='red')
plt.scatter(blue_points[:,0], blue_points[:,1], color='blue')[/amalthea_pre_exercise_code]
[amalthea_sample_code]
# 画一条线 y = -x
x_axis = np.linspace(-4, 4, 100)
y_axis = -x_axis
plt.plot(x_axis, y_axis)

我们可以用一个 权向量 $w$ 和一个 偏置 $b$ 来隐式地代表这条线，线上的点 $x$ 符合 $w^T x + b = 0$ 。

代入上例中的数据，得到 $w = (1,1)^T$ , $b = 0$ 。因此 $w^T x + b$ 等于 $(1,1)^T \cdot x$ 。

因此这条线可以表示为：

$(1,1)^T \cdot x = 0$

好了，现在要判断是红色还是蓝色，只要判断它在线的上方还是下方即可：把点 $x$ 代入 $w^T x + b$ ，根据结果的符号，如果正的， $x$ 就在线的上方，负的就在下方。

比如上面说的点 $(3,2)$ ：
$\pmatrix{ 1 & 1 } \cdot \pmatrix{ 3 \cr 2 } = 5$
$5 > 0$ ，所以点在线上方，因此是蓝色。

感知机的定义

往往而言，一个分类器（classifier）函数： $\hat{c}: R^d ->$ { $1, 2, … , C$ }，可以将一个点映射到一个类别（类别总共 C 个）。

而一个二元分类器就是总共有两个类别（ $C = 2$ ）的分类器。

我们判断红点蓝点时所用的感知机，就是一个二元分类器，其中 $w \in R^d$ 且 偏置 $b \in R^d$ ：

\hat{c} (x) =\begin {cases}
1, & w^{T} x + b \geq 0 \\\
2, & w^{T} x + b < 0
\end {cases}

这个 $\hat{c} (x)$ ，将 $R^d$ 分为了两个空间，各对应一个类别。

红蓝点的例子是二维（维度 $d = 2$ ）的，在二维情况下，空间是沿着一条线被划分的。推广到 $d$ 维的情况下，平面的划分总是沿着一个 $d - 1$ 维的超平面。

从划分类别到计算概率

在实际应用中，我们不光想知道点最可能是哪个类别的，我们也好奇这个点属于某个类别的概率是多少。

之前判断红蓝色，我们把点 x 的数据代入，如果得到的 $w^T x + b$ 值越大，那点距离分割线的距离肯定就越远，我们也更自信它是蓝色的。

但是当我们得到一个 $w^T x + b$ 的值的时候，我们没办法说它到底算不算大。那么为了把这个值转化为一种概率，我们可以把值压缩，让它们分布在 0 和 1 之间。

这可以用 sigmoid 函数 σ 实现：
$p(\hat{c} (x) = 1 | x) = σ( w^T x + b)$
其中 $\sigma(a) = \frac{1}{1 + e^{-a}}$

我们来看看 sigmoid 函数的实现：

import matplotlib.pyplot as plt
import numpy as np
# 创建从 -5 到 5 的间隔，步长 0.01
a = np.arange(-5, 5, 0.01)

# 计算对应的 sigmoid 函数的值
s = 1 / (1 + np.exp(-a))

# 画出结果
plt.plot(a, s)
plt.grid(True)
plt.show()

如图，当 $w^T x + b = 0$ ，即点位于分割线上时，sigmoid 函数得到这个值对应的概率为 0.5。当渐近线越接近 1， $w^T x + b$ 的值就越大；渐近线越接近 0， $w^T x + b$ 值就越小。

符合我们的期待。

现在来定义一个 sigmoid 函数的 Operation，这个 Operation 我们后面会用到：


class Operation:
    """Represents a graph node that performs a computation.

    An `Operation` is a node in a `Graph` that takes zero or
    more objects as input, and produces zero or more objects
    as output.
    """

    def __init__(self, input_nodes=[]):
        """Construct Operation
        """
        self.input_nodes = input_nodes

        # Initialize list of consumers (i.e. nodes that receive this operation's output as input)
        self.consumers = []

        # Append this operation to the list of consumers of all input nodes
        for input_node in input_nodes:
            input_node.consumers.append(self)

        # Append this operation to the list of operations in the currently active default graph
        _default_graph.operations.append(self)

    def compute(self):
        """Computes the output of this operation.
        "" Must be implemented by the particular operation.
        """
        pass

class Graph:
    """Represents a computational graph
    """

    def __init__(self):
        """Construct Graph"""
        self.operations = []
        self.placeholders = []
        self.variables = []

    def as_default(self):
        global _default_graph
        _default_graph = self

class placeholder:
    """Represents a placeholder node that has to be provided with a value
       when computing the output of a computational graph
    """

    def __init__(self):
        """Construct placeholder
        """
        self.consumers = []

        # Append this placeholder to the list of placeholders in the currently active default graph
        _default_graph.placeholders.append(self)

class Variable:
    """Represents a variable (i.e. an intrinsic, changeable parameter of a computational graph).
    """

    def __init__(self, initial_value=None):
        """Construct Variable

        Args:
          initial_value: The initial value of this variable
        """
        self.value = initial_value
        self.consumers = []

        # Append this variable to the list of variables in the currently active default graph
        _default_graph.variables.append(self)

class add(Operation):
    """Returns x + y element-wise.
    """

    def __init__(self, x, y):
        """Construct add

        Args:
          x: First summand node
          y: Second summand node
        """
        super().__init__([x, y])

    def compute(self, x_value, y_value):
        """Compute the output of the add operation

        Args:
          x_value: First summand value
          y_value: Second summand value
        """
        return x_value + y_value

class matmul(Operation):
    """Multiplies matrix a by matrix b, producing a * b.
    """

    def __init__(self, a, b):
        """Construct matmul

        Args:
          a: First matrix
          b: Second matrix
        """
        super().__init__([a, b])

    def compute(self, a_value, b_value):
        """Compute the output of the matmul operation

        Args:
          a_value: First matrix value
          b_value: Second matrix value
        """
        return a_value.dot(b_value)
class Session:
    """Represents a particular execution of a computational graph.
    """

    def run(self, operation, feed_dict={}):
        """Computes the output of an operation

        Args:
          operation: The operation whose output we'd like to compute.
          feed_dict: A dictionary that maps placeholders to values for this session
        """

        # Perform a post-order traversal of the graph to bring the nodes into the right order
        nodes_postorder = traverse_postorder(operation)

        # Iterate all nodes to determine their value
        for node in nodes_postorder:

            if type(node) == placeholder:
                # Set the node value to the placeholder value from feed_dict
                node.output = feed_dict[node]
            elif type(node) == Variable:
                # Set the node value to the variable's value attribute
                node.output = node.value
            else:  # Operation
                # Get the input values for this operation from node_values
                node.inputs = [input_node.output for input_node in node.input_nodes]

                # Compute the output of this operation
                node.output = node.compute(*node.inputs)

            # Convert lists to numpy arrays
            if type(node.output) == list:
                node.output = np.array(node.output)

        # Return the requested node value
        return operation.output


def traverse_postorder(operation):
    """Performs a post-order traversal, returning a list of nodes
    in the order in which they have to be computed

    Args:
       operation: The operation to start traversal at
    """

    nodes_postorder = []

    def recurse(node):
        if isinstance(node, Operation):
            for input_node in node.input_nodes:
                recurse(input_node)
        nodes_postorder.append(node)

    recurse(operation)
    return nodes_postorder
[/amalthea_pre_exercise_code]
[amalthea_sample_code]
class sigmoid(Operation):
    """返回元素 x 的 sigmoid 结果。
    """

    def __init__(self, a):
        """构造 sigmoid

        参数列表:
          a: 输入节点
        """
        super().__init__([a])

    def compute(self, a_value):
        """计算本 sigmoid operation 的输出

        参数列表:
          a_value: 输入值
        """
        return 1 / (1 + np.exp(-a_value))
def reTrue():
    return True

reTrue()

1. 举个例子

现在我们可以用 Python 做一个感知机，解决之前的红/蓝问题。再用这个感知机算一下 $(3,2)^T$ 是蓝点的概率

# 创建一个新 graph
Graph().as_default()

x = placeholder()
w = Variable([1, 1])
b = Variable(0)
p = sigmoid( add(matmul(w, x), b) )

session = Session()
print(session.run(p, {
    x: [3, 2]
}))

多分类感知机

目前为止，我们只用感知机做过个二元分类器，用来推算一个点，属于某一类别（共两个类别）的概率 $p$ ，那么自然，属于另一类别的概率就是 $1 - p$ 了。

但是往往实际情况下，类别的数量都会超过两个。比方说，在给图片做分类的时候，要输出的类别可能有很多种（比如狗、椅子、人、房子等等）。

因此我们要把感知机拓展一下，让它能支持输出多种类别下的可能性。

我们依然取常量 $C$ 作为类别的数量。但不再用之前二元时的权向量 $w$ ，而是引入权矩阵 $W \in R^{d \times C}$ 。

权矩阵的每一列包含一个单独的线性分类器中的权，每一个类别对应一个 分类器。

二元的时候，我们要计算 $w^T x$ 的点积，而现在我们要计算 $xW$ 。计算 $xW$ 返回的是一个位于 $R^C$ 的向量，它的各项可以看作权矩阵不同列的点积的结果。

然后，我们再将向量 $xW$ 加上 偏置向量 $b \in R^C$ 。向量 $b$ 的一项对应一种类别。

这样就生成了一个位于 $R^C$ 的向量，这个向量每一项分别代表点属于某一种类别（共 $C$ 个类别）的可能性。

过程看上去可能很复杂，但其实这个矩阵乘法，只不过并行地为 $C$ 个类别中的每一个，分别执行了它们各自对应的线性分类器而已，它们每一个都有自己的分割线，而这分割线依然可以像之前的红蓝问题一样，由给定的权向量和 偏置 隐式表示，只不过在这里，权向量由权矩阵的各列提供，而 偏置 则是 $b$ 向量的各项。

1. Softmax

原本的感知机生成单个标量，通过 sigmoid，我们把这个标量压缩，得到分布于 0 到 1 之间的一个概率。

推广到多类别感知机，它会生成一个向量 $a \in R^m$ 。同样地，向量 a 的第 $i$ 项值越大，我们就更有自信认为输入的点属于第 $i$ 个类别。

因此，我们也希望将向量 $a$ 转化为概率向量，向量的各项分别代表输入值属于各个类别的概率，向量的每一项都分布在 0 和 1 之间，且全部项相加总和为 1。

要实现这一点，通常做法是使用 softmax 函数。Softmax 函数其实是 sigmoid 在多类别输出情况下的一种推广：

[math] ? \sigma(a)_i = \frac{e^{a_i}}{\sum_{j=1}^C e^{a_j}} ? [/math]

class softmax(Operation):
    """返回 a 的 softmax 函数结果.
    """

    def __init__(self, a):
        """构造 softmax

        参数列表:
          a: 输入节点
        """
        super().__init__([a])

    def compute(self, a_value):
        """计算 softmax operation 的输出值

        参数列表:
          a_value: 输入值
        """
        return np.exp(a_value) / np.sum(np.exp(a_value), axis=1)[:, None]

2. 批量计算

我们可以通过矩阵的形式，一次传入多个值。也就是说，之前我们一次只能传入一个点，现在我们可以每次传入一个矩阵 $X \in R^{N \times d}$ ，矩阵的每一行都包含一个点（共 $N$ 行，包含 $N$ 个 $d$ 维点）。

我们把这种矩阵称为批量。

这样的话，我们计算的就是 $XW$ 而非 $xW$ 。计算 $XW$ 会返回一个 $N \times C$ 的矩阵，矩阵的每一行包含各个点的 $xW$ 。

我们再把每一行都加上一个 偏置向量 $b$ （此时 $b$ 是一个 $1 \times m$ 的行向量）。

因此这一整个过程就是计算了一个函数 $f : R^{N \times d} -> R^m$ ，其中 $f(X) = \sigma(XW + b)$ 。此处 计算图 如下：

3. 示例

我们来推广之前的红/蓝例子，让它能够支持批量计算和多类别输出。

# 创建一个新 graph
Graph().as_default()

X = placeholder()

# 为两种输出类别创建一个权矩阵:
# 蓝色的权向量是 (1, 1) ，红色是 (-1, -1) 
W = Variable([
    [1, -1],
    [1, -1]
])
b = Variable([0, 0])
p = softmax( add(matmul(X, W), b) )

# 创建一个 Session，针对我们的蓝色/红色点运行 perceptron
session = Session()
output_probabilities = session.run(p, {
    X: np.concatenate((blue_points, red_points))
})

# 打印前 10 行, 也就是前 10 个点的概率
print(output_probabilities[:10])

由于数据集中的前 10 个点都是蓝色的，感知机输出的蓝色的可能性（左边一列）要比红色的高。

如果你有什么问题，尽管在原帖评论区提问。