原文: Deep Learning From Scratch V: Multi-Layer Perceptrons
翻译:罗莎
审校:孙一萌
- 第一章:计算图
- 第二章:感知机
- 第三章:训练标准
- 第四章:梯度下降与反向传播
- 第五章:多层感知机
- 第六章:TensorFlow
多层感知机
动机
在用机器学习解决问题的时候,我们会发现,真实世界中事物的类别,很多都不是线性可分的。即不存在任何一条线,在这条线的一边全是属于某一类的点,而另一边全是属于另一个类的点。
举个例子:
import numpy as np
import matplotlib.pyplot as plt
points_origin = np.random.randn(25, 2) * 0.22
points_i = np.random.randn(25, 2) * 0.22 + np.ones((25, 2))
red_points = np.append(points_origin, points_i, 0)
points_origin[:,0] += 1
points_i[:,0] -= 1
blue_points = np.append(points_origin, points_i, 0)
# 在图上标出红蓝点
plt.scatter(red_points[:,0], red_points[:,1], color='red')
plt.scatter(blue_points[:,0], blue_points[:,1], color='blue')
可以看到,要分开蓝点和红点,用一条线是不可能的了。因此,我们的决策边界必须是个相当复杂的形状。
在这里,多层感知机
就该派上用场了:通过多层感知机,我们能训练出一个比直线更复杂的决策边界。
计算图
顾名思义,多层感知器(MLP)是由层层叠加的多个感知器组成的。
来看看计算图:

可以看到,输入(input)被传递到第一层(1st layer),第一层是一个多维感知机,它有一个权矩阵 和一个偏置向量
。然后,该层的输出被传递到第二层,作为第二层的输入。第二层又是一个感知机,有另一个权矩阵
,另一个偏置向量
。往后总共
层,每一层都会进行这个过程,一直到输出层为止。
我们将最后一层称为输出层,将其他的每一层都称为隐藏层
。
只有一个隐藏层的 MLP 的计算函数:
有两个隐藏层的 MLP 的计算函数:
以及推广到有 个隐藏层的 MLP 的计算函数:
实现
利用我们建立的库,我们现在可以轻松地执行多层感知器,而无需进一步的工作。
import numpy as np
import matplotlib.pyplot as plt
from queue import Queue
import time
# A dictionary that will map operations to gradient functions
_gradient_registry = {}
class RegisterGradient:
"""A decorator for registering the gradient function for an op type.
"""
def __init__(self, op_type):
"""Creates a new decorator with `op_type` as the Operation type.
Args:
op_type: The name of an operation
"""
self._op_type = eval(op_type)
def __call__(self, f):
"""Registers the function `f` as gradient function for `op_type`."""
_gradient_registry[self._op_type] = f
return f
class Operation(object):
"""Represents a graph node that performs a computation.
An `Operation` is a node in a `Graph` that takes zero or
more objects as input, and produces zero or more objects
as output.
"""
def __init__(self, input_nodes = []):
"""Construct Operation
"""
self.input_nodes = input_nodes
# Initialize list of consumers (i.e. nodes that receive this operation's output as input)
self.consumers = []
# Append this operation to the list of consumers of all input nodes
for input_node in input_nodes:
input_node.consumers.append(self)
# Append this operation to the list of operations in the currently active default graph
_default_graph.operations.append(self)
def compute(self):
"""Computes the output of this operation.
"" Must be implemented by the particular operation.
"""
pass
class add(Operation):
"""Returns x + y element-wise.
"""
def __init__(self, x, y):
"""Construct add
Args:
x: First summand node
y: Second summand node
"""
super().__init__([x, y])
def compute(self, x_value, y_value):
"""Compute the output of the add operation
Args:
x_value: First summand value
y_value: Second summand value
"""
self.inputs = [x_value, y_value]
return x_value + y_value
class matmul(Operation):
"""Multiplies matrix a by matrix b, producing a * b.
"""
def __init__(self, a, b):
"""Construct matmul
Args:
a: First matrix
b: Second matrix
"""
super().__init__([a, b])
def compute(self, a_value, b_value):
"""Compute the output of the matmul operation
Args:
a_value: First matrix value
b_value: Second matrix value
"""
self.inputs = [a_value, b_value]
return a_value.dot(b_value)
class sigmoid(Operation):
"""Returns the sigmoid of x element-wise.
"""
def __init__(self, a):
"""Construct sigmoid
Args:
a: Input node
"""
super().__init__([a])
def compute(self, a_value):
"""Compute the output of the sigmoid operation
Args:
a_value: Input value
"""
return 1 / (1 + np.exp(-a_value))
class softmax(Operation):
"""Returns the softmax of a.
"""
def __init__(self, a):
"""Construct softmax
Args:
a: Input node
"""
super().__init__([a])
def compute(self, a_value):
"""Compute the output of the softmax operation
Args:
a_value: Input value
"""
return np.exp(a_value) / np.sum(np.exp(a_value), axis = 1)[:,None]
class log(Operation):
"""Computes the natural logarithm of x element-wise.
"""
def __init__(self, x):
"""Construct log
Args:
x: Input node
"""
super().__init__([x])
def compute(self, x_value):
"""Compute the output of the log operation
Args:
x_value: Input value
"""
return np.log(x_value)
class multiply(Operation):
"""Returns x * y element-wise.
"""
def __init__(self, x, y):
"""Construct multiply
Args:
x: First multiplicand node
y: Second multiplicand node
"""
super().__init__([x, y])
def compute(self, x_value, y_value):
"""Compute the output of the multiply operation
Args:
x_value: First multiplicand value
y_value: Second multiplicand value
"""
return x_value * y_value
class reduce_sum(Operation):
"""Computes the sum of elements across dimensions of a tensor.
"""
def __init__(self, A, axis = None):
"""Construct reduce_sum
Args:
A: The tensor to reduce.
axis: The dimensions to reduce. If `None` (the default), reduces all dimensions.
"""
super().__init__([A])
self.axis = axis
def compute(self, A_value):
"""Compute the output of the reduce_sum operation
Args:
A_value: Input tensor value
"""
return np.sum(A_value, self.axis)
class negative(Operation):
"""Computes the negative of x element-wise.
"""
def __init__(self, x):
"""Construct negative
Args:
x: Input node
"""
super().__init__([x])
def compute(self, x_value):
"""Compute the output of the negative operation
Args:
x_value: Input value
"""
return -x_value
@RegisterGradient("add")
def _add_gradient(op, grad):
"""Computes the gradients for `add`.
Args:
op: The `add` `Operation` that we are differentiating
grad: Gradient with respect to the output of the `add` op.
Returns:
Gradients with respect to the input of `add`.
"""
a = op.inputs[0]
b = op.inputs[1]
grad_wrt_a = grad
while np.ndim(grad_wrt_a) > len(a.shape):
grad_wrt_a = np.sum(grad_wrt_a, axis=0)
for axis, size in enumerate(a.shape):
if size == 1:
grad_wrt_a = np.sum(grad_wrt_a, axis=axis, keepdims=True)
grad_wrt_b = grad
while np.ndim(grad_wrt_b) > len(b.shape):
grad_wrt_b = np.sum(grad_wrt_b, axis=0)
for axis, size in enumerate(b.shape):
if size == 1:
grad_wrt_b = np.sum(grad_wrt_b, axis=axis, keepdims=True)
return [grad_wrt_a, grad_wrt_b]
@RegisterGradient("matmul")
def _matmul_gradient(op, grad):
"""Computes the gradients for `matmul`.
Args:
op: The `matmul` `Operation` that we are differentiating
grad: Gradient with respect to the output of the `matmul` op.
Returns:
Gradients with respect to the input of `matmul`.
"""
A = op.inputs[0]
B = op.inputs[1]
return [grad.dot(B.T), A.T.dot(grad)]
@RegisterGradient("sigmoid")
def _sigmoid_gradient(op, grad):
"""Computes the gradients for `sigmoid`.
Args:
op: The `sigmoid` `Operation` that we are differentiating
grad: Gradient with respect to the output of the `sigmoid` op.
Returns:
Gradients with respect to the input of `sigmoid`.
"""
sigmoid = op.output
return grad * sigmoid * (1-sigmoid)
@RegisterGradient("softmax")
def _softmax_gradient(op, grad):
"""Computes the gradients for `softmax`.
Args:
op: The `softmax` `Operation` that we are differentiating
grad: Gradient with respect to the output of the `softmax` op.
Returns:
Gradients with respect to the input of `softmax`.
"""
softmax = op.output
return (grad - np.reshape(
np.sum(grad * softmax, 1),
[-1, 1]
)) * softmax
@RegisterGradient("log")
def _log_gradient(op, grad):
"""Computes the gradients for `log`.
Args:
op: The `log` `Operation` that we are differentiating
grad: Gradient with respect to the output of the `log` op.
Returns:
Gradients with respect to the input of `log`.
"""
x = op.inputs[0]
return grad/x
@RegisterGradient("multiply")
def _multiply_gradient(op, grad):
"""Computes the gradients for `multiply`.
Args:
op: The `multiply` `Operation` that we are differentiating
grad: Gradient with respect to the output of the `multiply` op.
Returns:
Gradients with respect to the input of `multiply`.
"""
A = op.inputs[0]
B = op.inputs[1]
return [grad * B, grad * A]
@RegisterGradient("reduce_sum")
def _reduce_sum_gradient(op, grad):
"""Computes the gradients for `reduce_sum`.
Args:
op: The `reduce_sum` `Operation` that we are differentiating
grad: Gradient with respect to the output of the `reduce_sum` op.
Returns:
Gradients with respect to the input of `reduce_sum`.
"""
A = op.inputs[0]
output_shape = np.array(A.shape)
output_shape[op.axis] = 1
tile_scaling = A.shape // output_shape
grad = np.reshape(grad, output_shape)
return np.tile(grad, tile_scaling)
@RegisterGradient("negative")
def _negative_gradient(op, grad):
"""Computes the gradients for `negative`.
Args:
op: The `negative` `Operation` that we are differentiating
grad: Gradient with respect to the output of the `negative` op.
Returns:
Gradients with respect to the input of `negative`.
"""
return -grad
class placeholder:
"""Represents a placeholder node that has to be provided with a value
when computing the output of a computational graph
"""
def __init__(self):
"""Construct placeholder
"""
self.consumers = []
# Append this placeholder to the list of placeholders in the currently active default graph
_default_graph.placeholders.append(self)
class Variable:
"""Represents a variable (i.e. an intrinsic, changeable parameter of a computational graph).
"""
def __init__(self, initial_value = None):
"""Construct Variable
Args:
initial_value: The initial value of this variable
"""
self.value = initial_value
self.consumers = []
# Append this variable to the list of variables in the currently active default graph
_default_graph.variables.append(self)
class Graph:
"""Represents a computational graph
"""
def __init__(self):
"""Construct Graph"""
self.operations = []
self.placeholders = []
self.variables = []
def as_default(self):
global _default_graph
_default_graph = self
class Session:
"""Represents a particular execution of a computational graph.
"""
def run(self, operation, feed_dict = {}):
"""Computes the output of an operation
Args:
operation: The operation whose output we'd like to compute.
feed_dict: A dictionary that maps placeholders to values for this session
"""
# Perform a post-order traversal of the graph to bring the nodes into the right order
nodes_postorder = traverse_postorder(operation)
# Iterate all nodes to determine their value
for node in nodes_postorder:
if type(node) == placeholder:
# Set the node value to the placeholder value from feed_dict
node.output = feed_dict[node]
elif type(node) == Variable:
# Set the node value to the variable's value attribute
node.output = node.value
else: # Operation
# Get the input values for this operation from node_values
node.inputs = [input_node.output for input_node in node.input_nodes]
# Compute the output of this operation
node.output = node.compute(*node.inputs)
# Convert lists to numpy arrays
if type(node.output) == list:
node.output = np.array(node.output)
# Return the requested node value
return operation.output
def traverse_postorder(operation):
"""Performs a post-order traversal, returning a list of nodes
in the order in which they have to be computed
Args:
operation: The operation to start traversal at
"""
nodes_postorder = []
def recurse(node):
if isinstance(node, Operation):
for input_node in node.input_nodes:
recurse(input_node)
nodes_postorder.append(node)
recurse(operation)
return nodes_postorder
class GradientDescentOptimizer:
def __init__(self, learning_rate):
self.learning_rate = learning_rate
def minimize(self, loss):
learning_rate = self.learning_rate
class MinimizationOperation(Operation):
def compute(self):
# Compute gradients
grad_table = compute_gradients(loss)
# Iterate all variables
for node in grad_table:
if type(node) == Variable:
# Retrieve gradient for this variable
grad = grad_table[node]
# Take a step along the direction of the negative gradient
node.value -= learning_rate * grad
return MinimizationOperation()
def compute_gradients(loss):
# grad_table[node] will contain the gradient of the loss w.r.t. the node's output
grad_table = {}
# The gradient of the loss with respect to the loss is just 1
grad_table[loss] = 1
# Perform a breadth-first search, backwards from the loss
visited = set()
queue = Queue()
visited.add(loss)
queue.put(loss)
while not queue.empty():
node = queue.get()
#print("CurrNode: " + str(node))
# If this node is not the loss
if node != loss:
#
# Compute the gradient of the loss with respect to this node's output
#
grad_table[node] = 0
# Iterate all consumers
for consumer in node.consumers:
#print("\t Consumer: " + str(consumer))
#print("\t GradTable: " + str(grad_table))
# Retrieve the gradient of the loss w.r.t. consumer's output
lossgrad_wrt_consumer_output = grad_table[consumer]
# Retrieve the function which computes gradients with respect to
# consumer's inputs given gradients with respect to consumer's output.
consumer_op_type = consumer.__class__
bprop = _gradient_registry[consumer_op_type]
# Get the gradient of the loss with respect to all of consumer's inputs
lossgrads_wrt_consumer_inputs = bprop(consumer, lossgrad_wrt_consumer_output)
if len(consumer.input_nodes) == 1:
# If there is a single input node to the consumer, lossgrads_wrt_consumer_inputs is a scalar
grad_table[node] += lossgrads_wrt_consumer_inputs
else:
# Otherwise, lossgrads_wrt_consumer_inputs is an array of gradients for each input node
# Retrieve the index of node in consumer's inputs
node_index_in_consumer_inputs = consumer.input_nodes.index(node)
# Get the gradient of the loss with respect to node
lossgrad_wrt_node = lossgrads_wrt_consumer_inputs[node_index_in_consumer_inputs]
# Add to total gradient
grad_table[node] += lossgrad_wrt_node
#
# Append each input node to the queue
#
if hasattr(node, "input_nodes"):
for input_node in node.input_nodes:
if not input_node in visited:
visited.add(input_node)
queue.put(input_node)
# Return gradients for each visited node
return grad_table
import matplotlib.pyplot as plt
points_origin = np.random.randn(25, 2) * 0.22
points_i = np.random.randn(25, 2) * 0.22 + np.ones((25, 2))
red_points = np.append(points_origin, points_i, 0)
points_origin[:,0] += 1
points_i[:,0] -= 1
blue_points = np.append(points_origin, points_i, 0)
# 创建一个新的 graph
Graph().as_default()
# 创建输入 placeholder
X = placeholder()
# 创建训练类别 placeholder
c = placeholder()
# 创建隐藏层
W_hidden = Variable(np.random.randn(2, 2))
b_hidden = Variable(np.random.randn(2))
p_hidden = sigmoid(add(matmul(X, W_hidden), b_hidden))
# 创建输出层
W_output = Variable(np.random.randn(2, 2))
b_output = Variable(np.random.randn(2))
p_output = softmax(add(matmul(p_hidden, W_output), b_output))
# 写出交叉熵损失
J = negative(reduce_sum(reduce_sum(multiply(c, log(p_output)), axis=1)))
# 写一个最小化损失的 operation
minimization_op = GradientDescentOptimizer(learning_rate=0.03).minimize(J)
# 构建 placeholder 输入
feed_dict = {
X: np.concatenate((blue_points, red_points)),
c:
[[1, 0]] * len(blue_points)
+ [[0, 1]] * len(red_points)
}
# 创建 session
session = Session()
# 进行 100 次梯度下降迭代
for step in range(1000):
J_value = session.run(J, feed_dict)
if step % 100 == 0:
print("Step:", step, " Loss:", J_value)
session.run(minimization_op, feed_dict)
# 打印最终结果
W_hidden_value = session.run(W_hidden)
print("Hidden layer weight matrix:\n", W_hidden_value)
b_hidden_value = session.run(b_hidden)
print("Hidden layer bias:\n", b_hidden_value)
W_output_value = session.run(W_output)
print("Output layer weight matrix:\n", W_output_value)
b_output_value = session.run(b_output)
print("Output layer bias:\n", b_output_value)
# 可视化分类边界
xs = np.linspace(-2, 2)
ys = np.linspace(-2, 2)
pred_classes = []
for x in xs:
for y in ys:
pred_class = session.run(p_output,
feed_dict={X: [[x, y]]})[0]
pred_classes.append((x, y, pred_class.argmax()))
xs_p, ys_p = [], []
xs_n, ys_n = [], []
for x, y, c in pred_classes:
if c == 0:
xs_n.append(x)
ys_n.append(y)
else:
xs_p.append(x)
ys_p.append(y)
plt.plot(xs_p, ys_p, 'ro', xs_n, ys_n, 'bo')
可以看到,我们已经学习得到了相当复杂的决策边界。我们使用的层越多,决策边界就会变得越复杂,从而使我们能够学习得到人类所不可能发现的分类模式,尤其是维度更高的情况下。
## 概括
恭喜你!从零开始走到这一步,你已经完成了土法神经网络的基础部分,与大多数机器学习从业者不同,你现在知道了其背后的工作原理,以及为什么它是这样做的。
我们来回顾一下。一开始我们介绍了计算图
,学习了它们的构建,以及输出的计算。之后,我们介绍了感知机
,即线性分类器,它通过 sigmoid 压缩 的输出,来得出属于各个分类的概率(在多分类的情况下可能使用 softmax 而非 sigmoid)。
紧接着,我们学习了如何通过损失函数来判断分类器的好坏,最小化交叉熵损失
,与最大似然
是相同的。随后,我们看到了如何通过梯度下降
来实现损失函数的最小化:通过向梯度的反方向迭代。
然后,我们引入了反向传播
,用来计算每个节点损失的导数,具体是通过广度优先搜索和链式法则实现的。我们用全部所学的内容,为红色/蓝色示例数据集训练了一个良好的线性分类器。
最后,我们学习了以多层感知机,用来作为学习非线性决策边界的一种方法。多层感知机是通过一个隐藏层实现的。通过多层感知机,我们在一个非线性可分的数据集上成功地训练出了理想的决策边界。
下一步
接下来的章节将着重介绍在神经网络的训练方面的实际经验。