I'm a very very quick learner. -- Lou Bloom, Nightcrawler

本章主题是神经网络如何自主地从数据中学习，从而得到最优权重和偏置参数值。由此引入了损失函数的概念，用来量化当前的权重参数的不合适度，学习的目标就是找到让损失函数最小的那一组参数值，使用的方法是梯度法。

深度学习的意义

由于实际的神经网络具有海量的层数和参数，参数可达成千上万乃至上亿个，凭人力根本无法进行计算，因此需要深度学习来确定参数。

对于机器学习来说，在识别手写数字的场景里，通常需要先从图像中提取特征量（表现为向量的形式），然后通过机器学习技术来学习这些特征量的模式。特征量用来把图片转换为向量。

而对于深度学习而言，提取特征量这一操作也是由机器学习完成的。

图：人工学习->机器学习->深度学习

过拟合 over fitting

指参数对某一个数据集过度拟合的情况，即对该数据集精确度很高，但对其它数据集则不然。

过拟合是深度学习中经常要面对的一个问题。

损失函数 loss function

神经网络以损失函数作为量化指标，来寻找最优权重参数，一般采用均方误差和交叉熵误差。

均方误差 mean squared error

图：均方误差算式

各参数说明如下：

yk：神经网络输出
tk：监督数据
k：数据维度

# 均方误差
def mean_squared_error(y, t):
    return 0.5 * np.sum((y-t)**2)

one-hot表示

将正确解标签表示为1，其他标签表示为0。如在数字识别的例子里，一条监督数据是[0,0,1,0,0,0,0,0,0,0]，表示当前数字为2。如果使用非one-hot表示，则t直接就是2。

交叉熵误差 cross entropy error

图：交叉熵误差的算式

解标签tk采用one-hot表示，只有正确解为1，其它值为0。因此上式可以简化为E=-logyk。自然对数log是在(0,1]区间单调递增的，递增范围是负无穷大~0，因此yk越接近1，E的值越小，也就表明越接近正确（最佳）结果。

# 交叉熵误差
def cross_entropy_error(y, t):
    delta = 1e-7
    return -np.sum(t * np.log(y + delta))

作为保护性对策，增加一个极小值delta防止出现log(0)的边缘场景。

mini-batch 学习

为了在一次运算中使用更多数据验证参数的损失值，可以通过mini-batch的思路，即每次对批量数据计算损失函数，先求和再平均，已进行正规化。

图：批量计算交叉熵误差

叫mini-batch是因为相对于完整的数据集，只选取其中的一小部分(mini)进行计算。防止计算量过大。可以理解为抽样调查

# 通过numpy进行抽样
train_size = x_train_.shape[0] # 60000条
batch_size = 10
batch_mask = np.random.choice(train_size, batch_size) # 随机抽取10个下标
x_batch = x_train[batch_mask]
t_batch = t_train[batch_mask]

这段逻辑用如果用Java也能实现，但肯定比不上python简洁

numpy中关于array的维度操作

# nparray_test.py 测试
import numpy as np

a = np.array([[11,22,33], [44,55,66]])
print(a.ndim) # 2
print(a.size) # 6，即数组内元素个数
b = a.reshape(1, a.size)
print(b) # [[11 22 33 44 55 66]]
print(b.ndim) # 2，仍然是二维
print(b.size) # 6，元素数不变
print(b.shape) # (1,6)

数值微分

导数用来表示某个瞬间的变化量，即瞬时速度。

图：导数

前向差分：f(x+h)与f(x)之间的差分，因为偏离所以有误差
中心差分：f(x+h)与f(x-h)之间的差分，更接近真实值

# 中心差分
def numerical_diff(f, x)
    h = 1e-4 # 0.0001
    return (f(x+h)-f(x-h))/(2*h)

数值微分例子

对于下述二次函数，通过python代码计算其微分，以及在函数上x=5处的斜率，并绘图。

import numpy as np
import matplotlib.pylab as plt

def numerical_diff(f, x):
    h = 1e-4 # 0.0001
    return (f(x+h) - f(x-h)) / (2*h)

def function_1(x):
    return 0.01*x**2 + 0.1*x 

def tangent_line(f, x):
    d = numerical_diff(f, x)
    print(d)
    y = f(x) - d*x
    return lambda t: d*t + y
     
x = np.arange(0.0, 20.0, 0.1)
y = function_1(x)
plt.xlabel("x")
plt.ylabel("f(x)")

tf = tangent_line(function_1, 5)
y2 = tf(x)

plt.plot(x, y)
plt.plot(x, y2)

偏导数

含有多个变量的函数的导数称为偏导数，在使用时应当声明是对其中哪一个变量求导。

上式的python实现为：

def function_2(x):
    return x[0]**2 + x[1]**2
    # 或 return np.sum(x**2)

对应图像是：

其中对x0、x1求导分别写作：

在求导时，将已知参数值带入，对未知参数进行中心差分求导，例如求x0=3、x1=4时关于x0的偏导数：

def function_tmp1(x0):
    return x0**2 + 4.0**2.0

numerical_diff(function_tmp1, 3.0) # 6.00000

梯度

导数代表某个时间的瞬时速度，对于多维向量每一维求导，则得到了它的梯度（gradient）。当我们声明梯度时，需要说明是在x0=?、x1=?、...xn=?所有变量处的梯度。

对于函数f和多维参数x计算梯度的方法如下（含批处理）：

谨记：x是一个多维向量（即张量）

import numpy as np
import matplotlib.pylab as plt
from mpl_toolkits.mplot3d import Axes3D

def _numerical_gradient_no_batch(f, x):
    h = 1e-4 # 0.0001
    grad = np.zeros_like(x)
    
    for idx in range(x.size):
        tmp_val = x[idx]
        x[idx] = float(tmp_val) + h
        fxh1 = f(x) # f(x+h)
        
        x[idx] = tmp_val - h 
        fxh2 = f(x) # f(x-h)
        grad[idx] = (fxh1 - fxh2) / (2*h)
        
        x[idx] = tmp_val # 还原值
        
    return grad


def numerical_gradient(f, X):
    if X.ndim == 1:
        return _numerical_gradient_no_batch(f, X)
    else:
        grad = np.zeros_like(X)
        
        for idx, x in enumerate(X):
            grad[idx] = _numerical_gradient_no_batch(f, x)
        
        return grad

梯度指向各点处函数值减小最快的方向。

梯度法

梯度法就是利用梯度的概念来寻找函数最小值的方法，通过不断沿着梯度方向前进，进而不断缩小损失函数值。

极小值：局部最小值
最小值：全局最小值
鞍点：从某个方向看是极小值，从另一方向看是极大值

寻找最小值的梯度法称为梯度下降法（gradient descent method），寻找最大值的梯度法称为梯度上升法（gradient ascent method）。一般来说神经网络中梯度法是指梯度下降法。

在某一点上根据梯度方向前进，这就是梯度法的大白话表述。前进的步长用η表示，称为学习率（learning rate）。学习率决定在一次学习（前进）中，应当学习多少，以及在多大程度上更新参数。像学习率这样影响深度学习的值，称为超参数，以与权重等普通参数区分。

图：更新（学习）一次

学习率过大或者过小，都无法抵达一个“好的位置”，在神经网络中一般会一边改变学习率的值一半尝试学习，以便确认学习是否正常进行。

学习率过大：得到一个很大的发散的值
学习率过小：没学完就结束了

梯度下降法的python实现：

def gradient_descent(f, init_x, lr=0.01, step_num=100):
    x = init_x

    for i in range(step_num): # 前进步数
        grad = numerical_gradient(f, x) # 该点梯度
        x -= lr * grad # 向梯度方向前进

    return x

神经网络的梯度

在使用神经网络时，需要得出最优的权重矩阵，利用梯度法可以对这个矩阵进行计算，权重W神经网络和梯度表示如下：

由上例，定义一个simpleNet的简单神经网络：

import sys, os
sys.path.append(os.pardir)  # 为了导入父目录中的文件而进行的设定
import numpy as np
from common.functions import softmax, cross_entropy_error
from common.gradient import numerical_gradient

class simpleNet:
    def __init__(self):
        self.W = np.random.randn(2,3) # 预定一个随机参数神经网络

    def predict(self, x): # 计算网络输出
        return np.dot(x, self.W)

    def loss(self, x, t): # 计算交叉熵损失值
        z = self.predict(x)
        y = softmax(z)
        loss = cross_entropy_error(y, t)

        return loss

x = np.array([0.6, 0.9]) // 输入
t = np.array([0, 0, 1]) // 标签集

net = simpleNet()

f = lambda w: net.loss(x, t) # 使用lambda简化函数写法
dW = numerical_gradient(f, net.W)

print(dW) # 打印神经网络的梯度

有了上面的梯度算法，就可以设置步长和步数，经过迭代得到最小损失函数。

学习算法实现

调整权重和偏置以便拟合训练数据的过程称为学习，分为以下步骤：（由于第一步是随机抽样，因此该方法也称为随机梯度下降法 stochastic gradient descent）

抽样 mini-batch：从训练数据中随机选出一部分数据，用这部分数据进行学习
计算梯度：随机设定权重，计算各个权重参数梯度，梯度表示损失函数的值减小最多的方向
更新参数：由上一步得出的梯度计算新的权重参数
重复1~3步

类定义：双层神经网络 TwoLayerNet

首先定义双层网络的结构，它包含每一层的权重和偏置，并且提供计算输出的predict函数，计算交叉熵损失的loss函数，计算批量精确度的accuracy函数，以及生成梯度矩阵的numerical_gradient函数。

# coding: utf-8
import sys, os
sys.path.append(os.pardir)  # 为了导入父目录的文件而进行的设定
from common.functions import *
from common.gradient import numerical_gradient


class TwoLayerNet:

    # hidden_size 隐藏层（第1层）的神经元数
    # weight_init_std 初始权重
    def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01):
        # 初始化权重
        self.params = {}
        self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size) # 维度为input_size * hidden_size
        self.params['b1'] = np.zeros(hidden_size) # 偏置初始化为0
        self.params['W2'] = weight_init_std * np.random.randn(hidden_size, output_size) # hidden_size * output_size
        self.params['b2'] = np.zeros(output_size)

    # 根据定义计算输出
    def predict(self, x):
        W1, W2 = self.params['W1'], self.params['W2']
        b1, b2 = self.params['b1'], self.params['b2']
    
        a1 = np.dot(x, W1) + b1
        z1 = sigmoid(a1) # 隐藏层激活用sigmoid
        a2 = np.dot(z1, W2) + b2
        y = softmax(a2) # 输出层激活用softmax
        
        return y
        
    # x:输入数据, t:监督数据
    def loss(self, x, t):
        y = self.predict(x)
        
        return cross_entropy_error(y, t)
    
    # 这里x可以是batch输入
    def accuracy(self, x, t):
        y = self.predict(x)
        y = np.argmax(y, axis=1)
        t = np.argmax(t, axis=1)
        
        accuracy = np.sum(y == t) / float(x.shape[0])
        return accuracy
        
    # x:输入数据, t:监督数据
    def numerical_gradient(self, x, t):
        loss_W = lambda W: self.loss(x, t) # 损失
        
        grads = {}
        grads['W1'] = numerical_gradient(loss_W, self.params['W1'])
        grads['b1'] = numerical_gradient(loss_W, self.params['b1'])
        grads['W2'] = numerical_gradient(loss_W, self.params['W2'])
        grads['b2'] = numerical_gradient(loss_W, self.params['b2'])
        
        return grads # 损失的梯度
        
    # 计算梯度
    def gradient(self, x, t):
        W1, W2 = self.params['W1'], self.params['W2']
        b1, b2 = self.params['b1'], self.params['b2']
        grads = {} # 初始化为Map
        
        batch_num = x.shape[0]
        
        # forward 前向输出
        a1 = np.dot(x, W1) + b1
        z1 = sigmoid(a1)
        a2 = np.dot(z1, W2) + b2
        y = softmax(a2)
        
        # backward
        dy = (y - t) / batch_num
        grads['W2'] = np.dot(z1.T, dy)
        grads['b2'] = np.sum(dy, axis=0)
        
        da1 = np.dot(dy, W2.T)
        dz1 = sigmoid_grad(a1) * da1
        grads['W1'] = np.dot(x.T, dz1)
        grads['b1'] = np.sum(dz1, axis=0)

        return grads

实现深度学习

引入epoch的概念，它是一个单位，表示训练集中全部数据均被使用过一次时的更新次数。对于10000条数据的训练集来说，如果每个mini-batch学习100条，则epoch=100。

# coding: utf-8
import sys, os
sys.path.append(os.pardir)  # 为了导入父目录的文件而进行的设定
import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from two_layer_net import TwoLayerNet

# 读入数据
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True)

network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10) # 

iters_num = 10000  # 适当设定循环的次数
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.1 # 步长

train_loss_list = [] # 列表记录损失下降
train_acc_list = [] # 列表记录精度上升
test_acc_list = []

iter_per_epoch = max(train_size / batch_size, 1)

for i in range(iters_num):
    # 取mini-batch
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]
    
    # 计算梯度
    #grad = network.numerical_gradient(x_batch, t_batch)
    grad = network.gradient(x_batch, t_batch) # 每次学习都更新一遍梯度
    
    # 更新参数
    for key in ('W1', 'b1', 'W2', 'b2'):
        network.params[key] -= learning_rate * grad[key]
    
    loss = network.loss(x_batch, t_batch)
    train_loss_list.append(loss)
    
    if i % iter_per_epoch == 0:
        train_acc = network.accuracy(x_train, t_train)
        test_acc = network.accuracy(x_test, t_test)
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)
        print("train acc : {:.4f}, test acc : {:.4f}".format(train_acc, test_acc))

# 绘制图形
markers = {'train': 'o', 'test': 's'}
x = np.arange(len(train_acc_list))
plt.plot(x, train_acc_list, label='train acc')
plt.plot(x, test_acc_list, label='test acc', linestyle='--')
plt.xlabel("epochs")
plt.ylabel("accuracy")
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()

在每个epoch里，计算一次训练集和测试集的精度accuracy，并且显示在图像上，两条曲线吻合，说明没有发生过拟合。

图：未发生过拟合

《斋藤康毅-深度学习入门》读书笔记04-神经网络的学习