py-dl-proj-merge-2Python 深度学习项目（三）第九章：使用 OpenCV 和 TensorFlo

Python 深度学习项目（三）

原文：annas-archive.org/md5/2f844d9ad75aa257caad1025fa00b786

译者：飞龙

协议：CC BY-NC-SA 4.0

第九章：使用 OpenCV 和 TensorFlow 进行物体检测

欢迎来到第二章，专注于计算机视觉的内容，出自 Python 深度学习项目（让我们用一个数据科学的双关语开始吧！）。让我们回顾一下在第八章中我们所取得的成就，使用卷积神经网络（ConvNets）进行手写数字分类，在这一章中，我们能够使用卷积神经网络（CNN）训练一个图像分类器，准确地对图像中的手写数字进行分类。原始数据的一个关键特征是什么？我们的业务目标又是什么？数据比可能的情况要简单一些，因为每张图片中只有一个手写数字，我们的目标是准确地为图像分配数字标签。

如果每张图片中包含多个手写数字会发生什么？如果我们有一段包含数字的视频呢？如果我们想要识别图片中数字的位置呢？这些问题代表了现实世界数据所体现的挑战，并推动我们的数据科学创新，朝着新的模型和能力发展。

让我们将问题和想象力扩展到下一个（假设的）业务用例，针对我们的 Python 深度学习项目，我们将构建、训练和测试一个物体检测和分类模型，供一家汽车制造商用于其新一代自动驾驶汽车。自动驾驶汽车需要具备基本的计算机视觉能力，而这种能力正是我们通过生理和经验性学习所自然具备的。我们人类可以检查我们的视野并报告是否存在特定物体，以及该物体与其他物体的位置关系（如果存在的话）。所以，如果我问你是否看到了一个鸡，你可能会说没有，除非你住在农场并正在望着窗外。但如果我问你是否看到了一个键盘，你可能会说是的，并且甚至能够说出键盘与其他物体的不同，且位于你面前的墙之前。

对于计算机来说，这不是一项简单的任务。作为深度学习工程师，你将学习到直觉和模型架构，这将使你能够构建一个强大的物体检测与分类引擎，我们可以设想它将被用于自动驾驶汽车的测试。在本章中，我们将处理的数据输入比以往的项目更为复杂，且当我们正确处理这些数据时，结果将更加令人印象深刻。

那么，让我们开始吧！

物体检测直觉

当你需要让应用程序在图像中找到并命名物体时，你需要构建一个用于目标检测的深度神经网络。视觉领域非常复杂，静态图像和视频的相机捕捉的帧中包含了许多物体。目标检测被用于制造业的生产线过程自动化；自动驾驶车辆感知行人、其他车辆、道路和标志等；当然，还有面部识别。基于机器学习和深度学习的计算机视觉解决方案需要你——数据科学家——构建、训练和评估能够区分不同物体并准确分类检测到的物体的模型。

正如你在我们处理的其他项目中看到的，CNN 是处理图像数据的非常强大的模型。我们需要查看在单张（静态）图像上表现非常好的基础架构的扩展，看看哪些方法在复杂图像和视频中最有效。

最近，以下网络取得了进展：Faster R-CNN、基于区域的全卷积网络 (R-FCN)、MultiBox、固态硬盘 (SSD) 和 你只看一次 (YOLO)。我们已经看到了这些模型在常见消费者应用中的价值，例如 Google Photos 和 Pinterest 视觉搜索。我们甚至看到其中一些模型足够轻量且快速，能够在移动设备上表现良好。

可以通过以下参考文献列表进行近期该领域的研究：

PVANET: 用于实时目标检测的深度轻量级神经网络, arXiv:1608.08021
R-CNN: 用于准确目标检测和语义分割的丰富特征层次结构, CVPR, 2014.
SPP: 用于视觉识别的深度卷积网络中的空间金字塔池化, ECCV, 2014.
Fast R-CNN, arXiv:1504.08083.
Faster R-CNN: 使用区域提议网络实现实时目标检测, arXiv:1506.01497.
R-CNN 减去 R, arXiv:1506.06981.
拥挤场景中的端到端人物检测, arXiv:1506.04878.
YOLO – 你只看一次：统一的实时目标检测, arXiv:1506.02640
Inside-Outside Net: 使用跳跃池化和递归神经网络在上下文中检测物体
深度残差网络：用于图像识别的深度残差学习
R-FCN: 基于区域的全卷积网络进行目标检测
SSD: 单次多框检测器, arXiv:1512.02325

另外，以下是从 1999 年到 2017 年目标检测发展的时间线：

图 9.1：1999 到 2017 年目标检测发展时间线

本章的文件可以在github.com/PacktPublishing/Python-Deep-Learning-Projects/tree/master/Chapter09找到。

目标检测模型的改进

物体检测和分类一直是研究的主题。使用的模型建立在前人研究的巨大成功基础上。简要回顾进展历史，从 2005 年 Navneet Dalal 和 Bill Triggs 开发的计算机视觉模型方向梯度直方图（HOG）特征开始。

HOG 特征速度快，表现良好。深度学习和 CNN 的巨大成功使其成为更精确的分类器，因为其深层网络。然而，当时 CNN 的速度相较之下过于缓慢。

解决方案是利用 CNN 的改进分类能力，并通过一种技术提高其速度，采用选择性搜索范式，形成了 R-CNN。减少边界框的数量确实在速度上有所提升，但不足以满足预期。

SPP-net 是一种提出的解决方案，其中计算整个图像的 CNN 表示，并驱动通过选择性搜索生成的每个子部分的 CNN 计算表示。选择性搜索通过观察像素强度、颜色、图像纹理和内部度量来生成所有可能的物体位置。然后，这些识别出的物体会被输入到 CNN 模型中进行分类。

这一改进催生了名为 Fast R-CNN 的模型，采用端到端训练，从而解决了 SPP-net 和 R-CNN 的主要问题。通过名为 Faster R-CNN 的模型进一步推进了这项技术，使用小型区域提议 CNN 代替选择性搜索表现得非常好。

这是 Faster R-CNN 物体检测管道的快速概述：

对之前讨论的 R-CNN 版本进行的快速基准对比显示如下：

	R-CNN	Fast R-CNN	Faster R-CNN
平均响应时间	~50 秒	~2 秒	~0.2 秒
速度提升	1 倍	25 倍	250 倍

性能提升令人印象深刻，Faster R-CNN 是目前在实时应用中最准确、最快的物体检测算法之一。其他近期强大的替代方法包括 YOLO 模型，我们将在本章后面详细探讨。

使用 OpenCV 进行物体检测

让我们从开源计算机视觉（OpenCV）的基本或传统实现开始我们的项目。该库主要面向需要计算机视觉能力的实时应用。

OpenCV 在 C、C++、Python 等多种语言中都有 API 封装，最佳的前进方式是使用 Python 封装器或任何你熟悉的语言来快速构建原型，一旦代码完成，可以在 C/C++中重写以用于生产。

在本章中，我们将使用 Python 封装器创建我们的初始物体检测模块。

所以，让我们开始吧。

一种手工制作的红色物体检测器

在本节中，我们将学习如何创建一个特征提取器，能够使用各种图像处理技术（如腐蚀、膨胀、模糊等）从提供的图像中检测任何红色物体。

安装依赖

首先，我们需要安装 OpenCV，我们通过这个简单的 pip 命令来完成：

pip install opencv-python

接着我们将导入它以及其他用于可视化和矩阵运算的模块：

import cv2
import matplotlib
from matplotlib import colors
from matplotlib import pyplot as plt
import numpy as np
from __future__ import division

此外，让我们定义一些帮助函数，帮助我们绘制图像和轮廓：

# Defining some helper function
def show(image):
    # Figure size in inches
    plt.figure(figsize=(15, 15))

    # Show image, with nearest neighbour interpolation
    plt.imshow(image, interpolation='nearest')

def show_hsv(hsv):
    rgb = cv2.cvtColor(hsv, cv2.COLOR_HSV2BGR)
    show(rgb)

def show_mask(mask):
    plt.figure(figsize=(10, 10))
    plt.imshow(mask, cmap='gray')

def overlay_mask(mask, image):
    rgb_mask = cv2.cvtColor(mask, cv2.COLOR_GRAY2RGB)
    img = cv2.addWeighted(rgb_mask, 0.5, image, 0.5, 0)
    show(img)

def find_biggest_contour(image):
    image = image.copy()
    im2,contours, hierarchy = cv2.findContours(image, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)

    contour_sizes = [(cv2.contourArea(contour), contour) for contour in contours]
    biggest_contour = max(contour_sizes, key=lambda x: x[0])[1]

    mask = np.zeros(image.shape, np.uint8)
    cv2.drawContours(mask, [biggest_contour], -1, 255, -1)
    return biggest_contour, mask

def circle_countour(image, countour):
    image_with_ellipse = image.copy()
    ellipse = cv2.fitEllipse(countour)

    cv2.ellipse(image_with_ellipse, ellipse, (0,255,0), 2)
    return image_with_ellipse

探索图像数据

在任何数据科学问题中，首先要做的就是探索和理解数据。这有助于我们明确目标。所以，让我们首先加载图像并检查图像的属性，比如色谱和尺寸：

# Loading image and display
image = cv2.imread('./ferrari.png')
show(image)

以下是输出结果：

由于图像在内存中存储的顺序是蓝绿红（BGR），我们需要将其转换为红绿蓝（RGB）：

image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
show(image)

以下是输出结果：

图 9.2：RGB 色彩格式中的原始输入图像。

对图像进行归一化处理

我们将缩小图像尺寸，为此我们将使用 cv2.resize() 函数：

max_dimension = max(image.shape)
scale = 700/max_dimension
image = cv2.resize(image, None, fx=scale,fy=scale)

现在我们将执行模糊操作，使像素更加规范化，为此我们将使用高斯核。高斯滤波器在研究领域非常流行，常用于各种操作，其中之一是模糊效果，能够减少噪声并平衡图像。以下代码执行了模糊操作：

image_blur = cv2.GaussianBlur(image, (7, 7), 0)

然后我们将把基于 RGB 的图像转换为 HSV 色谱，这有助于我们使用颜色强度、亮度和阴影来提取图像的其他特征：

image_blur_hsv = cv2.cvtColor(image_blur, cv2.COLOR_RGB2HSV)

以下是输出结果：

图：9.3：HSV 色彩格式中的原始输入图像。

准备掩膜

我们需要创建一个掩膜，可以检测特定的颜色谱；假设我们要检测红色。现在我们将创建两个掩膜，它们将使用颜色值和亮度因子进行特征提取：

# filter by color
min_red = np.array([0, 100, 80])
max_red = np.array([10, 256, 256])
mask1 = cv2.inRange(image_blur_hsv, min_red, max_red)

# filter by brightness
min_red = np.array([170, 100, 80])
max_red = np.array([180, 256, 256])
mask2 = cv2.inRange(image_blur_hsv, min_red, max_red)

# Concatenate both the mask for better feature extraction
mask = mask1 + mask2

以下是我们的掩膜效果：

掩膜的后处理

一旦我们成功创建了掩膜，我们需要执行一些形态学操作，这是用于几何结构分析和处理的基本图像处理操作。

首先，我们将创建一个内核，执行各种形态学操作，对输入图像进行处理：

kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (15, 15))

闭操作：膨胀后腐蚀 对于关闭前景物体内部的小碎片或物体上的小黑点非常有帮助。

现在让我们对掩膜执行闭操作：

mask_closed = cv2.morphologyEx(mask, cv2.MORPH_CLOSE, kernel)

开操作 腐蚀后膨胀 用于去除噪声。

然后我们执行开操作：

mask_clean = cv2.morphologyEx(mask_closed, cv2.MORPH_OPEN, kernel)

以下是输出结果：

图 9.4：此图展示了形态学闭运算和开运算的输出（左侧），我们将二者结合起来得到最终处理后的掩膜（右侧）。

在前面的截图中，你可以看到（截图的左侧）形态学操作如何改变掩膜的结构，并且当将两种操作结合时（截图的右侧），你会得到去噪后的更干净的结构。

应用掩膜

现在是时候使用我们创建的掩膜从图像中提取物体了。首先，我们将使用辅助函数找到最大的轮廓，这是我们需要提取的物体的最大区域。然后将掩膜应用于图像，并在提取的物体上绘制一个圆形边界框：

# Extract biggest bounding box
big_contour, red_mask = find_biggest_contour(mask_clean)

# Apply mask
overlay = overlay_mask(red_mask, image)

# Draw bounding box
circled = circle_countour(overlay, big_contour)

show(circled)

以下是输出：

图 9.5：此图展示了我们从图像中检测到红色区域（汽车车身），并在其周围绘制了一个椭圆。

啪！我们成功提取了图像，并使用简单的图像处理技术在物体周围绘制了边界框。

使用深度学习进行物体检测

在本节中，我们将学习如何构建一个世界级的物体检测模块，而不需要太多使用传统的手工技术。我们将使用深度学习方法，这种方法足够强大，可以自动从原始图像中提取特征，然后利用这些特征进行分类和检测。

首先，我们将使用一个预制的 Python 库构建一个物体检测器，该库可以使用大多数最先进的预训练模型，之后我们将学习如何使用 YOLO 架构实现一个既快速又准确的物体检测器。

快速实现物体检测

物体检测在 2012 年后由于行业趋势向深度学习的转变而得到了广泛应用。准确且越来越快的模型，如 R-CNN、Fast-RCNN、Faster-RCNN 和 RetinaNet，以及快速且高度准确的模型如 SSD 和 YOLO，现在都在生产中使用。在本节中，我们将使用 Python 库中功能齐全的预制特征提取器，只需几行代码即可使用。此外，我们还将讨论生产级设置。

那么，开始吧。

安装所有依赖项

这与我们在前几章中执行的步骤相同。首先，让我们安装所有依赖项。在这里，我们使用一个名为 ImageAI 的 Python 模块（github.com/OlafenwaMoses/ImageAI），它是一个有效的方法，可以帮助你快速构建自己的物体检测应用程序：

pip install tensorflow
pip install keras
pip install numpy
pip install scipy
pip install opencv-python
pip install pillow
pip install matplotlib
pip install h5py
# Here we are installing ImageAI
pip3 install https://github.com/OlafenwaMoses/ImageAI/releases/download/2.0.2/imageai-2.0.2-py3-none-any.whl

我们将使用 Python 3.x 环境来运行这个模块。

对于此实现，我们将使用一个在 COCO 数据集上训练的预训练 ResNet 模型（cocodataset.org/#home）（一个大规模的目标检测、分割和描述数据集）。你也可以使用其他预训练模型，如下所示：

DenseNet-BC-121-32.h5 (github.com/OlafenwaMoses/ImageAI/releases/download/1.0/DenseNet-BC-121-32.h5) (31.7 MB)
inception_v3_weights_tf_dim_ordering_tf_kernels.h5 (github.com/OlafenwaMoses/ImageAI/releases/download/1.0/inception_v3_weights_tf_dim_ordering_tf_kernels.h5) (91.7 MB)
resnet50_coco_best_v2.0.1.h5 (github.com/OlafenwaMoses/ImageAI/releases/download/1.0/resnet50_coco_best_v2.0.1.h5) (146 MB)
resnet50_weights_tf_dim_ordering_tf_kernels.h5 (github.com/OlafenwaMoses/ImageAI/releases/download/1.0/resnet50_weights_tf_dim_ordering_tf_kernels.h5) (98.1 MB)
squeezenet_weights_tf_dim_ordering_tf_kernels.h5 (github.com/OlafenwaMoses/ImageAI/releases/download/1.0/squeezenet_weights_tf_dim_ordering_tf_kernels.h5) (4.83 MB)
yolo-tiny.h5 (github.com/OlafenwaMoses/ImageAI/releases/download/1.0/yolo-tiny.h5) (33.9 MB)
yolo.h5 (github.com/OlafenwaMoses/ImageAI/releases/download/1.0/yolo.h5): 237 MB

要获取数据集，请使用以下命令：

wget https://github.com/OlafenwaMoses/ImageAI/releases/download/1.0/resnet50_coco_best_v2.0.1.h5

实现

现在我们已经准备好所有的依赖项和预训练模型，我们将实现一个最先进的目标检测模型。我们将使用以下代码导入 ImageAI 的ObjectDetection类：

from imageai.Detection import ObjectDetection
import os
model_path = os.getcwd()

然后我们创建ObjectDetection对象的实例，并将模型类型设置为RetinaNet()。接下来，我们设置下载的 ResNet 模型部分，并调用loadModel()函数：

object_detector = ObjectDetection()
object_detector.setModelTypeAsRetinaNet()
object_detector.setModelPath( os.path.join(model_path , "resnet50_coco_best_v2.0.1.h5"))
object_detector.loadModel()

一旦模型被加载到内存中，我们就可以将新图像输入模型，图像可以是任何常见的图像格式，如 JPEG、PNG 等。此外，函数对图像的大小没有限制，因此你可以使用任何维度的数据，模型会在内部处理它。我们使用detectObjectsFromImage()来输入图像。此方法返回带有更多信息的图像，例如检测到的对象的边界框坐标、检测到的对象的标签以及置信度分数。

以下是一些用作输入模型并执行目标检测的图像：

图 9.6：由于在写这章时我正在去亚洲（马来西亚/兰卡威）旅行，我决定尝试使用一些我在旅行中拍摄的真实图像。

以下代码用于将图像输入到模型中：

object_detections = object_detector.detectObjectsFromImage(input_image=os.path.join(model_path , "image.jpg"), output_image_path=os.path.join(model_path , "imagenew.jpg"))

此外，我们迭代object_detection对象，以读取模型预测的所有物体及其相应的置信度分数：

for eachObject in object_detections:
    print(eachObject["name"] , " : " , eachObject["percentage_probability"])

以下是结果的展示方式：

图 9.7：从目标检测模型中提取的结果，图中包含了检测到的物体周围的边界框。结果包含物体的名称和置信度分数。

所以，我们可以看到，预训练模型表现得非常好，只用了很少的代码行。

部署

现在我们已经准备好所有基本代码，让我们将ObjectDetection模块部署到生产环境中。在本节中，我们将编写一个 RESTful 服务，它将接受图像作为输入，并返回检测到的物体作为响应。

我们将定义一个POST函数，它接受带有 PNG、JPG、JPEG 和 GIF 扩展名的图像文件。上传的图像路径将传递给ObjectDetection模块，后者执行检测并返回以下 JSON 结果：

from flask import Flask, request, jsonify, redirect
import os , json
from imageai.Detection import ObjectDetection

model_path = os.getcwd()

PRE_TRAINED_MODELS = ["resnet50_coco_best_v2.0.1.h5"]

# Creating ImageAI objects and loading models

object_detector = ObjectDetection()
object_detector.setModelTypeAsRetinaNet()
object_detector.setModelPath( os.path.join(model_path , PRE_TRAINED_MODELS[0]))
object_detector.loadModel()
object_detections = object_detector.detectObjectsFromImage(input_image='sample.jpg')

# Define model paths and the allowed file extentions
UPLOAD_FOLDER = model_path
ALLOWED_EXTENSIONS = set(['png', 'jpg', 'jpeg', 'gif'])

app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER

def allowed_file(filename):
    return '.' in filename and \
           filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS

@app.route('/predict', methods=['POST'])
def upload_file():
    if request.method == 'POST':
        # check if the post request has the file part
        if 'file' not in request.files:
            print('No file part')
            return redirect(request.url)
        file = request.files['file']
        # if user does not select file, browser also
        # submit a empty part without filename
        if file.filename == '':
            print('No selected file')
            return redirect(request.url)
        if file and allowed_file(file.filename):
            filename = file.filename
            file_path = os.path.join(app.config['UPLOAD_FOLDER'], filename) 
            file.save(file_path) 

    try:
        object_detections = object_detector.detectObjectsFromImage(input_image=file_path)
    except Exception as ex:
        return jsonify(str(ex))
    resp = []
    for eachObject in object_detections :
        resp.append([eachObject["name"],
                     round(eachObject["percentage_probability"],3)
                     ]
                    )

    return json.dumps(dict(enumerate(resp)))

if __name__ == "__main__":
    app.run(host='0.0.0.0', port=4445)

将文件保存为object_detection_ImageAI.py并执行以下命令来运行 Web 服务：

python object_detection_ImageAI.py

以下是输出结果：

图 9.8：成功执行 Web 服务后的终端屏幕输出。

在另一个终端中，你现在可以尝试调用 API，如下所示的命令：

curl -X POST \
 http://0.0.0.0:4445/predict \
 -H 'content-type: multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW' \
 -F file=@/Users/rahulkumar/Downloads/IMG_1651.JPG

以下是响应输出：

{
 "0": ["person",54.687],
 "1": ["person",56.77],
 "2": ["person",55.837],
 "3": ["person",75.93],
 "4": ["person",72.956],
 "5": ["bird",81.139]
}

所以，这真是太棒了；仅用了几个小时的工作，你就准备好了一个接近最先进技术的生产级目标检测模块。

使用 YOLOv2 进行实时目标检测

目标检测和分类的重大进展得益于一个过程，即你只需对输入图像进行一次查看（You Only Look Once，简称 YOLO）。在这一单次处理过程中，目标是设置边界框的角坐标，以便绘制在检测到的物体周围，并使用回归模型对物体进行分类。这个过程能够避免误报，因为它考虑了整个图像的上下文信息，而不仅仅是像早期描述的区域提议方法那样的较小区域。如下所示的卷积神经网络（CNN）可以一次性扫描图像，因此足够快，能够在实时处理要求的应用中运行。

YOLOv2 在每个单独的网格中预测 N 个边界框，并为每个网格中的对象分类关联一个置信度级别，该网格是在前一步骤中建立的 S×S 网格。

图 9.9：YOLO 工作原理概述。输入图像被划分为网格，然后被送入检测过程，结果是大量的边界框，这些框通过应用一些阈值进一步过滤。

这个过程的结果是生成 S×S×N 个补充框。对于这些框的很大一部分，你会得到相当低的置信度分数，通过应用一个较低的阈值（在本例中为 30%），你可以消除大多数被错误分类的对象，如图所示。

在本节中，我们将使用一个预训练的 YOLOv2 模型进行目标检测和分类。

准备数据集

在这一部分，我们将探索如何使用现有的 COCO 数据集和自定义数据集进行数据准备。如果你想用很多类别来训练 YOLO 模型，你可以按照已有部分提供的指示操作，或者如果你想建立自己的自定义目标检测器，跟随自定义构建部分的说明。

使用预先存在的 COCO 数据集

在此实现中，我们将使用 COCO 数据集。这是一个用于训练 YOLOv2 以进行大规模图像检测、分割和标注的优质数据集资源。请从 cocodataset.org 下载数据集，并在终端中运行以下命令：

获取训练数据集：

wget http://images.cocodataset.org/zips/train2014.zip

获取验证数据集：

wget http://images.cocodataset.org/zips/val2014.zip

获取训练和验证的标注：

wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip

现在，让我们将 COCO 格式的标注转换为 VOC 格式：

安装 Baker：

pip install baker

创建文件夹以存储图像和标注：

mkdir images annotations

在 images 文件夹下解压 train2014.zip 和 val2014.zip：

unzip train2014.zip -d ./images/
unzip val2014.zip -d ./images/

将 annotations_trainval2014.zip 解压到 annotations 文件夹：

unzip annotations_trainval2014.zip -d ./annotations/

创建一个文件夹来存储转换后的数据：

mkdir output
mkdir output/train
mkdir output/val

python coco2voc.py create_annotations /TRAIN_DATA_PATH train /OUTPUT_FOLDER/train
python coco2voc.py create_annotations /TRAIN_DATA_PATH val /OUTPUT_FOLDER/val

最终转换后的文件夹结构如下所示：

图 9.10：COCO 数据提取和格式化过程示意图

这建立了图像和标注之间的完美对应关系。当验证集为空时，我们将使用 8:2 的比例自动拆分训练集和验证集。

结果是我们将有两个文件夹，./images 和 ./annotation，用于训练目的。

使用自定义数据集

现在，如果你想为你的特定应用场景构建一个目标检测器，那么你需要从网上抓取大约 100 到 200 张图像并进行标注。网上有很多标注工具可供使用，比如 LabelImg (github.com/tzutalin/labelImg) 或 Fast Image Data Annotation Tool (FIAT) (github.com/christopher5106/FastAnnotationTool)。

为了让你更好地使用自定义目标检测器，我们提供了一些带有相应标注的示例图像。请查看名为 Chapter09/yolo/new_class/ 的代码库文件夹。

每个图像都有相应的标注，如下图所示：

图 9.11：这里显示的是图像与标注之间的关系

此外，我们还需要从 pjreddie.com/darknet/yolo/ 下载预训练权重文件，我们将用它来初始化模型，并在这些预训练权重的基础上训练自定义目标检测器：

wget https://pjreddie.com/media/files/yolo.weights

安装所有依赖：

我们将使用 Keras API 结合 TensorFlow 方法来创建 YOLOv2 架构。让我们导入所有依赖：

pip install keras tensorflow tqdm numpy cv2 imgaug

以下是相关的代码：

from keras.models import Sequential, Model
from keras.layers import Reshape, Activation, Conv2D, Input, MaxPooling2D, BatchNormalization, Flatten, Dense, Lambda
from keras.layers.advanced_activations import LeakyReLU
from keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard
from keras.optimizers import SGD, Adam, RMSprop
from keras.layers.merge import concatenate
import matplotlib.pyplot as plt
import keras.backend as K
import tensorflow as tf
import imgaug as ia
from tqdm import tqdm
from imgaug import augmenters as iaa
import numpy as np
import pickle
import os, cv2
from preprocessing import parse_annotation, BatchGenerator
from utils import WeightReader, decode_netout, draw_boxes

#Setting GPU configs
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = ""

总是建议使用 GPU 来训练任何 YOLO 模型。

配置 YOLO 模型

YOLO 模型是通过一组超参数和其他配置来设计的。这个配置定义了构建模型的类型，以及模型的其他参数，如输入图像的大小和锚点列表。目前你有两个选择：tiny YOLO 和 full YOLO。以下代码定义了要构建的模型类型：

# List of object that YOLO model will learn to detect from COCO dataset 
#LABELS = ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']

# Label for the custom curated dataset.
LABEL = ['kangaroo']
IMAGE_H, IMAGE_W = 416, 416
GRID_H,  GRID_W  = 13 , 13
BOX              = 5
CLASS            = len(LABELS)
CLASS_WEIGHTS    = np.ones(CLASS, dtype='float32')
OBJ_THRESHOLD    = 0.3
NMS_THRESHOLD    = 0.3
ANCHORS          = [0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828]

NO_OBJECT_SCALE  = 1.0
OBJECT_SCALE     = 5.0
COORD_SCALE      = 1.0
CLASS_SCALE      = 1.0

BATCH_SIZE       = 16
WARM_UP_BATCHES  = 0
TRUE_BOX_BUFFER  = 50

配置预训练模型和图像的路径，如以下代码所示：

wt_path = 'yolo.weights'                      
train_image_folder = '/new_class/images/'
train_annot_folder = '/new_class/anno/' 
valid_image_folder = '/new_class/images/' 
valid_annot_folder = '/new_class/anno/'

定义 YOLO v2 模型

现在，让我们来看看 YOLOv2 模型的架构：

# the function to implement the organization layer (thanks to github.com/allanzelener/YAD2K)
def space_to_depth_x2(x):
    return tf.space_to_depth(x, block_size=2)
input_image = Input(shape=(IMAGE_H, IMAGE_W, 3))
true_boxes  = Input(shape=(1, 1, 1, TRUE_BOX_BUFFER , 4))

# Layer 1
x = Conv2D(32, (3,3), strides=(1,1), padding='same', name='conv_1', use_bias=False)(input_image)
x = BatchNormalization(name='norm_1')(x)
x = LeakyReLU(alpha=0.1)(x)
x = MaxPooling2D(pool_size=(2, 2))(x)

# Layer 2
x = Conv2D(64, (3,3), strides=(1,1), padding='same', name='conv_2', use_bias=False)(x)
x = BatchNormalization(name='norm_2')(x)
x = LeakyReLU(alpha=0.1)(x)
x = MaxPooling2D(pool_size=(2, 2))(x)

# Layer 3
# Layer 4
# Layer 23
# For the entire architecture, please refer to the yolo/Yolo_v2_train.ipynb notebook here: https://github.com/PacktPublishing/Python-Deep-Learning-Projects/blob/master/Chapter09/yolo/Yolo_v2_train.ipynb

我们刚刚创建的网络架构可以在这里找到：github.com/PacktPublishing/Python-Deep-Learning-Projects/blob/master/Chapter09/Network_architecture/network_architecture.png

以下是输出结果：

Total params: 50,983,561
Trainable params: 50,962,889
Non-trainable params: 20,672

训练模型

以下是训练模型的步骤：

加载我们下载的权重并用它们初始化模型：

weight_reader = WeightReader(wt_path)
weight_reader.reset()
nb_conv = 23
for i in range(1, nb_conv+1):
    conv_layer = model.get_layer('conv_' + str(i))

    if i < nb_conv:
        norm_layer = model.get_layer('norm_' + str(i))

        size = np.prod(norm_layer.get_weights()[0].shape)

        beta  = weight_reader.read_bytes(size)
        gamma = weight_reader.read_bytes(size)
        mean  = weight_reader.read_bytes(size)
        var   = weight_reader.read_bytes(size)

        weights = norm_layer.set_weights([gamma, beta, mean, var])       

    if len(conv_layer.get_weights()) > 1:
        bias   = weight_reader.read_bytes(np.prod(conv_layer.get_weights()[1].shape))
        kernel = weight_reader.read_bytes(np.prod(conv_layer.get_weights()[0].shape))
        kernel = kernel.reshape(list(reversed(conv_layer.get_weights()[0].shape)))
        kernel = kernel.transpose([2,3,1,0])
        conv_layer.set_weights([kernel, bias])
    else:
        kernel = weight_reader.read_bytes(np.prod(conv_layer.get_weights()[0].shape))
        kernel = kernel.reshape(list(reversed(conv_layer.get_weights()[0].shape)))
        kernel = kernel.transpose([2,3,1,0])
        conv_layer.set_weights([kernel])

随机化最后一层的权重：

layer   = model.layers[-4] # the last convolutional layer
weights = layer.get_weights()

new_kernel = np.random.normal(size=weights[0].shape)/(GRID_H*GRID_W)
new_bias   = np.random.normal(size=weights[1].shape)/(GRID_H*GRID_W)

layer.set_weights([new_kernel, new_bias])

生成如下代码中的配置：

generator_config = {
    'IMAGE_H' : IMAGE_H, 
    'IMAGE_W' : IMAGE_W,
    'GRID_H' : GRID_H, 
    'GRID_W' : GRID_W,
    'BOX' : BOX,
    'LABELS' : LABELS,
    'CLASS' : len(LABELS),
    'ANCHORS' : ANCHORS,
    'BATCH_SIZE' : BATCH_SIZE,
    'TRUE_BOX_BUFFER' : 50,
}

创建训练和验证批次：

# Training batch data
train_imgs, seen_train_labels = parse_annotation(train_annot_folder, train_image_folder, labels=LABELS)
train_batch = BatchGenerator(train_imgs, generator_config, norm=normalize)

# Validation batch data
valid_imgs, seen_valid_labels = parse_annotation(valid_annot_folder, valid_image_folder, labels=LABELS)
valid_batch = BatchGenerator(valid_imgs, generator_config, norm=normalize, jitter=False)

设置早停和检查点回调：

early_stop = EarlyStopping(monitor='val_loss', 
                           min_delta=0.001, 
                           patience=3, 
                           mode='min', 
                           verbose=1)

checkpoint = ModelCheckpoint('weights_coco.h5', 
                             monitor='val_loss', 
                             verbose=1, 
                             save_best_only=True, 
                             mode='min', 
                             period=1)

使用以下代码来训练模型：

tb_counter = len([log for log in os.listdir(os.path.expanduser('~/logs/')) if 'coco_' in log]) + 1
tensorboard = TensorBoard(log_dir=os.path.expanduser('~/logs/') + 'coco_' + '_' + str(tb_counter), 
                          histogram_freq=0, 
                          write_graph=True, 
                          write_images=False)

optimizer = Adam(lr=0.5e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
#optimizer = SGD(lr=1e-4, decay=0.0005, momentum=0.9)
#optimizer = RMSprop(lr=1e-4, rho=0.9, epsilon=1e-08, decay=0.0)

model.compile(loss=custom_loss, optimizer=optimizer)

model.fit_generator(generator = train_batch, 
                    steps_per_epoch = len(train_batch), 
                    epochs = 100, 
                    verbose = 1,
                    validation_data = valid_batch,
                    validation_steps = len(valid_batch),
                    callbacks = [early_stop, checkpoint, tensorboard], 
                    max_queue_size = 3)

以下是输出结果：

Epoch 1/2
11/11 [==============================] - 315s 29s/step - loss: 3.6982 - val_loss: 1.5416

Epoch 00001: val_loss improved from inf to 1.54156, saving model to weights_coco.h5
Epoch 2/2
11/11 [==============================] - 307s 28s/step - loss: 1.4517 - val_loss: 1.0636

Epoch 00002: val_loss improved from 1.54156 to 1.06359, saving model to weights_coco.h5

以下是仅两个 epoch 的 TensorBoard 输出：

图 9.12：该图表示 2 个 epoch 的损失曲线

评估模型

一旦训练完成，让我们通过将输入图像馈送到模型中来执行预测：

首先，我们将模型加载到内存中：

model.load_weights("weights_coco.h5")

现在设置测试图像路径并读取它：

input_image_path = "my_test_image.jpg"
image = cv2.imread(input_image_path)
dummy_array = np.zeros((1,1,1,1,TRUE_BOX_BUFFER,4))
plt.figure(figsize=(10,10))

对图像进行归一化：

input_image = cv2.resize(image, (416, 416))
input_image = input_image / 255.
input_image = input_image[:,:,::-1]
input_image = np.expand_dims(input_image, 0)

做出预测：

netout = model.predict([input_image, dummy_array])

boxes = decode_netout(netout[0], 
                      obj_threshold=OBJ_THRESHOLD,
                      nms_threshold=NMS_THRESHOLD,
                      anchors=ANCHORS, 
                      nb_class=CLASS)

image = draw_boxes(image, boxes, labels=LABELS)

plt.imshow(image[:,:,::-1]); plt.show()

这是一些结果：

恭喜你，你已经开发出了一个非常快速且可靠的最先进物体检测器。

我们学习了如何使用 YOLO 架构构建一个世界级的物体检测模型，结果看起来非常有前景。现在，你也可以将相同的模型部署到其他移动设备或树莓派上。

图像分割

图像分割是将图像中的内容按像素级别进行分类的过程。例如，如果你给定一张包含人的图片，将人从图像中分离出来就是图像分割，并且是通过像素级别的信息来完成的。

我们将使用 COCO 数据集进行图像分割。

在执行任何 SegNet 脚本之前，你需要做以下工作：

cd SegNet
wget http://images.cocodataset.org/zips/train2014.zip
mkdir images
unzip train2014.zip -d images

在执行 SegNet 脚本时，确保你的当前工作目录是SegNet。

导入所有依赖项

在继续之前，确保重新启动会话。

我们将使用numpy、pandas、keras、pylab、skimage、matplotlib和pycocotools，如以下代码所示：

from __future__ import absolute_import
from __future__ import print_function

import pylab
import numpy as np
import pandas as pd
import skimage.io as io
import matplotlib.pyplot as plt

from pycocotools.coco import COCO
pylab.rcParams['figure.figsize'] = (8.0, 10.0)
import cv2

import keras.models as models, Sequential
from keras.layers import Layer, Dense, Dropout, Activation, Flatten, Reshape, Permute
from keras.layers import Conv2D, MaxPool2D, UpSampling2D, ZeroPadding2D
from keras.layers import BatchNormalization

from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
from keras.optimizers import Adam

import keras
keras.backend.set_image_dim_ordering('th')

from tqdm import tqdm
import itertools
%matplotlib inline

探索数据

我们将首先定义用于图像分割的注释文件的位置，然后初始化 COCO API：

# set the location of the annotation file associated with the train images
annFile='annotations/annotations/instances_train2014.json'

# initialize COCO api with
coco = COCO(annFile)

下面是应该得到的输出：

loading annotations into memory...
Done (t=12.84s)
creating index...
index created!

图像

由于我们正在构建一个二进制分割模型，让我们考虑从images/train2014文件夹中只标记为person标签的图像，以便将图像中的人分割出来。COCO API 为我们提供了易于使用的方法，其中两个常用的方法是getCatIds和getImgIds。以下代码片段将帮助我们提取所有带有person标签的图像 ID：

# extract the category ids using the label 'person'
catIds = coco.getCatIds(catNms=['person'])

# extract the image ids using the catIds
imgIds = coco.getImgIds(catIds=catIds )

# print number of images with the tag 'person'
print("Number of images with the tag 'person' :" ,len(imgIds))

这应该是输出结果：

Number of images with the tag 'person' : 45174

现在让我们使用以下代码片段来绘制图像：

# extract the details of image with the image id
img = coco.loadImgs(imgIds[2])[0]
print(img)

# load the image using the location of the file listed in the image variable
I = io.imread('images/train2014/'+img['file_name'])

# display the image
plt.imshow(I)

下面是应该得到的输出：

{'height': 426, 'coco_url': 'http://images.cocodataset.org/train2014/COCO_train2014_000000524291.jpg', 'date_captured': '2013-11-18 09:59:07', 'file_name': 'COCO_train2014_000000524291.jpg', 'flickr_url': 'http://farm2.staticflickr.com/1045/934293170_d1b2cc58ff_z.jpg', 'width': 640, 'id': 524291, 'license': 3}

我们得到如下输出图像：

图 9.13：数据集中样本图像的绘制表示。

在前面的代码片段中，我们将一个图像 ID 传入 COCO 的loadImgs方法，以提取与该图像对应的详细信息。如果你查看img变量的输出，列出的一项键是file_name键。这个键包含了位于images/train2014/文件夹中的图像名称。

然后，我们使用已导入的io模块的imread方法读取图像，并使用matplotlib.pyplot进行绘制。

注释

现在让我们加载与之前图片对应的标注，并在图片上绘制该标注。coco.getAnnIds()函数帮助我们通过图像 ID 加载标注信息。然后，借助coco.loadAnns()函数，我们加载标注并通过coco.showAnns()函数绘制出来。重要的是，你要先绘制图像，再进行标注操作，代码片段如下所示：

# display the image
plt.imshow(I)

# extract the annotation id 
annIds = coco.getAnnIds(imgIds=img['id'], catIds=catIds, iscrowd=None)

# load the annotation
anns = coco.loadAnns(annIds)

# plot the annotation on top of the image
coco.showAnns(anns)

以下应为输出：

图 9.14：在图像上可视化标注

为了能够获取标注标签数组，使用coco.annToMask()函数，如以下代码片段所示。该数组将帮助我们形成分割目标：

# build the mask for display with matplotlib
mask = coco.annToMask(anns[0])

# display the mask
plt.imshow(mask)

以下应为输出：

图 9.15：仅可视化标注

准备数据

现在让我们定义一个data_list()函数，它将自动化加载图像及其分割数组到内存，并使用 OpenCV 将它们调整为 360*480 的形状。此函数返回两个列表，其中包含图像和分割数组：

def data_list(imgIds, count = 12127, ratio = 0.2):
    """Function to load image and its target into memory.""" 
    img_lst = []
    lab_lst = []

    for x in tqdm(imgIds[0:count]):
        # load image details
        img = coco.loadImgs(x)[0]

        # read image
        I = io.imread('images/train2014/'+img['file_name'])
        if len(I.shape)<3:
            continue

        # load annotation information
        annIds = coco.getAnnIds(imgIds=img['id'], catIds=catIds, iscrowd=None)

        # load annotation
        anns = coco.loadAnns(annIds)

        # prepare mask
        mask = coco.annToMask(anns[0])

        # This condition makes sure that we select images having only one person 
        if len(np.unique(mask)) == 2:

            # Next condition selects images where ratio of area covered by the 
 # person to the entire image is greater than the ratio parameter
 # This is done to not have large class imbalance
            if (len(np.where(mask>0)[0])/len(np.where(mask>=0)[0])) > ratio :

                # If you check, generated mask will have 2 classes i.e 0 and 2 
 # (0 - background/other, 1 - person).
 # to avoid issues with cv2 during the resize operation
 # set label 2 to 1, making label 1 as the person. 
                mask[mask==2] = 1

                # resize image and mask to shape (480, 360)
                I= cv2.resize(I, (480,360))
                mask = cv2.resize(mask, (480,360))

                # append mask and image to their lists
                img_lst.append(I)
                lab_lst.append(mask)
    return (img_lst, lab_lst)

# get images and their labels
img_lst, lab_lst = data_list(imgIds)

print('Sum of images for training, validation and testing :', len(img_lst))
print('Unique values in the labels array :', np.unique(lab_lst[0]))

以下应为输出：

Sum of images for training, validation and testing : 1997
Unique values in the labels array : [0 1]

图像归一化

首先，让我们定义make_normalize()函数，它接受一张图像并对其进行直方图归一化操作。返回的对象是一个归一化后的数组：

def make_normalize(img):
    """Function to histogram normalize images."""
    norm_img = np.zeros((img.shape[0], img.shape[1], 3),np.float32)

    b=img[:,:,0]
    g=img[:,:,1]
    r=img[:,:,2]

    norm_img[:,:,0]=cv2.equalizeHist(b)
    norm_img[:,:,1]=cv2.equalizeHist(g)
    norm_img[:,:,2]=cv2.equalizeHist(r)

    return norm_img

plt.figure(figsize = (14,5))
plt.subplot(1,2,1)
plt.imshow(img_lst[9])
plt.title(' Original Image')
plt.subplot(1,2,2)
plt.imshow(make_normalize(img_lst[9]))
plt.title(' Histogram Normalized Image')

以下应为输出：

图 9.16：图像直方图归一化前后对比

在前面的截图中，我们看到左边是原始图片，非常清晰，而右边是归一化后的图片，几乎看不见。

编码

定义了make_normalize()函数后，我们现在定义一个make_target函数。该函数接受形状为(360, 480)的分割数组，然后返回形状为(360,480,2)的分割目标。在目标中，通道0表示背景，并且在图像中代表背景的位置为1，其他位置为零。通道1表示人物，并且在图像中代表人物的位置为1，其他位置为0。以下代码实现了该函数：

def make_target(labels):
    """Function to one hot encode targets."""
    x = np.zeros([360,480,2])
    for i in range(360):
        for j in range(480):
            x[i,j,labels[i][j]]=1
    return x

plt.figure(figsize = (14,5))
plt.subplot(1,2,1)
plt.imshow(make_target(lab_lst[0])[:,:,0])
plt.title('Background')
plt.subplot(1,2,2)
plt.imshow(make_target(lab_lst[0])[:,:,1])
plt.title('Person')

以下应为输出：

图 9.17：可视化编码后的目标数组

模型数据

我们现在定义一个名为model_data()的函数，它接受图像列表和标签列表。该函数将对每个图像应用make_normalize()函数以进行归一化，并对每个标签/分割数组应用make_encode()函数以获得编码后的数组。

该函数返回两个列表，一个包含归一化后的图像，另一个包含对应的目标数组：

def model_data(images, labels):
    """Function to perform normalize and encode operation on each image."""
    # empty label and image list
    array_lst = []
    label_lst=[]

    # apply normalize function on each image and encoding function on each label
    for x,y in tqdm(zip(images, labels)):
        array_lst.append(np.rollaxis(normalized(x), 2))
        label_lst.append(make_target(y))

    return np.array(array_lst), np.array(label_lst)

# Get model data
train_data, train_lab = model_data(img_lst, lab_lst)

flat_image_shape = 360*480

# reshape target array
train_label = np.reshape(train_lab,(-1,flat_image_shape,2))

# test data
test_data = test_data[1900:]
# validation data
val_data = train_data[1500:1900]
# train data
train_data = train_data[:1500]

# test label
test_label = test_label[1900:]
# validation label
val_label = train_label[1500:1900]
# train label
train_label = train_label[:1500]

在前面的代码片段中，我们还将数据划分为训练集、测试集和验证集，其中训练集包含1500个数据点，验证集包含400个数据点，测试集包含97个数据点。

定义超参数

以下是一些我们将在整个代码中使用的定义超参数，它们是完全可配置的：

# define optimizer
optimizer = Adam(lr=0.002)

# input shape to the model
input_shape=(3, 360, 480)

# training batchsize
batch_size = 6

# number of training epochs
nb_epoch = 60

要了解更多关于optimizers及其在 Keras 中的 API，请访问keras.io/optimizers/。如果遇到关于 GPU 的资源耗尽错误，请减少batch_size。

尝试不同的学习率、optimizers和batch_size，看看这些因素如何影响模型的质量，如果得到更好的结果，可以与深度学习社区分享。

定义 SegNet

为了进行图像分割，我们将构建一个 SegNet 模型，它与我们在第八章中构建的自编码器非常相似：使用卷积神经网络进行手写数字分类，如图所示：

图 9.18：本章使用的 SegNet 架构

我们将定义的 SegNet 模型将接受(3,360, 480)的图像作为输入，目标是(172800, 2)的分割数组，并且在编码器中将具有以下特点：

第一层是一个具有 64 个 33 大小滤波器的二维卷积层，activation为relu，接着是批量归一化，然后是使用大小为 22 的 MaxPooling2D 进行下采样。
第二层是一个具有 128 个 33 大小滤波器的二维卷积层，activation为relu，接着是批量归一化，然后是使用大小为 22 的 MaxPooling2D 进行下采样。
第三层是一个具有 256 个 33 大小滤波器的二维卷积层，activation为relu，接着是批量归一化，然后是使用大小为 22 的 MaxPooling2D 进行下采样。
第四层再次是一个具有 512 个 3*3 大小滤波器的二维卷积层，activation为relu，接着是批量归一化。

模型在解码器中将具有以下特点：

第一层是一个具有 512 个 33 大小滤波器的二维卷积层，activation为relu，接着是批量归一化，然后是使用大小为 22 的 UpSampling2D 进行下采样。
第二层是一个具有 256 个 33 大小滤波器的二维卷积层，activation为relu，接着是批量归一化，然后是使用大小为 22 的 UpSampling2D 进行下采样。
第三层是一个具有 128 个 33 大小滤波器的二维卷积层，activation为relu，接着是批量归一化，然后是使用大小为 22 的 UpSampling2D 进行下采样。
第四层是一个具有 64 个 3*3 大小滤波器的二维卷积层，activation为relu，接着是批量归一化。
第五层是一个大小为 1*1 的 2 个卷积 2D 层，接着是 Reshape、Permute 和一个softmax激活层，用于预测得分。

使用以下代码描述模型：

model = Sequential()
# Encoder
model.add(Layer(input_shape=input_shape))
model.add(ZeroPadding2D())
model.add(Conv2D(filters=64, kernel_size=(3,3), padding='valid', activation='relu'))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2,2)))

model.add(ZeroPadding2D())
model.add(Conv2D(filters=128, kernel_size=(3,3), padding='valid', activation='relu'))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2,2)))

model.add(ZeroPadding2D())
model.add(Conv2D(filters=256, kernel_size=(3,3), padding='valid', activation='relu'))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2,2)))

model.add(ZeroPadding2D())
model.add(Conv2D(filters=512, kernel_size=(3,3), padding='valid', activation='relu'))
model.add(BatchNormalization())

# Decoder # For the remaining part of this section of the code refer to the segnet.ipynb file in the SegNet folder. Here is the github link: https://github.com/PacktPublishing/Python-Deep-Learning-Projects/tree/master/Chapter09

编译模型

在模型定义完成后，使用categorical_crossentropy作为loss，Adam作为optimizer来编译模型，这由超参数部分中的optimizer变量定义。我们还将定义ReduceLROnPlateau，以便在训练过程中根据需要减少学习率，如下所示：

# compile model
model.compile(loss="categorical_crossentropy", optimizer=Adam(lr=0.002), metrics=["accuracy"])

# use ReduceLROnPlateau to adjust the learning rate
reduceLROnPlat = ReduceLROnPlateau(monitor='val_acc', factor=0.75, patience=5,
                      min_delta=0.005, mode='max', cooldown=3, verbose=1)

callbacks_list = [reduceLROnPlat]

拟合模型

模型编译完成后，我们将使用模型的fit方法在数据上拟合模型。在这里，由于我们在一个小的数据集上进行训练，重要的是将参数 shuffle 设置为True，以便在每个 epoch 后对图像进行打乱：

# fit the model
history = model.fit(train_data, train_label, callbacks=callbacks_list,
                    batch_size=batch_size, epochs=nb_epoch,
                    verbose=1, shuffle = True, validation_data = (val_data, val_label))

这应为输出结果：

图 9.19：训练输出

以下展示了准确率和损失曲线：

图 9.20：展示训练进展的曲线

测试模型

训练好模型后，在测试数据上评估模型，如下所示：

loss,acc = model.evaluate(test_data, test_label)
print('Loss :', loss)
print('Accuracy :', acc)

这应为输出结果：

97/97 [==============================] - 7s 71ms/step
Loss : 0.5390811630131043
Accuracy : 0.7633129960482883

我们看到，我们构建的 SegNet 模型在测试图像上损失为 0.539，准确率为 76.33。

让我们绘制测试图像及其相应生成的分割结果，以便理解模型的学习情况：

for i in range(3):
    plt.figure(figsize = (10,3))
    plt.subplot(1,2,1)
    plt.imshow(img_lst[1900+i])
    plt.title('Input')
    plt.subplot(1,2,2)
    plt.imshow(model.predict_classes(test_data[i:(i+1)*1]).reshape(360,480))
    plt.title('Segmentation')

以下应为输出结果：

图 9.21：在测试图像上生成的分割结果

从前面的图中，我们可以看到模型能够从图像中分割出人物。

结论

项目的第一部分是使用 Keras 中的 YOLO 架构构建一个目标检测分类器。

项目的第二部分是建立一个二进制图像分割模型，针对的是包含人物和背景的 COCO 图像。目标是建立一个足够好的模型，将图像中的人物从背景中分割出来。

我们通过在 1500 张形状为 3604803 的图像上进行训练构建的模型，在训练数据上的准确率为 79%，在验证和测试数据上的准确率为 78%。该模型能够成功地分割出图像中的人物，但分割的边缘略微偏离应有的位置。这是由于使用了一个较小的训练集。考虑到训练所用的图像数量，模型在分割上做得还是很不错的。

在这个数据集中还有更多可用于训练的图像，虽然使用 Nvidia Tesla K80 GPU 训练所有图像可能需要一天以上的时间，但这样做将能够获得非常好的分割效果。

总结

在本章的第一部分，我们学习了如何使用现有的分类器构建一个 RESTful 服务来进行目标检测，并且我们还学习了如何使用 YOLO 架构的目标检测分类器和 Keras 构建一个准确的目标检测器，同时实现了迁移学习。在本章的第二部分，我们了解了图像分割是什么，并在 COCO 数据集的图像上构建了一个图像分割模型。我们还在测试数据上测试了目标检测器和图像分割器的性能，并确定我们成功达成了目标。

第十章：使用 FaceNet 构建人脸识别

在上一章中，我们学习了如何在图像中检测物体。本章中，我们将探讨物体检测的一个具体应用——人脸识别。人脸识别结合了两项主要操作：人脸检测，接着是人脸分类。

在这个项目中提供我们业务用例的（假设）客户是一个高性能计算数据中心，属于 Tier III，并获得了可持续性认证。他们设计了这个设施，以满足对自然灾害的最高保护标准，并配备了许多冗余系统。

目前该设施已经实施了超高安全协议，以防止恶意的人为灾难，并且他们希望通过人脸识别技术增强其安全性，用于控制设施内的安全区域访问。

他们所容纳和维护的服务器处理着世界上最敏感、最有价值且最具影响力的数据，因此风险非常高：

这个人脸识别系统需要能够准确识别出他们自己员工的身份，同时也能识别出偶尔参观数据中心进行检查的客户员工。

他们要求我们提供一个基于智能的能力的 POC，供审查并随后在他们的数据中心中应用。

所以，在本章中，我们将学习如何构建一个世界级的人脸识别系统。我们将定义如下的流程：

人脸检测：首先，查看一张图像并找到其中所有可能的人脸
人脸提取：其次，聚焦于每一张人脸图像并理解它，例如它是否转向一侧或光线较暗
特征提取：第三，使用卷积神经网络（CNN）从人脸中提取独特特征
分类器训练：最后，将该人脸的独特特征与所有已知人员的特征进行比较，从而确定此人的姓名

你将学习每个步骤背后的主要思想，以及如何使用以下深度学习技术在 Python 中构建你自己的面部识别系统：

dlib (dlib.net/)：提供一个可以用于人脸检测和对齐的库。
OpenFace (cmusatyalab.github.io/openface/)：一个深度学习人脸识别模型，由 Brandon Amos 等人（bamos.github.io/）开发。它还能够在实时移动设备上运行。
FaceNet (arxiv.org/abs/1503.03832)：一种用于特征提取的 CNN 架构。FaceNet 使用三元组损失作为损失函数。三元组损失通过最小化正样本之间的距离，同时最大化负样本之间的距离来工作。

设置环境

由于设置过程可能非常复杂并且耗时，且本章不涉及这些内容，我们将构建一个包含所有依赖项（包括 dlib、OpenFace 和 FaceNet）的 Docker 镜像。

获取代码

从仓库中获取我们将用来构建人脸识别的代码：

git clone https://github.com/PacktPublishing/Python-Deep-Learning-Projects
cd Chapter10/

构建 Docker 镜像

Docker 是一个容器平台，简化了部署过程。它解决了在不同服务器环境中安装软件依赖的问题。如果你是 Docker 新手，可以在 www.docker.com/ 阅读更多内容。

要在 Linux 机器上安装 Docker，请运行以下命令：

curl https://get.docker.com | sh

对于 macOS 和 Windows 等其他系统，请访问 docs.docker.com/install/。如果你已经安装了 Docker，可以跳过此步骤。

安装 Docker 后，你应该能够在终端中使用 docker 命令，示例如下：

现在我们将创建一个 docker 文件，安装所有依赖项，包括 OpenCV、dlib 和 TensorFlow。该文件可以在 GitHub 仓库中找到，链接如下：github.com/PacktPublishing/Python-Deep-Learning-Projects/tree/master/Chapter10/Dockerfile：

#Dockerfile for our env setup
FROM tensorflow/tensorflow:latest

RUN apt-get update -y --fix-missing
RUN apt-get install -y ffmpeg
RUN apt-get install -y build-essential cmake pkg-config \
                    libjpeg8-dev libtiff5-dev libjasper-dev libpng12-dev \
                    libavcodec-dev libavformat-dev libswscale-dev libv4l-dev \
                    libxvidcore-dev libx264-dev \
                    libgtk-3-dev \
                    libatlas-base-dev gfortran \
                    libboost-all-dev \
                    python3 python3-dev python3-numpy

RUN apt-get install -y wget vim python3-tk python3-pip

WORKDIR /
RUN wget -O opencv.zip https://github.com/Itseez/opencv/archive/3.2.0.zip \
    && unzip opencv.zip \
    && wget -O opencv_contrib.zip https://github.com/Itseez/opencv_contrib/archive/3.2.0.zip \
    && unzip opencv_contrib.zip

# install opencv3.2
RUN cd /opencv-3.2.0/ \
   && mkdir build \
   && cd build \
   && cmake -D CMAKE_BUILD_TYPE=RELEASE \
            -D INSTALL_C_EXAMPLES=OFF \
            -D INSTALL_PYTHON_EXAMPLES=ON \
            -D OPENCV_EXTRA_MODULES_PATH=/opencv_contrib-3.2.0/modules \
            -D BUILD_EXAMPLES=OFF \
            -D BUILD_opencv_python2=OFF \
            -D BUILD_NEW_PYTHON_SUPPORT=ON \
            -D CMAKE_INSTALL_PREFIX=$(python3 -c "import sys; print(sys.prefix)") \
            -D PYTHON_EXECUTABLE=$(which python3) \
            -D WITH_FFMPEG=1 \
            -D WITH_CUDA=0 \
            .. \
    && make -j8 \
    && make install \
    && ldconfig \
    && rm /opencv.zip \
    && rm /opencv_contrib.zip

# Install dlib 19.4
RUN wget -O dlib-19.4.tar.bz2 http://dlib.net/files/dlib-19.4.tar.bz2 \
    && tar -vxjf dlib-19.4.tar.bz2

RUN cd dlib-19.4 \
    && cd examples \
    && mkdir build \
    && cd build \
    && cmake .. \
    && cmake --build . --config Release \
    && cd /dlib-19.4 \
    && pip3 install setuptools \
    && python3 setup.py install \
    && cd $WORKDIR \
    && rm /dlib-19.4.tar.bz2

ADD $PWD/requirements.txt /requirements.txt
RUN pip3 install -r /requirements.txt

CMD ["/bin/bash"]

现在执行以下命令来构建镜像：

docker build -t hellorahulk/facerecognition -f Dockerfile

安装所有依赖项并构建 Docker 镜像大约需要 20-30 分钟：

下载预训练模型

我们将下载一些额外的文件，这些文件将在本章后面详细使用和讨论。

使用以下命令下载 dlib 的人脸关键点预测器：

curl -O http://dlib.net/

files/shape_predictor_68_face_landmarks.dat.bz2
bzip2 -d shape_predictor_68_face_landmarks.dat.bz2
cp shape_predictor_68_face_landmarks.dat facenet/

下载预训练的 Inception 模型：

curl -L -O https://www.dropbox.com/s/hb75vuur8olyrtw/Resnet-185253.pb
cp Resnet-185253.pb pre-model/

一旦我们准备好所有组件，文件夹结构应该大致如下所示：

代码的文件夹结构

确保你将要训练模型的人的图像保存在 /data 文件夹中，并将该文件夹命名为 /data/<class_name>/<class_name>_000<count>.jpg。

/output 文件夹将包含训练后的 SVM 分类器和所有预处理的图像，这些图像将保存在一个子文件夹 /intermediate 中，使用与 /data 文件夹相同的命名规则。

专业提示：为了提高准确度，始终确保每个类别有超过五张样本图像。这将有助于模型更快地收敛，并且能够更好地泛化。

构建管道

人脸识别是一种生物识别解决方案，它通过测量面部的独特特征来进行识别。为了执行人脸识别，你需要一种方式来唯一地表示一个面孔。

任何人脸识别系统的基本思想是将面部特征分解为独特的特征，然后使用这些特征来表示身份。

构建一个强大的特征提取流水线非常重要，因为它将直接影响我们系统的性能和准确性。1960 年，伍德罗·布莱德索（Woodrow Bledsoe）使用了一种技术，标记出面部显著特征的坐标。这些特征包括发际线、眼睛和鼻子的位置信息。

后来，在 2005 年，发明了一种更强大的技术——定向梯度直方图（HOG）。这种技术捕捉了图像中密集像素的朝向。

目前最先进的技术，超越了所有其他技术，使用的是卷积神经网络（CNN）。2015 年，谷歌的研究人员发布了一篇论文，描述了他们的系统 FaceNet (arxiv.org/abs/1503.03832)，该系统使用 CNN 依赖图像像素来识别特征，而不是手动提取它们。

为了构建面部识别流水线，我们将设计以下流程（在图中用橙色方块表示）：

预处理：找到所有的面部，修正面部的朝向。
特征提取：从处理过的面部图像中提取独特的特征。
分类器训练：使用 128 维特征训练 SVM 分类器。

图示如下：

这张图示例了面部识别流水线的端到端流程。

我们将详细查看每一个步骤，并构建我们世界级的面部识别系统。

图像的预处理

我们流水线的第一步是面部检测。然后我们将对面部进行对齐，提取特征，并最终在 Docker 上完成预处理。

面部检测

很显然，首先定位给定照片中的面部是非常重要的，这样它们才能被送入流水线的后续部分。检测面部有很多方法，例如检测皮肤纹理、椭圆/圆形形状检测以及其他统计方法。我们将使用一种叫做 HOG 的方法。

HOG是一种特征描述符，表示梯度方向（或定向梯度）的分布（直方图），这些梯度被用作特征。图像的梯度（x和y导数）是有用的，因为在边缘和角落（强度变化突然的区域）周围，梯度的幅值较大，而这些是图像中的优秀特征。

为了在图像中找到面部，我们将把图像转换为灰度图像。然后，我们会逐个查看图像中的每个像素，并尝试使用 HOG 检测器提取像素的朝向。我们将使用dlib.get_frontal_face_detector()来创建我们的面部检测器。

以下小示例展示了基于 HOG 的面部检测器在实现中的应用：

import sys
import dlib
from skimage import io

# Create a HOG face detector using the built-in dlib class
face_detector = dlib.get_frontal_face_detector()

# Load the image into an array
file_name = 'sample_face_image.jpeg'
image = io.imread(file_name)

# Run the HOG face detector on the image data.
# The result will be the bounding boxes of the faces in our image.
detected_faces = face_detector(image, 1)

print("Found {} faces.".format(len(detected_faces)))

# Loop through each face we found in the image
for i, face_rect in enumerate(detected_faces):
 # Detected faces are returned as an object with the coordinates 
 # of the top, left, right and bottom edges
  print("- Face #{} found at Left: {} Top: {} Right: {} Bottom: {}".format(i+1, face_rect.left(), face_rect.top(), face_rect.right(), face_rect.bottom()))

输出结果如下：

Found 1 faces. 
-Face #1 found at Left: 365 Top: 365 Right: 588 Bottom: 588

对齐面部

一旦我们知道面部所在的区域，就可以执行各种隔离技术，从整体图像中提取出面部。

需要解决的一个挑战是，图像中的人脸可能会被旋转成不同的方向，使其在机器看来有些不同。

为了解决这个问题，我们将对每张图像进行扭曲，使得眼睛和嘴唇始终位于提供图像中的相同位置。这将使我们在接下来的步骤中更容易进行人脸比较。为此，我们将使用一种叫做人脸地标估计的算法。

基本思想是，我们将提出 68 个特定的关键点（称为地标），这些关键点存在于每张脸上——下巴顶部、每只眼睛的外缘、每条眉毛的内缘等等。然后，我们将训练一个机器学习算法，使其能够在任何脸部上找到这 68 个特定的关键点。

我们将在每张脸上定位的 68 个地标显示在下图中：

这张图像是由 Brandon Amos 创建的（bamos.github.io/），他在 OpenFace 项目中工作（github.com/cmusatyalab/openface）。

这里有一个小片段，演示了如何使用我们在环境设置部分下载的人脸地标：

import sys
import dlib
import cv2
import openface

predictor_model = "shape_predictor_68_face_landmarks.dat"

# Create a HOG face detector , Shape Predictor and Aligner
face_detector = dlib.get_frontal_face_detector()
face_pose_predictor = dlib.shape_predictor(predictor_model)
face_aligner = openface.AlignDlib(predictor_model)

# Take the image file name from the command line
file_name = 'sample_face_image.jpeg'

# Load the image
image = cv2.imread(file_name)

# Run the HOG face detector on the image data
detected_faces = face_detector(image, 1)

print("Found {} faces.".format(len(detected_faces))

# Loop through each face we found in the image
for i, face_rect in enumerate(detected_faces):

  # Detected faces are returned as an object with the coordinates 
  # of the top, left, right and bottom edges
  print("- Face #{} found at Left: {} Top: {} Right: {} Bottom: {}".format(i, face_rect.left(), face_rect.top(), face_rect.right(), face_rect.bottom()))

 # Get the the face's pose
  pose_landmarks = face_pose_predictor(image, face_rect)

 # Use openface to calculate and perform the face alignment
  alignedFace = face_aligner.align(534, image, face_rect, landmarkIndices=openface.AlignDlib.OUTER_EYES_AND_NOSE)

 # Save the aligned image to a file
  cv2.imwrite("aligned_face_{}.jpg".format(i), alignedFace)

使用这个方法，我们可以执行各种基本的图像变换，如旋转和缩放，同时保持平行线的特性。这些变换也被称为仿射变换（en.wikipedia.org/wiki/Affine_transformation）。

输出结果如下：

通过分割，我们解决了在图像中找到最大脸部的问题，而通过对齐，我们基于眼睛和下唇的位置，将输入图像标准化为居中。

这是我们数据集中的一个示例，展示了原始图像和处理后的图像：

特征提取

现在我们已经完成了数据的分割和对齐，我们将生成每个身份的向量嵌入。这些嵌入可以作为分类、回归或聚类任务的输入。

训练一个 CNN 输出人脸嵌入的过程需要大量的数据和计算能力。然而，一旦网络训练完成，它就可以为任何脸部生成测量结果，甚至是它从未见过的脸！因此，这一步只需要做一次。

为了方便起见，我们提供了一个已经在 Inception-Resnet-v1 上预训练的模型，您可以在任何人脸图像上运行它，以获取 128 维特征向量。我们在环境设置部分下载了此文件，它位于/pre-model/Resnet-185253.pb目录中。

如果您想自己尝试这个步骤，OpenFace 提供了一个 Lua 脚本（github.com/cmusatyalab/openface/blob/master/batch-represent/batch-represent.lua），该脚本会生成文件夹中所有图像的嵌入，并将它们写入 CSV 文件。

创建输入图像嵌入的代码可以在段落后找到。该代码可在仓库中找到：github.com/PacktPublishing/Python-Deep-Learning-Projects/blob/master/Chapter10/facenet/train_classifier.py。

在这个过程中，我们从 Resnet 模型加载了训练好的组件，如embedding_layer、images_placeholder和phase_train_placeholder，以及图像和标签：

def _create_embeddings(embedding_layer, images, labels, images_placeholder, phase_train_placeholder, sess):
    """
    Uses model to generate embeddings from :param images.
    :param embedding_layer: 
    :param images: 
    :param labels: 
    :param images_placeholder: 
    :param phase_train_placeholder: 
    :param sess: 
    :return: (tuple): image embeddings and labels
    """
    emb_array = None
    label_array = None
    try:
        i = 0
        while True:
            batch_images, batch_labels = sess.run([images, labels])
            logger.info('Processing iteration {} batch of size: {}'.format(i, len(batch_labels)))
            emb = sess.run(embedding_layer,
                           feed_dict={images_placeholder: batch_images, phase_train_placeholder: False})

            emb_array = np.concatenate([emb_array, emb]) if emb_array is not None else emb
            label_array = np.concatenate([label_array, batch_labels]) if label_array is not None else batch_labels
            i += 1

    except tf.errors.OutOfRangeError:
        pass

    return emb_array, label_array

这是嵌入创建过程的快速概览。我们将图像和标签数据以及来自预训练模型的几个组件一起输入：

该过程的输出将是一个 128 维的向量，表示人脸图像。

在 Docker 上执行

我们将在 Docker 镜像上实现预处理。我们将通过-v标志将project目录挂载为 Docker 容器中的一个卷，并在输入数据上运行预处理脚本。结果将写入通过命令行参数指定的目录。

align_dlib.py文件来自 CMU。它提供了检测图像中的人脸、查找面部特征点并对齐这些特征点的方法：

docker run -v $PWD:/facerecognition \
-e PYTHONPATH=$PYTHONPATH:/facerecognition \
-it hellorahulk/facerecognition python3 /facerecognition/facenet/preprocess.py \
--input-dir /facerecognition/data \
--output-dir /facerecognition/output/intermediate \
--crop-dim 180

在前面的命令中，我们通过--input-dir标志设置了输入数据路径。该目录应包含我们要处理的图像。

我们还使用--output-dir标志设置了输出路径，存储分割对齐的图像。我们将使用这些输出图像作为训练输入。

--crop-dim标志用于定义图像的输出尺寸。在这种情况下，所有图像将被存储为 180 × 180。

该过程的结果将在/output文件夹内创建一个/intermediate文件夹，其中包含所有预处理过的图像。

训练分类器

首先，我们将从input目录（--input-dir标志）加载已分割并对齐的图像。在训练过程中，我们将对图像进行预处理。此预处理将向图像添加随机变换，从而生成更多的图像用于训练。

这些图像将以 128 的批量大小输入到预训练模型中。该模型将为每张图像返回一个 128 维的嵌入，为每个批次返回一个 128 x 128 的矩阵。

在创建这些嵌入后，我们将使用它们作为特征输入到 scikit-learn 的 SVM 分类器中，进行每个身份的训练。

以下命令将启动该过程并训练分类器。分类器将作为pickle文件保存在--classifier-path参数定义的路径中：

docker run -v $PWD:/facerecognition \
-e PYTHONPATH=$PYTHONPATH:/facerecognition \
-it hellorahulk/facerecognition \
python3 /facerecognition/facenet/train_classifier.py \
--input-dir /facerecognition/output/intermediate \
--model-path /facerecognition/pre-model/Resnet-185253.pb \
--classifier-path /facerecognition/output/classifier.pkl \
--num-threads 16 \
--num-epochs 25 \
--min-num-images-per-class 10 \
--is-train

一些自定义参数可以调整：

--num-threads：根据 CPU/GPU 配置进行修改
--num-epochs：根据你的数据集进行更改
--min-num-images-per-class：根据你的数据集进行更改
--is-train：设置为True标志表示训练

这个过程可能需要一段时间，具体取决于你用于训练的图像数量。完成后，你会在/output文件夹内找到一个classifier.pkl文件。

现在，你可以使用classifier.pkl文件进行预测，并将其部署到生产环境中。

评估

我们将评估训练好的模型的性能。为此，我们将执行以下命令：

docker run -v $PWD:/facerecognition \
-e PYTHONPATH=$PYTHONPATH:/facerecognition \
-it hellorahulk/facerecognition \
python3 /facerecognition/facenet/train_classifier.py \
--input-dir /facerecognition/output/intermediate \
--model-path /facerecognition/pre-model/Resnet-185253.pb \
--classifier-path /facerecognition/output/classifier.pkl \
--num-threads 16 \
--num-epochs 2 \
--min-num-images-per-class 10 \

执行完成后，你将看到带有置信度分数的预测结果，如下图所示：

我们可以看到，模型能够以 99.5%的准确度进行预测，并且预测速度相对较快。

总结

我们成功完成了一个世界级的人脸识别概念验证（POC）项目，利用 OpenFace、dlib 和 FaceNet 的深度学习技术，服务于我们假设的高性能数据中心。

我们构建了一个包含以下内容的流水线：

人脸检测：检查图像并找到其中包含的所有人脸
人脸提取：集中关注每张人脸并了解其基本特征
特征提取：通过卷积神经网络（CNN）从人脸中提取独特特征
分类器训练：将这些独特特征与所有已知的人进行比较，并确定该人的名字

强大的面部识别系统为访问控制提供的安全等级，符合这一 Tier III 设施所要求的高标准。这个项目是深度学习强大能力的一个极佳例子，能够为我们客户的业务运营带来有意义的影响。

第十一章：自动图像描述

在上一章中，我们了解了如何构建物体检测和分类模型，这非常令人兴奋。但在这一章中，我们将做一些更令人印象深刻的事情，结合计算机视觉和自然语言处理的当前最先进技术，形成一个完整的图像描述方法（www.cs.cmu.edu/~afarhadi/papers/sentence.pdf）。这将负责构建任何提供图像的计算机生成的自然描述。

我们的团队被要求构建这个模型，以生成自然语言的图像描述，用作一家公司的核心智能，该公司希望帮助视障人士利用网络上照片分享的爆炸式增长。想到这项深度学习技术可能具备有效地为这一社区带来图像内容的能力，令人兴奋。可能会喜欢我们工作的成果的人群包括从出生时就视力受损的人到老年人群体。这些用户类型及更多人群可以使用基于本项目模型的图像描述机器人，从而了解发布的图像内容，举个例子，他们可以跟上家人的动态。

考虑到这一点，我们来看一下我们需要做的深度学习工程。这个想法是用一个经过训练的深度卷积神经网络（CNN）替换编码器-解码器架构中的编码器（RNN 层），该 CNN 用于分类图像中的物体。

通常，CNN 的最后一层是 softmax 层，它为每个物体分配该物体在图像中出现的概率。但如果我们将 CNN 中的 softmax 层去除，我们可以将 CNN 对图像的丰富编码传递给解码器（语言生成 RNN），该解码器设计用于生成短语。然后，我们可以直接在图像及其描述上训练整个系统，这样可以最大化生成的描述与每个图像的训练描述最佳匹配的可能性。

这是自动图像描述模型的小示意图。在左上角是编码器-解码器架构，用于序列到序列的模型，并与物体检测模型结合，如下图所示：

在此实现中，我们将使用预训练的 Inception-v3 模型作为特征提取器，在 ImageNet 数据集上训练一个编码器。

数据准备

让我们导入构建自动描述模型所需的所有依赖项。

本章的所有 Python 文件和 Jupyter Notebook 可以在github.com/PacktPublishing/Python-Deep-Learning-Projects/tree/master/Chapter11找到。

初始化

对于此实现，我们需要 TensorFlow 版本大于或等于 1.9，并且我们还将启用即时执行模式（www.tensorflow.org/guide/eager），这将帮助我们更有效地调试代码。以下是代码：

# Import TensorFlow and enable eager execution
import tensorflow as tf
tf.enable_eager_execution()

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

import re
import numpy as np
import os
import time
import json
from glob import glob
from PIL import Image
import pickle

下载并准备 MS-COCO 数据集

我们将使用 MS-COCO 数据集 (cocodataset.org/#home) 来训练我们的模型。该数据集包含超过 82,000 张图片，每张图片都至少有五个不同的标题注释。以下代码将自动下载并提取数据集：

annotation_zip = tf.keras.utils.get_file('captions.zip', 
                                          cache_subdir=os.path.abspath('.'),
                                          origin = 'http://images.cocodataset.org/annotations/annotations_trainval2014.zip',
                                          extract = True)
annotation_file = os.path.dirname(annotation_zip)+'/annotations/captions_train2014.json'

name_of_zip = 'train2014.zip'
if not os.path.exists(os.path.abspath('.') + '/' + name_of_zip):
  image_zip = tf.keras.utils.get_file(name_of_zip, 
                                      cache_subdir=os.path.abspath('.'),
                                      origin = 'http://images.cocodataset.org/zips/train2014.zip',
                                      extract = True)
  PATH = os.path.dirname(image_zip)+'/train2014/'
else:
  PATH = os.path.abspath('.')+'/train2014/'

这将涉及一个大的下载过程。我们将使用训练集，它是一个 13 GB 的文件。

以下是输出结果：

Downloading data from http://images.cocodataset.org/annotations/annotations_trainval2014.zip 
252878848/252872794 [==============================] - 6s 0us/step 
Downloading data from http://images.cocodataset.org/zips/train2014.zip 
13510574080/13510573713 [==============================] - 322s 0us/step

在这个示例中，我们将选择 40,000 个标题的子集，并使用这些标题及其对应的图片来训练模型。和往常一样，如果选择使用更多数据，标题质量会得到提升：

# read the json annotation file
with open(annotation_file, 'r') as f:
    annotations = json.load(f)

# storing the captions and the image name in vectors
all_captions = []
all_img_name_vector = []

for annot in annotations['annotations']:
    caption = '<start> ' + annot['caption'] + ' <end>'
    image_id = annot['image_id']
    full_coco_image_path = PATH + 'COCO_train2014_' + '%012d.jpg' % (image_id)

    all_img_name_vector.append(full_coco_image_path)
    all_captions.append(caption)

# shuffling the captions and image_names together
# setting a random state
train_captions, img_name_vector = shuffle(all_captions,
                                          all_img_name_vector,
                                          random_state=1)

# selecting the first 40000 captions from the shuffled set
num_examples = 40000
train_captions = train_captions[:num_examples]
img_name_vector = img_name_vector[:num_examples]

数据准备完成后，我们将把所有图片路径存储在 img_name_vector 列表变量中，相关的标题存储在 train_caption 中，如下图所示：

深度 CNN 编码器的数据准备

接下来，我们将使用预训练的 Inception-v3（在 ImageNet 上训练）来对每张图片进行分类。我们将从最后一个卷积层提取特征。我们将创建一个辅助函数，将输入图像转换为 Inception-v3 期望的格式：

#Resizing the image to (299, 299)
#Using the preprocess_input method to place the pixels in the range of -1 to 1.

def load_image(image_path):
    img = tf.read_file(image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize_images(img, (299, 299))
    img = tf.keras.applications.inception_v3.preprocess_input(img)
    return img, image_path

现在让我们初始化 Inception-v3 模型，并加载预训练的 ImageNet 权重。为此，我们将创建一个 tf.keras 模型，其中输出层是 Inception-v3 架构中的最后一个卷积层。

在创建 keras 模型时，你可以看到一个名为 include_top=False 的参数，它表示是否包括网络顶部的全连接层：

image_model = tf.keras.applications.InceptionV3(include_top=False, 
                                                weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output

image_features_extract_model = tf.keras.Model(new_input, hidden_layer)

输出结果如下：

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.5/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5
87916544/87910968 [==============================] - 40s 0us/step

所以，image_features_extract_model 是我们的深度 CNN 编码器，它负责从给定的图像中学习特征。

执行特征提取

现在我们将使用深度 CNN 编码器对每张图片进行预处理，并将输出保存到磁盘：

我们将使用之前创建的 load_image() 辅助函数按批加载图像
我们将把图像输入编码器以提取特征
将特征作为 numpy 数组输出：

encode_train = sorted(set(img_name_vector))
#Load images
image_dataset = tf.data.Dataset.from_tensor_slices(
                                encode_train).map(load_image).batch(16)
# Extract features
for img, path in image_dataset:
  batch_features = image_features_extract_model(img)
  batch_features = tf.reshape(batch_features, 
                              (batch_features.shape[0], -1, batch_features.shape[3]))
#Dump into disk
  for bf, p in zip(batch_features, path):
    path_of_feature = p.numpy().decode("utf-8")
    np.save(path_of_feature, bf.numpy())

语言生成（RNN）解码器的数据准备

第一步是对标题进行预处理。

我们将对标题执行一些基本的预处理步骤，例如以下操作：

首先，我们将对标题进行分词（例如，通过空格拆分）。这将帮助我们构建一个包含数据中所有唯一单词的词汇表（例如，“playing”，“football”等）。
接下来，我们将把词汇表大小限制为前 5,000 个单词以节省内存。我们将用unk（表示未知）替换所有其他单词。你显然可以根据使用场景进行优化。
最后，我们将创建一个单词到索引的映射以及反向映射。
然后，我们将对所有序列进行填充，使其长度与最长的序列相同。

这里是相关代码：

# Helper func to find the maximum length of any caption in our dataset

def calc_max_length(tensor):
    return max(len(t) for t in tensor)

# Performing tokenization on the top 5000 words from the vocabulary
top_k = 5000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k, 
                                                  oov_token="<unk>", 
                                                  filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')

# Converting text into sequence of numbers
tokenizer.fit_on_texts(train_captions)
train_seqs = tokenizer.texts_to_sequences(train_captions)

tokenizer.word_index = {key:value for key, value in tokenizer.word_index.items() if value <= top_k}

# putting <unk> token in the word2idx dictionary
tokenizer.word_index[tokenizer.oov_token] = top_k + 1
tokenizer.word_index['<pad>'] = 0

# creating the tokenized vectors
train_seqs = tokenizer.texts_to_sequences(train_captions)

# creating a reverse mapping (index -> word)
index_word = {value:key for key, value in tokenizer.word_index.items()}

# padding each vector to the max_length of the captions
cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')

# calculating the max_length 
# used to store the attention weights
max_length = calc_max_length(train_seqs)

所以，最终的结果将是一个整数序列数组，如下图所示：

现在，我们将使用 80:20 的比例将数据分为训练样本和验证样本：

img_name_train, img_name_val, cap_train, cap_val = train_test_split(img_name_vector,cap_vector,test_size=0.2,random_state=0)

# Checking the sample counts
print ("No of Training Images:",len(img_name_train))
print ("No of Training Caption: ",len(cap_train) )
print ("No of Training Images",len(img_name_val))
print ("No of Training Caption:",len(cap_val) )

No of Training Images: 24000
No of Training Caption:  24000
No of Training Images 6000
No of Training Caption: 6000

设置数据管道

我们的图像和标题已准备好！接下来，让我们创建一个tf.data数据集（www.tensorflow.org/api_docs/python/tf/data/Dataset）用于训练我们的模型。现在，我们将通过对它们进行转换和批处理来为图像和文本模型准备管道：

# Defining parameters
BATCH_SIZE = 64
BUFFER_SIZE = 1000
embedding_dim = 256
units = 512
vocab_size = len(tokenizer.word_index)

# shape of the vector extracted from Inception-V3 is (64, 2048)
# these two variables represent that
features_shape = 2048
attention_features_shape = 64

# loading the numpy files 
def map_func(img_name, cap):
    img_tensor = np.load(img_name.decode('utf-8')+'.npy')
    return img_tensor, cap

#We use the from_tensor_slices to load the raw data and transform them into the tensors

dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))

# Using the map() to load the numpy files in parallel
# NOTE: Make sure to set num_parallel_calls to the number of CPU cores you have
# https://www.tensorflow.org/api_docs/python/tf/py_func
dataset = dataset.map(lambda item1, item2: tf.py_func(
          map_func, [item1, item2], [tf.float32, tf.int32]), num_parallel_calls=8)

# shuffling and batching
dataset = dataset.shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(1)

定义字幕生成模型

我们用来构建自动字幕的模型架构灵感来自于Show, Attend and Tell论文（arxiv.org/pdf/1502.03044.pdf）。我们从 Inception-v3 的低层卷积层提取的特征给我们一个形状为(8, 8, 2048)的向量。然后，我们将其压缩成(64, 2048)的形状。

然后，这个向量通过 CNN 编码器传递，该编码器由一个单一的全连接层组成。RNN（在我们的案例中是 GRU）会关注图像来预测下一个词：

def gru(units):
  if tf.test.is_gpu_available():
    return tf.keras.layers.CuDNNGRU(units, 
                                    return_sequences=True, 
                                    return_state=True, 
                                    recurrent_initializer='glorot_uniform')
  else:
    return tf.keras.layers.GRU(units, 
                               return_sequences=True, 
                               return_state=True, 
                               recurrent_activation='sigmoid', 
                               recurrent_initializer='glorot_uniform')

注意

现在，我们将定义广为人知的注意力机制——巴赫达诺注意力（Bahdanau Attention）（arxiv.org/pdf/1409.0473.pdf）。我们将需要来自 CNN 编码器的特征，形状为（batch_size，64，embedding_dim）。该注意力机制将返回上下文向量和时间轴上的注意力权重：

class BahdanauAttention(tf.keras.Model):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, features, hidden):
    # hidden_with_time_axis shape == (batch_size, 1, hidden_size)
    hidden_with_time_axis = tf.expand_dims(hidden, 1)

    # score shape == (batch_size, 64, hidden_size)
    score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))

    # attention_weights shape == (batch_size, 64, 1)
    # we get 1 at the last axis because we are applying score to self.V
    attention_weights = tf.nn.softmax(self.V(score), axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * features
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights

CNN 编码器

现在让我们定义 CNN 编码器，它将是一个单一的全连接层，后面跟着 ReLU 激活函数：

class CNN_Encoder(tf.keras.Model):
    # Since we have already extracted the features and dumped it using pickle
    # This encoder passes those features through a Fully connected layer
    def __init__(self, embedding_dim):
        super(CNN_Encoder, self).__init__()
        # shape after fc == (batch_size, 64, embedding_dim)
        self.fc = tf.keras.layers.Dense(embedding_dim)

    def call(self, x):
        x = self.fc(x)
        x = tf.nn.relu(x)
        return x

RNN 解码器

在这里，我们将定义 RNN 解码器，它将接受来自编码器的编码特征。这些特征被输入到注意力层，与输入的嵌入向量连接。然后，连接后的向量被传递到 GRU 模块，进一步通过两个全连接层：

class RNN_Decoder(tf.keras.Model):
  def __init__(self, embedding_dim, units, vocab_size):
    super(RNN_Decoder, self).__init__()
    self.units = units

    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = gru(self.units)
    self.fc1 = tf.keras.layers.Dense(self.units)
    self.fc2 = tf.keras.layers.Dense(vocab_size)

    self.attention = BahdanauAttention(self.units)

  def call(self, x, features, hidden):
    # defining attention as a separate model
    context_vector, attention_weights = self.attention(features, hidden)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # shape == (batch_size, max_length, hidden_size)
    x = self.fc1(output)

    # x shape == (batch_size * max_length, hidden_size)
    x = tf.reshape(x, (-1, x.shape[2]))

    # output shape == (batch_size * max_length, vocab)
    x = self.fc2(x)

    return x, state, attention_weights

  def reset_state(self, batch_size):
    return tf.zeros((batch_size, self.units))

encoder = CNN_Encoder(embedding_dim)
decoder = RNN_Decoder(embedding_dim, units, vocab_size)

损失函数

我们正在使用Adam优化器来训练模型，并对为<PAD>键计算的损失进行屏蔽：

optimizer = tf.train.AdamOptimizer()

# We are masking the loss calculated for padding
def loss_function(real, pred):
    mask = 1 - np.equal(real, 0)
    loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=real, logits=pred) * mask
    return tf.reduce_mean(loss_)

训练字幕生成模型

现在，让我们训练模型。我们需要做的第一件事是提取存储在相应.npy文件中的特征，然后将这些特征通过 CNN 编码器传递。

编码器的输出、隐藏状态（初始化为 0）以及解码器输入（即起始标记）将传递给解码器。解码器返回预测结果和解码器的隐藏状态。

然后，解码器的隐藏状态会被传回模型中，预测结果将用于计算损失。在训练过程中，我们使用教师强迫技术来决定解码器的下一个输入。

教师强迫是一种技术，将目标词作为下一个输入传递给解码器。这种技术有助于快速学习正确的序列或序列的正确统计特性。

最后一步是计算梯度并将其应用于优化器，然后进行反向传播：

EPOCHS = 20
loss_plot = []

for epoch in range(EPOCHS):
    start = time.time()
    total_loss = 0

    for (batch, (img_tensor, target)) in enumerate(dataset):
        loss = 0

        # initializing the hidden state for each batch
        # because the captions are not related from image to image
        hidden = decoder.reset_state(batch_size=target.shape[0])

        dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * BATCH_SIZE, 1)

        with tf.GradientTape() as tape:
            features = encoder(img_tensor)

            for i in range(1, target.shape[1]):
                # passing the features through the decoder
                predictions, hidden, _ = decoder(dec_input, features, hidden)

                loss += loss_function(target[:, i], predictions)

                # using teacher forcing
                dec_input = tf.expand_dims(target[:, i], 1)

        total_loss += (loss / int(target.shape[1]))

        variables = encoder.variables + decoder.variables

        gradients = tape.gradient(loss, variables) 

        optimizer.apply_gradients(zip(gradients, variables), tf.train.get_or_create_global_step())

        if batch % 100 == 0:
            print ('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, 
                                                          batch, 
                                                          loss.numpy() / int(target.shape[1])))
    # storing the epoch end loss value to plot later
    loss_plot.append(total_loss / len(cap_vector))

    print ('Epoch {} Loss {:.6f}'.format(epoch + 1, 
                                         total_loss/len(cap_vector)))
    print ('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

以下是输出：

在执行了几次训练迭代后，让我们绘制Epoch与Loss的图表：

plt.plot(loss_plot)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss Plot')
plt.show()

输出如下：

训练过程中的损失与迭代次数图

评估字幕生成模型

评估函数与训练循环类似，唯一的不同是我们这里不使用教师强迫。每次时间步解码器的输入是它之前的预测结果、隐藏状态以及编码器的输出。

做预测时需要记住的几个要点：

当模型预测结束标记时，停止预测
存储每个时间步的注意力权重

让我们定义evaluate()函数：

def evaluate(image):
 attention_plot = np.zeros((max_length, attention_features_shape))

 hidden = decoder.reset_state(batch_size=1)

 temp_input = tf.expand_dims(load_image(image)[0], 0)
 img_tensor_val = image_features_extract_model(temp_input)
 img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))

 features = encoder(img_tensor_val)

 dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
 result = []

 for i in range(max_length):
 predictions, hidden, attention_weights = decoder(dec_input, features, hidden)

 attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()

 predicted_id = tf.argmax(predictions[0]).numpy()
 result.append(index_word[predicted_id])

 if index_word[predicted_id] == '<end>':
 return result, attention_plot

 dec_input = tf.expand_dims([predicted_id], 0)

 attention_plot = attention_plot[:len(result), :]
 return result, attention_plot

此外，让我们创建一个helper函数来可视化预测单词的注意力点：

def plot_attention(image, result, attention_plot):
    temp_image = np.array(Image.open(image))

    fig = plt.figure(figsize=(10, 10))

    len_result = len(result)
    for l in range(len_result):
        temp_att = np.resize(attention_plot[l], (8, 8))
        ax = fig.add_subplot(len_result//2, len_result//2, l+1)
        ax.set_title(result[l])
        img = ax.imshow(temp_image)
        ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())

    plt.tight_layout()
    plt.show()

# captions on the validation set
rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]
real_caption = ' '.join([index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)

print ('Real Caption:', real_caption)
print ('Prediction Caption:', ' '.join(result))
plot_attention(image, result, attention_plot)
# opening the image
Image.open(img_name_val[rid])

输出如下：

部署字幕生成模型

现在，让我们将整个模块部署为 RESTful 服务。为此，我们将编写一个推理代码，加载最新的检查点，并对给定的图像进行预测。

查看仓库中的inference.py文件。所有代码与训练循环相似，唯一的不同是我们这里不使用教师强迫。每次时间步解码器的输入是它之前的预测结果、隐藏状态以及编码器的输出。

一个重要部分是将模型加载到内存中，我们使用tf.train.Checkpoint()方法来加载所有学习到的权重，包括optimizer、encoder、decoder，并将它们加载到内存中。以下是相应的代码：

checkpoint_dir = './my_model'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(
                                 optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder,
                                )

checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

因此，我们将创建一个evaluate()函数，该函数定义了预测循环。为了确保预测在某些词语后停止，我们将在模型预测结束标记<end>时停止预测：

def evaluate(image):
    attention_plot = np.zeros((max_length, attention_features_shape))

    hidden = decoder.reset_state(batch_size=1)

    temp_input = tf.expand_dims(load_image(image)[0], 0)
 # Extract features from the test image
    img_tensor_val = image_features_extract_model(temp_input)
    img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))
 # Feature is fed into the encoder
    features = encoder(img_tensor_val)

    dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
    result = []
 # Prediction loop
    for i in range(max_length):
        predictions, hidden, attention_weights = decoder(dec_input, features, hidden)

        attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()

        predicted_id = tf.argmax(predictions[0]).numpy()
        result.append(index_word[predicted_id])
 # Hard stop when end token is predicted
        if index_word[predicted_id] == '<end>':
            return result, attention_plot

        dec_input = tf.expand_dims([predicted_id], 0)

    attention_plot = attention_plot[:len(result), :]
    return result, attention_plot

现在让我们在 Web 应用程序代码中使用这个evaluate()函数：

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
@author: rahulkumar
"""

from flask import Flask , request, jsonify

import time
from inference import evaluate
import tensorflow as tf

app = Flask(__name__)

@app.route("/wowme")
def AutoImageCaption():
    image_url=request.args.get('image')
    print('image_url')
    image_extension = image_url[-4:]
    image_path = tf.keras.utils.get_file(str(int(time.time()))+image_extension, origin=image_url)
    result, attention_plot = evaluate(image_path)
    data = {'Prediction Caption:': ' '.join(result)}

    return jsonify(data)

if __name__ == "__main__":
    app.run(host = '0.0.0.0',port=8081)

在终端执行以下命令来运行 Web 应用程序：

python caption_deploy_api.py

你应该会得到以下输出：

* Running on http://0.0.0.0:8081/ (Press CTRL+C to quit)

现在我们请求 API，如下所示：

curl 0.0.0.0:8081/wowme?image=https://www.beautifulpeopleibiza.com/images/BPI/img_bpi_destacada.jpg

我们应该能得到预测的字幕，如下图所示：

确保在大图像上训练模型，以获得更好的预测效果。

哇！我们刚刚部署了最先进的自动字幕生成模块。

总结

在这个实现中，我们使用了一个预训练的 Inception-v3 模型作为特征提取器，并将其作为编码器的一部分，在 ImageNet 数据集上进行训练，作为深度学习解决方案的一部分。该解决方案结合了目前最先进的计算机视觉和自然语言处理技术，形成了一个完整的图像描述方法(www.cs.cmu.edu/~afarhadi/papers/sentence.pdf)，能够为任何提供的图像构建计算机生成的自然描述。通过这个训练好的模型，我们有效地打破了图像与语言之间的障碍，并提供了一项技术，可以作为应用程序的一部分，帮助视障人士享受照片分享这一大趋势带来的益处！伟大的工作！