R-机器学习快速启动指南-一-R 机器学习快速启动指南（一）前言本书提供了关于如何使用机器学习算法创建预测模型的实际

R 机器学习快速启动指南（一）

原文：annas-archive.org/md5/13dee0bdb6445b090ed411f424dc82f4

译者：飞龙

协议：CC BY-NC-SA 4.0

前言

本书提供了关于如何使用机器学习算法创建预测模型的实际指南。在教程中使用玩具或小型数据集来学习机器学习是很常见的，这对于学习基本概念非常实用，但在尝试将所学应用于实际问题时不充分。

本书涵盖了基于机器学习算法开发预测模型的主要步骤。数据收集、数据处理、单变量和多变量分析以及应用最常用的机器学习算法是本书中描述的一些步骤。这是一本编程书，包含多行代码，因此你可以复制书中描述的所有步骤。

这本书展示了为什么不存在唯一的建模可能性；在每一个建模步骤中存在的不同选项是实现准确和有用模型的关键。

本书中的应用案例基于金融行业。这主要是因为我对信息和问题很熟悉，并且因为存在大量数据可以应用多种技术，这可以代表现实生活中可以找到的问题。

本书的理论框架基于解释金融危机及其原因。我们能否预测下一次金融危机？如果不能，至少你会学到非常实用的数据压缩技术。

本书面向的对象

这本书是研究生有用的教科书，也是研究人员以及想要了解如何处理大量数据以及预测模型开发和机器学习算法应用中主要问题的机器学习和大数据实践者的参考书。它涵盖了机器学习中的基本现代主题，并描述了算法应用的一些关键方面。本书聚焦于信用风险和金融危机，因此对该领域的学者也可能很有趣。

本书涵盖的内容

第一章，机器学习中的 R 基础，介绍了本书将解决的问题，并涵盖了获取和运行 R 以供后续章节使用的基础知识。

第二章，预测银行失败 - 数据收集，涵盖了在收集数据时出现的主要问题以及如何构建数据以获得相关特征或变量来开发你的第一个预测模型。

第三章，预测银行失败 - 描述性分析，展示了如何观察和描述数据，如何处理高度不平衡的数据，以及如何处理变量中的缺失值。

第四章，预测银行失败 - 单变量分析，涵盖了变量个体预测能力和它们与目标变量之间关系的分析和测量。此外，由于变量数量较多，本章还包括了一些减少变量数量的技术。

第五章，预测银行失败 - 多变量分析，展示了不同机器学习算法的实现。逻辑回归、正则化方法、梯度提升、神经网络以及支持向量机（SVM）被简要解释并实现，以尝试获得一个准确预测银行失败的模型。本章还包括了一些基本指南，关于如何结合不同模型的结果以提高我们模型的准确性，以及如何以自动和可视化的方式生成模型。

第六章，可视化各国经济问题，涵盖了金融危机演变为主权债务危机的过程，这一危机甚至动摇了欧盟的基础和偿债能力。本章展示了如何衡量不同国家的宏观经济失衡。具体来说，本章将帮助您理解聚类分析、自然中的无监督模型以及这些技术如何帮助解决监督问题。

第七章，主权危机 - 自然语言处理和主题建模，介绍了文本挖掘和主题提取的概念。本章表明，文本挖掘在收集定性报告中的信息方面非常有用。

为了充分利用本书

本书是一本编程书，因此一些编程经验对于充分利用本书内容是可取的。如果您确实是一位编程新手，第一章，机器学习中的 R 语言基础，将为您提供理解 R 语言及其工作原理的起点。基本概念、概念和结构将被简要解释。

这第一章不会让您成为 R 语言的专家，但它为您提供了理解本书中所有代码的关键指南。

需要最新版本的 R 和 RStudio 来复制本书中包含的编程代码。

下载示例代码文件

您可以从www.packt.com的账户下载本书的示例代码文件。如果您在其他地方购买了这本书，您可以访问www.packt.com/support并注册，以便将文件直接通过电子邮件发送给您。

您可以通过以下步骤下载代码文件：

在www.packt.com登录或注册。
选择“支持”标签。
点击“代码下载与勘误表”。
在搜索框中输入书籍名称，并遵循屏幕上的说明。

文件下载后，请确保您使用最新版本解压缩或提取文件夹。

适用于 Windows 的 WinRAR/7-Zip
适用于 Mac 的 Zipeg/iZip/UnRarX
适用于 Linux 的 7-Zip/PeaZip

本书代码包也托管在 GitHub 上，网址为github.com/PacktPublishing/Machine-Learning-with-R-Quick-Start-Guide。如果代码有更新，它将在现有的 GitHub 仓库中更新。

我们还有其他来自我们丰富的书籍和视频目录的代码包，可在**github.com/PacktPublishing/**找到。查看它们吧！

下载彩色图像

我们还提供了一份包含本书中使用的截图/图表彩色图像的 PDF 文件。您可以从这里下载：www.packtpub.com/sites/default/files/downloads/9781838644338_ColorImages.pdf。

使用的约定

本书使用了多种文本约定。

CodeInText：表示文本中的代码单词、数据库表名、文件夹名、文件名、文件扩展名、路径名、虚拟 URL、用户输入和 Twitter 昵称。以下是一个示例：“我们可以使用list()创建列表，或者通过连接其他列表来创建。”

代码块设置如下：

n<-10
n
## [1] 10

任何命令行输入或输出都按以下方式编写：

install.packages("ggplot2")

粗体：表示新术语、重要单词或您在屏幕上看到的单词。例如，菜单或对话框中的单词在文本中显示如下。以下是一个示例：“寻找下载和安装 R，并选择您的操作系统。我们正在为 Windows 安装，因此选择 Windows 链接。”

联系我们

我们欢迎读者的反馈。

一般反馈：如果您对本书的任何方面有疑问，请在邮件主题中提及书名，并给我们发送电子邮件至customercare@packtpub.com。

勘误：尽管我们已经尽最大努力确保内容的准确性，但错误仍然可能发生。如果您在这本书中发现了错误，我们将不胜感激，如果您能向我们报告，我们将不胜感激。请访问www.packt.com/submit-erra…，选择您的书籍，点击勘误提交表单链接，并输入详细信息。

盗版：如果您在互联网上以任何形式遇到我们作品的非法副本，我们将不胜感激，如果您能提供位置地址或网站名称，我们将不胜感激。请通过链接至材料的方式与我们联系至copyright@packt.com。

如果您想成为一名作者：如果您在某个领域有专业知识，并且对撰写或参与一本书籍感兴趣，请访问authors.packtpub.com.

请留下您的评价。一旦您阅读并使用了这本书，为何不在购买它的网站上留下评价呢？潜在读者可以查看并使用您的客观意见来做出购买决定，我们 Packt 公司可以了解您对我们产品的看法，并且我们的作者可以查看他们对书籍的反馈。谢谢！

如需了解更多关于 Packt 的信息，请访问packt.com。

第一章：机器学习 R 基础

您可能已经习惯了在新闻中听到诸如大数据、机器学习和人工智能等词汇。每天出现的新应用这些术语的数量令人惊讶。例如，亚马逊、Netflix 使用的推荐系统、搜索引擎、股市分析，甚至语音识别等，只是其中的一小部分。每年都会出现不同的新算法和新技术，其中许多基于先前的方法或结合了不同的现有算法。同时，越来越多的教程和课程专注于教授这些内容。

许多课程存在一些共同限制，如解决玩具问题或全部关注算法。这些限制可能导致您对数据建模方法产生错误的理解。因此，建模过程在业务和数据理解、数据准备等步骤之前就非常重要。如果没有这些前期步骤，未来模型应用时可能存在缺陷。此外，模型开发在找到合适的算法后并未结束。模型性能评估、可解释性和模型的部署也非常相关，并且是建模过程的最终成果。

在这本书中，我们将学习如何开发不同的预测模型。本书中包含的应用或示例基于金融领域，并尝试构建一个理论框架，帮助您理解金融危机的原因，这对世界各地的国家产生了巨大影响。

本书使用的所有算法和技术都将使用 R 语言实现。如今，R 是数据科学的主要语言之一。关于哪种语言更好的争论非常激烈，R 或 Python。这两种语言都有许多优点和一些缺点。

根据我的经验，R 在金融数据分析方面更加强大。我发现了很多专注于这个领域的 R 库，但在 Python 中并不多见。尽管如此，信用风险和金融信息与时间序列的处理密切相关，至少在我看来，Python 在这方面表现更好。循环或长短期记忆（LSTM）网络在 Python 中的实现也更为出色。然而，R 提供了更强大的数据可视化和交互式风格的库。建议您根据项目需要交替使用 R 和 Python。Packt 提供了许多关于 Python 机器学习的优质资源，其中一些列在这里供您方便查阅：

《Python 机器学习——第二版》， www.packtpub.com/big-data-and-business-intelligence/python-machine-learning-second-edition
《动手实践数据科学和 Python 机器学习》，www.packtpub.com/big-data-and-business-intelligence/hands-data-science-and-python-machine-learning
《Python 机器学习实例》，www.packtpub.com/big-data-and-business-intelligence/python-machine-learning-example

在本章中，让我们重温你对机器学习的知识，并使用 R 开始编码。

本章将涵盖以下主题：

R 和 RStudio 安装
一些基本命令
R 中的对象、特殊情况和基本运算符
控制代码流程
R 包的所有内容
进一步的步骤

R 和 RStudio 安装

让我们先从安装 R 开始。它是完全免费的，可以从cloud.r-project.org/下载。安装 R 是一个简单的任务。

让我们看看在 Windows PC 上安装 R 的步骤。对于在其他操作系统上安装，步骤简单，可在同一cloud.r-project.org/链接找到。

让我们从在 Windows 系统上安装 R 开始：

访问cloud.r-project.org/。
查找“下载并安装 R”，并选择你的操作系统。我们正在为 Windows 安装，所以选择 Windows 链接。
前往子目录并点击 base。
你将被重定向到一个显示下载 R X.X.X for Windows 的页面。在撰写本书时，你需要点击下载 R 3.5.2 for Windows 的版本。
保存并运行.exe 文件。
你现在可以选择安装 R 的设置语言。
将会打开一个设置向导，你可以继续点击“下一步”，直到到达“选择目标位置”。
选择你首选的位置并点击“下一步”。
点击“下一步”按钮几次，直到 R 开始安装。
安装完成后，R 将通过消息“完成 R for Windows 3.5.2 设置向导”通知你。你现在可以点击“完成”。
你可以在桌面上找到 R 的快捷方式，双击它以启动 R。
就像任何其他应用程序一样，如果你在桌面上找不到 R，你可以点击开始按钮，所有程序，然后查找 R 并启动它。
你将看到一个类似于以下截图的屏幕：

这是 R 命令提示符，等待输入。

关于 R 的注意事项

在输入命令之前，你必须知道 R 是一个区分大小写的和解释型语言。

你可以选择手动输入命令或根据你的意愿从源文件运行一组命令。R 提供了许多内置函数，为用户提供大部分功能。作为用户，你甚至可以创建用户自定义函数。

您甚至可以创建和操作对象。您可能知道，对象是可以分配值的任何东西。交互式会话要求在执行过程中所有对象都必须存在于内存中，而函数可以放在具有当前程序引用的包中，并且可以在需要时访问。

使用 RStudio

除了使用 R，还建议使用 RStudio。RStudio 是一个 集成开发环境（IDE），就像任何其他 IDE 一样，可以增强您与 R 的交互。

RStudio 提供了一个非常组织良好的界面，可以同时清楚地表示图表、数据表、R 代码和输出。

此外，R 提供了类似导入向导的功能，可以在不编写代码的情况下导入和导出不同格式的文件。

在看到标准的 R GUI 界面后，您会发现它与 RStudio 非常相似，但区别在于与 R GUI 相比，RStudio 非常直观且用户友好。您可以从菜单中选择许多选项，甚至可以根据您的需求自定义 GUI。桌面版 RStudio 可在 www.rstudio.com/products/rstudio/download/#download 下载。

RStudio 安装

安装步骤与 R 的安装非常相似，因此没有必要描述详细的步骤。

第一次打开 RStudio，您将看到三个不同的窗口。您可以通过转到文件，新建文件，并选择 R 脚本来启用第四个窗口：

在左上角的窗口中，可以编写脚本，然后保存并执行。接下来的左侧窗口代表控制台，其中可以直接执行 R 代码。

右上方的窗口允许可视化工作空间中定义的变量和对象。此外，还可以查看之前执行过的命令历史。最后，右下方的窗口显示工作目录。

一些基本命令

这里有一份有用的命令列表，用于开始使用 R 和 RStudio：

help.start(): 启动 R 文档的 HTML 版本
help(command)/??command/help.search(command): 显示与特定命令相关的帮助
demo(): 一个用户友好的界面，运行一些 R 脚本的演示
library(help=package): 列出包中的函数和数据集
getwd(): 打印当前活动的工作目录
ls(): 列出当前会话中使用的对象
setwd(mydirectory): 将工作目录更改为 mydirectory
options(): 显示当前选项中的设置
options(digits=5): 您可以打印指定的数字作为输出
history(): 显示直到限制为 25 的之前的命令
history(max.show=Inf): 不论限制如何，显示所有命令
savehistory(file=“myfile”): 保存历史记录（默认文件是 .Rhistory 文件）
loadhistory(file=“myfile”): 回忆你的命令历史
save.image(): 保存当前工作空间到特定工作目录下的 .RData 文件
save(object list,file=“myfile.RData”): 将对象保存到指定文件
load(“myfile.RData”): 从指定文件加载特定对象
q(): 这将退出 R，并提示保存当前工作空间
library(package): 加载特定于项目的库
install.package(package): 从类似 CRAN 的存储库或甚至从本地文件下载并安装包
rm(object1, object2…): 删除对象

在 RStudio 中执行命令时，应在控制台中编写，然后必须按 Enter。

在 RStudio 中，可以通过结合代码行和纯文本来创建交互式文档。R 笔记本将有助于直接与 R 交互，因此当我们使用它时，可以生成具有出版质量的文档作为输出。

要在 RStudio 中创建新笔记本，请转到文件，新建文件，R 笔记本。默认笔记本将打开，如下截图所示：

这个笔记本是一个具有 .rmd 扩展名的纯文本文件。一个文件包含三种类型的内容：

（可选）由 --- 行包围的 YAML 标头
R 代码块由 ```pyr

Text mixed with simple text formatting

R code chunks allow for the execution of code and display the results in the same notebook. To execute a chunk, click the run button within the chunk or place the cursor inside it and press Ctrl + Shift + Enter. If you wish to insert a chunk button on the toolbar, press Ctrl + Alt + I.

While saving the current notebook, a code and output file in HTML format will be generated and will be saved with the notebook. To see what the HTML file looks like, you can either click the Preview button or you can use the shortcut Ctrl + Shift + K. You can find and download all the code of this book as a R Notebook, where you can execute all the code without writing it directly.

Objects, special cases, and basic operators in R

By now, you will have figured out that R is an object-oriented language. All our variables, data, and functions will be stored in the active memory of the computer as objects. These objects can be modified using different operators or functions. An object in R has two attributes, namely, mode and length.

Mode includes the basic type of elements and has four options:

Numeric: These are decimal numbers
Character: Represents sequences of string values
Complex: Combination of real and imaginary numbers, for example, x+ai
Logical: Either true (1) or false (0)

Length means the number of elements in an object.

In most cases, we need not care whether or not the elements of a numerical object are integers, reals, or even complexes. Calculations will be carried out internally as numbers of double precision, real, or complex, depending on the case. To work with complex numbers, we must indicate explicitly the complex part.

In case an element or value is unavailable, we assign NA, a special value. Usually, operations with NA elements result in NA unless we are using some functions that can treat missing values in some way or omit them. Sometimes, calculations can lead to answers with a positive or negative infinite value (represented by R as Inf or -Inf, respectively). On the other hand, certain calculations lead to expressions that are not numbers represented by R as NaN (short for not a number).

Working with objects

You can create an object using the <- operator:


n<-10

n

## [1] 10

```py

In the preceding code, an object called `n` is created. A value of `10` has been assigned to this object. The assignment can also be made using the `assign()` function, although this isn't very common.

Once the object has been created, it is possible to perform operations on it, like in any other programming language:

n+5

[1] 15


These are some examples of basic operations.

Let's create our variables:

x<-4

y<-3


Now, we can carry out some basic operations:

*   Sum of variables:

x + y

[1] 7


*   Subtraction of variables:

x - y

[1] 1


*   Multiplication of variables:

x * y

[1] 12


*   Division of variables:

x / y

[1] 1.333333


*   Power of variables:

x ** y

[1] 64


Likewise in R, there are defined constants that are widely used, such as the following ones:

*   The `pi` (![](https://p6-xtjj-sign.byteimg.com/tos-cn-i-73owjymdk6/835544b681eb4ef18dd719761b8bf900~tplv-73owjymdk6-jj-mark-v1:0:0:0:0:5o6Y6YeR5oqA5pyv56S-5Yy6IEAg5biD5a6i6aOe6b6Z:q75.awebp?rk3s=f64ab15b&x-expires=1771369591&x-signature=o8aft%2BcmNoxjz5m7Ha8RXlVPtWQ%3D)) number :

x * pi

[1] 12.56637


*   Exponential function:

exp(y)

[1] 20.08554


There are also functions for working with numbers, such as the following:

*   Sign (positive or negative of a number):

sign(y)

[1] 1


*   Finding the maximum value:

max(x,y)

[1] 4


*   Finding the minimum value:

min(x,y)

[1] 3


*   Factorial of a number:

factorial(y)

[1] 6


*   Square root function:

sqrt(y)

[1] 1.732051


It is also possible to assign the result of previous operations to another object. For example, the sum of variables `x` and `y` is assigned to an object named `z`:

z <- x + y

[1] 7


As shown previously, these functions apply if the variables are numbers, but there are also other operators to work with strings:

x > y

[1] TRUE

x + y != 8

[1] TRUE


The main logical operators are summarized in the following table:

| **Operator** | **Description** |
| < | Less than |
| <= | Less than or equal to |
| > | Greater than |
| >= | Greater than or equal to |
| == | Equal to |
| != | Not equal to |
| !x | Not *x* |
| x | *y* |
| x & y | *x* and *y* |
| isTRUE(x) | Test if *x* is TRUE |

# Working with vectors

A **vector** is one of the basic data structures in R. It contains only similar elements, like strings and numbers, and it can have data types such as logical, double, integer, complex, character, or raw. Let's see how vectors work.

Let's create some vectors by using `c()`:

a<-c(1,3,5,8)

[1] 1 3 5 8


On mixing different objects with vector elements, there is a transformation of the elements so that they belong to the same class:

y <- c(1,3)

class(y)

[1] "numeric"


When we apply commands and functions to a vector variable, they are also applied to every element in the vector:

y <- c(1,5,1)

y + 3

[1] 4 8 4


You can use the `:` operator if you wish to create a vector of consecutive numbers:

c(1:10)

[1] 1 2 3 4 5 6 7 8 9 10


Do you need to create more complex vectors? Then use the `seq()` function. You can create vectors as complex as number of points in an interval or even to find out the step size that we might need in machine learning:

seq(1, 5, by=0.1)

[1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6

[18] 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3

[35] 4.4 4.5 4.6 4.7 4.8 4.9 5.0

seq(1, 5, length.out=22)

[1] 1.000000 1.190476 1.380952 1.571429 1.761905 1.952381 2.142857

[8] 2.333333 2.523810 2.714286 2.904762 3.095238 3.285714 3.476190

[15] 3.666667 3.857143 4.047619 4.238095 4.428571 4.619048 4.809524

[22] 5.000000


The `rep()` function is used to repeat the value of *x*, *n* number of times:

rep(3,20)

[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3


# Vector indexing

Elements of a vector can be arranged in several haphazard ways, which can make it difficult to access them when needed. Hence, indexing makes it easier to access the elements.

You can have any type of index vectors, from logical, integer, and character.

Vector of integers starting from 1 can be used to specify elements in a vector, and it is also possible to use negative values.

Let's see some examples of indexing:

*   Returns the *n*th element of *x*:

x <- c(9,8,1,5)


*   Returns all *x* values except the *n*th element:

x[-3]

[1] 9 8 5


*   Returns values between *a* and *b*:

x[1:2]

[1] 9 8


*   Returns items that are greater than *a* and less than *b*:

x[x>0 & x<4]

[1] 1


Moreover, you can even use a logical vector. In this case, either `TRUE` or `FALSE` will be returned if an element is present at that position:

x[c(TRUE, FALSE, FALSE, TRUE)]

[1] 9 5


# Functions on vectors

In addition to the functions and operators that we've seen for numerical values, there are some specific functions for vectors, such as the following:

*   Sum of the elements present in a vector:

sum(x)

[1] 23


*   Product of elements in a vector:

prod(x)

[1] 360


*   Length of a vector:

length(x)

[1] 4


*   Modifying a vector using the `<-` operator:

[1] 9 8 1 5

x[1]<-22

[1] 22 8 1 5


# Factor

A vector of strings of a character is known as a **factor**. It is used to represent categorical data, and may also include the different levels of the categorical variable. Factors are created with the `factor` command:

r<-c(1,4,7,9,8,1)

r<-factor(r)

[1] 1 4 7 9 8 1

Levels: 1 4 7 8 9


# Factor levels

**Levels** are possible values that a variable can take. Suppose the original value of 1 is repeated; it will appear only once in the levels.

Factors can either be numeric or character variables, but levels of a factor can only be characters.

Let's run the `level` command:

levels(r)

[1] "1" "4" "7" "8" "9"


As you can see, `1`, `4`, `7`, `8`, and `9` are the possible levels that the level `r` can have.

The `exclude` parameter allows you to exclude levels of a custom factor:

factor(r, exclude=4)

[1] 1 7 9 8 1

Levels: 1 7 8 9


Finally, let's find out if our factor values are ordered or unordered:

a<- c(1,2,7,7,1,2,2,7,1,7)

a<- factor(a, levels=c(1,2,7), ordered=TRUE)

[1] 1 2 7 7 1 2 2 7 1 7

Levels: 1 < 2 < 7


# Strings

Any value that is written in single or double quotes will be considered a **string**:

c<-"This is our first string"

[1] "This is our first string"

class(c)

[1] "character"


When I say single quotes are allowed, please know that even if you specify the string in single quotes, R will always store them as double quotes.

# String functions

Let's see how we can transform or convert strings using R.

The most relevant string examples are as follows:

*   To know the number of characters in a string:

nchar(c)

[1] 24


*   To return the substring of *x*, originating at a particular character in *x*:

substring(c,4)

[1] "s is our first string"


*   To return the substring of *x* originating at one character located at *n* and ending at another character located at a place, *m*:

substring(c,1,4)

[1] "This"


*   To divide the string *x* into a list of sub chains using the delimiter as a separator:

strsplit(c, " ")

[[1]]

[1] "This" "is" "our" "first" "string"


*   To check if the given pattern is in the string, and in that case returns true (or `1`):

grep("our", c)

[1] 1

grep("book", c)

integer(0)


*   To look for the first occurrence of a pattern in a string:

regexpr("our", c)

[1] 9

attr(,"match.length")

[1] 3

attr(,"index.type")

[1] "chars"

attr(,"useBytes")

[1] TRUE


*   To convert the string into lowercase:

tolower(c)

[1] "这是我们第一条字符串"


*   To convert the string into capital letters:

toupper(c)

[1] "THIS IS OUR FIRST STRING"


*   To replace the first occurrence of the pattern by the given value with a string:

sub("our", "my", c)

[1] "这是我们第一条字符串"


*   To replace the occurrences of the pattern with the given value with a string:

gsub("our", "my", c)

[1] "This is my first string"


*   To return the string as elements of the given array, separated by the given separator using `paste(string,array, sep=“Separator”)`:

paste(c,"My book",sep=" : ")

[1] "这是我们第一条字符串：我的书"


# Matrices

You might know that a standard matrix has a two-dimensional, rectangular layout. Matrices in R are no different than a standard matrix.

# Representing matrices

To represent a matrix of *n* elements with *r* rows and *c* columns, the `matrix` command is used:

m<-matrix(c(1,2,3,4,5,6), nrow=2, ncol=3)

[,1] [,2] [,3]

[1,] 1 3 5

[2,] 2 4 6


# Creating matrices

A matrix can be created by rows instead of by columns, which is done by using the `byrow` parameter, as follows:

m<-matrix(c(1,2,3,4,5,6), nrow=2, ncol=3,byrow=TRUE)

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 4 5 6


With the `dimnames` parameter, column names can be added to the matrix:

m<-matrix(c(1,2,3,4,5,6), nrow=2, ncol=3,byrow=TRUE,dimnames=list(c('Obs1', 'Obs2'), c('col1', 'Col2','Col3')))

col1 Col2 Col3

Obs1 1 2 3

Obs2 4 5 6


There are three more alternatives to creating matrices:

rbind(1:3,4:6,10:12)

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 4 5 6

[3,] 10 11 12

cbind(1:3,4:6,10:12)

[,1] [,2] [,3]

[1,] 1 4 10

[2,] 2 5 11

[3,] 3 6 12

m<-array(c(1,2,3,4,5,6), dim=c(2,3))

[,1] [,2] [,3]

[1,] 1 3 5

[2,] 2 4 6


# Accessing elements in a matrix

You can access the elements in a matrix in a similar way to how you accessed elements of a vector using indexing. However, the elements here would be the index number of rows and columns.

Here a some examples of accessing elements:

*   If you want to access the element at a second column and first row:

m<-array(c(1,2,3,4,5,6), dim=c(2,3))

[,1] [,2] [,3]

[1,] 1 3 5

[2,] 2 4 6

m[1,2]

[1] 3


*   Similarly, accessing the element at the second column and second row:

m[2,2]

[1] 4


*   Accessing the elements in only the second row:

m[2,]

[1] 2 4 6


*   Accessing only the first column:

m[,1]

[1] 1 2


# Matrix functions

Furthermore, there are specific functions for matrices:

*   The following function extracts the diagonal as a vector:

m<-matrix(c(1,2,3,4,5,6,7,8,9), nrow=3, ncol=3)

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

diag(m)

[1] 1 5 9


*   Returns the dimensions of a matrix:

dim(m)

[1] 3 3


*   Returns the sum of columns of a matrix:

colSums(m)

[1] 6 15 24


*   Returns the sum of rows of a matrix:

rowSums(m)

[1] 12 15 18


*   The transpose of a matrix can be obtained using the following code:

t(m)

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 4 5 6

[3,] 7 8 9


*   Returns the determinant of a matrix:

det(m)

[1] 0


*   The auto-values and auto-vectors of a matrix are obtained using the following code:

eigen(m)

eigen() decomposition

$values

[1] 1.611684e+01 -1.116844e+00 -5.700691e-16

$vectors

[,1] [,2] [,3]

[1,] -0.4645473 -0.8829060 0.4082483

[2,] -0.5707955 -0.2395204 -0.8164966

[3,] -0.6770438 0.4038651 0.4082483


# Lists

If objects are arranged in an orderly manner, which makes them components, they are known as **lists**.

# Creating lists

We can create a list using `list()` or by concatenating other lists:

x<- list(1:4,"book",TRUE, 1+4i)

[[1]]

[1] 1 2 3 4

[[2]]

[1] "book"

[[3]]

[1] TRUE

[[4]]

[1] 1+4i


Components will always be referred to by their referring numbers as they are ordered and numbered.

# Accessing components and elements in a list

To access each component in a list, a double bracket should be used:

x[[1]]

[1] 1 2 3 4


However, it is possible to access each element of a list as well:

x[[1]][2:4]

[1] 2 3 4


# Data frames

Data frames are special lists that can also store tabular values. However, there is a constraint on the length of elements in the lists: they all have to be of a similar length. You can consider every element in the list as columns, and their lengths can be considered as rows.

Just like lists, a data frame can have objects belonging to different classes in a column; this was not allowed in matrices.

Let's quickly create a data frame using the `data.frame()` function:

a <- c(1, 3, 5)

b <- c("red", "yellow", "blue")

c <- c(TRUE, FALSE, TRUE)

df <- data.frame(a, b, c)

a b c

1 red TRUE

3 yellow FALSE

5 blue TRUE


You can see the headers of a table as `a`, `b`, and `c`; they are the column names. Every line of the table represents a row, starting with the name of each row.

# Accessing elements in data frames

It is possible to access each cell in the table.

To do this, you should specify the coordinates of the desired cell. Coordinates begin within the position of the row and end with the position of the column:

df[2,1]

[1] 3


We can even use the row and column names instead of numeric values:

df[,"a"]

[1] 1 3 5


Some packages contain datasets that can be loaded to the workspace, for example, the `iris` dataset:

data(iris)


# Functions of data frames

Some functions can be used on data frames:

*   To find out the number of columns in a data frame:

ncol(iris)

[1] 5


*   To obtain the number of rows:

nrow(iris)

[1] 150


*   To print the first `10` rows of data:

head(iris,10)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

7 4.6 3.4 1.4 0.3 setosa

8 5.0 3.4 1.5 0.2 setosa

9 4.4 2.9 1.4 0.2 setosa

10 4.9 3.1 1.5 0.1 setosa


*   Print the last `5` rows of the `iris` dataset:

tail(iris,5)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

146 6.7 3.0 5.2 2.3 virginica

147 6.3 2.5 5.0 1.9 virginica

148 6.5 3.0 5.2 2.0 virginica

149 6.2 3.4 5.4 2.3 virginica

150 5.9 3.0 5.1 1.8 virginica


*   Finally, general information of the entire dataset is obtained using `str()`:

str(iris)

'data.frame': 150 obs. of 5 variables:

$ 花萼长度: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ 花萼宽度 : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ 花瓣长度: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ 花瓣宽度: num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ 物种 : 因子 w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


Although there are a lot of operations to work with data frames, such as merging, combining, or slicing, we won't go any deeper for now. We will be using data frames in further chapters, and shall cover more operations later.

# Importing or exporting data

In R, there are several functions for reading and writing data from many sources and formats. Importing data into R is quite simple.

The most common files to import into R are Excel or text files. Nevertheless, in R, it is also possible to read files in SPSS, SYSTAT, or SAS formats, among others.

In the case of Stata and SYSTAT files, I would recommend the use of the `foreign` package.

Let's install and load the `foreign` package:

安装包("foreign")

载入库(foreign)


We can use the `Hmisc` package for SPSS, and SAS for ease and functionality:

安装包("Hmisc")

载入库(Hmisc)


Let's see some examples of importing data:

*   Import a comma delimited text file. The first rows will have the variable names, and the comma is used as a separator:

mydata<-read.table("c:/mydata.csv", header=TRUE,sep=",", row.names="id")


*   To read an Excel file, you can either simply export it to a comma delimited file and then import it or use the `xlsx` package. Make sure that the first row comprises column names that are nothing but variables.
*   Let's read an Excel worksheet from a workbook, `myexcel.xlsx`:

载入库(xlsx)

mydata<-read.xlsx("c:/myexcel.xlsx", 1)


*   Now, we will read a concrete Excel sheet in an Excel file:

mydata<-read.xlsx("c:/myexcel.xlsx", sheetName= "mysheet")


*   Reading from the `systat` format:

载入库(foreign)

mydata<-read.systat("c:/mydata.dta")


*   Reading from the SPSS format:
    1.  First, the file should be saved from SPSS in a transport format:

getfile=’c:/mydata.sav’ exportoutfile=’c:/mydata.por’


*   2.  Then, the file can be imported into R with the `Hmisc` package:

载入库(Hmisc)

mydata<-spss.get("c:/mydata.por", use.value.labels=TRUE)


*   To import a file from SAS, again, the dataset should be converted in SAS:

libname out xport ‘c:/mydata.xpt’; data out.mydata; set sasuser.mydata; run;

载入库(Hmisc)

mydata<-sasxport.get("c:/mydata.xpt")


*   Reading from the Stata format:

载入库(foreign)

mydata<-read.dta("c:/mydata.dta")


Hence, we have seen how easy it is to read data from different file formats. Let's see how simple exporting data is.

There are analogous functions to export data from R to other formats. For SAS, SPSS, and Stata, the `foreign` package can be used. For Excel, you will need the `xlsx` package.

Here are a few exporting examples:

*   We can export data to a tab delimited text file like this:

write.table(mydata, "c:/mydata.txt", sep="\t")


*   We can export to an Excel spreadsheet like this:

载入库(xlsx)

write.xlsx(mydata, "c:/mydata.xlsx")


*   We can export to SPSS like this:

载入库(foreign)

write.foreign(mydata, "c:/mydata.txt", "c:/mydata.sps", package="SPSS")


*   We can export to SAS like this:

载入库(foreign)

write.foreign(mydata, "c:/mydata.txt", "c:/mydata.sas", package="SAS")


*   We can export to Stata like this:

载入库(foreign)

write.dta(mydata, "c:/mydata.dta")


# Working with functions

Functions are the core of R, and they are useful to structure and modularize code. We have already seen some functions in the preceding section. These functions can be considered built-in functions that are available on the basis of R or where we install some packages.

On the other hand, we can define and create our own functions based on different operations and computations we want to perform on the data. We will create functions in R using the `function()` directive, and these functions will be stored as objects in R.

Here is what the structure of a function in R looks like:

myfunction <- function(arg1, arg2, … )

{

statements

返回(object)

}


The objects specified under a function as local to that function and the resulting objects can have any data type. We can even pass these functions as arguments for other functions.

Functions in R support nesting, which means that we can define a function within a function and the code will work just fine.

The resulting value of a function is known as the last expression evaluated on execution.

Once a function is defined, we can use that function using its name and passing the required arguments.

Let's create a function named `squaredNum`, which calculates the square value of a number:

squaredNum<-function(number)

{

a<-number²

return(a)

}


Now, we can calculate the square of any number using the function that we just created:

squaredNum(425)

[1] 180625


As we move on in this book, we will see how important such user-defined functions are.

# Controlling code flow

R has a set of control structures that organize the flow of execution of a program, depending on the conditions of the environment. Here are the most important ones:

*   `If`/`else`: This can test a condition and execute it accordingly
*   `for`: Executes a loop that repeats for a certain number of times, as defined in the code
*   `while`: This evaluates a condition and executes only until the condition is true
*   `repeat`: Executes a loop an infinite number of times
*   `break`: Used to interrupt the execution of a loop
*   `next`: Used to jump through similar iterations to decrease the number of iterations and time taken to get the output from the loop
*   `return`: Abandons a function

The structure of `if else` is as `if (test_expression) { statement }`.

Here, if the `test_expression` returns true, the `statement` will execute; otherwise, it won't.

An additional `else` condition can be added like `if (test_expression) { statement1 } else { statement2 }`.

In this case, the `else` condition is executed only if `test_expression` returns false.

Let's see how this works. We will evaluate an `if` expression like so:

x<-4

y<-3

if (x >3) {

y <- 10

} else {

y<- 0

}


Since `x` takes a value higher than `3`, then the `y` value should be modified to take a value of `10`:

打印(y)

[1] 10


If there are more than two `if` statements, the `else` expression is transformed into `else if` like this `if ( test_expression1) { statement1 } else if ( test_expression2) { statement2 } else if ( test_expression3) { statement3 } else { statement4 }`.

The `for` command takes an iterator variable and assigns its successive values of a sequence or vector. It is usually used to iterate on the elements of an object, such as vector lists.

An easy example is as follows, where the `i` variable takes different values from `1` to `10` and prints them. Then, the loop finishes:

for (i in 1:10){

打印(i)

}

[1] 1

[1] 2

[1] 3

[1] 4

[1] 5

[1] 6

[1] 7

[1] 8

[1] 9

[1] 10


Additionally, loops can be nested in the same code:

x<- matrix(1:6,2,3)

for (i in seq_len(nrow(x))){

for (j in seq_len(ncol(x))){

打印(x[i,j])}

}

[1] 1

[1] 3

[1] 5

[1] 2

[1] 4

[1] 6


The `while` command is used to create loops until a specific condition is met. Let's look at an example:

x <- 1

while (x >= 1 & x < 20){

打印(x)

x = x+1

}

[1] 1

[1] 2

[1] 3

[1] 4

[1] 5

[1] 6

[1] 7

[1] 8

[1] 9

[1] 10

[1] 11

[1] 12

[1] 13

[1] 14

[1] 15

[1] 16

[1] 17

[1] 18

[1] 19


Here, values of `x` are printed, while `x` takes higher values than `1` and less than `20`. While loops start by testing the value of a condition, if true, the body of the loop is executed. After it has been executed, it will test the condition again, and keep on testing it until the result is false.

The `repeat` and `break` commands are related. The `repeat` command starts an infinite loop, so the only way out of it is through the `break` instruction:

x <- 1

repeat{

打印(x)

x = x+1

if (x == 6){

break

}

}

[1] 1

[1] 2

[1] 3

[1] 4

[1] 5


We can use the `break` statement inside for and while loops to stop the iterations of the loop and control the flow.

Finally, the `next` command can be used to skip some iterations without getting them terminated. When the R parser reads `next`, it terminates the current iteration and moves on to another new iteration.

Let's look at an example of `next`, where 20 iterations are skipped:

for (i in 1:15){

if (i <= 5){

next

} else { 打印(i)

} }

[1] 6

[1] 7

[1] 8

[1] 9

[1] 10

[1] 11

[1] 12

[1] 13

[1] 14

[1] 15


Before we start the next chapters of this book, it is recommended to practice these codes. Take your time and think about the code and how to use it. In the upcoming chapters, you will see a lot of code and new functions. Don't be concerned if you don't understand all of them. It is more important to have an understanding of the entire process to develop a predictive model and all the things you can do with R.

I have tried to make all of the code accessible, and it is possible to replicate all the tables and results provided in this book. Just enjoy understanding the process and reuse all the code you need in your own applications.

# All about R packages

Packages in R are a collection of functions and datasets that are developed by the community.

# Installing packages

Although R contains several functions in its basic installation, we will need to install additional packages to add new R functionalities. For example, with R it is possible to visualize data using the `plot` function. Nevertheless, we could install the `ggplot2` package to obtain more pretty plots.

A package mainly includes R code (not always just R code), documentation with explanations about the package and functions inside it, examples, and even datasets.

Packages are placed on different repositories where you can install them.

Two of the most popular repositories for R packages are as follows:

*   **CRAN**: The official repository, maintained by the R community around the world. All of the packages that are published on this repository should meet quality standards.
*   **GitHub**: This repository is not specific for R packages, but many of the packages have open source projects located in them. Unlike CRAN, there is no review process when a package is published.

To install a package from CRAN, use the `install.packages()` command. For example, the `ggplot2` package can be installed using the following command:

安装包("ggplot2")


To install packages from repositories other than CRAN, I would recommend using the `devtools` package:

安装包("devtools")


This package simplifies the process of installing packages from different repositories. With this package, some functions are available, depending on the repository you want to download a package from.

For example, use `install_cran` to download a package from CRAN or `install_github()` to download it from GitHub.

After the package has been downloaded and installed, we'll load it into our current R session using the `library` function. It is important to load packages so that we can use these new functions in our R session:

载入库(ggplot2)


`require`函数可用于加载包。`require`和`library`之间的唯一区别是，如果找不到特定的包，`library`将显示错误，但`require`将继续执行代码而不会显示任何错误。

# 必要的包

要运行本书中展示的所有代码，您需要安装我们提到的某些包。具体来说，您需要安装以下包（按字母顺序排序）：

+   `Amelia`: 用于缺失数据可视化和插补的包。

+   `Boruta`: 实现用于寻找相关变量的特征选择算法。

+   `caret`: 这个包（简称 **classification and regression training**）实现了几个机器学习算法，用于构建预测模型。

+   `caTools`: 包含几个基本实用函数，包括预测指标或分割样本的函数。

+   `choroplethr`/`choroplethrMaps`: 在 R 中创建地图。

+   `corrplot`: 计算变量之间的相关性并以图形方式显示。

+   `DataExplorer`: 包含数据探索过程中的不同函数。

+   `dplyr`: 数据操作包。

+   `fBasics`: 包含探索性数据分析的技术。

+   `funModeling`: 用于数据清洗、重要性变量分析和模型性能的函数。

+   `ggfortify`: 用于统计分析数据可视化工具的函数。

+   `ggplot2`: 用于声明性创建图形的系统。

+   `glmnet`: 面向 Lasso 和弹性网络正则化回归模型的包。

+   `googleVis`: R 对 Google 图表的接口。

+   `h2o`: 包含快速和可扩展算法的包，包括梯度提升、随机森林和深度学习。

+   `h2oEnsemble`: 提供从通过 `h2o` 包可访问的基学习算法创建集成功能。

+   `Hmisc`: 包含许多对数据分析和导入不同格式的文件有用的函数。

+   `kohonen`: 促进自组织图的创建和可视化。

+   `lattice`: 一个用于创建强大图形的包。

+   `lubridate`: 包含用于以简单方式处理日期的函数。

+   `MASS`: 包含几个统计函数。

+   `plotrix`: 包含许多绘图、标签、坐标轴和颜色缩放函数。

+   `plyr`: 包含可以分割、应用和组合数据的工具。

+   `randomForest`: 用于分类和回归的随机森林算法。

+   `rattle`: 这提供了一个用于不同 R 包的 GUI，可以帮助数据挖掘。

+   `readr`: 提供了一种快速且友好的方式来读取 `.csv`、`.tsv` 或 `.fwf` 文件。

+   `readtext`: 导入和处理纯文本和格式化文本文件的函数。

+   `recipes`: 用于数据操作和分析的有用包。

+   `rpart`: 实现分类和回归树。

+   `rpart.plot`: 使用 `rpart` 包创建树的最简单方法。

+   `Rtsne`: **t 分布随机邻域嵌入**（**t-SNE**）的实现。

+   `RWeka`: RWeka 包含许多数据挖掘算法，以及可以预处理和分类数据的工具。它提供了一个易于使用的接口来执行回归、聚类、关联和可视化等操作。

+   `rworldmap`: 使能国家级别和网格用户数据集的映射。

+   `scales`：这提供了一些方法，可以自动检测断点、确定坐标轴和图例的标签。它完成了映射的工作。

+   `smbinning`：一组用于构建评分模型的函数。

+   `SnowballC`：可以轻松实现非常著名的波特词干算法，该算法将单词折叠成根节点并比较词汇。

+   `sqldf`：使用 SQL 操作 R 数据框的函数。

+   `tibbletime`：用于处理时间序列的有用函数。

+   `tidyquant`：一个专注于以最简单的方式检索、操作和缩放金融数据分析的包。

+   `tidyr`：包括数据框操作的功能。

+   `tidyverse`：这是一个包含用于操作数据、探索和可视化的包的包。

+   `tm`：R 中的文本挖掘包。

+   `VIM`：使用此包，可以可视化缺失的包。

+   `wbstats`：此包让您能够访问世界银行 API 中的数据和统计数据。

+   `WDI`：搜索、提取和格式化来自世界银行**世界发展指标**（**WDI**）的数据。

+   `wordcloud`：此包提供了强大的功能，可以帮助您创建漂亮的词云。它还可以帮助可视化两份文档之间的差异和相似性。

一旦安装了这些包，我们就可以开始使用以下章节中包含的所有代码。

# 进一步的步骤

我们将使用美国破产问题陈述来帮助您深入了解机器学习过程，并为您提供处理和解决现实世界问题的实际经验。所有以下章节都将详细描述每个步骤。

以下章节的目标是描述基于机器学习技术开发模型的所有步骤和替代方案。

我们将看到几个步骤，从信息的提取和新生变量的生成，到模型的验证。正如我们将看到的，在开发的每个步骤中，都有一些替代方案或多个步骤是可能的。在大多数情况下，最佳替代方案将是给出更好预测模型的方案，但有时由于模型未来使用或我们想要解决的问题类型所施加的限制，可能会选择其他替代方案。

# 金融危机背景

在这本书中，我们将解决与金融危机相关的两个不同问题：美国银行的破产和欧洲国家偿债能力的评估。为什么我选择了这样具体的问题来写这本书？首先，是因为我对金融危机的关注，以及我试图避免未来危机的目标。另一方面，这是一个有趣的问题，因为可用的数据量很大，这使得它非常适合理解机器学习技术。

本书的大部分章节将涵盖开发预测模型以检测银行失败的情况。为了解决这个问题，我们将使用一个大型数据集，该数据集收集了处理不同算法时可能遇到的一些更典型的问题。例如，大量的观测值和变量以及不平衡的样本意味着分类模型中的一个类别比另一个大得多。

在接下来的章节中，我们将看到的一些步骤如下：

+   数据收集

+   特征生成

+   描述性分析

+   缺失信息的处理

+   单变量分析

+   多变量分析

+   模型选择

最后一章将专注于开发检测欧洲国家经济失衡的模型，同时涵盖一些基本的文本挖掘和聚类技术。

尽管这本书是技术性的，但每个大数据和机器学习解决方案最重要的方面之一是理解我们需要解决的问题。

到本书结束时，你会发现仅仅了解算法是不够的来开发模型。在跳入运行算法之前，你需要遵循许多重要的步骤。如果你注意这些初步步骤，你更有可能获得好的结果。

在这个意义上，并且因为我热衷于经济理论，你可以在存放本书代码的仓库中找到关于我们将要在这本书中解决的问题原因的总结，从经济角度来分析。具体来说，描述了金融危机的原因以及其传染和转变为主权危机的过程。

# 总结

在这一开篇章节中，我们确立了本书的目的。现在你对 R 及其概念有了基本的了解，我们将继续开发两个主要预测模型。我们将涵盖所有必要的步骤：数据收集、数据分析以及特征选择，并以实用的方式描述不同的算法。

在下一章中，我们将开始解决编程问题并收集开始模型开发所需的数据。


# 第二章：预测银行失败 - 数据收集

在每个模型开发中，我们需要获取足够的数据来构建模型。非常常见的是阅读“垃圾输入，垃圾输出”这个表达，这与如果你用糟糕的数据开发模型，结果模型也会很糟糕的事实相关。

尤其是在机器学习应用中，我们期望拥有大量数据，尽管在许多情况下并非如此。无论可用信息的数量如何，数据的质量是最重要的问题。

此外，作为一个开发者，拥有结构化数据非常重要，因为它可以立即进行操作。然而，数据通常以非结构化形式存在，这意味着处理和准备用于开发需要花费大量时间。许多人认为机器学习应用仅基于使用新的算法或技术，而实际上这个过程比这更复杂，需要更多时间来理解你所拥有的数据，以获得所有观察的最大价值。通过本书中我们将讨论的实际情况，我们将观察到数据收集、清洗和准备是一些最重要且耗时的工作。

在本章中，我们将探讨如何为我们的问题陈述收集数据：

+   收集财务数据

+   收集目标变量

+   数据结构化

# 收集财务数据

我们将从**联邦存款保险公司**（**FDIC**）网站（[`www.fdic.gov/`](https://www.fdic.gov/)）获取我们的数据。FDIC 是一个由美国国会领导的独立机构，其目标是维护人民的信心和金融系统的稳定。

# 为什么选择 FDIC？

FDIC 为美国商业银行和储蓄机构的存款人提供存款保险。因此，如果一家美国银行倒闭并关闭，FDIC 保证存款人不会损失他们的储蓄。最高可保证 25 万美元。

FDIC 还检查和监督某些金融机构。这些机构有义务定期报告与以下相关的财务报表的详细信息：

+   资本水平

+   清偿能力

+   资产的质量、类型、流动性和多元化

+   贷款和投资集中度

+   收益

+   流动性

银行的信息在 FDIC 网站上公开可用，我们可以下载用于我们的目的。我们会发现信息已经以所谓的**统一银行绩效报告**（**UBPR**）的形式结构化，它包括从财务报表中结合不同账户的几个比率。

例如，如果您想获取特定银行的 UBPR，或者只是想查看任何其他 UBPR 报告，您可以在[`cdr.ffiec.gov/public/ManageFacsimiles.aspx`](https://cdr.ffiec.gov/public/ManageFacsimiles.aspx)选择统一银行绩效报告（UBPR）：

![图片](https://p6-xtjj-sign.byteimg.com/tos-cn-i-73owjymdk6/801ae15d47794a4889bb8459ce9cef78~tplv-73owjymdk6-jj-mark-v1:0:0:0:0:5o6Y6YeR5oqA5pyv56S-5Yy6IEAg5biD5a6i6aOe6b6Z:q75.awebp?rk3s=f64ab15b&x-expires=1771369591&x-signature=B2R7HstumJY18RGHpWQx9a44PTc%3D)

报告下拉菜单允许选择 UBPR。我们可以通过名称或其他选项（如 FDIC 证书号码）搜索单个银行。此外，通过访问此链接[`cdr.ffiec.gov/public/PWS/DownloadBulkData.aspx`](https://cdr.ffiec.gov/public/PWS/DownloadBulkData.aspx)，可以同时下载所有可用银行的详细信息。

例如，以下截图显示了如何以文本格式下载 2016 年财务比率的批量数据：

![图片](https://p6-xtjj-sign.byteimg.com/tos-cn-i-73owjymdk6/99ee50e3b34e4afba38ea0b63887173e~tplv-73owjymdk6-jj-mark-v1:0:0:0:0:5o6Y6YeR5oqA5pyv56S-5Yy6IEAg5biD5a6i6aOe6b6Z:q75.awebp?rk3s=f64ab15b&x-expires=1771369591&x-signature=ksWnmpe7LCIgAfpFzSSopWGeRRM%3D)

您应仅选择“UBPR 比率 - 单期”选项，然后选择所需的日期（12/31/2016），最后设置输出格式，例如，制表符分隔。

在这个练习中，我们需要下载许多文件，从 2002 年到 2016 年每年一个。如果您不想下载数据，则没有必要下载。在代码中应用相关步骤后，R 工作空间将被保存，并且这个备份可供读者使用，无需花费时间运行代码或下载信息。

在我学习任何编程语言的经验中，当其他学习者进步时，花费时间在代码中寻找错误是非常令人沮丧的。因此，这些工作空间允许读者永远不会因为特定代码行的问题或甚至在我们的计算机上无法正常工作的具体包而感到沮丧。

在这种情况下，信息以文本分隔文件的形式下载，这使得以后上传到 R 中变得更加容易。对于每一年的每个 ZIP 文件都包含几个文本文件。这些文本文件包含关于银行特定领域的季度相关信息。2002 年到 2016 年所有 ZIP 文件的总大小达到 800MB。

# 列出文件

我们应该在电脑中为每年的文件创建一个文件夹，其中每个 ZIP 文件都需要解压缩。

一旦创建了文件夹，我们就可以在 R 中编写以下代码来列出我们创建的所有文件夹：

```py
myfiles <- list.files(path = "../MachineLearning/Banks_model/Data", pattern = "20",  full.names = TRUE)

 print(myfiles)
##  [1] "../MachineLearning/Banks_model/Data/2002"
##  [2] "../MachineLearning/Banks_model/Data/2003"
##  [3] "../MachineLearning/Banks_model/Data/2004"
##  [4] "../MachineLearning/Banks_model/Data/2005"
##  [5] "../MachineLearning/Banks_model/Data/2006"
##  [6] "../MachineLearning/Banks_model/Data/2007"
##  [7] "../MachineLearning/Banks_model/Data/2008"
##  [8] "../MachineLearning/Banks_model/Data/2009"
##  [9] "../MachineLearning/Banks_model/Data/2010"
## [10] "../MachineLearning/Banks_model/Data/2011"
## [11] "../MachineLearning/Banks_model/Data/2012"
## [12] "../MachineLearning/Banks_model/Data/2013"
## [13] "../MachineLearning/Banks_model/Data/2014"
## [14] "../MachineLearning/Banks_model/Data/2015"
## [15] "../MachineLearning/Banks_model/Data/2016"

pattern选项允许我们搜索所有名称中包含20的文件夹，遍历我们之前创建的所有文件夹。

查找文件

让我们读取myfiles列表中每个文件夹包含的所有.txt文件。一旦为每个年份读取了.txt文件，它们就会合并成一个单一的表格。这个过程需要几分钟才能完成（在我的情况下，几乎需要 30 分钟）。

library(readr)

 t <- proc.time()

 for (i in 1:length(myfiles)){

 tables<-list()
 myfiles <- list.files(path = "../MachineLearning/Banks_model/Data", pattern = "20",  full.names = TRUE)

 filelist <- list.files(path = myfiles[i], pattern = "*",  full.names = TRUE)
 filelist<-filelist[1:(length(filelist)-1)]

 for (h in 1:length(filelist)){

 aux = as.data.frame(read_delim(filelist[h],  "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE, skip = 2))

 variables<-colnames(as.data.frame(read_delim(filelist[h],  "\t", escape_double = FALSE, col_names = TRUE, trim_ws = TRUE, skip = 0)))

 colnames(aux)<-variables

 dataset_name<-paste("aux",h,sep='')
 tables[[h]]<-assign(dataset_name,aux)

 }

 final_data_name<-paste("year",i+2001,sep='')
 union <- Reduce(function(x, y) merge(x, y, all=T,
     by=c("ID RSSD","Reporting Period")), tables, accumulate=F)

 assign(final_data_name,union)
 rm(list=ls()[! ls() %in% c(ls(pattern="year*"),"tables","t")])
 }

 proc.time() - t

因此，它首先列出我们创建的所有文件夹。然后，它列出每个文件夹中的所有.txt文件并将它们读入 R。单个.txt文件提供不同的数据框，然后合并成一个单一的表格。代码的结果是创建了 16 个不同的表格，每个表格包含一个特定年份的信息。

合并结果

现在我们使用rbind函数合并年度表格。这是可能的，因为所有表格包含的确切列数相同：

rm(tables)
 database<-rbind(year2002,year2003,year2004,year2005,year2006,year2007,year2008,year2009,year2010,year2011,year2012,year2013,year2014,year2015,year2016)

删除表格

使用 rm() 命令，我们可以删除工作空间中除 database 之外的所有表格：

rm(list=ls()[! ls() %in% c(ls(pattern="database"))])

了解你的观测值

数据库包含总共 420404 个观测值和 1571 列：

print("Number of observations:")
## [1] "Number of observations:"
print(nrow(database))
## [1] 420404
print("Number of columns/variables:")
## [1] "Number of columns/variables:"
ncol(database)
## [1] 1571

让我们看看数据集现在看起来像什么，或者至少，前几个观测值和列：

head(database[,1:5])
 ##   ID RSSD       Reporting Period UBPR1795 UBPR3123.x UBPR4635
 ## 1 1000052 12/31/2002 11:59:59 PM      958       1264      996
 ## 2 1000100 12/31/2002 11:59:59 PM      -26       2250       33
 ## 3 1000276 12/31/2002 11:59:59 PM       46        719       86
 ## 4 1000409 12/31/2002 11:59:59 PM    13926      57059    19212
 ## 5 1000511 12/31/2002 11:59:59 PM       37        514       86
 ## 6 1000557 12/31/2002 11:59:59 PM        0        120       16

如您所见，第一列是每个银行的标识符。在第二列中，提供了财务信息的参考日期。其余的列使用 UBPR 前缀和数字编码。这种情况在实际情况中非常常见，因为有很多变量可用，但它们的含义是未知的。这种情况可能非常有问题，因为我们并不确切知道某些变量是否考虑了目标变量，或者变量是否将在模型实施时可用。

在我们的情况下，这个问题实际上并不是一个问题，因为你可以找到一个包含变量含义的字典，位于cdr.ffiec.gov/CDRDownload/CDR/UserGuide/v96/FFIEC%20UBPR%20Complete%20User%20Guide_2019-01-11.Zip。

例如，第一个变量 UBPR1795 的含义是净信贷损失，衡量由于未偿还而产生的损失，导致银行贷款的总金额。

处理重复项

当我们将不同的文本文件合并成每年一个表格时，一些列被重复了，因为它们同时包含在多个文本文件中。例如，所有包含在名为 Summary ratios 的文本文件中的比率都将复制到其他文本文件中。在这些情况下，R 为变量分配 .x 或 .y 后缀。

在以下代码中，我们删除了具有 .x 后缀的变量，因为它们在数据库中是重复的：

database[,grep(".x",colnames(database))]<-NULL

grep 函数在列名中搜索 .x 模式。如果检测到列中有此模式，则该列将被删除。此外，列名中的 .y 后缀也将被移除：

var_names<-names(database)

 var_names<-gsub(".y","",var_names)

 colnames(database)<-var_names

 rm(var_names)

最后，导入过程还创建了一些错误和不准确变量。这些列的名称以 X 开头。这些变量也将被删除，如下所示：

database[,grep("X",colnames(database))]<-NULL

让我们将工作空间保存到以下步骤：

save.image("Data1.RData")

操作我们的问题

数据库包含一个表示每个银行财务报表日期的列（称为 Reporting Period 字段）。每个银行可以在数据集中出现多次，从 2002 年 12 月到 2016 年 12 月，每季度一次。

然而，这个字段在 R 中不被识别为日期格式：

class(database$'Reporting Period')
## [1] "character"

让我们将此字段转换为日期格式：

首先，从 Reporting Period 列中提取左侧部分。前 10 个字符被提取到一个名为 Date 的新变量中：

database$Date<-substr(database$'Reporting Period',1,10)

让我们使用 as.Date 命令将这个新列转换为日期格式：

database$Date<-as.Date(database$Date,"%m/%d/%Y")

最后，删除 Reporting Period 字段，因为它不再相关：

database$'Reporting Period'<-NULL

我们有关于 2002 年至 2016 年所有季度的信息，但我们只对年末提供的财务信息感兴趣。

让我们过滤数据集，以考虑每年 12 月的信息：

database<-database[as.numeric(format(database$Date, "%m"))==12,]

在上一行代码之后，我们的数据库包含110239个观测值：

print("Observations in the filtered dataset:")
## [1] "Observations in the filtered dataset:"
nrow(database)
## [1] 110239

此外，它包含1494个变量，如下面的代码块所示：

print("Columns in the filtered dataset:")
## [1] "Columns in the filtered dataset:"
ncol(database)
## [1] 1494

到目前为止，让我们保存工作区的一个备份：

save.image("Data2.RData")

您现在可以查看数据集中的所有变量：

database_names<-data.frame(colnames(database))

由于变量数量相当高，建议将变量的名称保存到 Excel 文件中：

write.csv(database_names,file="database_names.csv")
rm(database_names)

如您所见，数据集中有一些变量的名称是一种代码。我们还知道，可以在 FDIC 网站上找到每个变量的含义。这种情况真的很常见，尤其是在信用风险应用中，信息提供了关于账户变动或交易详情。

以某种方式理解变量的含义，或者至少了解它们是如何生成的，这是很重要的。如果不这样做，我们可以包括一些与目标变量非常接近的变量作为预测变量，甚至包括在模型实施时不可用的变量。然而，我们知道数据集中没有明显的目标。因此，让我们收集我们问题的目标变量。

收集目标变量

我们需要确定一家银行是否在过去失败过——这将是我们目标。此信息也可在 FDIC 网站上找到，网址为www.fdic.gov/bank/individual/failed/banklist.html。

网站包括自 2000 年 10 月以来失败的银行，这涵盖了我们的整个数据集：

让我们看看实现这一目标的步骤：

将此信息下载到.csv文件中：

download.file("https://www.fdic.gov/bank/individual/failed/banklist.csv", "failed_banks.csv",method="auto", quiet=FALSE, mode = "wb", cacheOK = TRUE)

即使这个列表定期更新，因为历史信息不会改变，但结果仍然是可复制的。无论如何，用于开发的文件也包含在这本书的数据存储库中。

现在，将下载的文件上传到 R 中，如下所示：

failed_banks<-read.csv("failed_banks.csv", header=TRUE)

使用以下命令查看所有变量以及失败银行列表中包含的数据的一些详细信息：

str(failed_banks)

让我们打印前十行，如下所示：

head(failed_banks,n=10)# First 10 rows of dataset
 ##                                                Bank.Name
 ## 1                    Washington Federal Bank for Savings
 ## 2        The Farmers and Merchants State Bank of Argonia
 ## 3                                    Fayette County Bank
 ## 4  Guaranty Bank, (d/b/a BestBank in Georgia & Michigan)
 ## 5                                         First NBC Bank
 ## 6                                          Proficio Bank
 ## 7                          Seaway Bank and Trust Company
 ## 8                                 Harvest Community Bank
 ## 9                                            Allied Bank
 ## 10                          The Woodbury Banking Company
 ##                  City ST  CERT               Acquiring.Institution
 ## 1             Chicago IL 30570                  Royal Savings Bank
 ## 2             Argonia KS 17719                         Conway Bank
 ## 3          Saint Elmo IL  1802           United Fidelity Bank, fsb
 ## 4           Milwaukee WI 30003 First-Citizens Bank & Trust Company
 ## 5         New Orleans LA 58302                        Whitney Bank
 ## 6  Cottonwood Heights UT 35495                   Cache Valley Bank
 ## 7             Chicago IL 19328                 State Bank of Texas
 ## 8          Pennsville NJ 34951 First-Citizens Bank & Trust Company
 ## 9            Mulberry AR    91                        Today's Bank
 ## 10           Woodbury GA 11297                         United Bank
 ##    Closing.Date Updated.Date
 ## 1     15-Dec-17    21-Feb-18
 ## 2     13-Oct-17    21-Feb-18
 ## 3     26-May-17    26-Jul-17
 ## 4      5-May-17    22-Mar-18
 ## 5     28-Apr-17     5-Dec-17
 ## 6      3-Mar-17     7-Mar-18
 ## 7     27-Jan-17    18-May-17
 ## 8     13-Jan-17    18-May-17
 ## 9     23-Sep-16    25-Sep-17
 ## 10    19-Aug-16    13-Dec-18

文件包含以下相关信息：

失败银行的数目
这些银行所在的状态
他们失败的时间
收购机构

绘制失败随时间演变的趋势将非常有趣。为此，让我们检查Closing.Date列是否被识别为日期：

class(failed_banks$Closing.Date)
 ## [1] "factor"

这列不是日期。让我们使用类似于as.Date的另一个命令，通过lubridate库将其转换为日期：

library(lubridate)
failed_banks$Closing.Date <- dmy(failed_banks$Closing.Date)
class(failed_banks$Closing.Date)
 ## [1] "Date"

数据结构化

在获取了我们的目标变量并了解了我们的数据集之后，我们现在可以继续根据我们的目标进行实际的数据收集。在这里，我们将尝试根据收集目标变量部分中描述的不同年份获取银行的资料。

要做到这一点，我们创建一个新的变量，只提取银行破产时的年份，然后按年份计算银行的数目：

failed_banks$year<-as.numeric(format(failed_banks$Closing.Date, "%Y"))

 Failed_by_Year<-as.data.frame(table(failed_banks$year))
 colnames(Failed_by_Year)<-c("year","Number_of_banks")

 print(Failed_by_Year)
 ##    year Number_of_banks
 ## 1  2000               2
 ## 2  2001               4
 ## 3  2002              11
 ## 4  2003               3
 ## 5  2004               4
 ## 6  2007               3
 ## 7  2008              25
 ## 8  2009             140
 ## 9  2010             157
 ## 10 2011              92
 ## 11 2012              51
 ## 12 2013              24
 ## 13 2014              18
 ## 14 2015               8
 ## 15 2016               5
 ## 16 2017               8

让我们以图形方式查看我们的数据：

library(ggplot2)

 theme_set(theme_classic())

 # Plot
 g <- ggplot(Failed_by_Year, aes(year, Number_of_banks))
 g + geom_bar(stat="identity", width = 0.5, fill="tomato2") +
       labs(title="Number of failed banks over time",
       caption="Source: FDIC list of failed banks")+
       theme(axis.text.x = element_text(angle=65, vjust=0.6))

上述代码给出了以下输出：

如前图所示，在 2001 年和 2002 年的互联网泡沫危机期间以及从 2008 年开始的金融危机期间，破产银行的数目有所增加。

现在我们需要将破产银行的列表与我们的数据库合并。在破产银行数据集中，有一个包含每个银行 ID 的列，具体是证书号码列。这是 FDIC 分配的一个数字，用于唯一标识机构和保险证书。

然而，在包含财务信息的其他数据库中，ID 号码被称为 RSSD ID，这是不同的。这个数字是由联邦储备系统分配给机构的唯一标识符。

那么，我们如何将这两个数据集连接起来呢？我们需要在两个标识符之间建立一个映射。这个映射也可以在 FDIC 网站上找到，再次是在我们之前下载所有财务报表批量数据的同一部分。记住，网站可以通过cdr.ffiec.gov/public/pws/downloadbulkdata.aspx访问。

在这个网站上，我们需要在相关期间（2002-2016）下载呼叫报告——单期文件：

在最近下载的每个文件中，我们都可以找到一个名为FFIEC CDR Call Bulk POR mmddyyyy.txt的文件。

这个文件包含了关于每家银行的全部信息。首先，我们使用它们为破产银行列表中的每家银行分配一个ID RSSD号码。然后，我们可以使用ID RSSD字段将财务比率与破产银行列表连接起来。

下载完文件后，使用list.files函数列出您系统中的所有可用文件。

我们需要找到所有名称中包含FFIEC CDR Call Bulk POR的文件：

myfiles <- list.files(path = "../MachineLearning/Banks_model/Data/IDS", pattern = "FFIEC CDR Call Bulk POR",  full.names = TRUE)

现在，我们将所有文件读入 R 中，并将它们合并到一个名为IDs的数据框中：

此外，还创建了一个名为year的新列。这个列反映了对应信息的年份。我们需要存储IDs和日期，因为标识符可能会随时间变化。例如，当两家银行合并时，其中一家银行将在数据集中消失，而另一家可以保持相同的号码或获得一个新的号码。

您可以创建一个名为IDs的新空框架，如下所示：

IDs<-matrix("NA",0,4)
 colnames(IDs)<-c("ID RSSD","CERT","Name","id_year")
 IDs<-as.data.frame(IDs)

然后，我们迭代地读取所有文本文件，并将它们合并到这个IDs数据框中：

for (i in 1:length(myfiles))

 { 
 aux <- read.delim(myfiles[i])
 aux$year<-as.numeric(2000+i)
 aux<-aux[,c(1,2,6,ncol(aux))]
 colnames(aux)<-c("ID RSSD","CERT","Name","id_year")
 IDs<-rbind(IDs,aux)
 }

让我们按照以下方式打印出结果表：

head(IDs)
 ##   ID RSSD  CERT                             Name id_year
 ## 1      37 10057           BANK OF HANCOCK COUNTY    2001
 ## 2     242  3850 FIRST COMMUNITY BANK XENIA-FLORA    2001
 ## 3     279 28868      MINEOLA COMMUNITY BANK, SSB    2001
 ## 4     354 14083                 BISON STATE BANK    2001
 ## 5     439 16498                     PEOPLES BANK    2001
 ## 6     457 10202                 LOWRY STATE BANK    2001

现在，一个包含ID RSSD框架和每个银行随时间变化的Certificate number列的主表已经可用。

你可以按照以下方式删除无关信息：

rm(list=setdiff(ls(), c("database","failed_banks","IDs")))

接下来，我们将使用证书日期将failed banks名单和IDs数据集合并，但在合并之前，我们需要将两个数据集中的证书号码转换为数值格式：

failed_banks$CERT<-as.numeric(failed_banks$CERT)

 IDs$CERT<-as.numeric(IDs$CERT)

如果我们尝试将失败银行名单与IDs数据集合并，我们会发现一个问题。在failed banks名单中有一个表示银行破产年份的列，如果我们使用年份列将两个表连接起来，则不会在IDs表中找到这个列。

由于IDs快照对应于每年的 12 月，因此一家失败的银行不可能在这一特定年份的年底就已经存在。

为了正确合并两个数据集，在failed banks数据集中创建一个新的变量（id_year），从year列中减去一年：

failed_banks$id_year<-failed_banks$year-1

现在失败的银行已经使用merge函数与IDs信息连接起来。使用这个函数很简单；你只需要指定两个表以及用于连接的列名：

failed_banks<-merge(failed_banks,IDs,by.x=c("CERT","id_year"),all.x=TRUE)
failed_banks<-failed_banks[,c("CERT","ID RSSD","Closing.Date")]
head(failed_banks)
 ##   CERT ID RSSD Closing.Date
 ## 1   91   28349   2016-09-23
 ## 2  151  270335   2011-02-18
 ## 3  182  454434   2010-09-17
 ## 4  416    3953   2012-06-08
 ## 5  513  124773   2011-05-20
 ## 6  916  215130   2014-10-24

工作空间的新备份操作如下：

save.image("Data3.RData")

现在，可以将包含财务报表的数据库与失败银行名单合并，然后创建目标变量。我们将使用ID RSSD标识符将两个表连接起来：

database<-merge(database,failed_banks,by=c("ID RSSD"),all.x = TRUE)
## Warning in merge.data.frame(database, failed_banks, by = c("ID RSSD"),
 ## all.x = TRUE): column name 'UBPR4340' is duplicated in the result

数据库中增加了两个新列：CERT和Closing.Date。前面的代码提醒我们之前未检测到的重复列。因此，我们应该删除其中一个重复的列。使用grep函数，我们将获得包含UBPR4340变量的列数：

grep("UBPR4340",colnames(database))
## [1]  852 1454

删除出现重复变量的第二列：

database[,1454]<-NULL

当这两个新变量（CERT和Closing.Date）中的任何一个发现缺失值时，这表明这家银行在美国金融体系中仍在运营。另一方面，如果一家银行在这些变量中包含信息，则表明这家银行已经破产。我们可以看到数据库中有多少失败的观测值：

nrow(database[!is.na(database$Closing.Date),c('ID RSSD','Date','Closing.Date')])
## [1] 3705

数据集中有3.705个与失败银行对应的观测值。正如你所看到的，失败观测值的数量占总观测值的一小部分。

失败的观测值不代表独特的失败银行。这意味着一家失败的银行在最终破产之前的一段时间内有不同的财务报表。例如，对于以下代码块中提到的银行，有不同年份的财务信息可用。根据我们的数据库，这家银行在 2010 年破产：

failed_data<-database[!is.na(database$Closing.Date),c('ID RSSD','Date','Closing.Date')]
 head(failed_data)
 ##     ID RSSD       Date Closing.Date
 ## 259    2451 2003-12-31   2010-07-23
 ## 260    2451 2007-12-31   2010-07-23
 ## 261    2451 2008-12-31   2010-07-23
 ## 262    2451 2005-12-31   2010-07-23
 ## 263    2451 2004-12-31   2010-07-23
 ## 264    2451 2009-12-31   2010-07-23

我们应该评估我们预测模型的时间范围。信息日期和关闭日期之间的差异越大，我们模型的预期预测能力就越低。解释相当简单；从五年前的当前信息预测银行的失败比从一两年前的信息预测更困难。

让我们计算一下资产负债表日期之间的差异：

database$Diff<-as.numeric((database$Closing.Date-database$Date)/365)

我们的目标变量会是什么？我们想要预测什么？好吧，我们可以开发一个模型来预测在当前财务信息之后的六个月、一年甚至五年内的破产情况。

目标变量的定义应根据模型的目的进行，同时也要考虑到样本中失败银行或不良银行的数目。

标准期限根据投资组合、模型的目的以及不良银行或少数群体的样本而有所不同，这个样本应该足够大，以便开发一个稳健的模型。

时间跨度的定义非常重要，它决定了我们模型的目标及其未来的用途。

例如，我们可以将数据集中在财务报表后不到一年就失败的银行归类为不良银行：

database$Default0<-ifelse(database$Diff>=1 | is.na(database$Diff),0,1)

根据这个定义，不良银行的数目将如下：

table(database$Default0)
##
 ##      0      1
 ## 109763    476

数据集中只有476家银行在观察到财务信息后不到一年就失败了。

例如，以下银行在观察到财务信息后仅半年就失败了：

head(database[database$Default0==1,c('ID RSSD','Date','Closing.Date','Diff')],1)
 ##     ID RSSD       Date Closing.Date      Diff
 ## 264    2451 2009-12-31   2010-07-23 0.5589041
database$Default0<-NULL

在这一点上，对工作空间进行了一次新的备份：

save.image("Data4.RData")

在这个问题中，我们看到了大多数银行都是有偿付能力的，这些银行在样本中多次重复出现，尽管财务报表不同。

然而，保留样本中的所有良好银行并增加不良银行的重要性并不相关。有一些技术可以处理这个问题。

其中一个方法是为每个良好和不良观察值分配不同的权重，以便两个类别可以更加平衡。这种方法虽然有用，但会使机器学习算法的执行变得耗时得多，因为我们将会使用整个数据集，在我们的案例中，这超过了 10 万个观察值。

类别非常不平衡，正如我们在这个问题中发现的那样，可能会以负面方式影响模型拟合。为了保留所有观察值，数据子采样是非常常见的。通常执行三种主要技术：

欠采样：这可能是最简单的策略。它包括随机减少多数类到与少数类相同的大小。通过欠采样，不平衡问题得到了解决，但通常情况下，我们会减少数据集，特别是在少数类非常稀缺的情况下。如果这种情况发生，模型结果很可能会很糟糕。
过采样：通过多次随机选择少数类来达到多数类的相同大小。最常见的方法是多次复制少数观测。在问题解决方案的这个阶段，我们还没有选择用于训练或测试未来算法的数据，过采样可能会出现问题。我们将在训练集和验证集中重复未来可能发现的少数类示例，从而导致过拟合和误导性结果。
其他技术：如合成少数过采样技术（SMOTE）和随机过采样示例（ROSE）等技术减少多数类，并在少数类中创建人工新观测。

在这种情况下，我们将采用一种混合方法。

为了使以下步骤更容易，我们将重命名包含每个银行标识符的第一列：

colnames(database)[1]<-"ID_RSSD"

现在我们将以不同的方式处理失败和非失败的银行。让我们从只包含失败银行的数据库部分开始：

database_Failed<-database[!is.na(database$Diff),]

有3705个观测包含失败银行的信息：

nrow(database_Failed)
## [1] 3705

这个样本看起来是这样的：

head(database_Failed[,c("ID_RSSD","Date","Diff")])

 ##     ID_RSSD       Date      Diff
 ## 259    2451 2003-12-31 6.5643836
 ## 260    2451 2007-12-31 2.5616438
 ## 261    2451 2008-12-31 1.5589041
 ## 262    2451 2005-12-31 4.5616438
 ## 263    2451 2004-12-31 5.5616438
 ## 264    2451 2009-12-31 0.5589041

如显示的那样，在失败银行的列表中，我们有几个年份的相同银行的财务信息。每个银行距离破产日期最近的财务信息将被最终选中。

为了做到这一点，我们创建一个辅助表。这个表将包含银行观测到失败日期的最小距离。为此，我们现在将使用一个有用的包，sqldf。这个包允许我们像使用 SQL 语言一样编写查询：

aux<-database_Failed[,c('ID_RSSD','Diff')]

library(sqldf)
aux<-sqldf("SELECT ID_RSSD,
       min(Diff) as min_diff,
       max(Diff) as max_diff
       from aux group by ID_RSSD")

 head(aux)

 ##   ID_RSSD  min_diff max_diff
 ## 1    2451 0.5589041 7.564384
 ## 2    3953 0.4383562 9.443836
 ## 3   15536 0.8301370 6.835616
 ## 4   16337 0.7506849 7.756164
 ## 5   20370 0.4027397 8.408219
 ## 6   20866 0.5589041 7.564384

现在，我们的包含失败银行的样本与这个辅助表合并在一起：

database_Failed<-merge(database_Failed,aux,by=c("ID_RSSD"))

然后，我们只选择财务报表日期与截止日期之间的差异与min_diff列相同的观测：

database_Failed<-database_Failed[database_Failed$Diff==database_Failed$min_diff,]

按以下方式删除最近创建的列：

database_Failed$min_diff<-NULL
database_Failed$max_diff<-NULL

现在，我们想要减少非失败银行的数目。为此，我们随机选择每个银行的财务报表的一年：

使用以下代码提取非失败银行的观测总数：

database_NonFailed<-database[is.na(database$Diff),]

为了随机选择财务报表，我们应该遵循以下步骤：

首先，建立一个种子。在生成随机数时，需要一个种子来获得可重复的结果。使用相同的种子将允许你获得与本书中描述的相同的结果：

set.seed(10)

生成随机数；我们生成的随机数数量与非失败银行数据集中的行数相同：

Random<-runif(nrow(database_NonFailed))

将随机数作为新列添加到数据库中：

database_NonFailed<-cbind(database_NonFailed,Random)

计算每个银行的随机数最大值，并创建一个新的名为max的数据框：

max<-aggregate(database_NonFailed$Random, by = list(database_NonFailed$ID_RSSD), max)

 colnames(max)<-c("ID_RSSD","max")

将非失败银行的数据框与max数据框连接。然后，只选择随机数与每个银行最大值相同的观测：

database_NonFailed<-merge(database_NonFailed,max,all.x=TRUE)
 database_NonFailed<-    database_NonFailed[database_NonFailed$max==database_NonFailed$Random,]

按以下方式删除无关的列：

database_NonFailed$max<-NULL
database_NonFailed$Random<-NULL

使用 dim 函数，我们可以获得非失败银行的观测数。您可以看到，良好银行的数目已经显著减少：

dim(database_NonFailed)

## [1] 9654 1496

只有 9654 个观测值和 1496 个变量。

因此，我们最终可以通过结合之前的数据框来构建我们的数据集以开发我们的模型：

Model_database<-rbind(database_NonFailed,database_Failed)

目标变量现在也可以定义了：

Model_database$Default<-ifelse(is.na(Model_database$Diff),0,1)

工作空间中加载的其他对象可以按照以下方式删除：

rm(list=setdiff(ls(), c("Model_database")))

通常可以使用当前变量来定义新特征，然后将其包含在开发中。这些新变量通常被称为派生变量。因此，我们可以将派生变量定义为从一个或多个基础变量计算出的新变量。

一个非常直观的例子是从包含不同客户信息的数据库中计算一个名为 age 的变量。这个变量可以计算为该客户存储在系统中的日期与他们的出生日期之间的差异。

新变量应添加有用且非冗余的信息，这将有助于后续的学习，并有助于泛化步骤。

特征生成不应与特征提取混淆。特征提取与降维相关，因为它将原始特征进行转换并从潜在的原生和派生特征池中选择一个子集，这些特征可以用于我们的模型。

然而，在我们处理的问题中，构建额外的变量并不是非常相关。我们有一个非常大的数据集，测量了金融机构分析中的所有相关方面。

此外，在本部分开发中，那些在数据提取或处理阶段被包含，但对模型开发没有作用的变量必须被删除。

因此，以下变量将被删除：

Model_database$CERT<-NULL

 Model_database$Closing.Date<-NULL

 Model_database$Diff<-NULL

所有这些步骤都是构建我们的数据库所需要的。您可以看到我们在收集数据、目标变量以及尝试组织本章中的所有数据上花费了多少时间。在下一章中，我们将开始分析我们所获得的数据。在继续之前，您可以进行最后的备份，如下所示：

save.image("Data5.RData")

摘要

在本章中，我们开始收集开发预测银行失败的模型所需的数据。在这种情况下，我们下载了大量数据，并对它进行了结构化。此外，我们创建了我们的目标变量。在本章结束时，您应该已经了解到数据收集是模型开发的第一步，也是最重要的一步。当您处理自己的问题时，请花时间理解问题，然后考虑您需要什么样的数据以及如何获取它。在下一章中，我们将对所获得的数据进行描述性分析。

R-机器学习快速启动指南-一-