通过 C# 实现真正理解神经网络神经网络是一种计算模型，其灵感来自于人脑中的生物神经网络的功能，该网络由分层组织的互连节

Introduction 介绍

Neural networks are a computational model inspired by the way biological neural networks in the human brain function that consist of interconnected nodes, or artificial neurons, organized in layers. Information is processed through these layers, with each connection having an associated weight that is adjusted during the learning process. Neural networks are commonly used in machine learning to recognize patterns, make predictions, and perform various tasks based on data inputs.
神经网络是一种计算模型，其灵感来自于人脑中的生物神经网络的功能，该网络由分层组织的互连节点或人工神经元组成。信息通过这些层进行处理，每个连接都有一个相关的权重，该权重在学习过程中进行调整。神经网络通常用于机器学习中，以识别模式、进行预测并根据数据输入执行各种任务。

We will delve into the details of this definition in subsequent posts, demonstrating how neural networks can outperform other methods. Specifically, we will begin with logistic regression and illustrate, through a straightforward example, how it may fall short and how deep learning, on the contrary, can be advantageous. As usual, we adopt a step-by-step approach to progressively address the challenges associated with this data structure.
我们将在后续文章中深入研究这个定义的细节，展示神经网络如何优于其他方法。具体来说，我们将从逻辑回归开始，并通过一个简单的例子来说明它可能有何不足，以及相反，深度学习如何具有优势。像往常一样，我们采用循序渐进的方法来逐步解决与此数据结构相关的挑战。

We consulted the following books as references to compose this series.
我们查阅了以下书籍作为撰写本系列的参考书。

[Deep Learning (Goodfellow, Bengio, Courville)
深度学习（Goodfellow、Bengio、Courville）]
[Deep Learning: Foundations and Concepts (Bishop, Bishop)
深度学习：基础和概念（Bishop，Bishop）]

什么是逻辑回归？

We will not delve into the details of logistic regression in this post, as we have already dedicated an entire series to this topic earlier. We refer the reader to this link. Here, we will simply recap the final results.
我们不会在这篇文章中深入探讨逻辑回归的细节，因为我们之前已经专门用整个系列来讨论这个主题。我们建议读者访问此链接。在这里，我们简单回顾一下最终的结果。

Logistic regression is a statistical method used for binary classification problems, where the outcome variable has two possible classes. It models the probability that an instance belongs to a particular class, and its predictions are in the form of probabilities between 0 and 1. The logistic function σσ (sigmoid) is employed to transform a linear combination of input features into a probability score.
逻辑回归是一种用于二元分类问题的统计方法，其中结果变量有两个可能的类别。它对实例属于特定类别的概率进行建模，其预测采用 0 到 1 之间的概率形式。逻辑函数 σσ (sigmoid) 用于将输入特征的线性组合转换为概率分数。

重要的

The sigmoid function that appears in logistic regression is not arbitrary and can be derived through a simple mathematical study. Once again, refer to our previous post for detailed explanations.
逻辑回归中出现的sigmoid函数并不是任意的，可以通过简单的数学研究推导出来。再次，请参阅我们之前的文章以获取详细说明。

观察逻辑回归的实际应用

To witness logistic regression in action, we will swiftly implement it with ML.NET in the straightforward example outlined below, where we have two possible outcomes for 2-dimensional inputs.
为了见证逻辑回归的实际应用，我们将在下面概述的简单示例中使用 ML.NET 快速实现它，其中二维输入有两种可能的结果。

The corresponding dataset is illustrated below:
对应的数据集如下图所示：

X;Y;Category
0.12;0.154;1
0.135;0.26;1
0.125;0.142;1
0.122;0.245;1
0.115;0.142;1
0.132;0.2;1
0.84;0.76;0
0.834;0.8;0
0.855;0.78;0
0.865;0.84;0
0.835;0.82;0
0.82;0.745;0
...

使用 ML.NET 实现逻辑回归

ML.NET is an open-source machine learning framework developed by Microsoft that allows developers to integrate machine learning models into their .NET applications. ML.NET supports a variety of machine learning tasks, including classification, regression, clustering, and recommendation.
ML.NET 是 Microsoft 开发的开源机器学习框架，允许开发人员将机器学习模型集成到他们的 .NET 应用程序中。 ML.NET 支持各种机器学习任务，包括分类、回归、聚类和推荐。

The C# code for employing logistic regression is presented below:
使用逻辑回归的 C# 代码如下所示：

Shrink ▲ 收缩▲

static void Main(string[] args)
{
    var ctx = new MLContext();

    // Load data from file
    var path = AppContext.BaseDirectory + "/dataset.csv";
    var data = ctx.Data.LoadFromTextFile<DataRecord>
               (path, hasHeader: true, separatorChar: ';');

    var dataPipe = ctx.Transforms.Concatenate("Features", new[]
    {
        "X", "Y"
    }).Append(ctx.BinaryClassification.Trainers.LbfgsLogisticRegression
             (featureColumnName: "Features"));

    var model = dataPipe.Fit(data);

    // Predict an unforeseen input
    var record = new DataRecord()
    {
        X = 0.25f,
        Y = 0.24f
    };

    var pe = ctx.Model.CreatePredictionEngine<DataRecord, DataRecordPrediction>(model);
    var category = pe.Predict(record);
}

Here are the predicted values for some generated inputs.
以下是一些生成的输入的预测值。

X	Y	prediction 预言
0.25	0.24	1
0.05	0.02	1
0.92	0.86	0
0.5	0.55	0

This example is a simplified one, as the data is well-separated, making it relatively easy for a method like logistic regression to predict the correct values. But how will this algorithm perform with a significantly more complex dataset?
此示例是一个简化示例，因为数据分离良好，使得逻辑回归等方法相对容易预测正确值。但该算法如何处理更加复杂的数据集呢？

如果数据不可线性分离会发生什么？

We will now apply logistic regression to the dataset depicted below. It is notably more intricate than the previous one (despite being in a two-dimensional space, facilitating data representation), and, most importantly, the data is not linearly separable. The objective is to observe how logistic regression behaves under such circumstances.
我们现在将逻辑回归应用于下面描述的数据集。它明显比前一个更加复杂（尽管处于二维空间中，有利于数据表示），而且最重要的是，数据不是线性可分的。目的是观察逻辑回归在这种情况下的表现。

We will utilize the identical C# code as above (with different data), and observe some predicted values.
我们将使用与上面相同的 C# 代码（使用不同的数据），并观察一些预测值。

X	Y	prediction 预言
-0.25	0.24	0
0.45	-0.72	0
0.92	0.86	0
-0.5	-0.55	0

Here, it is evident that many predicted values are inaccurate. The algorithm consistently returns the same predicted probability (0.5), indicating that it cannot adapt to the specific problem at hand.
在这里，很明显许多预测值是不准确的。该算法始终返回相同的预测概率 (0.5)，表明它无法适应当前的具体问题。

This is normal as we must recall that logistic regression applies a sigmoid to linear data and can only be effective in separating linearly separable data. As it is clearly not the case in the given sample, this method proves to be inappropriate.
这是正常的，因为我们必须记住，逻辑回归将 sigmoid 应用于线性数据，并且只能有效地分离线性可分离的数据。由于给定样本中的情况显然并非如此，因此该方法被证明是不合适的。

是否有可能缓解这种现象？

The short answer is yes: theoretically, there are mathematical methods that can address this issue, and we will observe them in action in the next post. The extended answer is that, in practice, it can be very challenging.
简而言之，答案是肯定的：理论上，有一些数学方法可以解决这个问题，我们将在下一篇文章中观察它们的实际情况。扩展的答案是，在实践中，这可能非常具有挑战性。

那么，如何克服这个问题呢？

The idea is not revolutionary and simply involves initially applying a fixed nonlinear transformation ff to the inputs instead of directly working with the original inputs.
这个想法并不是革命性的，只是涉及最初对输入应用固定的非线性变换 ff ，而不是直接处理原始输入。

The logistic regression algorithm is then applied to the new dataset.
然后将逻辑回归算法应用于新数据集。

信息1

The approach of mapping the input space into another space to make the data linearly separable or more tractable in this new space is the foundation for kernel methods (from which we can derive support vector machines and similar techniques). This process can be highly fruitful if correctly applied and has robust advantages.
将输入空间映射到另一个空间以使数据在这个新空间中线性可分离或更容易处理的方法是核方法的基础（从中我们可以推导出支持向量机和类似技术）。如果应用得当，这个过程可以取得丰硕成果，并具有强大的优势。

信息2

In the figure above, the two spaces have the same dimensions for representation convenience, but in practice, there is no reason for them to be the same.
在上图中，为了表示方便，两个空间具有相同的尺寸，但实际上，它们没有理由相同。

我们应该如何处理我们的案件？

This process might sound a bit theoretical and abstract initially, but we will see how to put it into practice in the example from the previous post. For that, we will apply the following function to the inputs.
这个过程最初听起来有点理论和抽象，但我们将在上一篇文章的示例中看到如何将其付诸实践。为此，我们将对输入应用以下函数。

f(X,Y)=XY

Here are the results of this mapping for some values:
以下是某些值的映射结果：

XX	YY	f(X,Y)f(X,Y)
-0.9	0.5664324069	-0.509789165
0.89	-0.4208036774	-0.374515273
0.18	0.9626599405	0.173267893
...	...	...

信息

In our example, the final space is a 1-dimensional space, whereas the input space was 2-dimensional. This is by no means a general rule of thumb: the final space can be much larger than the input space.
在我们的示例中，最终空间是一维空间，而输入空间是二维空间。这绝不是一般的经验法则：最终空间可以比输入空间大得多。

Data is linearly separable in the new space.

Data is linearly separable in the new space.
数据在新空间中是线性可分的。

将逻辑回归算法应用于新数据集

Once the mapping has been performed on the input space, it suffices to apply logistic regression to the new dataset, and we can hope to make accurate predictions. Obviously, the predicted values must be first mapped to the new space before being evaluated by the trained model.
一旦在输入空间上执行了映射，就足以将逻辑回归应用于新的数据集，并且我们可以希望做出准确的预测。显然，在由训练模型进行评估之前，必须首先将预测值映射到新空间。

XX	YY	f(X,Y)f(X,Y)	prediction 预言
-0.25	0.24	-0.06	1
0.05	-0.02	-0.001	1
0.92	0.86	0.7912	0
-0.5	-0.55	0.275	0

And now the predictions are perfectly accurate.
现在，预测完全准确。

这种方法的缺点是什么？

The mathematical method we developed above appears to be well-suited for data with irregular shapes: it essentially involves magically performing a mapping between spaces to make the data linearly separable in the final space.
我们上面开发的数学方法似乎非常适合形状不规则的数据：它本质上涉及在空间之间神奇地执行映射，以使数据在最终空间中线性可分。

However, the challenge lies in how to identify this mapping. In our toy example, it was relatively easy to deduce the transformation because we could visualize the data graphically. But real-world scenarios involve many more dimensions and much more complex data that may be challenging or impossible to visualize. This is indeed the major issue with this process: we can never be certain that in the final space, our data is distinctly linearly separable.
然而，挑战在于如何识别这种映射。在我们的玩具示例中，推断转换相对容易，因为我们可以以图形方式可视化数据。但现实世界的场景涉及更多维度和更复杂的数据，这些数据可能具有挑战性或无法可视化。这确实是这个过程的主要问题：我们永远无法确定在最终空间中，我们的数据是明显线性可分的。

Does this mean that the current approach is not productive? Absolutely not! While we can uphold the conceptual idea of mapping data to another space, the key lies in refining the process of executing this mapping. Imagine the prospect of automatically discovering this mapping instead of relying on guesswork, wouldn't that be a significant advancement? This automation exists: enter neural networks and deep learning.
这是否意味着当前的方法没有成效？绝对不！虽然我们可以秉承将数据映射到另一个空间的概念思想，但关键在于细化执行此映射的过程。想象一下自动发现这种映射而不是依赖猜测的前景，这不是一个重大进步吗？这种自动化是存在的：进入神经网络和深度学习。

什么是神经网络？

We will discuss the rationale behind neural networks for multiclass classification (with KK classes), but the following can be readily adapted to regression.
我们将讨论多类分类（具有 KK 类）的神经网络背后的基本原理，但以下内容可以很容易地适用于回归。

The goal is to extend this model by making the basis functions ϕ1, ..., ϕD depend on parameters and then to allow these parameters to be adjusted along with the coefficients w1, ..., wD during training.
目标是通过使基函数 phi1, ..., phiD 取决于参数来扩展该模型，然后允许这些参数在训练期间与系数 w1, ..., wD 一起调整。

The idea of neural networks is to use basis functions that are themselves nonlinear functions of a linear combination of the inputs. That leads to the basic neural network model which can be described as a series of functional transformations.
神经网络的思想是使用基函数，基函数本身就是输入线性组合的非线性函数。这就产生了基本的神经网络模型，可以将其描述为一系列功能转换。

具体如何运作？

We continue to assume that we have a dataset composed of N records, each of which possesses D features. To illustrate, our previous toy example had 2 features (X and Y).
我们继续假设我们有一个由 N 条记录组成的数据集，每条记录都拥有 D 个特征。为了说明这一点，我们之前的玩具示例有 2 个特征（X 和 Y）。

Construct M linear combinations a1, ..., aM of the input variables x1, ..., xD
构造输入变量 x1, ..., xD 的 M 个线性组合 a1, ..., aM

信息

What is M, and what does it stand for? That is not very important for the moment, but the intuitive idea behind this formula is to construct a mapping from our original D-dimensional input space to another M-dimensional feature space. We will revisit this point later in this post.
M是什么，它代表什么？目前这还不是很重要，但是这个公式背后的直观想法是构造一个从我们原始的 D 维输入空间到另一个 M 维特征空间的映射。我们将在本文后面再次讨论这一点。

σ can be the sigmoid (or other activation function) in the case of binary classification or the identity in the case of regression.
在二元分类的情况下，σ 可以是 sigmoid（或其他激活函数），在回归的情况下，σ 可以是恒等式。

信息

We can observe that the final expression of the output is much more complex than that of simple logistic regression. This complexity provides flexibility, as now we will be able to model complex configurations (such as non-linearly separable data), but it comes at the expense of learning complexity. We now have many more weights to adjust, and the algorithms dedicated to this task are not trivial. It is this complexity that hindered neural network development in the late 80s.
我们可以观察到输出的最终表达式比简单逻辑回归复杂得多。这种复杂性提供了灵活性，因为现在我们将能够对复杂的配置（例如非线性可分离数据）进行建模，但这是以学习复杂性为代价的。我们现在有更多的权重需要调整，专用于这项任务的算法并不简单。正是这种复杂性阻碍了 80 年代末神经网络的发展。

信息

The neural network can be further generalized by customizing the activation functions; for example, we can use tanh instead of the sigmoid.
通过定制激活函数可以进一步泛化神经网络；例如，我们可以使用 tanh 代替 sigmoid。

It is quite customary to represent a neural network with a graphical form, as shown below.
人们很习惯用图形形式来表示神经网络，如下所示。

This graphical notation has several advantages over the mathematical formalism, as it can emphasize two key points.
这种图形表示法比数学形式主义有几个优点，因为它可以强调两个关键点。

First, we can consider additional layers of processing.
首先，我们可以考虑额外的处理层。

信息

The more layers, the more flexible the neural network becomes. In this context, one might think that increasing the number of layers will eventually enable the model to handle all real-world situations (which is the premise of deep learning). However, this approach significantly increases the complexity of learning parameters and, consequently, demands substantial computing resources.
层数越多，神经网络变得越灵活。在这种背景下，人们可能会认为增加层数最终将使模型能够处理所有现实世界的情况（这是深度学习的前提）。然而，这种方法显着增加了学习参数的复杂性，因此需要大量的计算资源。

Second, we can see that M represents the number of "hidden" units. The term "hidden" stems from the fact that these units are not part of the input or output layers but exist in intermediate layers, where they contribute to the network's ability to learn complex patterns and representations from the data.
其次，我们可以看到M代表“隐藏”单元的数量。术语“隐藏”源于这样一个事实：这些单元不是输入层或输出层的一部分，而是存在于中间层中，它们有助于网络从数据中学习复杂模式和表示的能力。

信息

M must be chosen judiciously: if it's too small, the neural network may struggle to generalize accurately. On the other hand, if it's too large, there is a risk of overfitting, accompanied by an increase in the learning of parameters.
M 必须明智地选择：如果它太小，神经网络可能难以准确地概括。另一方面，如果太大，则存在过拟合的风险，同时伴随着参数学习的增加。

The number of input and output units in a neural network is generally determined by the dimensionality of the dataset, whereas the number MM of hidden units is a free parameter that can be adjusted to give the best predictive performance.
神经网络中输入和输出单元的数量通常由数据集的维数决定，而隐藏单元的数量 MM 是一个自由参数，可以调整以提供最佳的预测性能。
Bishop (Pattern Recognition and Machine Learning)
Bishop（模式识别和机器学习）

Now that we have introduced the formalism and demonstrated how complex configurations can be represented by neural networks, it's time to explore how the parameters and weights involved in the process can be learned. We will delve into the dedicated procedure developed for this purpose.
现在我们已经介绍了形式主义并演示了如何通过神经网络来表示复杂的配置，现在是时候探索如何学习该过程中涉及的参数和权重了。我们将深入研究为此目的开发的专用程序。

To avoid overloading this article, readers interested in this implementation and in the backpropagation algorithm can find the continuation here.
为了避免本文超载，对此实现和反向传播算法感兴趣的读者可以在此处找到续篇。