《机器学习实战：基于Scikit-Learn、Keras和TensorFlow第2版》阅读场景：业余闲暇推荐指数：★★

机器学习实战 v2

阅读场景：业余闲暇

推荐指数：★★★★☆

一半基于sklearn的传统机器学习，一半基于keras-tf2的深度学习

X矩阵行每个实例的转置

$x^{(i)}$ 是数据集中第i个实例的所有特征值(不包括标签)的向量，而是 $y^{(i)}$ 标签(该实例的期望输出值)。

$\boldsymbol{X}=\left(\begin{array}{c}\left(\boldsymbol{x}^{(1)}\right)^{\mathrm{T}} \\ \left(\boldsymbol{x}^{(2)}\right)^{\mathrm{T}} \\ \vdots \\ \left(\boldsymbol{x}^{(1999)}\right)^{\mathrm{T}} \\ \left(\boldsymbol{x}^{(2000)}\right)^{\mathrm{T}}\end{array}\right)=\left(\begin{array}{cccc}-118.29 & 33.91 & 1416 & 38372 \\ \vdots & \vdots & \vdots & \vdots\end{array}\right)$

X是一个矩阵，其中包含数据集中所有实例的所有特征值(不包括标签)。每实例只有一行，第i行等于 $x^{(i)}$ 的转置，记为 ${(x^{(i)})}^T$ 。

confusion matrix

You can clearly see the kinds of errors the classifier makes. Remember that rows represent actual classes, while columns represent predicted classes. The column for class 8 is quite bright, which tells you that many images get misclassified as 8s. However, the row for class 8 is not that bad, telling you that actual 8s in general get properly classified as 8s. As you can see, the confusion matrix is not necessarily symmetrical. You can also see that 3s and 5s often get confused (in both directions). Analyzing the confusion matrix often gives you insights into ways to improve your classifier. Looking at this plot, it seems that your efforts should be spent on reducing the false 8s. For example, you could try to gather more training data for digits that look like 8s (but are not) so that the classifier can learn to distinguish them from real 8s. Or you could engineer new features that would help the classifier—for example, writing an algorithm to count the number of closed loops (e.g., 8 has two, 6 has one, 5 has none). Or you could preprocess the images (e.g., using Scikit-Image, Pillow, or OpenCV) to make some patterns, such as closed loops, stand out more. Analyzing individual errors can also be a good way to gain insights on what your classifier is doing and why it is failing, but it is more difficult and timeconsuming

Recall Precision

精度 $Precision= {\frac{TP}{TP+FP}}$

召回率 $recall = {\frac{TP}{TP+FN}}$

$F_{1}$ 分数是精度和召回率的谐波平均值

The F score is the harmonic mean of precision and recall (Equation 3-3). Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values. As a result, the classifier will only get a high F score if both recall and precision are high. 正常的平均值平等对待所有的值，而谐波平均值会给予低值更高的权重。只有两者度高时，分类器才能得到较高的 $F_{1}$ 分分

$F_{1}=\frac{2}{\frac{1}{\text { precision }}+\frac{1}{\text { recall }}}=2 \times \frac{\text { precision } \times \text { recall }}{\text { precision }+\text { recall }}=\frac{T P}{TP+\frac{F N+F P}{2}}$

请注意，当只有两个类(K=2)时，此成本函数等效于逻辑回归的成本困数(对数损失，请参见公式4-17)。

交叉熵

交叉熵源于信息理论。假设你想要有效传递每天的天气信息，选项(晴、下雨等)有8个，那么你可以用3比特对每个选项进行编码，因为 $2^3=8$ 。但是，如果你认为几乎每天都是晴天，那么，对“晴天”用1比特(0)，其他7个类用4比特(从1开始)进行编码，显然会更有效率一些。交叉熵测量的是你每次发送天气选项的平均比特数。如果你对天气的假设是完美的，交叉熵将会等于天气本身的熵(也就是其本身固有的不可预测性)。但是如果你的假设是错误的(比如经常下雨)，交叉熵将会变大，增加的这一部分我们称之为KL散度(Kullback-Leiblerdivergence，也叫作相对熵)。两个概率分布p和q之间的交叉熵定义为 $H(p, q)=-\sum_{x} p(x) \log q(x)$ (至少在离散分布时可以这样定义)。

Ensemble Learning
1. Voting Classifiers

One way to get a diverse set of classifiers is to use very different training algorithms, as just discussed.

Bagging and Pasting

Another approach is to use the same training algorithm for every predictor and train them on different random subsets of the training set. When sampling is performed with replacement, this method is called bagging (short for bootstrap aggregating 自举汇聚). When sampling is performed without replacement, it is called pasting. In other words, both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.

感知器

感知器是最简单的ANN架构之一，由Frank Rosenblatt于1957 年发明。它基于稍微不同的人工神经元，称为阈值逻辑单元(TLU)，有时也称为线性阈值单元(LTU)。输人和输出是数字(而不是二进制开/关值)，并且每个输人连接都与权重相关联。TLU计算其输人的加权和( $z=w_1x_1+w_2x_2+...+w_nX_n= x^Tw$ )，然后将阶跃函数应用于该和并输出结果: $h_w(x)=step(z)，其中z=x^Tw$ 。

Vanishing/Exploding Gradients

# Gradient Clipping
optimizer = keras.optimizers.SGD(clipvalue=1.0)
[0.9, 100.0] ->  [0.9, 1.0]
optimizer = keras.optimizers.SGD(clipnorm=1.0)
[0.9, 100.0] ->[0.00899964, 0.9999595]

This optimizer will clip every component of the gradient vector to a value between –1.0 and 1.0. This means that all the partial derivatives of the loss (with regard to each and every trainable parameter) will be clipped between –1.0 and 1.0. The threshold is a hyperparameter you can tune. Note that it may change the orientation of the gradient vector. For instance, if the original gradient vector is [0.9, 100.0], it points mostly in the direction of the second axis; but once you clip it by value, you get [0.9, 1.0], which points roughly in the diagonal between the two axes. In practice, this approach works well. If you want to ensure that Gradient Clipping does not change the direction of the gradient vector, you should clip by norm by setting clipnorm instead of clipvalue. This will clip the whole gradient if its ℓ norm is greater than the threshold you picked. For example, if you set clipnorm=1.0, then the vector [0.9, 100.0] will be clipped to [0.00899964, 0.9999595], preserving its orientation but almost eliminating the first component. If you observe that the gradients explode during training (you can track the size of the gradients using TensorBoard), you may want to try both clipping by value and clipping by norm, with different thresholds, and see which option performs best on the validation set.

Monte Carlo (MC) Dropout one weird trick

First, the paper established a profound connection between dropout networks (i.e., neural networks containing a Dropout layer before every weight layer) and approximate Bayesian inference, giving dropout a solid mathematical justification. Second, the authors introduced a powerful technique called MC Dropout, which can boost the performance of any trained dropout model without having to retrain it or even modify it at all, provides a much better measure of the model’s uncertainty, and is also amazingly simple to implement

We just make 100 predictions over the test set, setting training=True to ensure
that the Dropout layer is active, and stack the predictions. Since dropout is
active, all the predictions will be different. Recall that predict() returns a
matrix with one row per instance and one column per class. Because there are
10,000 instances in the test set and 10 classes, this is a matrix of shape [10000,10]

y_probas = np.stack([model(X_test_scaled, training=True)
for sample in range(100)])
y_proba = y_probas.mean(axis=0)

Encoding Categorical Features Using One-Hot Vectors 使用独热向量编码分类特征

vocab = ["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"]
indices = tf.range(len(vocab), dtype=tf.int64)
table_init = tf.lookup.KeyValueTensorInitializer(vocab, indices)
num_oov_buckets = 2
table = tf.lookup.StaticVocabularyTable(table_init, num_oov_buckets)

Memory Requirements This is especially true during training, because the reverse pass of backpropagation requires all the intermediate values computed during the forward pass. Using 32-bit floats, then the convolutional layer’s output will occupy 200 × 150 × 100 × 32 = 96 million bits (12 MB) of RAM During inference (i.e., when making a prediction for a new instance) the RAM occupied by one layer can be released as soon as the next layer has been computed
SENet An SE block analyzes the output of the unit it is attached to, focusing exclusively on the depth dimension (it does not look for any spatial pattern), and it learns which features are usually most active together. It then uses this information to recalibrate the feature maps An SE block is composed of just three layers: a global average pooling layer, a hidden dense layer using the ReLU activation function, and a dense output layer using the sigmoid activation function

As earlier, the global average pooling layer computes the mean activation for each feature map: for example, if its input contains 256 feature maps, it will output 256 numbers representing the overall level of response for each filter. The next layer is where the “squeeze” happens: this layer has significantly fewer than 256 neurons—typically 16 times fewer than the number of feature maps (e.g., 16 neurons)—so the 256 numbers get compressed into a small vector (e.g., 16 dimensions). This is a low-dimensional vector representation (i.e., an embedding) of the distribution of feature responses. This bottleneck step forces the SE block to learn a general representation of the feature combinations (we will see this principle in action again when we discuss autoencoders in Chapter 17). Finally, the output layer takes the embedding and outputs a recalibration vector containing one number per feature map (e.g., 256), each between 0 and 1. The feature maps are then multiplied by this recalibration vector, so irrelevant features (with a low recalibration score) get scaled down while relevant features (with a recalibration score close to 1) are left alone

TimeDistributed 给予了模型一种一对多，多对多的能力，应用一个layer到每个时间步
LSTM

As the long-term state c traverses the network from left to right, you can see that it first goes through a forget gate, dropping some memories, and then it adds some new memories via the addition operation (which adds the memories that were selected by an input gate). The result c is sent straight out, without any further transformation. So, at each time step, some memories are dropped and some memories are added. Moreover, after the addition operation, the long-term state is copied and passed through the tanh function, and then the result is filtered by the output gate. This produces the short-term state h (which is equal to the cell’s output for this time step, y ). Now let’s look at where new memories come from and how the gates work.

LSTM

The three other layers are gate controllers. Since they use the logistic activation function, their outputs range from 0 to 1. As you can see, their outputs are fed to element-wise multiplication operations, so if they output 0s they close the gate, and if they output 1s they open it

WaveNet They stacked 1D convolutional layers, doubling the dilation rate (how spread apart each neuron’s inputs are) at every layer: the first convolutional layer gets a glimpse of just two time steps at a time, while the next one sees four time steps (its receptive field is four time steps long), the next one sees eight time steps, and so on (see Figure 15-11). This way, the lower layers learn short-term patterns, while the higher layers learn long-term patterns. Thanks to the doubling dilation rate, the network can process extremely large sequences very efficiently
Attention Mechanisms Instead of just sending the encoder’s final hidden state to the decoder, we now send all of its outputs to the decoder. At each time step, the decoder’s memory cell computes a weighted sum of all these encoder outputs: this determines which words it will focus on at this step

It’s actually pretty simple: they are generated by a type of small neural network called an alignment model (or an attention layer), which is trained jointly with the rest of the Encoder–Decoder model.

One extra benefit of attention mechanisms is that they make it easier to understand what led the model to produce its output. This is called explainability. It can be especially useful when the model makes a mistake: for example, if an image of a dog walking in the snow is labeled as “a wolf walking in the snow,” then you can go back and check what the model focused on when it output the word “wolf.” You may find that it was paying attention not only to the dog, but also to the snow, hinting at a possible explanation: perhaps the way the model learned to distinguish dogs from wolves is by checking whether or not there’s a lot of snow around. You can then fix this by training the model with more images of wolves without snow, and dogs with snow

Transformer Architecture
1. The encoder’s Multi-Head Attention layer encodes each word’s relationship with every other word in the same sentence, paying more attention to the most relevant ones . This attention mechanism is called self-attention (the sentence is paying attention to itself)
2. The positional embeddings are simply dense vectors (much like word embeddings) that represent the position of a word in the sentence. because the Multi-Head Attention layers do not consider the order or the position of the words; they only look at their relationships. Since all the other layers are time-distributed, they have no way of knowing the position of each word (either relative or absolute). Obviously, the relative and absolute word positions are important, so we need to give this information to the Transformer somehow, and positional embeddings are a good way to do this.
Multi-Head Attention

要理解“多头注意力”层如何工作，我们必须首先理解“缩放点积注意力”层。让我们假设编码器分析了输人句子“They played chess”，并设法理解单词“They”是主语，单词“played”是动词，因此用这些词的表征来编码这些信息。现在假设解码器已经翻泽了主语，认为接下来应该翻译动词。为此，它需要从输入句子中获取动词。这类似于字典查找:好像编码器创建了一个字典{“subject”:“They”，“verb”:“played”...}，码器想要查找与键“verb”相对应的值。但是，该模型没有具体的令牌来表示键(如“subject”或“verb”)，它具有这些概念的向量化表示(它是在训练期间学到的)，因此用于查找的键(称为查询)不完全匹配字典中的任何键。解决方法是计算查询和字典中每个键之间的相似度，然后使用softmax 函数将这些相似度分数转换为加起来为1 的权重。如果表示动词的键到目前为止是最相似的查询，那么该键的权重将接近1。然后，该模型可以计算相应值的加权和，因此，如果“verb”键的权重接近1，则该加权和将非常接近单词“played”的表征。简而言之，你可以将整个过程视为可区分的字典查找。像Luong注意力一样，Transformer使用的相似性度量只是点积。实际上，除了比例因子外，该公式与Luong注意力的相同。

Tying Weights When an autoencoder is neatly symmetrical, like the one we just built, a common technique is to tie the weights of the decoder layers to the weights of the encoder layers. This halves the number of weights in the model, speeding up training and limiting the risk of overfitting
Sparse Autoencoders Another kind of constraint that often leads to good feature extraction is sparsity: by adding an appropriate term to the cost function, the autoencoder is pushed to reduce the number of active neurons in the coding layer. For example, it may be pushed to have on average only 5% significantly active neurons in the coding layer. This forces the autoencoder to represent each input as a combination of a small number of activations. As a result, each neuron in the coding layer typically ends up representing a useful feature (if you could speak only a few words per month, you would probably try to make them worth listening to)

sparse_l1_encoder = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(100, activation="selu"),
keras.layers.Dense(300, activation="sigmoid"),
keras.layers.ActivityRegularization(l1=1e-3)])

sparse_l1_decoder = keras.models.Sequential([
keras.layers.Dense(100, activation="selu", input_shape=[300]),
keras.layers.Dense(28 * 28, activation="sigmoid"),
keras.layers.Reshape([28, 28])])
sparse_l1_ae = keras.models.Sequential([sparse_l1_encoder,
sparse_l1_decoder])

This ActivityRegularization layer just returns its inputs, but as a side effect it adds a training loss equal to the sum of absolute values of its inputs (this layer only has an effect during training). Equivalently, you could remove the ActivityRegularization layer and set activity_regularizer=keras.regularizers.l1(1e-3) in the previous layer.

Another approach, which often yields better results, is to measure the actual sparsity of the coding layer at each training iteration, and penalize the model when the measured sparsity differs from a target sparsity

Variational Autoencoders

They are probabilistic autoencoders, meaning that their outputs are partly determined by chance, even after training (as opposed to denoising autoencoders, which use randomness only during training).
Most importantly, they are generative autoencoders, meaning that they can generate new instances that look like they were sampled from the training set.

As you can see in the diagram, although the inputs may have a very convoluted distribution, a variational autoencoder tends to produce codings that look as though they were sampled from a simple Gaussian distribution: during training, the cost function (discussed next) pushes the codings to gradually migrate within the coding space (also called the latent space) to end up looking like a cloud of Gaussian points. One great consequence is that after training a variational autoencoder, you can very easily generate a new instance: just sample a random coding from the Gaussian distribution, decode it, and voilà!

Now, let’s look at the cost function. It is composed of two parts. The first is the usual reconstruction loss that pushes the autoencoder to reproduce its inputs (we can use cross entropy for this, as discussed earlier). The second is the latent loss that pushes the autoencoder to have codings that look as though they were sampled from a simple Gaussian distribution: it is the KL divergence between the target distribution (i.e., the Gaussian distribution) and the actual distribution of the codings.

TFLite’s TFLite’s model converter can take a SavedModel and compress it to a much lighter format based on FlatBuffers. This is an efficient cross-platform serialization library (a bit like Protocol Buffers) initially created by Google for gaming. It is designed so you can load FlatBuffers straight to RAM without any preprocessing: this reduces the loading time and memory footprint.

symmetrical quantization

Appendix C

Machine Learning Project Checklist main steps:

Frame the problem and look at the big picture.
Get the data.
Explore the data to gain insights.
Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.
Explore many different models and shortlist the best ones.
Fine-tune your models and combine them into a great solution.
Present your solution.
Launch, monitor, and maintain your system. Obviously, you should feel free to adapt this checklist to your needs.

Frame the Problem and Look at the Big Picture

Define the objective in business terms.
How will your solution be used?
What are the current solutions/workarounds (if any)?
How should you frame this problem (supervised/unsupervised,online/offline,etc.)?
How should performance be measured?
Is the performance measure aligned with the business objective?
What would be the minimum performance needed to reach the business objective?
What are comparable problems? Can you reuse experience or tools?
Is human expertise available?
How would you solve the problem manually?
List the assumptions you (or others) have made so far.
Verify assumptions if possible.

Get the Data Note: automate as much as possible so you can easily get fresh data.

List the data you need and how much you need.
Find and document where you can get that data.
Check how much space it will take.
Check legal obligations, and get authorization if necessary.
Get access authorizations.
Create a workspace (with enough storage space).
Get the data.
Convert the data to a format you can easily manipulate (without changing the data itself).
Ensure sensitive information is deleted or protected (e.g., anonymized).
Check the size and type of data (time series, sample, geographical, etc.).
Sample a test set, put it aside, and never look at it (no data snooping!).

Explore the Data Note: try to get insights from a field expert for these steps.

Create a copy of the data for exploration (sampling it down to a manageable size if necessary).
Create a Jupyter notebook to keep a record of your data exploration.
Study each attribute and its characteristics:
- Name
- Type (categorical, int/float, bounded/unbounded, text,structured, etc.)
- % of missing values
- Noisiness and type of noise (stochastic, outliers, rounding errors, etc.)
- Usefulness for the task
- Type of distribution (Gaussian, uniform, logarithmic, etc.)
For supervised learning tasks, identify the target attribute(s).
Visualize the data.
Study the correlations between attributes.
Study how you would solve the problem manually.
Identify the promising transformations you may want to apply.
Identify extra data that would be useful (go back to “Get the Data”).
Document what you have learned.

Prepare the Data Notes:

Work on copies of the data (keep the original dataset intact).
Write functions for all data transformations you apply, for five reasons:
1. So you can easily prepare the data the next time you get a fresh dataset
2. So you can apply these transformations in future projects
3. To clean and prepare the test set
4. To clean and prepare new data instances once your solution is live
5. To make it easy to treat your preparation choices as hyperparameters

Data cleaning:

Fix or remove outliers (optional).
Fill in missing values (e.g., with zero, mean, median…) or drop their rows (or columns).

Feature selection (optional):

Drop the attributes that provide no useful information for the task.

Feature engineering, where appropriate:

Discretize continuous features.
Decompose features (e.g., categorical, date/time, etc.).
Add promising transformations of features (e.g., log(x), sqrt(x), x , etc.).
Aggregate features into promising new features.

Feature scaling:

Standardize or normalize features.

Shortlist Promising Models Notes:

If the data is huge, you may want to sample smaller training sets so you can train many different models in a reasonable time (be aware that this penalizes complex models such as large neural nets or Random Forests).
Once again, try to automate these steps as much as possible.

Train many quick-and-dirty models from different categories (e.g., linear, naive Bayes, SVM, Random Forest, neural net, etc.) using standard parameters.
Measure and compare their performance.

For each model, use N-fold cross-validation and compute the mean and standard deviation of the performance measure on the N folds.

Analyze the most significant variables for each algorithm.
Analyze the types of errors the models make.

What data would a human have used to avoid these errors?

Perform a quick round of feature selection and engineering.
Perform one or two more quick iterations of the five previous steps.
Shortlist the top three to five most promising models, preferring models that make different types of errors.

Fine-Tune the System Notes:

You will want to use as much data as possible for this step, especially as you move toward the end of fine-tuning
As always, automate what you can.

Fine-tune the hyperparameters using cross-validation:

Treat your data transformation choices as hyperparameters, especially when you are not sure about them (e.g., if you’re not sure whether to replace missing values with zeros or with the median value, or to just drop the rows).
Unless there are very few hyperparameter values to explore, prefer random search over grid search. If training is very long,you may prefer a Bayesian optimization approach (e.g., using Gaussian process priors, as described by Jasper Snoek et al.).

Try Ensemble methods. Combining your best models will often produce better performance than running them individually.
Once you are confident about your final model, measure its performance on the test set to estimate the generalization error.

Present Your Solution

Document what you have done.
Create a nice presentation.

Make sure you highlight the big picture first.

Explain why your solution achieves the business objective.
Don’t forget to present interesting points you noticed along the way.
- Describe what worked and what did not.
- List your assumptions and your system’s limitations.
Ensure your key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g., “the median income is the number-one predictor of housing prices”).

Launch!

Get your solution ready for production (plug into production data inputs, write unit tests, etc.).
Write monitoring code to check your system’s live performance at nregular intervals and trigger alerts when it drops.

Beware of slow degradation: models tend to “rot” as data evolves.
Measuring performance may require a human pipeline (e.g., via a crowdsourcing service).
Also monitor your inputs’ quality (e.g., a malfunctioning sensor sending random values, or another team’s output becoming stale). This is particularly important for online learning systems.

Retrain your models on a regular basis on fresh data (automate as much as possible).