Introduction

ML无处不在：垃圾邮件过滤器、搜索引擎、保洁机器人and so on。

ML 本质
- 诞生于AI
- 计算机的新型能力
应用领域
- 数据挖掘：从不断扩张的网络中获取数据集。如点击量、媒体报道、生物学、工程学。
- 解放双手的应用编程：自动直升机、手写识别、自然语言处理、CV
- 学习人脑思考

一、ML 究竟是什么？

百家争鸣。

没有明确设置情况下，给予计算机学习的能力(比较久远，也不甚准确)
计算机程序从经验E中学习解决任务T的某一性能度量P，通过P测定在T上的表现因经验E而提高，例如：跳棋学习
1. E 指近万次的对弈
2. T 就是指玩跳棋
3. P 就是指玩跳棋胜弈的概率
测验：

Suppose your email program watches which emails you do or do not mark as spam, and based on that learns how to better filter spam. What is the task T in this setting?

A. Classifying emails as spam or not spam. # T

B. Watching you label emails as spam or not spam. # E

C. The number (or fraction) of emails correctly classified as spam/not spam. # P

D. None of the above- -this is not a machine lerning problem. # 绷
当前的机器学习算法
1. 主要分类：监督学习和无监督学习
2. 其他类型：强化学习、(偏好)推荐系统

二、监督学习

1. 定义

基于既定输入来生成模型，根据非训练集中的数据输入来输出“正确结果”
回归 —— 预测一个连续值的输出(不确定)
分类 —— 预测离散值的输出（输出已“既定”）

2. 例子

回归问题（连续模型）
- 预测连续输出(输出值不明确)
分类问题（离散模型）
- 预测离散输出(输出值明确)
1. 还可以包含许多的其他特征：Clump Thickness 、Uniformity of Cell Size Uniformity of Cell Shape，and so forth

3. 测验

You're running a company, and you want to develop learning algorithms to address each of two problems.

Problem 1: You have a large inventory of identical items. You want to predict how many of these items will sell over the next 3 months. # 预测庞大的货品清单三月后的销量 - regression

Problem 2: You'd like software to examine individual customer accounts, and for each account decide if it has been hacked/compromised. # 判断用户账户是否正常 - classification Should you treat these as classification or as regression problems?[C]

A. Treat both as classification problems.

B. Treat problem 1 as a classification problem, problem 2 as a regression problem.

C. Treat problem 1 as a regression problem, problem 2 as a classification problem.

D. Treat both as regression problems.

三、无监督学习

1. 定义

输入数据没有“正确”的label或label都相同。这种情况下，聚类算法致力于将输入数据分成“簇”。
将所有结果相似(逻辑相邻)的数据捆绑在一起

监督与非监督的区别.PNG

2. 例子

谷歌新闻搜索、基因组、社交网络分析、星团、市场分割
鸡尾酒舞会：
1. 问题描述：两个话筒，两个人，两个话筒能同时听到两个人的讲话，但是声音强度不一致，如何将两个人的声音分离？或者只有一个人说话但是背景音比较嘈杂，如何将此人的发言和背景音分离？
2. 一行代码即可解决： $[W, s, v]=\operatorname{svd}\left(\left(\operatorname{repmat}\left(\operatorname{sum}\left(x . *^{*} x, 1\right), \operatorname{size}(x, 1), 1\right) .{ }^{*} x\right)^{*} x^{\prime}\right) ;$

3. 测验

Of the following examples, which would you address using an unsupervised learning algorithm? (Check all that apply.)

A. Given email labeled as spam/not spam, learn a spam filter. # 给定标签了，这是分类算法

B. Given a set of news articles found on the web, group them into set of articles about the same story. # 没分类，且要求分组 √

C. Given a database of customer data, automatically discover market segments and group customers into different market segments. #没分类，也要分组 √

D. Given a dataset of patients diagnosed as either having diabetes or not, learn to classify new patients as having diabetes or not. # 有分类，这是分类算法

具体算法

解决问题的具体步骤。

一、模型描述

1. 问题描述

房价预测：

房价预测.PNG

性质：
1. 是一个监督学习，因为输入数据已经是 labeled
2. 是一个回归问题，因为房屋价格不是既定的，而是连续的

2. 术语：

术语.PNG

输入数据被称为“训练集”
m 表示训练样例的数据
x 表示输入数据或其特征
y 表示输出输出或其目标结果
(x, y)表示一个训练样本， $(x^{(i)} ,y^{(i)} )$ 并不是考研中的 i 阶导或 i 次方，而是指第 i 个训练样本

3. 解决问题的一般流程：

设出假设函数 —— 线性回归函数
- 一般表示：
  
  线性：y = ax + b(一元线性回归函数)
  
  非线性
计算误差
纠正参数

如图所示：

机器学习的一般步骤.PNG

二、代价函数

1. 一元线性函数的一般图像

一元函数的表示.PNG

2. 代价函数

诞生背景：对于假设函数：y = ax + b，挑选出合适的 a 、b使的对于训练集中的输入 x ，输出数据能尽可能的贴合 y。
思路：对于 a、b的“适合度”，我们要再利用一个函数来评估 —— 代价函数，并 minimize 代价函数的值
一般形式： $\sum_{ m }^{ i = 1}(h_{\theta }(x^{^{(i)} } )- y^{(i)} ) ^{2}$ 。其中 m 为训练集个数。
1. J(a, b) =(1 / (2 * m) $\sum_{ m }^{ i = 1}(h_{ab }(x^{^{(i)} } )- y^{(i)} ) ^{2}$ 。称为方差函数

3. 一般流程

Hypothesis:

$h_{\theta}(x)=\theta_{0}+\theta_{1} x$

Parameters:

$\theta_{0}, \theta_{1}$

Cost Function:

$J\left(\theta_{0}, \theta_{1}\right)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}$

Goal: $\underset{\theta_{0}, \theta_{1}}{\operatorname{minimize}} J\left(\theta_{0}, \theta_{1}\right)$

4. 举例(只含一个未知数的假设函数，令 b = 0)：

当 b = 0时，每一个不同的 a 都对应着一条曲线，将此曲线对应数据集 x 的输出和 y 带入代价函数计算结果，绘出代价函数 a - J 图。
拟合过程

5. 举例(两个未知数 ${\theta_{i}}$ )

二元代价函数

二元损失曲面.PNG

等差面图(需要加强理解)

三、梯度下降

1. 问题描述

Have some function J( $\theta _{0} ,\theta _{1}$ ) Want min J( $\theta _{0} ,\theta _{1}$ )

Outline:

●Start with some ( $\theta _{0} ,\theta _{1}$ ) # (一般会将初始值设置为全0)

●Keep changing ( $\theta _{0} ,\theta _{1}$ ) to reduce J( $\theta _{0} ,\theta _{1}$ ) until we hopefully end up at a minimum #（或者人为设置停止次数）

这是一个线性回归算法。

2. 例子

根据二元函数曲面，判断如何下达“山脚”最快，应该往哪边走？
采用局部最优思想

梯度下降.PNG

3. 数学原理

Gradient descent algorithm

repeat until convergence {

$\theta_{j}:=\theta_{j}-\alpha \frac{\partial}{\partial \theta_{j}} J\left(\theta_{0}, \theta_{1}\right) \quad \text { (for } j=0 \text { and } j=1 \text { ) }$

}

说明：

① := 表示赋值

② ${\alpha}$ 是“学习率”，一个数字量，用来控制梯度下降的“步幅”。数值越大代表步子越大。

③ $\frac{\partial}{\partial \theta_{j}} J\left(\theta_{0}, \theta_{1}\right)$ 代表了一个偏导数， $\theta_{j}$ 依次迭代并获得最小值。

Correct: Simultaneous update(同步更新，学习率是一致的，且一定为正数)

$\begin{array}{l} \text { temp0 }:=\theta_{0}-\alpha \frac{\partial}{\partial \theta_{0}} J\left(\theta_{0}, \theta_{1}\right) \\ \text { temp1 }:=\theta_{1}-\alpha \frac{\partial}{\partial \theta_{1}} J\left(\theta_{0}, \theta_{1}\right) \\ \theta_{0}:=\text { temp0 } \\ \theta_{1}:=\text { temp1 } \end{array}$

4. 知识点总结

例子：房价拟合的J( $\theta$ )曲线的两种初始值迭代情况
导数项( $\frac{d}{d \theta_{1}} J\left(\theta_{1}\right)$ )的作用
1. $\theta_{1}$ 位于最佳点左侧
2. $\theta_{1}$ 位于最佳点右侧

导数项的拟合.PNG

学习率( $\alpha$ )的作用
1. 较小的话， $\theta_{1}$ 会迭代的比较慢，但是更精确。
2. 较大则与此相反，甚至会导致无法收敛。
若初始点直接设置在局部最优点，则不会改变。甚至是不为最优点但导数为0都会导致不进行迭代(如之奈何？)，机器会将所有的局部最优解聚合起来再进行比较。
梯度下降算法会自动调整步幅，以便逐渐地靠近局部最佳点。不需要每次迭代后都降低 ${\alpha}$ 的值。

5. 线性回归的梯度下降

基本步骤
1. 计算 j = 0 和 j = 1的对应的偏导数值：
  
  $\begin{array}{l} \frac{\partial}{\partial \theta_{j}} J\left(\theta_{0}, \theta_{1}\right) =\frac{\partial}{\partial \theta_{j}} \frac{1}{2 m} \sum_{i=1}^{m}\left(\theta_{0}+\theta_{1} x^{(i)}-y^{(i)}\right)^{2} \\ j=0: \frac{\frac{\partial}{\partial \theta_{0}} J\left(\theta_{0}, \theta_{1}\right)}{\frac{\partial}{\partial \theta_{1}} J\left(\theta_{0}, \theta_{1}\right)}=\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \\ j=1: \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)} \end{array}$
2. 迭代 $\theta_{0}, \theta_{1}$ 的值：
  
  repeat until convergence { # 达到对应精度或者预设迭代次数就停止
  
  $\begin{aligned} \theta_{0} &:=\theta_{0}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \\ \theta_{1} &:=\theta_{1}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)} \end{aligned}$
  
  }
3. 注意：二元的代价曲面都是一个Bowl碗装(弓型)的。
  
  这样的函数没有局部最优解，只有一个全局最优解。
等高线图的看法：
1. 中心点即为最低点，迭代过程中，各参数会不断逼近最低点。
2. 其实等高线中的 $\theta_{0}与\theta_{1}$ 并不存在对应关系，图像的曲线颜色代表了 J 的大小

等高线.PNG

6. "Batch"梯度下降算法(批处理)

过程：

有一些迭代方式不同，每次迭代只会使用局部的样本集，而"Batch"梯度迭代时，会使用所有的训练集样本，前面描述的迭代方法便是如此。

ML day01 - Introduction_Gradient descent

Introduction

一、ML 究竟是什么？

二、监督学习

1. 定义

2. 例子

3. 测验

三、无监督学习

1. 定义

2. 例子

3. 测验

具体算法

一、模型描述

1. 问题描述

2. 术语：

3. 解决问题的一般流程：

二、代价函数

1. 一元线性函数的一般图像

2. 代价函数

3. 一般流程

4. 举例(只含一个未知数的假设函数，令 b = 0)：

5. 举例(两个未知数θi{\theta_{i}}θi​)

三、梯度下降

1. 问题描述

2. 例子

3. 数学原理

4. 知识点总结

5. 线性回归的梯度下降

6. "Batch"梯度下降算法(批处理)

5. 举例(两个未知数 ${\theta_{i}}$ )