ML基本概念

ML = Looking for function(f)
Different types of functions
- Regression: f outputs a scalar
- Classification: given classes, f outputs the correct one
- Structured Learning: creat sth with structure(image,doc)
How to find a f ? => training
- step1: f with unknown parameters
- step2: define loss(L) from training data
  - loss is a f of parameters
  - loss means how good a set of value is
    
    eg:
    
    L=\frac{1}{n}\sum_{n}{e_n}\MAE:e=|y-\hat y|\MSE:e=(y-\hat y)^2
    
    MAE: L is mean absolute error
    
    MSE: L is mean square error
  - optimization: $w^*,b^*=argmin_{w,b}L$
    
    method: Gradient Descent
    - randomly pick an initail value $w_0$
    - compute $\frac{\partial L}{\partial w}|_{w=w_0}$
```
> if nagative => increase w
> elif positive => decrease w
>
> so w_0 \to w_1
```
    what about the increment ?
    
    $\textcolor{red}{\eta}\cdot\frac{\partial L}{\partial w}|_{w=w_0}$
    
    $\eta$ is learning rate
    
    $\eta$ : a parameter that needs to be set by self => hyperparameter(超参数)
    
    in conclusion, $w_1\leftarrow w_0-\eta\frac{\partial L}{\partial w}|_{w=w_0}$
    - update w iteratively(反复迭代 w)
    故梯度下降法存在：局部最优解的问题(won't cause a problem actually)
    - In cases with multiple parameters, it's similar to having only a single parameter.
  prediction(then adjusting the model based on prediction results again...)
  
  The above example is based on the foundation of a linear model.
  
  但是线性模型具有一定的局限性(model bias)
  
  solution: add s set of piecewise linear fs
  
  You can modify the parameters(c,b,w) in the function to adjust the shape of it.
  
  So the new model got more features.
  
  $y=b+\sum_ic_isigmoid(b_i+w_ix_i)\\ y=b+\sum_ic_isigmoid(b_i+\textcolor{green}{\sum_jw_{ij}x_j})$
  
  i : number of sigmoid fs; j : number of features; $sigmoid()=\sigma()$
  
  so this time Loss = L( $\theta$ )
- step3: optimization
  - $\vec{\theta^*}=argmin_{\vec\theta}L$
    - randomly pick initial values \vec\theta_0
    gradient: $\nabla$
    - update $\vec\theta$ iteratively
      - $\vec\theta_1\leftarrow\vec\theta_0-\eta\vec g\\ \vec\theta_2\leftarrow\vec\theta_1-\eta\vec g\\ ......$
    if N = 10000, batch size = 10, how many update in 1 epoch?
    
    answer: 1000 updates
  - sigmoid $\to$ ReLU(Rectified Linear Unit): $cmax(0,b+wx_0)$
    $sigmoid:y=b+\sum_ic_isigmoid(b_i+\sum_iw_{ij}x_j)\\ ReLU:y=b+\sum_{\textcolor{red}{2i}}c_imax(0,b_i+\sum_jw_{ij}x_j)$
    which one is better? =>ReLU
  - multiple hidden layers
    
    Increasing this hyperparameter can reduce the value of Loss, but increases the complexity of the model.
  - deep means many hidden layers, but why want "deep" but not "fat"(just put all the neurons in a row)??? --hhh

01-ML基本概念

ML基本概念

ML = Looking for function(f)

Different types of functions

How to find a f ? => training

step1: f with unknown parameters

step2: define loss(L) from training data

step3: optimization