01-ML基本概念

114 阅读1分钟

ML基本概念

  • ML = Looking for function(f)

  • Different types of functions

    • Regression: f outputs a scalar
    • Classification: given classes, f outputs the correct one
    • Structured Learning: creat sth with structure(image,doc)
  • How to find a f ? => training

    • step1: f with unknown parameters

    • step2: define loss(L) from training data

      • loss is a f of parameters

      • loss means how good a set of value is

        eg:

        L=\frac{1}{n}\sum_{n}{e_n}\MAE:e=|y-\hat y|\MSE:e=(y-\hat y)^2

        MAE: L is mean absolute error

        MSE: L is mean square error

        01-1.png

      • optimization: w,b=argminw,bLw^*,b^*=argmin_{w,b}L

        method: Gradient Descent

        • randomly pick an initail value w0w_0

        • compute Lww=w0\frac{\partial L}{\partial w}|_{w=w_0}

        01-2.png

        > if nagative => increase w
        > elif positive => decrease w
        >
        > so w_0 \to w_1
        

        what about the increment ?

        01-3.png

        ηLww=w0\textcolor{red}{\eta}\cdot\frac{\partial L}{\partial w}|_{w=w_0}

        η\eta is learning rate

        η\eta : a parameter that needs to be set by self => hyperparameter(超参数)

        in conclusion, w1w0ηLww=w0w_1\leftarrow w_0-\eta\frac{\partial L}{\partial w}|_{w=w_0}

        • update w iteratively(反复迭代 w)

        01-4.png

        故梯度下降法存在:局部最优解的问题(won't cause a problem actually)

        • In cases with multiple parameters, it's similar to having only a single parameter.

        01-5.png

        01-6.png

      prediction(then adjusting the model based on prediction results again...)

      The above example is based on the foundation of a linear model.

      但是线性模型具有一定的局限性(model bias)

      solution: add s set of piecewise linear fs

      01-7.png

      You can modify the parameters(c,b,w) in the function to adjust the shape of it.

      So the new model got more features.

      y=b+icisigmoid(bi+wixi)y=b+icisigmoid(bi+jwijxj)y=b+\sum_ic_isigmoid(b_i+w_ix_i)\\ y=b+\sum_ic_isigmoid(b_i+\textcolor{green}{\sum_jw_{ij}x_j})

      i : number of sigmoid fs; j : number of features; sigmoid()=σ()sigmoid()=\sigma()

      01-8.png

      01-9.png

      01-10.png

      01-11.png

      so this time Loss = L(θ\theta)

    • step3: optimization

      • θ=argminθL\vec{\theta^*}=argmin_{\vec\theta}L

        • randomly pick initial values \vec\theta_0

        01-12.png gradient: \nabla

        • update θ\vec\theta iteratively

          • θ1θ0ηgθ2θ1ηg......\vec\theta_1\leftarrow\vec\theta_0-\eta\vec g\\ \vec\theta_2\leftarrow\vec\theta_1-\eta\vec g\\ ......

        01-13.png

        if N = 10000, batch size = 10, how many update in 1 epoch?

        answer: 1000 updates

      • sigmoid \to ReLU(Rectified Linear Unit): cmax(0,b+wx0)cmax(0,b+wx_0)

        sigmoid:y=b+icisigmoid(bi+iwijxj)ReLU:y=b+2icimax(0,bi+jwijxj)sigmoid:y=b+\sum_ic_isigmoid(b_i+\sum_iw_{ij}x_j)\\ ReLU:y=b+\sum_{\textcolor{red}{2i}}c_imax(0,b_i+\sum_jw_{ij}x_j)

        which one is better? =>ReLU

      • multiple hidden layers

        Increasing this hyperparameter can reduce the value of Loss, but increases the complexity of the model.

      • 01-14.png deep means many hidden layers, but why want "deep" but not "fat"(just put all the neurons in a row)??? --hhh