机器学习 (1) 单变量线性回归

569 阅读6分钟

这是我参与8月更文挑战的第19天,活动详情查看:8月更文挑战

Notes of Andrew Ng’s Machine Learning —— (1) Linear Regression with One Variable

Model Representation

Notation

  • x(i)x^{(i)} denote the input variables, called input features
  • y(i)y^{(i)} denote the output variables, called target variable
  • (x(i),y(i))(x^{(i)}, y^{(i)}) is called a training example
  • (x(i),y(i));i=1,...,m(x^{(i)}, y^{(i)}); i=1, ..., m —— is called a training set

⚠️【Note】The superscript "(i)^{(i)}" in the notations is simply an index into the train set.

  • XX denote the space of input values
  • YY denote the space of output values

In many examples, X=Y=RX = Y =\R.

A more formal description of supervised learning

Given a training set, to learn a function h:XYh: X \to Y so that h(x)h(x) is a "good" predictor for the corresponding value of yy.

For historical reasons, this function hh is called a hypothesis.

Pictorially:

page1image20344816.png

With our new notations, the categories of supervised learning problems can be this:

  • regression problem: when the target variable is continuous.
  • classification problem: yy can take on only a small number of discrete values

Hypothesis & Model Regression

To represent the hypothesis hh, for example, if we want to fit a data set like this:

image-20190819214710193

Simply, we may make the hθ(x)=θ0+θ1xh_\theta(x)=\theta_0+\theta_1x. This is means that we are going to predict that yy is a linear function of xx.

Plotting this in the picture, it will be something like this:

006tNc79gy1g65bp99j97j307o05gq2z

And what this function is doing is predicting that yy is some straight line function of xx.

In this case, this model is called linear regression with one variable(or Univariate linear regression).

P.S. we can also fit a more complicated (perhaps non-linear) functions, but this linear case is the simplest.

Cost Function

Well, now we have got this:

image-20190819221840291

To get a "good" hypothesis function, we need to choose θ0\theta_0, θ1\theta_1 so that hθ(x)h_\theta(x) is close to yy for our training exmaples (x,y)(x, y). A cost function is of great help with this goal.

Cost function

A cost function can help us to measure the accuracy of our hypothesis function.

This takes an average difference of all the results of the hypothesis with inputs from x's and the actual output y's:

J(θ0,θ1)=12mi=1m[hθ(x(i))y(i)]2(1)J(\theta_0, \theta_1) = \frac{1}{2m}\sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}]^2 \tag{1}

The function (1)(1) is our cost function exactly. Take a look at it:

  • hθ(x(i))y(i)h_\theta(x^{(i)})-y^{(i)} shows the difference between the predicted value and the actual value.

  • 1mi=1m...\frac{1}{m}\sum^{m}_{i=1}... offers the mean of the squares of hθ(x(i))y(i)h_\theta(x^{(i)})-y^{(i)}

  • The mean is halved (12\frac{1}{2}) as a convenience for the computati::on of the gradient descent, as the 12f2=f\frac{1}{2}f^2 = f.

P.S. This function is otherwise called the "Squared error function", or "Mean squared error".

Goal

To make hθ(x)h_\theta(x) closing to yy, we are just expected to minimize our cost function by adjusting the value of θ0\theta_0, θ1\theta_1.

We describe this goal like this:

minimizeθ0,θ1J(θ0,θ1)(2)\mathop{minimize}\limits_{\theta_0, \theta_1} J(\theta_0, \theta_1) \tag{2}

Or, more directly:

minimizeθ0,θ112mi=1m[hθ(x(i))y(i)]2(3)\mathop{minimize}\limits_{\theta_0, \theta_1} \frac{1}{2m}\sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}]^2 \tag{3}

And if we are working with a linear regression with one variable, the hθ(x)=θ0+θ1xh_\theta(x)=\theta_0+\theta_1x.

⚠️【Note】Hypothesis Function & Cost Function

  • hθ(x)h_\theta(x) for fixed θi\theta_i, is a function of xx.
  • J(θ0,θ1)J(\theta_0, \theta_1) is a function of the parameter θi\theta_i .

Cost Function - Intuition I

Let's draw some pictures for better understanding of what the values of the cost function.

To getting start, we are going to work with a simplified hypothesis function:

屏幕快照 2019-08-19 22.44.10

Our training data set is scattered on the x-y plane. We are trying to make a straight line (defined by hθ(x)h_\theta(x)) which passes through these scattered data points.

Our objective is to get the best possible line. The best possible line will be such:

So that the average squared vertical distances of the scattered points from the line (1mi=1m[hθ(x(i))y(i)]2\frac{1}{m}\sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}]^2) will be the least.

Ideally, the line should pass through all the points of our training data set. In such a case, the value of cost function J(θ0,θ1)J(\theta_0, \theta_1) will be 0.

E.g. A ideal situation where J=0J=0:

Let this be our training set: Only three points (1, 1), (2, 2) & (3, 3)

屏幕快照 2019-08-19 22.54.07

Now, we try setting θ1\theta_1 to different values: -0.5, 0, 0.5, 1, 1.5......

When θ1=1\theta_1=1, we get a slope of 1 which goes through every single data point in our model. Conversely, when θ1=0.5\theta_1=0.5, we see the vertical distance from our fit to the data points increase:

新建项目

By doing this, we got a series of graph of hθ(x)h_\theta(x) in x-y plane as well as yield to the following θ1\theta_1-J(θ1)J(\theta_1) graph:

image-20190819230628708

Thus as a goal, we should try to minimize the cost function. In this case, θ1=1\theta_1=1 is our global minimum.

Cost Function - Intuition II

Unlike before, this time, we won't continue with the simplified hypothesis, we are going to keep both of parameters θ0\theta_0 and θ1\theta_1. So the hypithesis function will be hθ(x)=θ0+θ1xh_\theta(x)=\theta_0+\theta_1x.

Here's our problem formulation as usual:

Hypothesis: hθ(x)=θ0+θ1xParameters: θ0,θ1Cost Function: J(θ0,θ1)=12mi=1m[hθ(x(i))y(i)]2Goal: minimizeθ0,θ1J(θ0,θ1)\begin{array}{rl} \textrm{Hypothesis: } & h_\theta(x)=\theta_0+\theta_1x\\ & \\ \textrm{Parameters: } & \theta_0, \theta_1\\ & \\ \textrm{Cost Function: } & J(\theta_0, \theta_1) = \frac{1}{2m}\sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}]^2\\ & \\ \textrm{Goal: } & \mathop{minimize}\limits_{\theta_0, \theta_1} J(\theta_0, \theta_1) \end{array}

Same as last time, we want to unserstand the hypothesis hh and the cost function JJ via a series of graph. However, we'd like to use a contour plot to describe our J(θ0,θ1)J(\theta_0, \theta_1).

A contour plot is a graph that contains many contour lines.

A contour line of a two variable function has a constant value at all points of the same line.

An example:

屏幕快照 2019-08-21 09.29.48

Taking any color and going along the 'circle', one would expect to get the same value of the cost function.

To touch our optimization objective, we can try to setting the parameters θi\theta_i to different values.

When θ0=360\theta_0 = 360 and θ1=0\theta_1 = 0, the value of J(θ0,θ1)J(\theta_0, \theta_1) in the contour plot gets closer to the center thus reducing the cost function error. Now we get a result in a better fit of the data:

屏幕快照 2019-08-21 09.29.58

Minimizing the cost function as much as possible and consequently, the result of θ1\theta_1 and θ0\theta_0 tend to be around 0.12 and 250 respectively. Plotting those values on our graph to the right seems to put our point in the center of the inner most 'circle'.

Obviously, we dislike to write a software to just plot out a contour plot and then try to manually read off the numbers to reach our goal. We want an efficient algorithm for automatically finding the value of θ0\theta_0 and θ1\theta_1 that minimizes the cost function JJ. Actually, the gradient descent algorithm that we will talk about works great on this question.

Gradient Descent

There is a algorithm called gradient descent for minimizing the cost function JJ. And we can use it not only in linear regression as it's actually used all over the place in machine learning.

Let's talk about gradient descent for minimizing some arbitrary function JJ. So here's the problem setup:

image-20190821172439919

We put θ0\theta_0 on the x axis and θ1\theta_1 on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. The graph below depicts such a setup:

image-20190821204112216

We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum. The red arrows show the minimum points in the graph.

Image that we are physically standing at a point on a hill, in gradient descent, what we're going to do is to spin 360 degrees around and just look all around us, and ask, "If I were to take a little step in some direction, and I want to go down the hill as quickly as possible, what direction should I take?" then you take a step in that direction. Repeat doing this until you converge to a local minimum. Like the black line in the picture above shows.

Notice that if we choose different points to grandient descent, we may reach different local optimums.

Mathematically, this is the definition of the gradient descent algorithm:

Gradient Descent Algorithm

repeat until convergence {

θj:=θjαθjJ(θ0,θ1)(for j=0 and j=1)\qquad\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1)\qquad \textrm{(for } j=0 \textrm{ and } j=1 \textrm{)}

}

The α\alpha is a number that is called the learning rate. It basically controls how big a step we take downhill with gradient descent.

At each iteration jj, one should simultaneously update the parameters θ1,θ2,...,θn. Updating a specific parameter prior to calculating another one on the j(th) iteration would yield to a wrong implementation:

屏幕快照 2019-08-21 21.24.15

Gradient Descent Intuition

Let's explore the scenario where we used **one parameter θ1\theta_1 ** and plotted its cost function to implement a gradient descent. Our formula for a single parameter was : Repeat until convergence:

θ1:=αddθ1J(θ1)\theta_1:=\alpha\frac{d}{d\theta_1}J(\theta_1)

Regardless of the slope's sign for ddθ1J(θ1)\frac{d}{d\theta_1}J(\theta_1) eventually converges to its minimum value.

The following graph shows that when the slope is negative, the value of θ1\theta_1 increases and when it is positive, the value of θ1\theta_1 decreases:

image-20190826211520863

On a side note, we should adjust our parameter α\alpha to ensure that the gradient descent algorithm converges in a reasonable time. Failure to converge or too much time to obtain the minimum value imply that our step size is wrong.

image-20190826211727240

How does gradient descent converge with a fixed step size α\alpha? The intuition behind the convergence is that ddθ1J(θ1)\frac{d}{d\theta_1}J(\theta_1) approaches 0 as we approach the bottom of our convex function. At the minimum, the derivative will always be 0 and thus we get:

image-20190826211936126

Gradient Descent for Linear Regression

Now, we have learnt the gradient descent, the linear regression model and the squared error cost function as well. This time, we are going to put together gradient descent with our cost function that will give us an algorithm for linear regression for fitting a straight line to our data.

This is what we worked out:

image-20190826212442071

We are going to apply gradient descent algorithm to minimize our squared error cost function.

The key term is the derivative term:

θjJ(θ0,θ1)=θj12mi=1m[hθ(x(i))y(i)]2=θj12mi=1m[θ0+θ1x(i)y(i)]2\frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1) = \frac{\partial}{\partial\theta_j}\frac{1}{2m}\sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}]^2 = \frac{\partial}{\partial\theta_j}\frac{1}{2m}\sum^m_{i=1}[\theta_0+\theta_1x^{(i)}-y^{(i)}]^2
j=0:θ0J(θ0,θ1)=1mi=1m[hθ(x(i))y(i)]j=1:θ1J(θ0,θ1)=1mi=1m[hθ(x(i))y(i)]x(i)\begin{array}{rl} j=0: & \frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)=\frac{1}{m}\sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}]\\ & \\ j=1: & \frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)=\frac{1}{m}\sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}] \cdot x^{(i)} \end{array}

Plug them back into our gradient descent algorithm:

repeat until convergence {

θ0:=θ0α1mi=1m[hθ(x(i))y(i)]\qquad\theta_0 := \theta_0 - \alpha\frac{1}{m}\sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}]

θ1:=θ1α1mi=1m[hθ(x(i))y(i)]x(i)\qquad\theta_1 := \theta_1 - \alpha\frac{1}{m}\sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}] \cdot x^{(i)}

}

Notice: update θ0\theta_0 and θ1\theta_1 simultaneously.

The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these gradient descent equations, our hypothesis will become more and more accurate. So, this is simply gradient descent on the original cost function J.

This method looks at every example in the entire training set on every step, and is called batch gradient descent.

Note that, while gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local,optima; thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum. Indeed, J is a convex quadratic function. Here is an example of gradient descent as it is run to minimize a quadratic function.

image-20190826231632514

The ellipses shown above are the contours of a quadratic function. Also shown is the trajectory taken by gradient descent, which was initialized at (48,30). The x’s in the figure (joined by straight lines) mark the successive values of θ that gradient descent went through as itconverged to its minimum.


Experiment

作为补充,我用 Python 写了一个非常容易理解,但性能出奇低的单变量梯度下降:

(篇幅限制,我删除了大量注释,完整的代码在这个 Gist

import math
import random

class LinearRegressionWithOneVariable(object):
    def __init__(self, training_set):
        self.training_set = training_set  # [(x: int, y: int), ...]
        self.theta = [0, 0]
    
    def _hypothesis(self, x, theta):
        return theta[0] + theta[1] * x
    
    def _cost(self, dataset, theta):
        s = 0
        for i in dataset:
            s += (self._hypothesis(i[0], theta) - i[1]) ** 2
        return s / (2 * len(dataset))
    
    def _gradient_descent(self, dataset, starting_theta, learning_rate, epsilon, max_count=4294967296):
        theta = list.copy(starting_theta)
        last_theta = list.copy(starting_theta)
        cost = self._cost(dataset, theta)
        last_cost = cost * 2
        count = 0
        epsilon = abs(epsilon)
        diff = epsilon + 1
        while (diff > epsilon):
            count += 1
            if count >= max_count:
                raise RuntimeError("not convergence after {mc} iterations.".format(mc=max_count))
    
            try:
                t_sum= sum((self._hypothesis(i[0], theta) - i[1] for i in dataset))
                theta[0] = theta[0] - learning_rate * t_sum / len(dataset)
    
                t_sum = sum(
                    ((self._hypothesis(i[0], theta) - i[1]) * i[0] for i in dataset)
                    )
                theta[1] = theta[1] - learning_rate * t_sum / len(dataset)
    
                last_cost = cost
                cost = self._cost(dataset, theta)
    
                if not any((math.isnan(x) or math.isinf(x) for x in theta)) and abs(cost) <= abs(last_cost):
                    diff = max((abs(last_theta[i] - theta[i]) for i in range(len(theta))))
                    last_theta = list.copy(theta)
                    learning_rate += learning_rate * 4
                else:
                    theta = list.copy(last_theta)
                    learning_rate /= 10
                    if (learning_rate == 0):
                        learning_rate = self._get_learning_rate(self.training_set)
    
            except OverflowError:
                theta = list.copy(last_theta)
                learning_rate /= 10
                if (learning_rate == 0):
                    learning_rate = self._get_learning_rate(self.training_set)


        return theta, count
    
    def _get_learning_rate(self, dataset):
        return 1 / max((i[1] for i in dataset))
    
    def regress(self, epsilon=1, learning_rate=0, verbose=False):
        init_theta = [random.random(), random.random()]
    
        if learning_rate == 0:
            learning_rate = self._get_learning_rate(self.training_set)
        
        self.theta, count = self._gradient_descent(self.training_set, init_theta, learning_rate, epsilon)
        
        if verbose:
            print('Gradient descent finished after {count} iterations:\ntheta_0:\t{t0}\ntheta_1:\t{t1}\ncost:\t\t{cost}\nhypothesis:\th(x) = {t0} + {t1}*x'.format(
                    count=count, t0=self.theta[0], t1=self.theta[1], cost=self._cost(self.training_set, self.theta)))
    
    def hypothesis(self, x):
        return self._hypothesis(x, self.theta)

使用例子:

    >>> model = LinearRegressionWithOneVariable([(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)])
    >>> model.regress(verbose=True, epsilon=0.0001)
    theta_0:        0.025486336182825836
    theta_1:        0.9940305813471573
    cost:           9.99815680487604e-05
    hypothesis:     h(x) = 0.025486336182825836 + 0.9940305813471573*x
    >>> model.hypothesis(10)
    9.9657921496544