ISLR - Statistical Learning

99 阅读3分钟

2.1 What Is Statistical Learning? In essence, statistical learning refers to a set of approaches for estimating f.

2.1.1 Why Estimate f? There are two main reasons that we may wish to estimate f: prediction and inference.

2.1.2 How Do We Estimate f?

Parametric Methods Parametric methods involve a two-step model-based approach.

  1. First, we make an assumption about the functional form, or shape, of f. For example, one very simple assumption is that f is linear in X: f(X) = β0 +β1X1 +β2X2 +...+βpXp. Once we have assumed that f is linear, the problem of estimating f is greatly simplified. Instead of having to estimate an entirely arbitrary p-dimensional function f(X), one only needs to estimate the p + 1 coefficients β0,β1,...,βp.
  2. After a model has been selected, we need a procedure that uses the training data to fit or train the model. In the case of the linear model, we need to estimate the parameters β0,β1,...,βp. That is, we want to find values of these parameters such that Y ≈β0 +β1X1 +β2X2 +...+βpXp. The most common approach to fitting the model is referred to as (ordinary) least squares,However, least squares is one of many possible ways to fit the linear model.

Non-parametric Methods Non-parametric methods do not make explicit assumptions about the functional form of f . Instead they seek an estimate of f that gets as close to the data points as possible without being too rough or wiggly. advantage: by avoiding the assumption of a particular functional form for f, they have the potential to accurately fit a wider range of possible shapes for f. Any parametric approach brings with it the possibility that the functional form used to estimate f is very different from the true f, in which case the resulting model will not fit the data well. In contrast, non-parametric approaches completely avoid this danger, since essentially no assumption about the form of f is made. major disadvantage: since they do not reduce the problem of estimating f to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for f.

2.1.3 The Trade-Off Between Prediction Accuracy and Model Interpretability

why would we ever choose to use a more restrictive method instead of a very flexible approach? If we are mainly interested in inference, then restrictive models are much more interpretable. For instance, when inference is the goal, the linear model may be a good choice since it will be quite easy to understand the relationship between Y and X1,X2,...,Xp. In contrast, very flexible approaches, such as the splines, and the boosting methods, can lead to such complicated estimates of f that it is difficult to understand how any individual predictor is associated with the response.

Least squares linear regression, is relatively inflexible but is quite interpretable. The lasso, relies upon the linear model but uses an alternative fitting procedure for estimating the coefficients β0, β1, . . . , βp. The new procedure is more restrictive in estimating the coefficients, and sets a number of them to exactly zero. Hence in this sense the lasso is a less flexible approach than linear regression. It is also more interpretable than linear regression, because in the final model the response variable will only be related to a small subset of the predictors—namely, those with nonzero coefficient estimates. Generalized additive models (GAMs), instead extend the linear model to allow for certain non-linear relationships. Consequently, GAMs are more flexible than linear regression. They are also somewhat less interpretable than linear regression, because the relationship between each predictor and the response is now modeled using a curve. Finally, fully non-linear methods such as bagging, boosting, and support vector machines with non-linear kernels, are highly flexible approaches that are harder to interpret.

We have established that when inference is the goal, there are clear advantages to using simple and relatively inflexible statistical learning methods. In some settings, however, we are only interested in prediction, and the interpretability of the predictive model is simply not of interest. For instance, if we seek to develop an algorithm to predict the price of a stock, our sole requirement for the algorithm is that it predict accurately— interpretability is not a concern. In this setting, we might expect that it will be best to use the most flexible model available. Surprisingly, this is not always the case! We will often obtain more accurate predictions using a less flexible method. This phenomenon, which may seem counterintuitive at first glance, has to do with the potential for overfitting in highly flexible methods.