Machine Learning Least Squares

Mistery of the “Rule of 5”. C.Hansch, K.W.Kim, R.H.Sarma, JACS, 1973, Vol. 95, No.19, 6447-6449 Topliss and Costello2 have pointed out the danger of finding meaningless chance correlations with three or four data points per variable. The correlation coefficient is good and there are almost five data points per variable. The method of least squares is a standard approach in regression analysis to approximate the. Proceedings of the 25th International Conference on Machine Learning: 33–40.

-->

Creates a linear regression model

Category: Machine Learning / Initialize Model / Regression

Note

Applies to: Machine Learning Studio

This content pertains only to Studio. Similar drag and drop modules have been added to the visual interface in Machine Learning service. Learn more in this article comparing the two versions.

Module overview

This article describes how to use the Linear Regression module in Azure Machine Learning Studio, to create a linear regression model for use in an experiment. Linear regression attempts to establish a linear relationship between one or more independent variables and a numeric outcome, or dependent variable.

You use this module to define a linear regression method, and then train a model using a labeled dataset. The trained model can then be used to make predictions. Alternatively, the untrained model can be passed to Cross-Validate Model for cross-validation against a labeled data set.

More about linear regression

Linear regression is a common statistical method, which has been adopted in machine learning and enhanced with many new methods for fitting the line and measuring error. In the most basic sense, regression refers to prediction of a numeric target. Linear regression is still a good choice when you want a very simple model for a basic predictive task. Linear regression also tends to work well on high-dimensional, sparse data sets lacking complexity.

Azure Machine Learning Studio supports a variety of regression models, in addition to linear regression. However, the term 'regression' can be interpreted loosely, and some types of regression provided in other tools are not supported in Studio.

The classic regression problem involves a single independent variable and a dependent variable. This is called simple regression. This module supports simple regression.
Multiple linear regression involves two or more independent variables that contribute to a single dependent variable. Problems in which multiple inputs are used to predict a single numeric outcome are also called multivariate linear regression.
The Linear Regression module can solve these problems, as can most of the other regression modules in Studio.
Multi-label regression is the task of predicting multiple dependent variables within a single model. For example, in multi-label logistic regression, a sample can be assigned to multiple different labels. (This is different from the task of predicting multiple levels within a single class variable.)
This type of regression is not supported in Azure Machine Learning. To predict multiple variables, create a separate learner for each output that you wish to predict.

For years statisticians have been developing increasingly advanced methods for regression. This is true even for linear regression. This module supports two methods to measure error and fit the regression line: ordinary least squares method, and gradient descent.

Gradient descent is a method that minimizes the amount of error at each step of the model training process. There are many variations on gradient descent and its optimization for various learning problems has been extensively studied. If you choose this option for Solution method, you can set a variety of parameters to control the step size, learning rate, and so forth. This option also supports use of an integrated parameter sweep.
Ordinary least squares is one of the most commonly used techniques in linear regression. For example, least squares is the method that is used in the Analysis Toolpak for Microsoft Excel.
Ordinary least squares refers to the loss function, which computes error as the sum of the square of distance from the actual value to the predicted line, and fits the model by minimizing the squared error. This method assumes a strong linear relationship between the inputs and the dependent variable.

How to configure Linear Regression

This module supports two methods for fitting a regression model, with very different options:

Gradient descent is a better loss function for models that are more complex, or that have too little training data given the number of variables.
This option also supports a parameter sweep, if you train the model using Tune Model Hyperparameters to automatically optimize the model parameters.
For small datasets, it is best to select ordinary least squares. This should give very similar results to Excel.

Create a regression model using ordinary least squares

Add the Linear Regression Model module to your experiment in Studio.
You can find this module in the Machine Learning category. Expand Initialize Model, expand Regression, and then drag the Linear Regression Model module to your experiment.
In the Properties pane, in the Solution method dropdown list, select Ordinary Least Squares. This option specifies the computation method used to find the regression line.
In L2 regularization weight, type the value to use as the weight for L2 regularization. We recommend that you use a non-zero value to avoid overfitting.
To learn more about how regularization affects model fitting, see this article: L1 and L2 Regularization for Machine Learning
Select the option, Include intercept term, if you want to view the term for the intercept.
Deselect this option if you don't need to review the regression formula.
For Random number seed, you can optionally type a value to seed the random number generator used by the model.
Using a seed value is useful if you want to maintain the same results across different runs of the same experiment. Otherwise, the default is to use a value from the system clock.
Deselect the option, Allow unknown categorical levels, if you want missing values to raise an error.
If this option is selected, an additional level is created for each categorical column. Any levels in the test dataset that were not present in the training dataset are mapped to this additional level.
Add the Train Model module to your experiment, and connect a labeled dataset.
Run the experiment.

Results for ordinary least squares model

After training is complete:

To view the model's parameters, right-click the trainer output and select Visualize.
To make predictions, connect the trained model to the Score Model module, along with a dataset of new values.
To perform cross-validation against a labeled data set, connect the untrained model to Cross-Validate Model.

Create a regression model using online gradient descent

Add the Linear Regression Model module to your experiment in Studio.
You can find this module in the Machine Learning category. Expand Initialize Model, expand Regression, and drag the Linear Regression Model module to your experiment
In the Properties pane, in the Solution method dropdown list, choose Online Gradient Descent as the computation method used to find the regression line.
For Create trainer mode, indicate whether you want to train the model with a predefined set of parameters, or if you want to optimize the model by using a parameter sweep.
- Single Parameter: If you know how you want to configure the linear regression network, you can provide a specific set of values as arguments.
- Parameter Range: If you want the algorithm to find the best parameters for you, set Create trainer mode option to Parameter Range. You can then specify multiple values for the algorithm to try.
For Learning rate, specify the initial learning rate for the stochastic gradient descent optimizer.
For Number of training epochs, type a value that indicates how many times the algorithm should iterate through examples. For datasets with a small number of examples, this number should be large to reach convergence.
Normalize features: If you have already normalized the numeric data used to train the model, you can deselect this option. By default, the module normalizes all numeric inputs to a range between 0 and 1.
Note
Remember to apply the same normalization method to new data used for scoring.
In L2 regularization weight, type the value to use as the weight for L2 regularization. We recommend that you use a non-zero value to avoid overfitting.
To learn more about how regularization affects model fitting, see this article: L1 and L2 Regularization for Machine Learning
Select the option, Average final hypothesis, to average the final hypothesis.
In regression models, hypothesis testing means using some statistic to evaluate the probability of the null hypothesis, which states that there is no linear correlation between a dependent and independent variable. In many regression problems, you must test a hypothesis involving more than one variable.
This option is enabled by default, meaning the algorithm tests a combination of the parameters where two or more parameters are involved.
Select the option, Decrease learning rate, if you want the learning rate to decrease as iterations progress.
For Random number seed, you can optionally type a value to seed the random number generator used by the model. Using a seed value is useful if you want to maintain the same results across different runs of the same experiment.
Deselect the option, Allow unknown categorical levels, if you want missing values to raise an error.
When this option is selected, an additional level is created for each categorical column. Any levels in the test dataset not present in the training dataset are mapped to this additional level.
Add a labeled dataset and one of the training modules.
If you are not using a parameter sweep, use the Train Model module.
To have the algorithm find the best parameters for you, train the model using Tune Model Hyperparameters.
Note
If you configure the model with specific values using the Single Parameter option and then switch to the Parameter Range option, the model is trained using the minimum value in the range for each parameter.
Conversely, if you configure specific settings when you create the model but select the Parameter Range option, the model is trained using the default values for the learner as the range of values to sweep over.
Run the experiment.

Results for online gradient descent

After training is complete:

To make predictions, connect the trained model to the Score Model module, together with new input data.
To perform cross-validation against a labeled data set, connect the untrained model to Cross-Validate Model.

Examples

For examples of regression models, see these sample experiments in the Azure AI Gallery:

Compare Regressors: Contrasts several different kinds of regression models.
Cross Validation for Regression: Demonstrates linear regression using ordinary least squares.
Twitter sentiment analysis: Uses several different regression models to generate predicted ratings.

Technical notes

This section contains implementation details, tips, and answers to frequently asked questions.

Usage tips

Many tools support creation of linear regression, ranging from the simple to complex. For example, you can easily perform linear regression in Excel, using the Solver Toolpak, or you can code your own regression algorithm, using R, Python, or C#.

However, because linear regression is a well-established technique that is supported by many different tools, there are many different interpretations and implementations. Not all types of models are supported equally by all tools. There are also some differences in nomenclature to observe.

Regression methods are often categorized by the number of response variables. For example, multiple linear regression means a model that has multiple variables to predict.
In Matlab, multivariate regression refers to a model that has multiple response variables.
In Azure Machine Learning, regression models support a single response variable.
In the R language, the features provided for linear regression depend on the package you are using. For example, the glm package will give you the ability to create a logistic regression model with multiple independent variables. In general, Azure Machine Learning Studio provides the same functionality as the R glm package.

We recommend that you use this module, Linear Regression, for typical regression problems.

In contrast, if you are using multiple variables to predict a class value, we recommend the Two-Class Logistic Regression or Multiclass Logistic Regression modules.

If you want to use other linear regression packages that are available for the R language, we recommend that you use the Execute R Script module and call the lm or glm packages, which are included in the runtime environment of Azure Machine Learning Studio.

Module parameters

Name	Range	Type	Default	Description
Normalize features	any	Boolean	true	Indicate whether instances should be normalized
Average final hypothesis	any	Boolean	true	Indicate whether the final hypothesis should be averaged
Learning rate	>=double.Epsilon	Float	0.1	Specify the initial learning rate for the stochastic gradient descent optimizer
Number of training epochs	>=0	Integer	10	Specify how many times the algorithm should iterate through examples. For datasets with a small number of examples, this number should be large to reach convergence.
Decrease learning rate	Any	Boolean	true	Indicate whether the learning rate should decrease as iterations progress
L2 regularization weight	>=0.0	Float	0.001	Specify the weight for L2 regularization. Use a non-zero value to avoid overfitting.
Random number seed	any	Integer	Specify a value to seed the random number generator used by the model. Leave blank for default.
Allow unknown categorical levels	any	Boolean	true	Indicate whether an additional level should be created for each categorical column. Any levels in the test dataset not available in the training dataset are mapped to this additional level.
Include intercept term	Any	Boolean	True	Indicate whether an additional term should be added for the intercept

Outputs

Name	Type	Description
Untrained model	ILearner interface	An untrained regression model

Part of a series on Statistics
Regression analysis
Models
Estimation
Background

General formulation[edit]

Consider a learning setting given by a probabilistic space

{displaystyle (Xtimes Y,rho (X,Y))}

{displaystyle Yin R}

. Let

{displaystyle S={x_{i},y_{i}}_{i=1}^{n}}

denote a training set of

{displaystyle n}

pairs i.i.d. with respect to

{displaystyle rho }

. Let

{displaystyle V:Ytimes Rrightarrow [0;infty )}

be a loss function. Define

{displaystyle F}

as the space of the functions such that expected risk:

{displaystyle varepsilon (f)=int V(y,f(x)),drho (x,y)}

is well defined. The main goal is to minimize the expected risk:

{displaystyle inf _{fin F}varepsilon (f)}

Since the problem cannot be solved exactly there is a need to specify how to measure the quality of a solution. A good learning algorithm should provide an estimator with a small risk.

As the joint distribution

{displaystyle rho }

is typically unknown, the empirical risk is taken. For regularized least squares the square loss function is introduced:

{displaystyle varepsilon (f)={frac {1}{n}}sum _{i=1}^{n}V(y_{i},f(x_{i}))={frac {1}{n}}sum _{i=1}^{n}(y_{i}-f(x_{i}))^{2}}

However, if the functions are from a relatively unconstrained space, such as the set of square-integrable functions on

{displaystyle X}

, this approach may overfit the training data, and lead to poor generalization. Thus, it should somehow constrain or penalize the complexity of the function

{displaystyle f}

. In RLS, this is accomplished by choosing functions from a reproducing kernel Hilbert space (RKHS)

{displaystyle {mathcal {H}}}

, and adding a regularization term to the objective function, proportional to the norm of the function in

{displaystyle {mathcal {H}}}

{displaystyle inf _{fin F}varepsilon (f)+lambda R(f),lambda >0}

Kernel formulation[edit]

Definition of RKHS[edit]

A RKHS can be defined by a symmetricpositive-definite kernel function

{displaystyle K(x,z)}

with the reproducing property:

{displaystyle langle K_{x},frangle _{mathcal {H}}=f(x),}

where

{displaystyle K_{x}(z)=K(x,z)}

. The RKHS for a kernel

{displaystyle K}

consists of the completion of the space of functions spanned by

{displaystyle left{K_{x}mid xin Xright}}

{displaystyle f(x)=sum _{i=1}^{n}alpha _{i}K_{x_{i}}(x),fin {mathcal {H}}}

, where all

{displaystyle alpha _{i}}

are real numbers. Some commonly used kernels include the linear kernel, inducing the space of linear functions:

{displaystyle K(x,z)=x^{T}z,}

the polynomial kernel, inducing the space of polynomial functions of order

{displaystyle d}

{displaystyle K(x,z)=(x^{T}z+1)^{d},}

and the Gaussian kernel:

{displaystyle K(x,z)=e^{-{frac {|x-z|^{2}}{sigma ^{2}}}}.}

Note that for an arbitrary loss function

{displaystyle V}

, this approach defines a general class of algorithms named Tikhonov regularization. For instance, using the hinge loss leads to the support vector machine algorithm, and using the epsilon-insensitive loss leads to support vector regression.

Arbitrary kernel[edit]

The representer theorem guarantees that the solution can be written as:

{displaystyle f(x)=sum _{i=1}^{n}c_{i}K(x_{i},x)}

for some

{displaystyle cin mathbb {R} ^{n}}

The minimization problem can be expressed as:

{displaystyle min _{cin R^{n}}{frac {1}{n}}|Y-Kc|_{R^{n}}^{2}+lambda |f|_{H}^{2}}

where, with some abuse of notation, the

{displaystyle i,j}

entry of kernel matrix

{displaystyle K}

(as opposed to kernel function

{displaystyle K(cdot ,cdot )}

) is

{displaystyle K(x_{i},x_{j})}

For such a function,

{displaystyle {begin{aligned}&|f|_{H}^{2}=langle f,frangle _{H}=leftlangle sum _{i=1}^{n}c_{i}K(x_{i},cdot ),sum _{j=1}^{n}c_{j}K(x_{j},cdot )rightrangle _{H}={}&sum _{i=1}^{n}sum _{j=1}^{n}c_{i}c_{j}langle K(x_{i},cdot ),K(x_{j},cdot )rangle _{H}=sum _{i=1}^{n}sum _{j=1}^{n}c_{i}c_{j}K(x_{i},x_{j})=c^{T}Kc,end{aligned}}}

The following minimization problem can be obtained:

{displaystyle min _{cin R^{n}}{frac {1}{n}}|Y-Kc|_{R^{n}}^{2}+lambda c^{T}Kc}

As the sum of convex functions is convex, the solution is unique and its minimum can be found by setting the gradient w.r.t

{displaystyle c}

{displaystyle 0}

{displaystyle -{frac {1}{n}}K(Y-Kc)+lambda Kc=0Rightarrow K(K+lambda nI)c=KYRightarrow c=(K+lambda nI)^{-1}Y}

where

{displaystyle cin R^{n}}

Complexity[edit]

The complexity of training is basically the cost of computing the kernel matrix plus the cost of solving the linear system which is roughly

{displaystyle O(n^{3})}

. The computation of the kernel matrix for the linear or Gaussian kernel is

{displaystyle O(n^{2}D)}

. The complexity of testing is

{displaystyle O(n)}

Prediction[edit]

The prediction at a new test point

{displaystyle x_{*}}

is:

{displaystyle f(x_{*})=sum _{i=1}^{n}c_{i}K(x_{i},x_{*})=K(X,X_{*})^{T}c}

Linear kernel[edit]

For convenience a vector notation is introduced. Let

{displaystyle X}

be an

{displaystyle ntimes d}

matrix, where the rows are input vectors, and

{displaystyle Y}

{displaystyle ntimes 1}

vector where the entries are corresponding outputs. In terms of vectors, the kernel matrix can be written as

{displaystyle operatorname {K} =operatorname {X} operatorname {X} ^{T}}

. The learning function can be written as:

{displaystyle f(x_{*})=operatorname {K} _{x_{*}}c=x_{*}^{T}operatorname {X} ^{T}c=x_{*}^{T}w}

Here we define

{displaystyle w=X^{T}c,win R^{d}}

. The objective function can be rewritten as:

{displaystyle {begin{aligned}&{frac {1}{n}}|Y-operatorname {K} c|_{R^{n}}^{2}+lambda c^{T}operatorname {K} c[4pt]={}&{frac {1}{n}}|y-operatorname {X} operatorname {X} ^{T}c|_{R^{n}}^{2}+lambda c^{T}operatorname {X} operatorname {X} ^{T}c={frac {1}{n}}|y-operatorname {X} w|_{R^{n}}^{2}+lambda |w|_{R^{d}}^{2}end{aligned}}}

The first term is the objective function from ordinary least squares (OLS) regression, corresponding to the residual sum of squares. The second term is a regularization term, not present in OLS, which penalizes large

{displaystyle w}

values.As a smooth finite dimensional problem is considered and it is possible to apply standard calculus tools. In order to minimize the objective function, the gradient is calculated with respect to

{displaystyle w}

and set it to zero:

{displaystyle operatorname {X} ^{T}operatorname {X} w-operatorname {X} ^{T}y+lambda nw=0}

{displaystyle w=(operatorname {X} ^{T}operatorname {X} +lambda noperatorname {I} )^{-1}operatorname {X} ^{T}y}

This solution closely resembles that of standard linear regression, with an extra term

{displaystyle lambda operatorname {I} }

. If the assumptions of OLS regression hold, the solution

{displaystyle w=(operatorname {X} ^{T}operatorname {X} )^{-1}operatorname {X} ^{T}y}

, with

{displaystyle lambda =0}

, is an unbiased estimator, and is the minimum-variance linear unbiased estimator, according to the Gauss–Markov theorem. The term

{displaystyle lambda noperatorname {I} }

therefore leads to a biased solution; however, it also tends to reduce variance. This is easy to see, as the covariance matrix of the

{displaystyle w}

-values is proportional to

{displaystyle (operatorname {X} ^{T}operatorname {X} +lambda noperatorname {I} )^{-1}}

, and therefore large values of

{displaystyle lambda }

will lead to lower variance. Therefore, manipulating

{displaystyle lambda }

corresponds to trading-off bias and variance. For problems with high-variance

{displaystyle w}

estimates, such as cases with relatively small

{displaystyle n}

or with correlated regressors, the optimal prediction accuracy may be obtained by using a nonzero

{displaystyle lambda }

, and thus introducing some bias to reduce variance. Furthermore, it is not uncommon in machine learning to have cases where

{displaystyle n<d}

, in which case

{displaystyle X^{T}X}

is rank-deficient, and a nonzero

{displaystyle lambda }

is necessary to compute

{displaystyle (operatorname {X} ^{T}operatorname {X} +lambda noperatorname {I} )^{-1}}

Complexity[edit]

The parameter

{displaystyle lambda }

controls the invertibility of the matrix

{displaystyle X^{T}X+lambda nI}

.Several methods can be used to solve the above linear system,Cholesky decomposition being probably the method of choice, since the matrix

{displaystyle X^{T}X+lambda nI}

is symmetric and positive definite. The complexity of this method is

{displaystyle O(nD^{2})}

for training and

{displaystyle O(D)}

for testing. The cost

{displaystyle O(nD^{2})}

is essentially that of computing

{displaystyle X^{T}X}

, whereas the inverse computation (or rather the solution of the linear system) is roughly

{displaystyle O(D^{3})}

Feature maps and Mercer's theorem[edit]

In this section it will be shown how to extend RLS to any kind of reproducing kernel K. Instead of linear kernel a feature map is considered

{displaystyle Phi :Xrightarrow F}

for some Hilbert space

{displaystyle F}

, called the feature space. In this case the kernel is defined as: The matrix

{displaystyle X}

is now replaced by the new data matrix

{displaystyle Phi }

, where

{displaystyle Phi _{ij}=phi _{j}(x_{i})}

, or the

{displaystyle j}

-th component of the

{displaystyle phi (x_{i})}

{displaystyle K(x,x

It means that for a given training set

{displaystyle K=Phi Phi ^{T}}

. Thus, the objective function can be written as:

{displaystyle min _{cin mathbb {R} ^{n}}|Y-Phi Phi ^{T}|_{R^{n}}^{2}+lambda c^{T}Phi Phi ^{T}c}

This approach is known as the kernel trick. This technique can significantly simplify the computational operations. If

{displaystyle F}

is high dimensional, computing

{displaystyle phi (x_{i})}

may be rather intensive. If the explicit form of the kernel function is known, we just need to compute and store the

{displaystyle ntimes n}

kernel matrix

{displaystyle operatorname {K} }

In fact, the Hilbert space

{displaystyle F}

need not be isomorphic to

{displaystyle mathbb {R} ^{m}}

, and can be infinite dimensional. This follows from Mercer's theorem, which states that a continuous, symmetric, positive definite kernel function can be expressed as:

{displaystyle K(x,z)=sum _{i=1}^{infty }sigma _{i}e_{i}(x)e_{i}(z)}

where

{displaystyle e_{i}(x)}

form an orthonormal basis for

{displaystyle ell ^{2}(X)}

, and

{displaystyle sigma _{i}in mathbb {R} }

. If feature maps is defined

{displaystyle phi (x)}

with components

{displaystyle phi _{i}(x)={sqrt {sigma _{i}}}e_{i}(x)}

, it follows that

{displaystyle K(x,z)=langle phi (x),phi (z)rangle }

. This demonstrates that any kernel can be associated with a feature map, and that RLS generally consists of linear RLS performed in some possibly higher-dimensional feature space. While Mercer's theorem shows how one feature map that can be associated with a kernel, in fact multiple feature maps can be associated with a given reproducing kernel. For instance, the map

{displaystyle phi (x)=K_{x}}

satisfies the property

{displaystyle K(x,z)=langle phi (x),phi (z)rangle }

for an arbitrary reproducing kernel.

Bayesian interpretation[edit]

Least squares can be viewed as a likelihood maximization under an assumption of normally distributed residuals. This is because the exponent of the Gaussian distribution is quadratic in the data, and so is the least-squares objective function. In this framework, the regularization terms of RLS can be understood to be encoding priors on

{displaystyle w}

. For instance, Tikhonov regularization corresponds to a normally distributed prior on

{displaystyle w}

that is centered at 0. To see this, first note that the OLS objective is proportional to the log-likelihood function when each sampled

{displaystyle y^{i}}

is normally distributed around

{displaystyle w^{T}cdot x^{i}}

. Then observe that a normal prior on

{displaystyle w}

centered at 0 has a log-probability of the form

{displaystyle log P(w)=q-alpha sum _{j=1}^{d}w_{j}^{2}}

where

{displaystyle q}

and

{displaystyle alpha }

are constants that depend on the variance of the prior and are independent of

{displaystyle w}

. Thus, minimizing the logarithm of the likelihood times the prior is equivalent to minimizing the sum of the OLS loss function and the ridge regression regularization term.

This gives a more intuitive interpretation for why Tikhonov regularization leads to a unique solution to the least-squares problem: there are infinitely many vectors

{displaystyle w}

satisfying the constraints obtained from the data, but since we come to the problem with a prior belief that

{displaystyle w}

is normally distributed around the origin, we will end up choosing a solution with this constraint in mind.

Other regularization methods correspond to different priors. See the list below for more details.

Specific examples[edit]

Ridge regression (or Tikhonov regularization)[edit]

One particularly common choice for the penalty function

{displaystyle R}

is the squared

{displaystyle ell _{2}}

norm, i.e.,

{displaystyle R(w)=sum _{j=1}^{d}w_{j}^{2}}

{displaystyle {frac {1}{n}}|Y-operatorname {X} w|_{2}^{2}+lambda sum _{j=1}^{d}|w_{j}|^{2}rightarrow min _{win mathbf {R^{d}} }}

The most common names for this are called Tikhonov regularization and ridge regression. It admits a closed-form solution for

{displaystyle w}

{displaystyle w=(X^{T}X+alpha I)^{-1}X^{T}Y}

The name ridge regression alludes to the fact that the

{displaystyle alpha I}

term adds positive entries along the diagonal 'ridge' of the sample covariance matrix

{displaystyle X^{T}X}

When

{displaystyle alpha =0}

, i.e., in the case of ordinary least squares, the condition that

{displaystyle d>n}

causes the sample covariance matrix

{displaystyle X^{T}X}

to not have full rank and so it cannot be inverted to yield a unique solution. This is why there can be an infinitude of solutions to the ordinary least squares problem when

{displaystyle d>n}

. However, when

{displaystyle alpha >0}

, i.e., when ridge regression is used, the addition of

{displaystyle alpha I}

to the sample covariance matrix ensures that all of its eigenvalues will be strictly greater than 0. In other words, it becomes invertible, and the solution becomes unique.

Compared to ordinary least squares, ridge regression is not unbiased. It accepts little bias to reduce variance and the mean square error, and helps to improve the prediction accuracy. Thus, ridge estimator yields more stable solutions by shrinking coefficients but suffers from the lack of sensitivity to the data.

Lasso regression[edit]

The least absolute selection and shrinkage (LASSO) method is another popular choice. In lasso regression, the lasso penalty function

{displaystyle R}

is the

{displaystyle ell _{1}}

norm, i.e.

{displaystyle R(w)=sum _{j=1}^{d}left|w_{j}right|}

{displaystyle {frac {1}{n}}|Y-operatorname {X} w|_{2}^{2}+lambda sum _{j=1}^{d}|w_{j}|rightarrow min _{win mathbf {R^{d}} }}

Note that the lasso penalty function is convex but not strictly convex. Unlike Tikhonov regularization, this scheme does not have a convenient closed-form solution: instead, the solution is typically found using quadratic programming or more general convex optimization methods, as well as by specific algorithms such as the least-angle regression algorithm.

An important difference between lasso regression and Tikhonov regularization is that lasso regression forces more entries of

{displaystyle w}

to actually equal 0 than would otherwise. In contrast, while Tikhonov regularization forces entries of

{displaystyle w}

to be small, it does not force more of them to be 0 than would be otherwise. Thus, LASSO regularization is more appropriate than Tikhonov regularization in cases in which we expect the number of non-zero entries of

{displaystyle w}

to be small, and Tikhonov regularization is more appropriate when we expect that entries of

{displaystyle w}

will generally be small but not necessarily zero. Which of these regimes is more relevant depends on the specific data set at hand.

Besides feature selection described above, LASSO has some limitations. Ridge regression provides better accuracy in the case

{displaystyle n>d}

for highly correlated variables.^[1] In another case,

{displaystyle n<d}

, LASSO selects at most

{displaystyle n}

variables. Moreover, LASSO tends to select some arbitrary variables from group of highly correlated samples, so there is no grouping effect.

ℓ₀ Penalization[edit]

{displaystyle {frac {1}{n}}|Y-operatorname {X} w|_{2}^{2}+lambda |w_{j}|_{0}rightarrow min _{win mathbf {R^{d}} }}

The most extreme way to enforce sparsity is to say that the actual magnitude of the coefficients of

{displaystyle w}

does not matter; rather, the only thing that determines the complexity of

{displaystyle w}

is the number of non-zero entries. This corresponds to setting

{displaystyle R(w)}

to be the

{displaystyle ell _{0}}

norm of

{displaystyle w}

. This regularization function, while attractive for the sparsity that it guarantees, is very difficult to solve because doing so requires optimization of a function that is not even weakly convex. Lasso regression is the minimal possible relaxation of

{displaystyle ell _{0}}

penalization that yields a weakly convex optimization problem.

Elastic net[edit]

For any non-negative

{displaystyle lambda _{1}}

and

{displaystyle lambda _{2}}

the objective has the following form:

{displaystyle {frac {1}{n}}|Y-operatorname {X} w|_{2}^{2}+lambda _{1}sum _{j=1}^{d}|w_{j}|+lambda _{2}sum _{j=1}^{d}|w_{j}|^{2}rightarrow min _{win mathbf {R^{d}} }}

Let

{displaystyle alpha ={frac {lambda _{1}}{lambda _{1}+lambda _{2}}}}

, then the solution of the minimization problem is described as:

{displaystyle {frac {1}{n}}|Y-operatorname {X} w|_{2}^{2}rightarrow min _{win mathbf {R^{d}} }{text{s.t.}}(1-alpha )|w|_{1}+alpha |w|_{2}leq t}

for some

{displaystyle t}

Consider

{displaystyle (1-alpha )|w|_{1}+alpha |w|_{2}leq t}

as an Elastic Net penalty function.

When

{displaystyle alpha =1}

, elastic net becomes ridge regression, whereas

{displaystyle alpha =0}

it becomes Lasso.

{displaystyle forall alpha in (0,1]}

Elastic Net penalty function doesn't have the first derivative at 0 and it is strictly convex

{displaystyle forall alpha >0}

taking the properties both lasso regression and ridge regression.

One of the main properties of the Elastic Net is that it can select groups of correlated variables. The difference between weight vectors of samples

{displaystyle x_{i}}

and

{displaystyle x_{j}}

is given by:

{displaystyle |w_{i}^{*}(lambda _{1},lambda _{2})-w_{j}^{*}(lambda _{1},lambda _{2})|leq {frac {sum _{i=1}^{n}|y_{i}|}{lambda _{2}}}{sqrt {2(1-rho _{ij})}}}

, where

{displaystyle rho _{ij}=x_{i}^{T}x_{j}}

.^[2]

{displaystyle x_{i}}

and

{displaystyle x_{j}}

are highly correlated (

{displaystyle rho _{ij}rightarrow 1}

), the weight vectors are very close. In the case of negatively correlated samples (

{displaystyle rho _{ij}rightarrow -1}

) the samples

{displaystyle -x_{j}}

can be taken. To summarize, for highly correlated variables the weight vectors tend to be equal up to a sign in the case of negative correlated variables.

Partial list of RLS methods[edit]

The following is a list of possible choices of the regularization function

{displaystyle R(cdot )}

, along with the name for each one, the corresponding prior if there is a simple one, and ways for computing the solution to the resulting optimization problem.

Name	Regularization function	Corresponding prior	Methods for solving
Tikhonov regularization	${displaystyle \|w\|_{2}^{2}}$	Normal	Closed form
Lasso regression	${displaystyle \|w\|_{1}}$	Laplace	Proximal gradient descent, least angle regression
${displaystyle ell _{0}}$ penalization	${displaystyle \|w\|_{0}}$	–	Forward selection, Backward elimination, use of priors such as spike and slab
Elastic nets	${displaystyle beta \|w\|_{1}+(1-beta )\|w\|_{2}^{2}}$	–	Proximal gradient descent
Total variation regularization	${displaystyle sum _{j=1}^{d-1}\|w_{j+1}-w_{j}\|}$	–	Split–Bregman method, among others

References[edit]

^Tibshirani Robert (1996). 'Regression shrinkage and selection via the lasso'(PDF). Journal of the Royal Statistical Society, Series B. 58: pp. 266–288.
^Hui, Zou; Hastie, Trevor (2003). 'Regularization and Variable Selection via the Elastic Net'(PDF). JRSSB. 67 (2): pp. 301–320.

External links[edit]

http://www.stanford.edu/~hastie/TALKS/enet_talk.pdf Regularization and Variable Selection via the Elastic Net (presentation)
Regularized Least Squares and Support Vector Machines (presentation)
Regularized Least Squares(presentation)

Retrieved from 'https://en.wikipedia.org/w/index.php?title=Regularized_least_squares&oldid=912609073'

Module overview

More about linear regression

How to configure Linear Regression

Create a regression model using ordinary least squares

Results for ordinary least squares model

Create a regression model using online gradient descent

Results for online gradient descent

Examples

Technical notes

Usage tips

Module parameters

Outputs

See also

General formulation[edit]

Kernel formulation[edit]

Definition of RKHS[edit]

Arbitrary kernel[edit]

Complexity[edit]

Prediction[edit]

Linear kernel[edit]

Complexity[edit]

Feature maps and Mercer's theorem[edit]

Bayesian interpretation[edit]

Specific examples[edit]

Ridge regression (or Tikhonov regularization)[edit]

Lasso regression[edit]

ℓ₀ Penalization[edit]

Elastic net[edit]

Partial list of RLS methods[edit]

See also[edit]

References[edit]

External links[edit]

Module overview

More about linear regression

How to configure Linear Regression

Create a regression model using ordinary least squares

Results for ordinary least squares model

Create a regression model using online gradient descent

Results for online gradient descent

Examples

Technical notes

Usage tips

Module parameters

Outputs

See also

General formulation[edit]

Kernel formulation[edit]

Definition of RKHS[edit]

Arbitrary kernel[edit]

Complexity[edit]

Prediction[edit]

Linear kernel[edit]

Complexity[edit]

Feature maps and Mercer's theorem[edit]

Bayesian interpretation[edit]

Specific examples[edit]

Ridge regression (or Tikhonov regularization)[edit]

Lasso regression[edit]

ℓ0 Penalization[edit]

Elastic net[edit]

Partial list of RLS methods[edit]

See also[edit]

References[edit]

External links[edit]

ℓ₀ Penalization[edit]