上QQ阅读APP看书，第一时间看更新

Creating a linear regression model

A simple linear regression is easy to understand, but represents the basis of regression techniques. Once these concepts are understood, it will be easier for us to address the other types of regression. To begin with, let's take an example of applying linear regression that's been taken from the real world.

Consider some data that has been collected on a group of bikers, which consists of the following aspects:

Number of years of use
Number of kilometers traveled in one year
Number of falls

Through these techniques, we find that, on average, when the number of kilometers traveled increases, the number of falls also increases. By increasing the number of years of motorcycle usage and by increasing the experience, the number of falls tends to decrease.

The linear regression method consists of precisely identifying a line that is capable of representing point distribution in a two-dimensional plane, that is, if the points corresponding to the observations are near the line, then the chosen model will be able to describe the link between the variables effectively.

In theory, there are an infinite number of lines that may approximate the observations, while in practice, there is only one mathematical model that optimizes the representation of the data. In the case of a linear mathematical relationship, the observations of the variable y can be obtained by a linear function of the observations of the variable x. For each observation, we will have the following:

In the preceding formula, x is the explanatory variable and y is the response variable. The parameters α and β, which represent the slope of the line and the intercept with the y-axis, respectively, must be estimated based on the observations collected for the two variables included in the model.

Slope α is of particular interest, that is, the variation of the mean response for every single increment of the explanatory variable. What about a change in this coefficient? If the slope is positive, the regression line increases from left to right, and if the slope is negative, the line decreases from left to right. When the slope is zero, the explanatory variable has no effect on the value of the response. But it is not just the sign of α that establishes the weight of the relationship between the variables. More generally, its value is also important. In the case of a positive slope, the mean response is higher when the explanatory variable is higher, while in the case of a negative slope, the mean response is lower when the explanatory variable is higher.

If we have a set of observations in the form (x₁, y₁), (x₂, y₂), ... (x_n, y_n), for each of these pairs, we can write an equation. In this way, we get a system of linear equations. We can represent this equation in matrix form, as shown in the following diagram:

We will name the terms contained in this formula as follows:

This can be expressed using a condensed formulation:

This represents a system of linear equations, and to locate the solution, we will resolve the following equation:

In the previous equation, there are three mathematical operations involving matrices: transpose, inverse, and matrix multiplication.

But how does least squares regression work? In the least squares method, the coefficients are estimated by determining numerical values that minimize the sum of the squared deviations between the observed responses and the fitted responses.

As we said, given n points (x₁, y₁), (x₂, y₂), ... (x_n, y_n) in the observed population, a least squares regression line is defined as follows:

This is the equation line for which the following quantity is minimal:

This quantity represents the sum of the squares of the distances of each experimental datum (x_i, y_i) from the corresponding point on the straight line, as shown in the following graph:

To understand this concept, it is easier to draw the distances between these points, formally called residuals, for a couple of pieces of data. Once the coefficients are obtained, calculating the residuals is really simple – the observed minus the estimated values, that is:

A residual is a measure of how well a regression line fits an individual data point. Therefore, the model is said to fit the data well if the residuals appear to behave randomly. However, the model is clearly said to fit the data poorly if the residuals happen to display a systematic pattern.