上QQ阅读APP看书，第一时间看更新

Exploratory analysis

Before starting with data analysis through multiple linear regression, we conduct an exploratory analysis to understand how the data is distributed and extract preliminary knowledge:

To display the first 20 rows of the DataFrame that have been imported, we can use the head() function, as follows:

print(data.head(20))

The following results are returned:

The head() function, with no arguments, gets the first five rows of data from the DataFrame.

Now, the dataset is available in our Python environment. To extract further information, we can invoke the info() function, as follows:

print(data.info())

The print() method prints a concise summary of a DataFrame, including the index dtypes and column dtypes, non-null values, and memory usage.

The following results are returned:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
crim       506 non-null float64
zn         506 non-null float64
indus      506 non-null float64
chas       506 non-null int64
nox        506 non-null float64
rm         506 non-null float64
age        506 non-null float64
dis        506 non-null float64
rad        506 non-null int64
tax        506 non-null float64
ptratio    506 non-null float64
black      506 non-null float64
lstat      506 non-null float64
medv       506 non-null float64
dtypes: float64(12), int64(2)
memory usage: 55.4 KB
None

A series of additional information is returned for all the variables contained in the dataset. To get a preview of the data contained in it, we can calculate a series of basic statistics.

To do so, we will use the describe() function in the following way:

summary = data.describe()
summary = summary.transpose()
print(summary)

We have simply transposed the results to make printing on the screen easier.

The following results are returned:

The describe() function generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution, excluding not a number (NaN) values. It analyzes both numeric and object series, as well as the DataFrame column sets of mixed data types. The output will vary depending on what is provided.

In the preceding screenshot, we can see that the variables have different ranges. When the predictors have different ranges, the impact on response variables by the features having a greater numeric range could be more than the one having a lesser numeric range, and this could in turn impact the prediction's accuracy. Our goal is to improve predictive accuracy and not allow a particular feature to impact the prediction due to a large numeric value range. Thus, we may need to scale values under different features so that they fall under a common range. Through this statistical procedure, it is possible to compare identical variables belonging to different distributions and also different variables or variables expressed in different units.

Remember, it is a good practice to rescale the data before training a regression algorithm. With rescaling, data units are eliminated, allowing you to compare data from different locations easily.

In this case, we will use the min-max method (usually called feature scaling) to get all the scaled data in the range [0, 1]. The formula to achieve this is as follows:

To perform feature scaling, we can use the preprocessing package available in the sklearn library. The sklearn library is a free software machine learning library for the Python programming language. It features various classification, regression, and clustering algorithms, including support vector machines (SVMs), random forests, gradient boosting, k-means, and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

Remember, to import a library that is not present in the initial distribution of Python, you must use the pip install command, followed by the name of the library. This command should be used only once and not every time you run the code.

The sklearn.preprocessing package provides several common utility functions and transformer classes to modify the features available in a representation that best suits our needs. We will begin using the following steps:

As always, we start by importing the package:

from sklearn.preprocessing import MinMaxScaler

To scale features between a given minimum and maximum value, in our case, between zero and one, so that the maximum absolute value of each feature is scaled to unit size, the MinMaxScaler function can be used.

Let's start by defining the scaler object:

scaler = MinMaxScaler()

Now, just to have a confirmation of what we are going to do, we print the parameters that we will use for the next resizing:

print(scaler.fit(data))

The fit method computes the minimum and maximum that is to be used for later scaling. The result is as follows:

MinMaxScaler(copy=True, feature_range=(0, 1))

Now, we can scale the features:

DataScaled = scaler.fit_transform(data)

The fit_transform method fits to the data and then transforms it.

A NumPy array shape is returned. It is advisable to report the results in the starting format (pandas DataFrame), at least for comparison purposes.

Let's do this using the following code block:

DataScaled = pd.DataFrame(DataScaled, columns=BHNames)

To verify that the transformation was carried out, we will print the basic statistics that we had already calculated previously:

summary = DataScaled.describe()
summary = summary.transpose()
print(summary)

The following results are returned:

With reference to the preceding screenshot, every variable is included in a range between 0 and 1. Now, all features have values between 0 and 1. We will now move on to a visual analysis. For example, what we can do is plot the boxplot of the variables.

A boxplot, which is also referred to as a whiskers chart, is a graphical representation that's used to describe the distribution of a sample by simple dispersion and position indexes. A boxplot can be represented – either horizontally or vertically – by means of a rectangular partition divided by two segments. The rectangle (box) is delimited by the first quartile (25th percentile) and the third quartile (75th percentile), and divided by the median (50th percentile), as shown in the following diagram:

Segments outside the box (whiskers) represent the lower and the upper whiskers. By default, the whiskers will extend up to 1.5 times the interquartile range from the top or bottom of the box to the furthest datum within that distance. In this way, the four equally populated ranges delineated by quartiles are graphically represented. To plot a boxplot in Python, we can use the matplotlib library.

The matplotlib library is a Python 2-D plotting library that produces publication-quality figures in a variety of hard copy formats and interactive environments across platforms. The matplotlib library tries to make easy things and hard things possible. You can generate plots, histograms, power spectra, bar charts, error charts, scatter plots, and so on with just a few lines of code. The matplotlib.pyplot function consists of a collection of command-style functions that make matplotlib work, such as MATLAB. Each pyplot function makes a change to a figure, such as creating a figure, creating a plotting area in a figure, plotting some lines in a plotting area, decorating the plot with labels, and so on.

As always, let's start by importing the library into Python:

import matplotlib.pyplot as plt

The available data is in pandas DataFrame format. For this reason, we can use the pandas.DataFrame.boxplot function. This function makes a box plot from the DataFrame columns, which are optionally grouped by some other columns:

boxplot = data.boxplot(column=BHNames)
plt.show()

Finally, to print on video, the plt.show() function will be used.

This function displays all figures and blocks until the figures have been closed. In the following diagram, the boxplots of all the variables contained in the DataScaled DataFrame are shown:

From the analysis of the previous diagram, we can note that several variables have outliers, with the crim variable being the one that has the largest number.

Outlier values are numerically different from the rest of the collected data. Statistics derived from samples containing outliers can be misleading.

In the following chapters, we'll look at how we can handle this problem. Furthermore, from the analysis of the previous diagram, we can see that the predictors are many; often, this can create problems rather than give us a hand. We can then find which of the available predictors are most correlated with the response variable. A standardized measurement of the relationship between two variables is instead represented by correlation, which can be calculated, starting from covariance. In Python, correlation coefficients are calculated by the pandas.DataFrame.corr() function; it computes pairwise correlation of columns, excluding NA/null values. Three methods are available, namely:

pearson (standard correlation coefficient)
kendall (Kendall Tau correlation coefficient)
spearman (Spearman rank correlation)

Remember, the correlation coefficient of two random variables is a measure of their linear dependence.

In the following code block, we calculate the correlation coefficients for the DataScaled DataFrame:

CorData = DataScaled.corr(method='pearson')

To display all DataFrame columns on video, we can use option_context with one or more options:

with pd.option_context('display.max_rows', None,
             'display.max_columns', CorData.shape[1]):
    print(CorData)

In the following screenshot, we can see the results:

Due to the large number of variables, the obtained matrix is not easily interpretable. To overcome this inconvenience, we can plot a correlogram. A correlogram is a graph of a correlation matrix. It is very useful to highlight the most correlated variables in a data table. In this plot, correlation coefficients are colored according to their value. A correlation matrix can also be reordered according to the degree of association between variables. We can plot a correlogram in Python using the matplotlib.pyplot.matshow() function. This function displays a DataFrame as a matrix in a new figure window. The origin is set at the upper left-hand corner, and rows (first dimension of the array) are displayed horizontally. The aspect ratio of the figure window is that of the array, unless this would make an excessively short or narrow figure. Tick labels for the x-axis are placed on top. Let's view these steps in the following code block:

plt.matshow(CorData)
plt.xticks(range(len(CorData.columns)), CorData.columns)
plt.yticks(range(len(CorData.columns)), CorData.columns)
plt.colorbar()
plt.show()

From the preceding code block, we come to know that plt.xticks and plt.yticks set the current tick locations and labels of the x-axis and y-axis. plt.colorbar() adds a colorbar to a plot. Finally, plt.show() shows the plot on video. The correlogram is shown in the following diagram:

As we are interested in the existing relationship between the response variable (medv) and predictors, we will only analyze the last line of the correlation matrix. In it, we can see the predictors that are most closely related, namely rm, lstat, and ptratio. Indeed, these variables have colors that approach the extremes of the color label (the different color is due to the positive or negative correlation, as shown in the color label in the right-hand part of the plot).