上QQ阅读APP看书，第一时间看更新

Explore the Boston housing dataset

Navigate to Subtopic Data exploration in the Jupyter Notebook and run the cell containing df.describe() :

This computes various properties including the mean, standard deviation, minimum, and maximum for each column. This table gives a high-level idea of how everything is distributed. Note that we have taken the transform of the result by adding a .T to the output; this swaps the rows and columns. Going forward with the analysis, we will specify a set of columns to focus on.

Run the cell where these "focus columns" are defined:

    cols = ['RM', 'AGE', 'TAX', 'LSTAT', 'MEDV']

This subset of columns can be selected from df using square brackets. Display this subset of the DataFrame by running df[cols].head() :

As a reminder, let's recall what each of these columns is. From the dataset documentation, we have the following:

- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- TAX full-value property-tax rate per $10,000
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's

To look for patterns in this data, we can start by calculating the pairwise correlations using pd.DataFrame.corr.

Calculate the pairwise correlations for our selected columns by running the cell containing the following code:

   df[cols].corr()

This resulting table shows the correlation score between each set of values. Large positive scores indicate a strong positive (that is, in the same direction) correlation. As expected, we see maximum values of 1 on the diagonal.

Pearson coefficient is defined as the co-variance between two variables, divided by the product of their standard deviations:

The co-variance, in turn, is defined as follows:

Here, n is the number of samples, x_i and y_i are the individual samples being summed over, and and are the means of each set.

Instead of straining our eyes to look at the preceding table, it's nicer to visualize it with a heatmap. This can be done easily with Seaborn.

Run the next cell to initialize the plotting environment, as discussed earlier in the chapter. Then, to create the heatmap, run the cell containing the following code:

     import matplotlib.pyplot as plt
     import seaborn as sns
     %matplotlib inline 

     ax = sns.heatmap(df[cols].corr(),
     cmap=sns.cubehelix_palette(20, light=0.95, dark=0.15))
     ax.xaxis.tick_top() # move labels to the top
     plt.savefig('../figures/chapter-1-boston-housing-corr.png',
     bbox_inches='tight', dpi=300)

We call sns.heatmap and pass the pairwise correlation matrix as input. We use a custom color palette here to override the Seaborn default. The function returns a matplotlib.axes object which is referenced by the variable ax. The final figure is then saved as a high-resolution PNG to the figures folder.

For the final step in our dataset exploration exercise, we'll visualize our data using Seaborn's pairplot function.

Visualize the DataFrame using Seaborn's pairplot function. Run the cell containing the following code:

     sns.pairplot(df[cols],
     plot_kws={'alpha': 0.6},
     diag_kws={'bins': 30})

Having previously used a heatmap to visualize a simple overview of the correlations, this plot allows us to see the relationships in far more detail.
Looking at the histograms on the diagonal, we see the following:

- a: RM and MEDV have the closest shape to normal distributions.
- b: AGE is skewed to the left and LSTAT is skewed to the right (this may seem counter intuitive but skew is defined in terms of where the mean is positioned in relation to the max).
- c: For TAX, we find a large amount of the distribution is around 700. This is also
  evident from the scatter plots

Taking a closer look at the MEDV histogram in the bottom right, we actually see something similar to TAX where there is a large upper-limit bin around $50,000. Recall when we did df.describe(), the min and max of MDEV was 5k and 50k, respectively. This suggests that median house values in the dataset were capped at 50k.