上QQ阅读APP看书，第一时间看更新

Linear models with Seaborn and scikit-learn

Scroll to SubtopicIntroduction to predictive analytics in the Jupyter Notebook and look just above at the pairplot we created in the previous section. In particular, look at the scatter plots in the bottom-left corner:

Note how the number of rooms per house (RM) and the % of the population that is lower class (LSTAT) are highly correlated with the median house value (MDEV). Let's pose the following question: how well can we predict MDEV given these variables?

To help answer this, let's first visualize the relationships using Seaborn. We will draw the scatter plots along with the line of best fit linear models.

Draw scatter plots along with the linear models by running the cell that contains the following:

    fig, ax = plt.subplots(1, 2)
    sns.regplot('RM', 'MEDV', df, ax=ax[0],
    scatter_kws={'alpha': 0.4}))
    sns.regplot('LSTAT', 'MEDV', df, ax=ax[1],
    scatter_kws={'alpha': 0.4}))

The line of best fit is calculated by minimizing the ordinary least squares error function, something Seaborn does automatically when we call the regplot function. Also note the shaded areas around the lines, which represent 95% confidence intervals.

These 95% confidence intervals are calculated by taking the standard deviation of data in bins perpendicular to the line of best fit, effectively determining the confidence intervals at each point along the line of best fit. In practice, this involves Seaborn bootstrapping the data, a process where new data is created through random sampling with replacement. The number of bootstrapped samples is automatically determined based on the size of the dataset, but can be manually set as well by passing the n_boot argument.

Seaborn can also be used to plot the residuals for these relationships. Plot the residuals by running the cell containing the following:

    fig, ax = plt.subplots(1, 2)
    ax[0] = sns.residplot('RM', 'MEDV', df, ax=ax[0],
                      scatter_kws={'alpha': 0.4})
    ax[0].set_ylabel('MDEV residuals $(y-\hat{y})$')
    ax[1] = sns.residplot('LSTAT', 'MEDV', df, ax=ax[1],
                      scatter_kws={'alpha': 0.4})
    ax[1].set_ylabel('')

Each point on these residual plots is the difference between that sample (y) and the linear model prediction ( ?). Residuals greater than zero are data points that would be underestimated by the model. Likewise, residuals less than zero are data points that would be overestimated by the model.

Patterns in these plots can indicate sub-optimal modeling. In each preceding case, we see diagonally arranged scatter points in the positive region. These are caused by the $50,000 cap on MEDV. The RM data is clustered nicely around 0, which indicates a good fit. On the other hand, LSTAT appears to be clustered lower than 0.

Moving on from visualizations, the fits can be quantified by calculating the mean squared error. We'll do this now using scikit-learn. Defile a function that calculates the line of best fit and mean squared error, by running the cell that contains the following:

      def get_mse(df, feature, target='MEDV'):
      # Get x, y to model
      y = df[target].values
      x = df[feature].values.reshape(-1,1)
      ...
      ...
      error = mean_squared_error(y, y_pred)
      print('mse = {:.2f}'.format(error))
      print()

In the get_mse function, we first assign the variables y and x to the target MDEV and the dependent feature, respectively. These are cast as NumPy arrays by calling the values attribute. The dependent features array is reshaped to the format expected by scikit-learn; this is only necessary when modeling a one-dimensional feature space. The model is then instantiated and fitted on the data. For linear regression, the fitting consists of computing the model parameters using the ordinary least squares method (minimizing the sum of squared errors for each sample). Finally, after determining the parameters, we predict the target variable and use the results to calculate the MSE.

Call the get_mse function for both RM and LSTAT, by running the cell containing the following:

      get_mse(df, 'RM')
      get_mse(df, 'LSTAT')

Comparing the MSE, it turns out the error is slightly lower for LSTAT. Looking back to the scatter plots, however, it appears that we might have even better success using a polynomial model for LSTAT. In the next activity, we will test this by computing a third-order polynomial model with scikit-learn.

Forgetting about our Boston housing dataset for a minute, consider another real-world situation where you might employ polynomial regression. The following example is modeling weather data. In the following plot, we see temperatures (lines) and precipitations (bars) for Vancouver, BC, Canada:

Any of these fields are likely to be fit quite well by a fourth-order polynomial. This would be a very valuable model to have, for example, if you were interested in predicting the temperature or precipitation for a continuous range of dates.

You can find the data source for this here:

http://climate.weather.gc.ca/climate_normals/results_e.html?stnID=888.