上QQ阅读APP看书，第一时间看更新

Create categorical fields from continuous variables and make segmented visualizations

Scroll up to the pair plot in the Jupyter Notebook where we compared MEDV, LSTAT, TAX, AGE, and RM:

Take a look at the panels containing AGE. As a reminder, this feature is defined as the proportion of owner-occupied units built prior to 1940. We are going to convert this feature to a categorical variable. Once it's been converted, we'll be able to replot this figure with each panel segmented by color according to the age category.

Scroll down to Subtopic Building and exploring categorical features and click into the first cell. Type and execute the following to plot the AGE cumulative distribution:

    sns.distplot(df.AGE.values, bins=100,
    hist_kws={'cumulative': True},
    kde_kws={'lw': 0})
    plt.xlabel('AGE')
    plt.ylabel('CDF')
    plt.axhline(0.33, color='red')
    plt.axhline(0.66, color='red')
    plt.xlim(0, df.AGE.max());

Note that we set kde_kws={'lw': 0} in order to bypass plotting the kernel density estimate in the preceding figure.

Looking at the plot, there are very few samples with low AGE, whereas there are far more with a very large AGE. This is indicated by the steepness of the distribution on the far right-hand side.

The red lines indicate 1/3 and 2/3 points in the distribution. Looking at the places where our distribution intercepts these horizontal lines, we can see that only about 33% of the samples have AGE less than 55 and 33% of the samples have AGE greater than 90! In other words, a third of the housing communities have less than 55% of homes built prior to 1940. These would be considered relatively new communities. On the other end of the spectrum, another third of the housing communities have over 90% of homes built prior to 1940. These would be considered very old.

We'll use the places where the red horizontal lines intercept the distribution as a guide to split the feature into categories: Relatively New, Relatively Old, and Very Old.

Setting the segmentation points as 50 and 85, create a new categorical feature by running the following code:

    def get_age_category(x):
        if x < 50:
            return 'Relatively New'
        elif 50 <= x < 85:
            return 'Relatively Old'
        else:
            return 'Very Old'
    df['AGE_category'] = df.AGE.apply(get_age_category)

Here, we are using the very handy Pandas method apply, which applies a function to a given column or set of columns. The function being applied, in this case get_ age_category, should take one argument representing a row of data and return one value for the new column. In this case, the row of data being passed is just a single value, the AGE of the sample.

The apply method is great because it can solve a variety of problems and allows for easily readable code. Often though, vectorized methods such as pd.Series.str can accomplish the same thing much faster. Therefore, it's advised to avoid using it if possible, especially when working with large datasets. We'll see some examples of vectorized methods in the upcoming chapters.

Check on how many samples we've grouped into each age category by typing df.groupby('AGE_category').size()
into a new cell and running

Looking at the result, it can be seen that two class sizes are fairly equal, and the Very Old group is about 40% larger. We are interested in keeping the classes comparable in size, so that each is well-represented and it's straightforward to make inferences from the analysis.

It may not always be possible to assign samples into classes evenly, and in real-world situations, it's very common to find highly imbalanced classes. In such cases, it's important to keep in mind that it will be difficult to make statistically significant claims with respect to the under-represented class. Predictive analytics with imbalanced classes can be particularly difficult. The following blog post offers an excellent summary of methods for handling imbalanced classes when doing machine learning: https://svds.com/learning-imbalanced-classes/.

Let's see how the target variable is distributed when segmented by our new feature AGE_category.

Make a violin plot by running the following code:

    sns.violinplot(x='MEDV', y='AGE_category', data=df,
    order=['Relatively New', 'Relatively Old', 'Very Old']);

The violin plot shows a kernel density estimate of the median house value distribution for each age category. We see that they all resemble a normal distribution. The Very Old group contains the lowest median house value samples and has a relatively large width, whereas the other groups are more tightly centered around their average. The young group is skewed to the high end, which is evident from the enlarged right half and position of the white dot in the thick black line within the body of the distribution.

This white dot represents the mean and the thick black line spans roughly 50% of the population (it fills to the first quantile on either side of the white dot). The thin black line represents boxplot whiskers and spans 95% of the population. This inner visualization can be modified to show the individual data points instead, by passing inner='point' to sns.violinplot(). Let's do that now.

Redo the violin plot adding the inner='point' argument to the sns.violinplot call:

It's good to make plots like this for test purposes in order to see how the underlying data connects to the visual. We can see, for example, how there are no median house values lower than roughly $16,000 for the Relatively New segment, and therefore the distribution tail actually contains no data. Due to the small size of our dataset (only about 500 rows), we can see this is the case for each segment.

Re-do the pairplot from earlier, but now include color labels for each AGE category. This is done by simply passing the hue argument, as follows:

    cols = ['RM', 'AGE', 'TAX', 'LSTAT', 'MEDV', 'AGE_category']
    sns.pairplot(df[cols], hue='AGE_category',
    hue_order=['Relatively New', 'Relatively Old', 'Very Old'],
    plot_kws={'alpha': 0.5}, diag_kws={'bins': 30});

Looking at the histograms, the underlying distributions of each segment appear similar for RM and TAX. The LSTAT distributions, on the other hand, look more distinct. We can focus on them in more detail by again using a violin plot.

Make a violin plot comparing the LSTAT distributions for each AGE_category segment:

Unlike the MEDV violin plot, where each distribution had roughly the same width, here we see the width increasing along with AGE. Communities with primarily old houses (the Very Old segment) contain anywhere from very few to many lower class residents, whereas Relatively New communities are much more likely to be predominantly higher class, with over 95% of samples having less lower class percentages than the Very Old communities. This makes sense, because Relatively New neighborhoods would be more expensive.