上QQ阅读APP看书，第一时间看更新

Load the Boston housing dataset

In the chapter 1 Jupyter Notebook, scroll to subtopic Loading the Data into Jupyter Using a Pandas DataFrame of Our First Analysis: The Boston Housing Dataset. The Boston housing dataset can be accessed from the sklearn.datasets module using the load_boston method.

Run the first two cells in this section to load the Boston dataset and see the data structures type:

The output of the second cell tells us that it's a scikit-learn Bunch object. Let's get some more information about that to understand what we are dealing with.

Run the next cell to import the base object from scikit-learn utils and print the docstring in our Notebook:

Reading the resulting docstring suggests that it's basically a dictionary, and can essentially be treated as such.

Print the field names (that is, the keys to the dictionary) by running the next cell. We find these fields to be self-explanatory: ['DESCR', 'target', 'data', 'feature_names'].

Run the next cell to print the dataset description contained in boston['DESCR']. Note that in this call, we explicitly want to print the field value so that the Notebook renders the content in a more readable format than the string representation (that is, if we just type boston['DESCR'] without wrapping it in a print statement). We then see the dataset information as we've previously summarized:

    Boston House Prices dataset
    ===========================
    Notes
    ------
    Data Set Characteristics:
    :Number of Instances: 506
    :Number of Attributes: 13 numeric/categorical predictive
    :Median Value (attribute 14) is usually the target
    :Attribute Information (in order):
    - CRIM per capita crime rate by town
    …
    - MEDV Median value of owner-occupied homes in $1000's
    :Missing Attribute Values: None

Of particular importance here are the feature descriptions (under Attribute Information). We will use this as reference during our analysis.

Now, we are going to create a Pandas DataFrame that contains the data. This is beneficial for a few reasons: all of our data will be contained in one object, there are useful and computationally efficient DataFrame methods we can use, and other libraries such as Seaborn have tools that integrate nicely with DataFrames.

In this case, we will create our DataFrame with the standard constructor method.

Run the cell where Pandas is imported and the docstring is retrieved for pd.DataFrame:

The docstring reveals the DataFrame input parameters. We want to feed in boston['data'] for the data and use boston['feature_names'] for the headers.

Run the next few cells to print the data, its shape, and the feature names:

Looking at the output, we see that our data is in a 2D NumPy array. Running the command boston['data'].shape returns the length (number of samples) and the number of features as the first and second outputs, respectively

Load the data into a Pandas DataFrame df by running the following:

df = pd.DataFrame(data=boston['data'], columns=boston['feature_names'])

In machine learning, the variable that is being modeled is called the target variable; it's what you are trying to predict given the features. For this dataset, the suggested target is MEDV, the median house value in 1,000s of dollars

Run the next cell to see the shape of the target:

We see that it has the same length as the features, which is what we expect. It can, therefore, be added as a new column to the DataFrame.

Add the target variable to df by running the cell with the following:

    df['MEDV'] = boston['target']

To distinguish the target from our features, it can be helpful to store it at the front of our DataFrame.
Move the target variable to the front of df by running the cell with the following:

     y = df['MEDV'].copy()
     del df['MEDV']
     df = pd.concat((y, df), axis=1)

Here, we introduce a dummy variable y to hold a copy of the target column before removing it from the DataFrame. We then use the Pandas concatenation function to combine it with the remaining DataFrame along the 1st axis (as opposed to the 0th axis, which combines rows).

You will often see dot notation used to reference DataFrame columns. For example, previously we could have done y = df.MEDV.copy() . This does not work for deleting columns, however; del df.MEDV would raise an error.

Now that the data has been loaded in its entirety, let's take a look at the DataFrame.

We can do df.head() or df.tail() to see a glimpse of the data and len(df) to make sure the number of samples is what we expect. Run the next few cells to see the head, tail, and length of df:

Each row is labeled with an index value, as seen in bold on the left side of the table. By default, these are a set of integers starting at 0 and incrementing by one for each row.

Printing df.dtypes will show the datatype contained within each column.

Run the next cell to see the datatypes of each column.

For this dataset, we see that every field is a float and therefore most likely a continuous variable, including the target. This means that predicting the target variable is a regression problem.

The next thing we need to do is clean the data by dealing with any missing data, which Pandas automatically sets as NaN values. These can be identified by running df.isnull(), which returns a Boolean DataFrame of the same shape as df. To get the number of NaN's per column, we can do df.isnull().sum(). Run the next cell to calculate the number of NaN values in each column:

For this dataset, we see there are no NaN's, which means we have no immediate work to do in cleaning the data and can move on.

To simplify the analysis, the final thing we'll do before exploration is remove some of the columns. We won't bother looking at these, and instead focus on the remainder in more detail.

Remove some columns by running the cell that contains the following code:

  for col in ['ZN', 'NOX', 'RAD', 'PTRATIO', 'B']:
     del df[col]