Keras 2.x Projects
上QQ阅读APP看书,第一时间看更新

Neural networks for regression using Keras

The real estate market is a market where the sales and purchase between sellers and buyers refer to the exchange of real estate of any kind, such as housing, land, commercial premises, and so on. Real estate prices depend on a series of factors that make the asset more palatable for potential buyers.

These factors include the socioeconomic conditions, environmental conditions, and educational facilities of the area in which the property is located. Analyzing how these factors affect the cost of real estate can be a valuable tool for technicians in the sector in order to predict the market trends, depending on the changes that are occurring.

To do this, we will run a neural network regression for the Boston dataset; the median values of owner-occupied homes are predicted for the test data. The dataset describes 13 numerical properties of houses in Boston suburbs, and is concerned with modeling the price of houses in those suburbs in thousands of dollars. As such, this is a regression predictive modeling problem. Input attributes include features such as the crime rate, the proportion of nonretail business acres, chemical concentrations, and more.

To get the data for this section, we will draw on the large collection of data available in the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml.

The following list shows all the variables, followed by a brief description:

  • Number of instances: 506
  • Number of attributes: 14 continuous attributes (including the class attribute medv), and one binary-valued attribute

Each of the detailed attributes areas are as follows:

  • crim: Per capita crime rate by town
  • zn: Proportion of residential land zoned for lots over 25,000 square feet
  • indus: Proportion of nonretail business acres per town
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • nox: Nitric oxides concentration (parts per ten million)
  • rm: Average number of rooms per dwelling
  • age: Proportion of owner-occupied units built prior to 1940
  • dis: Weighted distances to five Boston employment centers
  • rad: Index of accessibility to radial highways
  • tax: Full-value property tax rate per $10,000
  • ptratio: Pupil-teacher ratio by town
  • black: 1000(Bk - 0.63)^2, where Bk is the proportion of blacks by town
  • lstat: Percent of the lower status of the population
  • medv: Median value of owner-occupied homes in $1,000

Of these, medv is the response variable, while the other 13 variables are possible predictors. The goal of this analysis is to fit a regression model that best explains the variation in medv. Is there a relationship between the first 13 columns and the medv response variable? Can we predict the medv value based on the 13 input columns? As we stated previously, the objective of this example is to predict the median value of owner-occupied homes. The answers to these questions will allow us to predict the median value of the houses according to a series of factors.

The data is available in a .data file named housing.data from the UCI dataset. To start, let's look at how we can import the data into Python. To do this, we will use the read_csv module of the pandas library. The read_cs method loads the data in a pandas DataFrame.

The first thing to do is import the library that we will use, as follows:

import pandas as pd

From now on, to refer to any function contained in the pandas library, we just use the string pd. We have changed the pandas library name, and the as clause can be added to do that.

The pandas library is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. In particular, it offers data structures and operations for manipulating numericals.

The available data does not contain the header, so it is necessary to retrieve the names of the variables that are contained in another file, and always make it available in the UCI archive.

Now, let's put them in a list:

BHNames= ['crim','zn','indus','chas','nox','rm',
'age','dis','rad','tax','ptratio','black','lstat','medv']

Let's look at how we can import the data contained in the dataset in Python:

url='https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
data = pd.read_csv(url, delim_whitespace=True, names=BHNames)

As we said, to import a dataset, we use the read_csv module of the pandas library. In this function, instead of the filename, we can also enter a complete URL of a file contained on a website repository. To do this, first, we have set the complete URL in the url variable, and then this variable has been passed to the function. Furthermore, two other parameters have been passed to the function, namely delim_whitespace and names. The first specifies whether or not whitespace will be used as sep. The second specifies a list of column names to use.