上QQ阅读APP看书，第一时间看更新

Data scaling

Analyzing the describe() function's results, we can see that the variables have different ranges. When the data has different ranges, the impact on the target variable by the feature having a greater numeric range could be more than the one that has a less numeric range, and this could, in turn, impact the prediction's accuracy. To remove this effect, we can scale values under different features so that they fall under a common range.

Remember, it is a good practice to rescale the data before training a deep learning algorithm. With rescaling, data units are eliminated, allowing you to easily compare data from different locations.

In this case, we will use z-score standardization. This technique consists of subtracting the mean of the column to each value in a column, and then dividing the result for the standard deviation of the column. The formula to achieve this is as follows:

The result of standardization is that the features will be rescaled so that they'll have the properties of a standard normal distribution, as follows:

μ=0
σ=1

Here, μ is the mean and σ is the standard deviation from the mean.

In summary, the z score (also called the standard score) represents the number of standard deviations with which the value of an observation point or data is greater than the mean value of what is observed or measured. Values more than the mean have positive z scores, while values less than the mean have negative z scores. The z score is a quantity without dimension, which is obtained by subtracting the population mean from a single rough score and then dividing the difference for the standard deviation of the population.

Before proceeding, it is advisable to divide the data into input and target. This will be particularly useful because ad scaling will only affect the input values. Let's start from them:

InputNames = HDNames
InputNames.pop()
Input = pd.DataFrame(DataNew.iloc[:, 0:13],columns=InputNames)

First, we created a new list of names, removing the name of our target (HeartDisease) from the original one (HDNames). To do this, we used the pop() function. This function returns items and drops them from the frame. Then, the iloc() function is used to extract the first 13 columns from the DataNew DataFrame. Let's move to the target:

Target = pd.DataFrame(DataNew.iloc[:, 13],columns=['HeartDisease'])

To scale the input data, as we did in Chapter 2, Modeling Real Estate Using Regression Analysis, we will use the sklearn.preprocessing package once again. Specifically, we are using the StandardScaler() function, but as always, we start by importing the package:

from sklearn.preprocessing import StandardScaler

Let's start by defining the scaler variable:

scaler = StandardScaler()
print(scaler.fit(Input))

To print the parameters that we will use for the next resizing, we will use the fit() method, as follows:

print(scaler.fit(data))

The fit() method computes the mean and std to be used for later scaling. The result is as follows:

StandardScaler(copy=True, with_mean=True, with_std=True)

Now we can scale the features:

InputScaled = scaler.fit_transform(Input)

The fit_transform() method fits to data, and then transforms it. A numpy array of a specific shape is returned. It is advisable to report the results in the starting format (pandas DataFrame), at least for comparison purposes. Let's do this:

InputScaled = pd.DataFrame(InputScaled,columns=InputNames)

To verify the transformation that's been carried out, we again print the basic statistics that we calculated previously:

summary = InputScaled.describe()
summary = summary.transpose()
print(summary)

The following results are returned:

Looking at the preceding figures, we can confirm that all input variables now have the mean and the standard deviation of about zero.

本周热推：

计算机网络 AI 3.0 AI的25种可能 ABB工业机器人编程全集啊哈C！思考快你一步