上QQ阅读APP看书，第一时间看更新

Data splitting

Training the parameters of a prediction function and testing it on the same data is an incorrect procedure from a methodological point of view. A model is used simply to predict the sample labels. If used during the training phase, it would have a perfect score, but would not be able to predict anything useful on the data that hasn't previously been explored. This situation is called overfitting. To avoid this, it is a common practice to run an automatic learning experiment (data splitting) to provide some of the data that's available as a training set and a test set.

Data splitting is an operation that allows us to divide the available data into two sets, generally for cross-validation purposes. A dataset is used to train a predictive model, and the other to test the model's performance. Training and testing the model forms the basis for further usage of the model for prediction in predictive analytics. For example, if given a dataset that has 100 rows of data, which includes the predictor and response variables, we will split the dataset into a convenient ratio (say 70:30) and allocate 70 rows for training and 30 rows for testing. The rows will be selected randomly to reduce bias. Once the training data is available, the data is fed to the neural network to get the massive universal function in place. The training data determines the weights, biases, and activation functions to be used so that we can get to output from input.

Once sufficient convergence is achieved, the model is stored in memory and the next step is testing the model. We pass the 30 rows of data to check if the actual output matches the predicted output from the model. The evaluation is used to get various metrics that can validate the model. If the accuracy is too wary, the model has to be rebuilt with changes in the training data and other parameters passed to the neural network builder.

To split the data, the scikit-learn library has been used. More specifically, the sklearn.model_selection.train_test_split() function has been used. This function quickly computes a random split into training and test sets.

Let's start by importing the function:

from sklearn.model_selection import train_test_split

At this point, to make work easier for us, we will divide the starting DataFrame into two: predictors (X) and target (Y).

To do this, the pandas.DataFrame.drop() function will be used:

X = DataScaled.drop('medv', axis = 1)
print(X.describe())
Y = DataScaled['medv']
print(Y.describe())

The pandas.DataFrame.drop() function drops specified labels from rows or columns. We will remove rows or columns by specifying label names and the corresponding axis, or by specifying the index or column names directly. When using a multi-index, labels on different levels can be removed by specifying the level. To extract X, we have removed the target column (medv) from the starting DataScaled DataFrame.

The following results are returned:

X shape = (506, 13)
Y shape = (506,)

So, X has 13 columns (predictors) and Y has only one column (target). Now, we have to split the two DataFrames:

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30, random_state = 5)
print(‘X train shape = ’,X_train.shape)
print(‘X test shape = ’, X_test.shape)
print(‘Y train shape = ’, Y_train.shape)
print(‘Y test shape = ’,Y_test.shape)

In the train_test_split() function, four parameters are passed, namely X, Y, test_size, and random_state. X and Y are predictors and target DataFrames. The test_size parameter can take the following types: float, integer or none, and optional (default=0.25). If float is between 0.0 and 1.0, it represents the proportion of the dataset to include in the test split. If the parameter is int, it represents the absolute number of test samples. If the parameter is None, the value is set to complement the train size. By default, the value is set to 0.25. In our case, we set test_size = 0.30, which means that 30% of the data is divided up as test data. Finally, the random_state parameter is used to set the seed used by the random number generator. In this way, the repetitive splitting of the operation is guaranteed.

The following results are returned:

X train shape = (354, 13)
X test shape = (152, 13)
Y train shape = (354,)
Y test shape = (152,)

So, the starting DataFrame is split into two datasets that have 354 rows (X_train) and 152 rows (X_test). A similar subdivision was made for Y.