上QQ阅读APP看书，第一时间看更新

Building a TFBT model for exoplanet detection

In this section, we shall build the gradient boosted trees model for detecting exoplanets using the Kepler dataset. Let us follow these steps in the Jupyter Notebook to build and train the exoplanet finder model:

We will save the names of all the features in a vector with the following code:

numeric_column_headers = x_train.columns.values.tolist()

We will then bucketize the feature columns into two buckets around the mean since the TFBT estimator only takes bucketed features with the following code:

bc_fn = tf.feature_column.bucketized_column
nc_fn = tf.feature_column.numeric_column
bucketized_features = [bc_fn(source_column=nc_fn(key=column),
                             boundaries=[x_train[column].mean()])
                       for column in numeric_column_headers]

Since we only have numeric bucketized features and no other kinds of features, we store them in the all_features variable with the following code:

all_features = bucketized_features

We will then define the batch size and create a function that will provide inputs from the label and feature vectors created from the training data. For creating this function we use a convenience function tf.estimator.inputs.pandas_input_fn() provided by TensorFlow. We will use the following code:

batch_size = 32
pi_fn = tf.estimator.inputs.pandas_input_fn
train_input_fn = pi_fn(x = x_train,
                       y = y_train,
                       batch_size = batch_size,
                       shuffle = True,
                       num_epochs = None)

Similarly, we will create another data input function that would be used to evaluate the model from the test features and label vectors and name it eval_input_fn using the following code:

eval_input_fn = pi_fn(x = x_test,
                      y = y_test,
                      batch_size = batch_size,
                      shuffle = False,
                      num_epochs = 1)

We will define the number of trees to be created as 100 and the number of steps to be used for training as 100. We also define the BoostedTreeClassifier as the estimator using the following code:

n_trees = 100
n_steps = 100

m_fn = tf.estimator.BoostedTreesClassifier
model = m_fn(feature_columns=all_features,
             n_trees = n_trees,
             n_batches_per_layer = batch_size,
             model_dir='./tfbtmodel')

Since we are doing classification, hence we use the BoostedTreesClassifier, for regression problems where a value needs to be predicted, TensorFlow also has an estimator named BoostedTreesRegressor.

One of the parameters provided to the estimator function is model_dir that defines where the trained model would be stored. The estimators are built such that they look for the model in that folder in further invocations for using them for inference and prediction. We name the folder as tfbtmodel to save the model.

We have used the minimum number of models to define the BoostedTreesClassifier. Please look up the definition of this estimator in the TensorFlow API documentation to find various other parameters that can be provided to further customize the estimator.

The following output in the Jupyter Notebook describes the classifier estimator and its various settings:

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': './tfbtmodel', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fdd48c93b38>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

Post this, we will train the model using the train_input_fn function that provides the exoplanets input data using 100 steps with the following code:

model.train(input_fn=train_input_fn, steps=n_steps)

The Jupyter Notebook shows the following output to indicate the training in progress:

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./tfbtmodel/model.ckpt-19201
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Saving checkpoints for 19201 into ./tfbtmodel/model.ckpt.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:loss = 1.0475121e-05, step = 19201
INFO:tensorflow:Saving checkpoints for 19202 into ./tfbtmodel/model.ckpt.
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Loss for final step: 1.0475121e-05.

Use the eval_input_fn that provides batches from the test dataset to evaluate the model with the following code:

results = model.evaluate(input_fn=eval_input_fn)

The Jupyter Notebook shows the following output as the progress of the evaluation:

INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-09-07-04:23:31
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./tfbtmodel/model.ckpt-19203
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-09-07-04:23:50
INFO:tensorflow:Saving dict for global step 19203: accuracy = 0.99122804, accuracy_baseline = 0.99122804, auc = 0.49911517, auc_precision_recall = 0.004386465, average_loss = 0.09851996, global_step = 19203, label/mean = 0.00877193, loss = 0.09749381, precision = 0.0, prediction/mean = 4.402521e-05, recall = 0.0
WARNING:tensorflow:Issue encountered when serializing resources.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'_Resource' object has no attribute 'name'
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 19203: ./tfbtmodel/model.ckpt-19203

Note that during the evaluation the estimator loads the parameters saved in the checkpoint file:

INFO:tensorflow:Restoring parameters from ./tfbtmodel/model.ckpt-19203

The results of the evaluation are stored in the results collection. Let us print each item in the results collection using the for loop in the following code:

for key,value in sorted(results.items()):
    print('{}: {}'.format(key, value))

The Notebook shows the following results:

accuracy: 0.9912280440330505
accuracy_baseline: 0.9912280440330505
auc: 0.4991151690483093
auc_precision_recall: 0.004386465065181255
average_loss: 0.0985199585556984
global_step: 19203
label/mean: 0.008771929889917374
loss: 0.09749381244182587
precision: 0.0
prediction/mean: 4.4025211536791176e-05
recall: 0.0

It is observed that we achieve an accuracy of almost 99% with the first model itself. This is because the estimators are prewritten with several optimizations and we did not need to set various values of hyperparameters ourselves. For some datasets, the default hyperparameter values in the estimators will work out of the box, but for other datasets, you will have to play with various inputs to the estimators.