
Building a TFBT model for exoplanet detection
In this section, we shall build the gradient boosted trees model for detecting exoplanets using the Kepler dataset. Let us follow these steps in the Jupyter Notebook to build and train the exoplanet finder model:
- We will save the names of all the features in a vector with the following code:
numeric_column_headers = x_train.columns.values.tolist()
- We will then bucketize the feature columns into two buckets around the mean since the TFBT estimator only takes bucketed features with the following code:
bc_fn = tf.feature_column.bucketized_column
nc_fn = tf.feature_column.numeric_column
bucketized_features = [bc_fn(source_column=nc_fn(key=column),
boundaries=[x_train[column].mean()])
for column in numeric_column_headers]
- Since we only have numeric bucketized features and no other kinds of features, we store them in the all_features variable with the following code:
all_features = bucketized_features
- We will then define the batch size and create a function that will provide inputs from the label and feature vectors created from the training data. For creating this function we use a convenience function tf.estimator.inputs.pandas_input_fn() provided by TensorFlow. We will use the following code:
batch_size = 32
pi_fn = tf.estimator.inputs.pandas_input_fn
train_input_fn = pi_fn(x = x_train,
y = y_train,
batch_size = batch_size,
shuffle = True,
num_epochs = None)
- Similarly, we will create another data input function that would be used to evaluate the model from the test features and label vectors and name it eval_input_fn using the following code:
eval_input_fn = pi_fn(x = x_test,
y = y_test,
batch_size = batch_size,
shuffle = False,
num_epochs = 1)
- We will define the number of trees to be created as 100 and the number of steps to be used for training as 100. We also define the BoostedTreeClassifier as the estimator using the following code:
n_trees = 100
n_steps = 100
m_fn = tf.estimator.BoostedTreesClassifier
model = m_fn(feature_columns=all_features,
n_trees = n_trees,
n_batches_per_layer = batch_size,
model_dir='./tfbtmodel')
One of the parameters provided to the estimator function is model_dir that defines where the trained model would be stored. The estimators are built such that they look for the model in that folder in further invocations for using them for inference and prediction. We name the folder as tfbtmodel to save the model.
The following output in the Jupyter Notebook describes the classifier estimator and its various settings:
INFO:tensorflow:Using default config. INFO:tensorflow:Using config: {'_model_dir': './tfbtmodel', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fdd48c93b38>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
- Post this, we will train the model using the train_input_fn function that provides the exoplanets input data using 100 steps with the following code:
model.train(input_fn=train_input_fn, steps=n_steps)
The Jupyter Notebook shows the following output to indicate the training in progress:
INFO:tensorflow:Calling model_fn. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. WARNING:tensorflow:Issue encountered when serializing resources. Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore. '_Resource' object has no attribute 'name' INFO:tensorflow:Graph was finalized. INFO:tensorflow:Restoring parameters from ./tfbtmodel/model.ckpt-19201 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. WARNING:tensorflow:Issue encountered when serializing resources. Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore. '_Resource' object has no attribute 'name' INFO:tensorflow:Saving checkpoints for 19201 into ./tfbtmodel/model.ckpt. WARNING:tensorflow:Issue encountered when serializing resources. Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore. '_Resource' object has no attribute 'name' INFO:tensorflow:loss = 1.0475121e-05, step = 19201 INFO:tensorflow:Saving checkpoints for 19202 into ./tfbtmodel/model.ckpt. WARNING:tensorflow:Issue encountered when serializing resources. Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore. '_Resource' object has no attribute 'name' INFO:tensorflow:Loss for final step: 1.0475121e-05.
- Use the eval_input_fn that provides batches from the test dataset to evaluate the model with the following code:
results = model.evaluate(input_fn=eval_input_fn)
The Jupyter Notebook shows the following output as the progress of the evaluation:
INFO:tensorflow:Calling model_fn. WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead. WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Starting evaluation at 2018-09-07-04:23:31 INFO:tensorflow:Graph was finalized. INFO:tensorflow:Restoring parameters from ./tfbtmodel/model.ckpt-19203 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Finished evaluation at 2018-09-07-04:23:50 INFO:tensorflow:Saving dict for global step 19203: accuracy = 0.99122804, accuracy_baseline = 0.99122804, auc = 0.49911517, auc_precision_recall = 0.004386465, average_loss = 0.09851996, global_step = 19203, label/mean = 0.00877193, loss = 0.09749381, precision = 0.0, prediction/mean = 4.402521e-05, recall = 0.0 WARNING:tensorflow:Issue encountered when serializing resources. Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore. '_Resource' object has no attribute 'name' INFO:tensorflow:Saving 'checkpoint_path' summary for global step 19203: ./tfbtmodel/model.ckpt-19203
Note that during the evaluation the estimator loads the parameters saved in the checkpoint file:
INFO:tensorflow:Restoring parameters from ./tfbtmodel/model.ckpt-19203
- The results of the evaluation are stored in the results collection. Let us print each item in the results collection using the for loop in the following code:
for key,value in sorted(results.items()):
print('{}: {}'.format(key, value))
The Notebook shows the following results:
accuracy: 0.9912280440330505 accuracy_baseline: 0.9912280440330505 auc: 0.4991151690483093 auc_precision_recall: 0.004386465065181255 average_loss: 0.0985199585556984 global_step: 19203 label/mean: 0.008771929889917374 loss: 0.09749381244182587 precision: 0.0 prediction/mean: 4.4025211536791176e-05 recall: 0.0
It is observed that we achieve an accuracy of almost 99% with the first model itself. This is because the estimators are prewritten with several optimizations and we did not need to set various values of hyperparameters ourselves. For some datasets, the default hyperparameter values in the estimators will work out of the box, but for other datasets, you will have to play with various inputs to the estimators.