上QQ阅读APP看书，第一时间看更新

Matching a model to use cases

Every time we choose to define an observation as a function of some features, we open up a Pandora's box of semi-causally linked features, where each feature itself could be redefined (or quantified) as a function of other features. In doing so, we might want to take a step back, and consider what exactly we are trying to represent. Is our model capturing relevant patterns? Can we rely on our data? Will our resources, be it algorithms or computational firepower, suffice for learning from the data we have?

Recall our earlier scenario of predicting the quantity of food an individual is likely to consume in each meal. The features that we discussed, such as their physical exertion, could be redefined as a function of their metabolic and hormonal activity. Similarly, dietary preferences could be redefined as a function of their gut bacteria and stool composition. Each of these redefinitions adds new features to our model, bringing with them additional complexity.

Perhaps we can even achieve a greater accuracy in predicting exactly how much takeout you should order. Would this be worth the effort of getting a stomach biopsy every day? Or installing a state-of-the-art electron microscope in your toilet? Most of you will agree: no, it would not be. How did we come to this consensus? Simply by assessing our use case of dietary prediction and selecting features that are relevant enough to predict what we want to predict, in a fashion that is reliable and proportional to our situation. A complex model supplemented by high-quality hardware (such as toilet sensors) is unnecessary and unrealistic for the use case of dietary prediction. You could as easily achieve a functional predictive model based on easily obtainable features, such as purchase history and prior preferences.

The essence of this story is that you may define any observable phenomenon as a function of other phenomenon in a recursive manner, but a clever data scientist will know when to stop by picking appropriate features that reasonably fit your use case; are readily observable and verifiable; and robustly deal with all relevant situations. All we need is to approximate a function that reliably predicts the output classes for our data points. Inducting a too complex or simplistic representation of our phenomenon will naturally lead to the demise of our ML project.