For me, 2016 is a year of water leaks.
I was forced to move apartments during the summer. (Blame my old landlord for the lower frequency of posts this year!) That old apartment was overrun by water issues. In the past four years, there were two big leaks in addition to annual visible "seepage" in the ceiling. The first big leak ruined my first night back from Hurricane Sandy-induced evacuation. It was an absolute gusher as a pipe broke in the apartment right above me – people were knocking on my door and I woke up to find a downpour through the light fixture in my kitchen.
By contrast, the second big one was a slow-moving leak behind the closet in my living room.The water coursed through ten apartments from the top of the building downwards, eventually reaching the ground-floor tenants who raised the alarm. By the time I returned home from work, everything in the closet was soaked, and water started dripping from the living-room ceiling.
Imagine one is building a statistical model to detect water leaks in apartment buildings. (You know that's where I was headed…)
The first key issue to consider is the speed of the leak – think of it as the strength of the signal. The first downpour was quickly detected, and well-contained because it was so strong. The downstairs tenant pounded on my door, and the superintendant immediately turned off the water supply. The second incident was drip-drip as well as hidden behind the closet, and therefore it took a while to get detected. When you have a weak signal, or if the signal is hidden or mixed in with noise, you have a much harder problem on your hands.
Traditionally, we build models that presume some uncertainty, and rely on discovering causal relationships or persistent correlations. Recently, as I shall discuss below, some models can be rendered effective just by collecting more of the right data.
As a statistician, I like to find auxiliary data that are correlated with the signal. So I'd study the pattern of piping in the building, past incidents of leakage, causes of water leaks, and so on. In a different incident in a relative's building, also in 2016, the water leak was caused by a defective valve installed on all the dishwashers in the building, courtesy of a recent building-wide renovation project. Several other apartments already had leaked before for the same reason so in principle, subsequent leakage events could well have been predicted by the insurer. In this paradigm, I use past data to establish causal pathways (or at least persistent correlations), which allows me to issue probabilities of leakage for each apartment in the building.
A new paradigm of predictive modeling is taking shape but poorly understood. These "Big Data" modelers would call for installation of sensors everywhere. Imagine a sensor had been added to the dishwater, and every five seconds, it would transmit a message, proclaiming "I'm not leaking", "I'm not leaking." In this way, any leak would be found within five seconds of occurring. Surely, this model could outperform traditional models if predictive accuracy were measured.
While the predictive accuracy of this sensor-driven model may look better, the skill in building models has played no role in it. The improvement is achieved through surveillance. It is as if we had hired a 365/24/7 guard to look over every dishwasher in the building.
A similar situation has played out in the digital advertising world in the last 5 years or so. A fast-growing segment of this market is called "remarketing." This marketing tactic explains why many of us are stalked by corresponding ads after we browsed a certain product webpage, or placed an item into the shopping cart, or purchased something and paid for it.
Any model will tell you that the further you have ventured into the purchasing cycle, the more likely you will consummate the sale. For example, those who have browsed and placed an item into the shopping cart are more likely to purchase this item than those who merely browsed and did not place the item in the shopping cart. So if the analyst knows that you have placed something into the shopping cart, he or she is smart to stalk you around cyberspace with ads for that product – the remarketing ads placed by the analyst then has a better chance of being attributed the "cause" of the transaction.
But the act of placing something in the shopping cart is not a real cause of the transaction. It is merely a necessary step in the buying process, closer to the desired final act of paying for the item. Every final transaction must pass through this step. To the extent that putting the item in the shopping cart predicts a purchase, it is due to active surveillance of the process. Our understanding of why the customer did what he or she did has not improved at all.
Remarketing models will predict purchasing behavior better than other types of models, and this advantage did not arise from better skills. The analyst is just utilizing data collected closer to the time of the subsequent transaction. The role of the remarketing ad is facilitating the execution of an already-made decision. It sometimes gets it wrong, usually because of an already-executed purchase, or a prior decision not to purchase the item (after browsing it). In these cases, the model falls for false-positive signals.
The trouble with remarketing is that it tends to send money to the wrong players: the remarketing ad is crowned the "cause" of the transaction while the hidden true cause of the transaction (e.g. a celebrity endorsement) is afforded no compensation. As advertising dollars shift more and more to remarketing, due to its apparent predictive prowess, the system robs the other channels that may have created the demand for the product.
In the above example, the problem of predicting who will consummate a purchase has been turned into a race of who can insert an ad closest to the purchase event, which has in turn become the problem of who can gather the clearest signals as close to the purchase event as possible. No sophisticated analysis is required.
I now return to the sensor-driven predictive model of water leaks, and the same dynamic applies. The sensors are so close to the action that we don't need complex analysis to get a good prediction.
Is that last point really true? I think there are pre-conditions for this type of models to work well. The theory behind the leakage, i.e. that the dishwater valve is the sole, known cause of water leaks, must be rock-solid. If other valves in an apartment are also defective, this model will miss some of the leaks. If the dishwater valve fault is necessary but not sufficient for leaking water, then the model will suffer from some false-positive errors.
The success of the sensor-driven model also presumes the sanctity of the data. It is possible – though the odds need to be measured – that these cheap sensors may once in a while generate false signals, or fail to transmit, say due to end of life. In a world filled with gazillions of sensors (imagined by those tech execs pushing the era of "Internet of things"), even a tiny per-sensor fault rate generates a huge number of failures.