Big data analytics image via Shutterstock
By James Kobielus (@jameskobielus)
To truly see deep into the future, you need to see just as far into the past. If you have a large enough sample of data on how those populations behaved in the past, you can predict their behaviors with a reasonable degree of confidence.
But if your historical sample is very small, you’ll have a tough time explaining why your statistical predictions of future events is better than the proverbial coin flip. And if the event you’re trying to predict is rare, you may not have enough historical occurrences to overcome the statistical biases inherent in small samples. In those latter cases, you may be able to see far into the deep past, but the events you’re searching for are as sparse as Earth-like exoplanets in the vastness of space.
Data scientists are never content to throw up their hands and say we can’t at least have a statistical best guess on rare events. Indeed, decision makers demand clarity on “black swans” and other once-in-a-lifetime events with significant downside risks. We would all like to believe that the probability of these showstoppers is far from random, though the shape of their distribution curves is anybody’s guess.
Nevertheless, data scientists try various approaches to wrap their models around rare events. In this recent blog, Tavish Srivastava provides guidance for building logistic regression models to predict rare events with confidence in spite of small-sample bias. Logistic regression is for predicting the outcome of a categorical dependent variable (e.g., the binary outcomes of “event will occur” vs. “event won’t occur) based on one or more predictive independent variables. Here’s a more general discussion of the issues involved in applying logistical regression to prediction of rare events.
The take-away is that, to use this approach with confidence, a data scientist should have both a large enough sample size of the events being predicted and a large enough number of occurrences, within the sample, of the least-frequent event (e.g., “will they churn?” vs. “will they not churn”?) being predicted.
This is a modeling challenge where big data provides undoubted value. To the extent that you can collect, store, and analyze the entire population of event data (or at least a very large sample) in a Hadoop or other big-data cluster, the more likely you are to find enough occurrences.
Even a limited sample from a whole-population big-data store may contain more rare occurrences than would a smaller-population database that was culled from the same event-data source.