Two decades ago, we saw the early introduction of data collection and storage systems into industrial settings. My own experience with these systems, such as the PI data historian system from OSIsoft, was in the pulp and paper industry, but the prevalence of these systems crosses sector boundaries. The purpose was to look at trends in all the different sensor readings, control actions and actuator settings in an industrial process, in order to attempt to draw conclusions from the accumulated machine records that might not be immediately obvious to operators or technical staff. These conclusions could then presumably be used to reduce operating costs or improve product quality.
Growth of these systems tracked the decrease in the cost of computer data storage systems. With memory chips and hard drives being expensive early in the computer era, operating data was often overwritten daily if not hourly. A study I ran in the late 1990's [see reference below] had access to 36 months worth of industrial data, collected daily by a mill engineer and averaged monthly. So 36 data points to describe 3 years of operations ... today, a years' worth of industrial process data from a full-scale plant amounts to terabytes of data, but can be stored forever at a cost of a few dollars.
In parallel with the rise of data historians and archiving systems, we saw the growth of analytical processes to try to make sense of this new avalanche of data. A quick Google search yields an enormous list of textbooks and articles on the topic of data mining and so-called 'statistical learning'; some are even sold by Amazon as well as publishers such as Springer-Verlag or Wiley.
Figure 1: Data from Nature, as used to illustrate correlation versus cause and effect by the late Prof. Martin Weber, Dep't of Chemical Engineering, McGill University, ca. 1990. |
Figure 2: In this correlation, each stork is associated with approximately one live birth per working day (5 days per week, 50 weeks per year). |
In fact, the variable in question was manipulated by a control algorithm, with the objective being to maintain minimal product quality requirements necessary to satisfy customers in the face of raw material variability, while using the least amount of raw materials and process inputs necessary to do so. Reduced quality meant lost or highly discounted sales and had a much bigger impact on the bottom line than the slight increase in operating costs required on occasion to maintain product quality in the face of the inevitable disturbances. The quality-related variables were not flagged by the data mining process, since they did not move at all and in fact were uncorrelated with any other variable. This was, in fact, a good thing, as it proved the existence of a well-designed control algorithm: yes, the manipulated variable moves around a lot, but the result is that the controlled variable (product quality) is stable.
So this is an example where faulty information was extracted from big data, and shows the importance of understanding the difference between industrial data (where a controller or an operator may have his or her hand on a valve) versus lab-generated data (which covers every possible combination of all variables, even combinations that will never be implemented industrially). The article referenced above provides another example: mill data showed conclusively that using less bleach led to brighter paper, a correlation which is clearly nonsense on its own, but which makes sense in context.
One thing I have learned from helping to develop process control systems is this: it is very hard to find a good process control person. The best approach is to find a good control person (of whom there are many), and pair him with a good process person (also lots of good people out there). I would suggest the same is true of any of the broad new tools for data analysis (mining, process modeling and integration, etc.): there are lots of analytics people out there, but for best effect, they need to be paired with a process expert who understands the underlying chemistry and physics, as well as the business context.
I have focused here on industrial challenges. Today we are seeing data mining and analysis tools applied to any sphere where lots of data exists. For instance, Google (click here) collects staggering amounts of data every second (including data about who reads this blog, and what else those readers might be interested in), and presumably is building analytical systems to extract information about the world from all this data. (Should we worry that Google knows so much about us? For discussion outside this post...)
So while this post has discussed data in an industrial setting, perhaps it is worth asking if the same challenges exist in the social or other applications of data analysis, and if so, how are they being managed? Big Questions around Big Data.
Reference: Browne, T.C., Miles, K.B., McDonald, J.D. and Wood, J.R., “Multivariate Analysis of Seasonal Pulp Quality Variations in a TMP Mill”, Pulp Pap Can 105(10):35-39 (October 2004).