Along my brief research on BigData, I’ve found 5 type of S.L.I.P.S that a data scientist might encounter along the way: Statistic, Learning, Information, Psychology and Sources.
1) Statistic (Left Foot)
Is without any doubt the main and well-known technical aspect. The most common slip concerning statistic is misleading correlation with causation. In other words, discovering correlations among variables doesn’t necessarily imply a cause-effect relation. Mathematically speaking, correlation is a necessary but not sufficient condition for a cause-effect relationship.
(see also K. Borne: Statistical Truisms in the Age of BigData).
2) Learning (Right Foot)
OK, lets assume that a cause-effect relationship exists: which model\algorithm to chose in order to describe the relationship? There are many: ARMA, Kalman’s Filter, Neural Networks, customized,… which one fits best? A model that has been validated with the data available now might be not valid anymore in the future. So, constantly monitoring and measure the error of prediction with the estimated values by the model.
Choosing a model implies making assumptions. In other words, never quit to learn from data and be open to break assumptions otherwise predictions and analysis will be slanted.
3) Information (Right Hand)
Which information is really meaningful? That’s the first point to clarify before implementing a bigdata initiative or any new BI tool for your business.
Another point is misleading information with data. According to information theory, and a well-grounded common sense as well, data are facts while information is an interpretation of facts based upon assumptions (see also the D.A.I. model).
(see also: D. Laney & M. Beyer: BigData Strategy Essentials for Business and IT).
4) Psychology (the Head… of course!)
Have you ever heard about eco-chamber effects and social influence? Well, what happen is that social media might amplify irrational behaviours where individuals (me included) base its decisions, more or less consciously, not only on their knowledge or values but also on the actions of those who act before them.
In particular, whenever dealing with tricky-slippy tools such as bigdata sentiments is better to consider carefully the relevance and impacts of psychology and behaviours. The risk is to gather data that is intrinsically biased (see also My Issue with BigData Sentiments.)
D. Amerland: How Semantic Search is changing end-user behaviour
C. Sunstein: Echo Chambers: Bush v. Gore, Impeachment, and Beyond – Princeton University Press
e! Science News: Information technology amplifies irrational group behavior).
5) Sources (Left Hand)
Variety!!! That is one of the three V suggested by D. Laney: Volume, Velocity and Variety. Not only choosing the right model is important in order to avoid predictions’ and insights’ biases: what about the reliability of the sources of data that has been used for the analysis? If the data is biased predictions and insights will be biased as well. In particular, any series of data has a variance and a bias that can not be eliminated.
How to mitigate such a risk? By gathering data from different sources and weight them accordingly to its reliability: the variance.
Moreover, as a bigdata scientist and as a consumer as well, never forget positive and negative SEO tactics. There is a social-digital jungle there! (see Tripadvisor: a Case Study to Think Why BigData Variety matters).
Feelink – Feel & Think approach for doing life!