99.99% of Big Data is irrelevant! So why do we need it?

99.99% of Big Data is irrelevant! So why do we need it?

As part of an ongoing series on analytics and big data, Michael Wu, principal scientist of analytics at Lithium Technologies, shares his thoughts on the explosion of data due to the social media revolution. 

One of the most common arguments favoring Big Data is that data is versatile and doesn’t really have a shelf life. Even though you don’t need it today, its relevance and utility may become apparent in the future. And since you never know what you might need in the future, you might as well store everything that you can now.
This argument is almost tautological. That means it is irrefutably true no matter how you interpret it. Since the future is inaccessible (at least for now), and humans are risk aversive, we will always want to hedge against the unknown future. The only question left is how cheaply can we track and store these data? If it is cheap, this approach makes sense!
Although data storage is relatively cheap these days, there are hidden costs in Big Data initiative beyond the mere cost of hard drives. Since Big Data are so big that they cannot be stored in, nor analysed on conventional databases, you need a completely new stack of technology for its capture, storage and analysis. This stack is known as the SMAQ stack (i.e. Storage, MapReduce and Query). One of the most popular SMAQ stacks is based on Hadoop, an open source implementation of Google File System (GFS). So the actual SMAQ stack itself isn’t expensive. The cost is the new talent that is needed to use this stack effectively so enterprises can derive insights from the Big Data.
Despite the fact that Big Data technology is relatively cheap, the total cost of ownership (TCO) of any Big Data initiative may still be quite high when you factor in the cost of human resources. So, Big Data is definitely an investment that may not be right for everyone.
Your signal is my noise
Let’s look at a different argument for big data. Although the relevant data is not big at all, the overlap between everyone's relevant data is also tiny. That means everyone's relevant data is quite different and there is very little overlap between them. What is relevant to me may be completely useless to you and vice versa. Likewise, your signal is probably somebody else's noise. Since we usually don’t know who will be looking at these data, we must store everything we can in order to better serve everyone.
The small overlap in relevance is most apparent in Data as a Service (DaaS) vendors like Social Media Monitoring (SMM, a.k.a. Listening Platforms). If you are a company or a brand using SMM, you are probably concerned with the conversation about you and your competitors. That is actually a very tiny fraction of the conversation on social media because there are conversations about hundreds of thousands of different brands out there. Every brand will be interested in the conversation about itself, and every brand will have a different set of competitors. Since no one knows which brand will subscribe which DaaS, DaaS vendors need to be prepared to serve all brands by storing all conversations on the social web.
Now, if you are not a DaaS provider (e.g. SMM or VRM) you might not need all these “big” data. For a brand, all you really need are the conversations about you and your competitors. There are several options for getting these data.
  1. You can capture and store the data yourself.
  2. You can buy the data (with a big check).
  3. Or you can subscribe to a DaaS provider and get these data with much lower cost.
Maybe you don’t need Big Data
Both arguments above hinge on the fact that the precise use of the big data is unknown. We don’t know what questions we may need to answer, and we don’t know what data can help answer them.
Sometimes, however, we do know the questions we need to answer. In fact, we often have some very specific business questions with regards to social media. What is the ROI? Which technology is most engaging? Who are your most valuable influencers, etc. In these cases, you don't need “big” data. You just need the “right” data, the relevant data, the precise data that addresses your question! And that is usually a pretty small data set; sometimes it can even be loaded and analysed on a beefy personal computer.
Conclusion
Alright, there are probably hundreds of reasons for and against Big Data. I’ve talked about three here, what are your arguments for or against Big Data?
Although there is little dispute to the utility of Big Data, collecting and storing these data by yourself may not be the most economical way to get it. So when should you start thinking seriously about your own Big Data initiative?
  1. If you have access to the talent and can do it cheaply. That includes the talents to extract and analyse the relevant data in order to derive insights and value from it.
  2. If you are a DaaS provider and need the data to serve your customers.
  3. If you have specific questions, then all you really need is just the “right” data, which is usually not big at all!
Michael Wu, Ph.D. is the principal scientist of analytics at Lithium Technologies. Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics and its application to social CRM. You can follow him on Twitter at mich8elwu.
Back to top Back to top