Recently, I have been reading up on big data mainly because I’m a geek and partly because of all the buzz floating around about it. I also got to sit down at VMworld and try out the new Big Data Extensions for vSphere which got me thinking even more about trying out Hadoop and seeing what it really could do.
What is Big Data
Wikipedia defines Big Data as “…the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.”
Let me start by saying that Big Data, just like “Cloud”, is not anything new. In fact companies like Teradata have been processing big data since before I was born (http://en.wikipedia.org/wiki/Teradata), however with recent advancements in open source projects like Hadoop, combined with a media spot light on NSA leaks and god knows who else collecting every bit of information about you… it’s no wonder that the “Big Data” buzz word took off. I would say that the closest thing to Big Data, before big data, was probably Data mining… in both cases you are taking lots of data and looking for patterns or at least records that stand out.
A little more history. In 1991 Teradata shipped its first system capable of handling 1TB of data to Wal-Mart, also in 1991 IBM introduced a 1GB hard drive… later in 1992 Seagate introduced a 2.1GB hard drive… So to make the math easy lets just say that big data in 1991 was 1000x bigger than a PC hard drive. So today with PC hard drives topping 4TB, it’s not hard to believe that big data is probably 4 Petabytes or more! In fact that could be considered a “common” big data installation considering Facebook published how it moved its 30 petabyte big data platform in 2011… imagine how big that is now…
You’ve probably also read about the NSA’s new datacenter in Utah (Another article)… people claim it has a 100,000 sq ft datacenter (which isn’t really that uncommon). So let’s assume they can fit 5000 racks of servers in that space, and they use 2u servers with one 4TB hard drive in each server. That comes to 420 Petabytes of storage, obviously that is all speculation, but that is big data LOL.
((5000 racks * 42u each) / 2u servers) = 105,000 servers * 4TB drive in each = 420,000TB or 420 Petabytes
Other people are whispering that this datacenter could support as much as 5 zettabytes of data too. 1 Zettabyte = 1000 Petabytes or about 250,000 4TB Hard Drives. Personally I find that a little hard to believe right now, however I could see it scaling to that at some point.
The good side of Big Data
Anyhow, what really interests me about big data is its commercial uses, such as what Wal-Mart, Kroger, Meijers (Ohio grocery store chains) or any other retailer would use big data for. In fact you’re probably already seeing the output of big data and don’t even know it. Where do you think those coupons that print during checkout come from ? Let me elaborate…
If I were to come in and buy baby formula, baby wipes, and a big list of groceries you would just think I was your typical family doing its grocery shopping. But what if you were able to look at every receipt from every store you have then compare them to find items that are regularly purchased together. Obviously sometimes people purchase on impulse, but if you look at enough data (say a few million receipts) you would see patterns. From those patterns you could then, in real-time, look at which items I am purchasing that day, and determine which ones I may have forgotten, or which ones relate to the “normal patterns” I have, and print coupons for me to use the next time I’m shopping.
Alternatively super markets or retailers could also use this big data to determine how much of an item to stock at any given time, or when to expect to hire seasonal help etc etc.
Plus if you add in frequent shopper card programs like most big retailers have, just think about the possibilities… If I am buying formula and wipes on a regular basis in 2013 its pretty safe to say that I will be looking for toddler clothes in 2015… and even back to school supplies in 2018. And with my shopper card, a retailer like Wal-Mart can certainly track where I live… so who needs 10-year-old census data when they have almost real-time analytics?
So basically all you need is a big pool of information in which you can use big data tools like Hadoop to look for patterns and then use those patterns to positively impact your business. You might be thinking that this sounds like what databases have been doing for years… and you would be right, but databases require structured data… columns and rows of predictable records to be useful. And at some point a database is limited to the number of rows it can handle before becoming slow. The idea of Big Data is that your write code, that can be running on every row of data no matter what it looks like. This allows hadoop to break your dataset into chunks, then distribute those chunks to thousands of compute nodes, run your code on each record in parallel, and finally combine the results. Remember many hands make light work.
Taking the bad with the good
Like any good spy movie or novel there are always some people who are more fascinated with the evil possibilities of a technology. So while the good things that big data can bring are awesome, the evil stuff scares the hell out of me.
For example, If I’m using my shoppers card every time I purchase an item it’s pretty safe to say that I may use my debit card, or a credit card… or maybe even a check (I think those still exist LOL). So technically a retailer would have more financial information about me than my home town bank does. Don’t get me wrong I don’t think many retailers are interested in stealing my bank information…. after all there isn’t that much money in them anyhow LOL, but we also see that the government is looking at everything and even has direct access to many large companies databases. And worse yet, what about when some black hat breaks into said database…. their intentions may not be so noble.
So as Voltaire, Stan Lee, and maybe even FDR say… “With great power comes great responsibility”, and it couldn’t be more true with big data.
More posts on the way
Lately I have been slacking off on my blogging duties, mainly because of an addition to my family, and because I’m also working on a vCloud related book in addition to my normal “day job”.
However with that being said, I have had a chance to play around with a couple Hadoop distributions lately and hope to post some articles about them very shortly.
Stay tuned, and thanks for reading!