“Big Data” seems to be the buzz word everywhere and the number of blogs on this very topic has been exponentially growing. Let me take a step back to understand what to expect. Even at India TechEd 2012 we plan to cover this very topic under the Architect track. Personally, I am really excited to see this session discussed from multiple angles. As budding Architects there are tons to look out for. Refer my previous post coming your way on Architecture. So at TechEd India we will have speakers discuss the problem statements and the possible solutions with recommendation on architecture. In this blog post, I am surely talking about some of them – I am not going to steal the awesome content they are lining up :).
Where does Big Data fit? Datasets that exceed the boundaries and size of normal processing capabilities forcing you to take non-traditional approaches.
I was wanting to drop this topic before and strangely figured out that the SQL Community are anyways running the TSQL Tuesday on this very topic. Now with announcements at SQL PASS and investments of Microsoft also in this space – this is huge deal.
When we talk about Big Data we are fundamentally looking at 3 basic dimensions:
- Large Data (In ranges of Peta to exabytes and more)
- Complex Data (Write once – read many times, Dynamic Schema data)
- Unstructured data (Text mining, Images, Videos, Logs)
And these are the same problems we currently have in the industry when it comes to database / data store systems. Look at systems now with RFID tags, Web logs, sensors, medical images, telecom, public sector databases etc all are grappling with this problem.
Where to start?
Hadoop started as a way to quickly process Web log files. Web 2.0 sites were finding that they were accumulating logs that contained valuable click information and user behavior data. As an alternative to parsing log data and storing it in a relational database, Hadoop emerged as a way to keep the log files in their original format and allow processing and analysis.
Though the basic concept is simple and powerful, let me link to some basic explanation to the post Pinal Dave wrote today. He takes a stab at simply demystifying the basics on Hadoop, Pigs, Hives, MapReduce. Feel free to read more on them:
- Pig – A high-level language that lets non-programmers use Hadoop
- Hive – An SQL query implementation for Hadoop
- HBase – A key/value store for Hadoop
One other resource I would like to point in this context is Cloudera from learning resources. Cloudera is a for-profit company that produces integrated, tested, and commercially supported Hadoop releases. Look at some of the other extensions they support as extensions – some new releases make an interesting read.
- Hue – Hadoop user interface
- Sqoop – tool to import relational data
- Flume – tool to import nonrelational data
- Oozie – workflow engine and many more.
Relational or DW Database Obsolete?
Personally, I don’t think we are talking about this-or-that Boolean approach here. There is something that makes these concepts of Hadoop interesting and viable for organizations to start considering. Let me call out some of them (not exhaustive though)-
- Hadoop clusters can be on x86 commodity hardware
- No need build cubes for predictive analysis of large data
- Relational DB have their own limits on scale-out and scale-up scenarios
- Addition of scale-out options easy with Hadoop
With this steady stream of data, is this what the industry is also looking for? Check the McKinsey Global Institute – Big Data: The next frontier for innovation competition and productivity paper and the numbers are bind blowing.
- 1.5 million more data saavy managers in the US alone
- 140,000-190,000 deep analytical talent positions
- €250 billion Potential annual value to Europe’s public sector
- 15 out of 17 sectors in the US have more data stored per company than the US Library of Congress
Read the whitepaper and there are many more statistics that seem to make this Big Data really Big. Now take examples of big data patterns and sites like facebook or twitter with millions of data stream coming every minute and you want some analytics. Does this Big data architecture qualify here? or do you need a different architectural choices? Well, don’t forget to tune into our India TechEd Architecture track for the details :).
Microsoft Integration Points
From Microsoft, you are going to see lot of work to happen as it is data. Applications like Excel, PowerPivot, Power View, SQL Server Analysis Services, SQL Server Reporting Services are some of the integration we have seen in the recent past at SQL PASS. More about this can be read from the MS Big data home site.
Channel-9 Video: Lynn Langit and Dave Nielsen discuss "Big Data" in the Cloud
MSR Research Paper on Big Data – gives a nice read
Another Research Paper: Big Data and Cloud Computing: New Wine or just New Bottles?
What we can see is, as we get to know this more recent phenomenon of Big Data even the cloud seems to embrace it with two hands. You are going to see some serious integration across the platform and it is a great sign for us –
- Connectors for Hadoop, integrating it with SQL Server and SQL Sever Parallel Data Warehouse.
- An ODBC driver for Hive, permitting any Windows application to access and run queries against the Hive data warehouse.
- An Excel Hive Add-in, which enables the movement of data directly from Hive into Excel or PowerPivot.
Where to start
I highly recommend using Apache Hadoop on Windows WIKI – please bookmark it. Now as a Microsoft ecosystem, there are 3 other interesting pages for reference you don’t want to miss.
On-Premise Deployment of Apache Hadoop for Windows
Windows Azure Deployment of Apache Hadoop for Windows
Windows Azure Deployment of Hadoop on the Elastic Map Reduce (EMR) Portal
This forms a great ecosystem from on-premise to the Cloud. As part of the whole bundle of links here, couldn’t resist from linking Rob Farley who has been kind enough to point out that Big Data now features in 24 hours of PASS too. Nice timing to talk more and more about Big data.
Personally, I see there is tons of learning with Big Data coming our way and 2012 will start the same conversation that we started about BI in Year 2005 timeframe. So get prepared for some Big Hype, Big Challenges, Big Insights and a Big Year of Big Data coming your way.