Big Data and Hadoop are now long-lasting buzzwords in the data processing community. Yet, few database practitioners understand what these technologies are, how to use them productively and how to integrate them into a conventional data processing landscape. It’s no wonder, as nearly all resources on these topics target software developers and not data professionals.
For this tutorial, we use IBM BigInsights Hadoop system and besides exploring the common Hadoop features we delve into some of its unique enhancements.
Here is the overview of what we are going to talk about:
- What is Big Data? For sure you could not escape the Big Data buzzword, but do you know what Big Data really is? Is your data Big? How about Medium data? Could you/should you apply Hadoop and its tooling to it? There are benefits even if your data is not huge!
- MapReduce algorithm. At the heart of Hadoop is MapReduce, the algorithm for processing large data sets with a parallel, distributed algorithm executing on a cluster. Learn about this algorithm that brings scalability and fault-tolerance to variety of applications.
- Hadoop. Hadoop is the framework that implements the common parts of the MapReduce. It provides the environment in which to run user Big Data programs. It is fault tolerant, it scales, it is cost effective and it can enable thousands of computers to jointly process data in parallel.
- Hive and Pig. While Java APIs for Hadoop allow for a lot of flexibility, they are at a fairly low level. For data professionals, the productive way of approaching the Hadoop is at a higher level: Hive allows for a subset of SQL to be run over the files stored in Hadoop’s Distributed File System (HDFS), while Pig is a data flow language. See the characteristics of both and its strengths and weaknesses.
- HBase. The database for Hadoop. Complementing traditional Hadoop processing, which falls into a category of batch processing, HBase is a database that provides online / real-time performance. It lies on top of the other Hadoop infrastructure and it is a distributed columnar database.
- Big SQL. Of course, the most productive approach for a data practitioner would be trusted SQL, but plain Hadoop does not have this feature. IBM’s Big SQL extension to Hadoop provides SQL users a familiar environment to become productive with Hadoop and even to use the JDBC APIs. You will learn how to use Big SQL and quickly become productive with Big Data applications.
How about the labs? In the tutorial we will show hands on how to start exploiting the benefits of Hadoop using the IBM BigInsights Hadoop distribution. We will use the QuickStart edition where you can begin exploring Hadoop in a virtual machine - just unpack and run. You will get the instructions on how to get it after the tutorial and run the examples yourselves.
I am looking forward to seeing you at the tutorial at the Information on Demand Conference, November 7th 2013. The tutorial is part of the Big Data and Analytics Tutorial Series. Register now here.