8/15/13

Big Data, Hadoop and Big SQL, A Crash Course at the IOD 2013


Big Data and Hadoop are now long-lasting buzzwords in the data processing community. Yet, few database practitioners understand what these technologies are, how to use them productively and how to integrate them into a conventional data processing landscape. It’s no wonder, as nearly all resources on these topics target software developers and not data professionals.

IOD 2013: big data, hadoop, big sql, big insights
At this year’s IBM Information on Demand Conference, November 3-7 in Las Vegas, I will be giving a tutorial that is addressing this concern specifically: we will approach Big Data and Hadoop technologies from the perspective of data professionals. We will introduce the key elements of the Hadoop ecosystem, the IBM’s enhancements and highlight the impact of these technologies on the data systems and practices in the enterprise.

For this tutorial, we use IBM BigInsights Hadoop system and besides exploring the common Hadoop features we delve into some of its unique enhancements.

Here is the overview of what we are going to talk about:


  • What is Big Data? For sure you could not escape the Big Data buzzword, but do you know what Big Data really is? Is your data Big? How about Medium data? Could you/should you apply Hadoop and its tooling to it? There are benefits even if your data is not huge!
  • MapReduce algorithm. At the heart of Hadoop is MapReduce, the algorithm for processing large data sets with a parallel, distributed algorithm executing on a cluster. Learn about this algorithm that brings scalability and fault-tolerance to variety of applications.
  • Hadoop. Hadoop is the framework that implements the common parts of the MapReduce. It provides the environment in which to run user Big Data programs. It is fault tolerant, it scales, it is cost effective and it can enable thousands of computers to jointly process data in parallel.
  • Hive and Pig. While Java APIs for Hadoop allow for a lot of flexibility, they are at a fairly low level. For data professionals, the productive way of approaching the Hadoop is at a higher level: Hive allows for a subset of SQL to be run over the files stored in Hadoop’s Distributed File System (HDFS), while Pig is a data flow language. See the characteristics of both and its strengths and weaknesses.
  • HBase. The database for Hadoop. Complementing traditional Hadoop processing, which falls into a category of batch processing, HBase is a database that provides online / real-time performance. It lies on top of the other Hadoop infrastructure and it is a distributed columnar database.
  • Big SQL. Of course, the most productive approach for a data practitioner would be trusted SQL, but plain Hadoop does not have this feature. IBM’s Big SQL extension to Hadoop provides SQL users a familiar environment to become productive with Hadoop and even to use the JDBC APIs. You will learn how to use Big SQL and quickly become productive with Big Data applications.

How about the labs? In the tutorial we will show hands on how to start exploiting the benefits of Hadoop using the IBM BigInsights Hadoop distribution. We will use the QuickStart edition where you can begin exploring Hadoop in a virtual machine - just unpack and run.  You will get the instructions on how to get it after the tutorial and run the examples yourselves.

I am looking forward to seeing you at the tutorial at the Information on Demand Conference, November 7th 2013. The tutorial is part of the Big Data and Analytics Tutorial Series. Register now here.

1 comment:

  1. Hi Vladimir,

    What is your take on Big SQL vs. Impala? How would you characterize Big SQL in Matt Aslett's taxonomy? "7 Hadoop questions. Q5: SQL in Hadoop, SQL on Hadoop, or SQL and Hadoop?"
    Cheers,
    Jim Tommaney

    ReplyDelete