8/15/13

Big Data, Hadoop and Big SQL, A Crash Course at the IOD 2013


Big Data and Hadoop are now long-lasting buzzwords in the data processing community. Yet, few database practitioners understand what these technologies are, how to use them productively and how to integrate them into a conventional data processing landscape. It’s no wonder, as nearly all resources on these topics target software developers and not data professionals.

IOD 2013: big data, hadoop, big sql, big insights
At this year’s IBM Information on Demand Conference, November 3-7 in Las Vegas, I will be giving a tutorial that is addressing this concern specifically: we will approach Big Data and Hadoop technologies from the perspective of data professionals. We will introduce the key elements of the Hadoop ecosystem, the IBM’s enhancements and highlight the impact of these technologies on the data systems and practices in the enterprise.

For this tutorial, we use IBM BigInsights Hadoop system and besides exploring the common Hadoop features we delve into some of its unique enhancements.

Here is the overview of what we are going to talk about:


  • What is Big Data? For sure you could not escape the Big Data buzzword, but do you know what Big Data really is? Is your data Big? How about Medium data? Could you/should you apply Hadoop and its tooling to it? There are benefits even if your data is not huge!
  • MapReduce algorithm. At the heart of Hadoop is MapReduce, the algorithm for processing large data sets with a parallel, distributed algorithm executing on a cluster. Learn about this algorithm that brings scalability and fault-tolerance to variety of applications.
  • Hadoop. Hadoop is the framework that implements the common parts of the MapReduce. It provides the environment in which to run user Big Data programs. It is fault tolerant, it scales, it is cost effective and it can enable thousands of computers to jointly process data in parallel.
  • Hive and Pig. While Java APIs for Hadoop allow for a lot of flexibility, they are at a fairly low level. For data professionals, the productive way of approaching the Hadoop is at a higher level: Hive allows for a subset of SQL to be run over the files stored in Hadoop’s Distributed File System (HDFS), while Pig is a data flow language. See the characteristics of both and its strengths and weaknesses.
  • HBase. The database for Hadoop. Complementing traditional Hadoop processing, which falls into a category of batch processing, HBase is a database that provides online / real-time performance. It lies on top of the other Hadoop infrastructure and it is a distributed columnar database.
  • Big SQL. Of course, the most productive approach for a data practitioner would be trusted SQL, but plain Hadoop does not have this feature. IBM’s Big SQL extension to Hadoop provides SQL users a familiar environment to become productive with Hadoop and even to use the JDBC APIs. You will learn how to use Big SQL and quickly become productive with Big Data applications.

How about the labs? In the tutorial we will show hands on how to start exploiting the benefits of Hadoop using the IBM BigInsights Hadoop distribution. We will use the QuickStart edition where you can begin exploring Hadoop in a virtual machine - just unpack and run.  You will get the instructions on how to get it after the tutorial and run the examples yourselves.

I am looking forward to seeing you at the tutorial at the Information on Demand Conference, November 7th 2013. The tutorial is part of the Big Data and Analytics Tutorial Series. Register now here.

8/13/13

NoSQL Now 2013 Tutorial: Introduction to Hadoop Ecosystem

NoSQL Now!
NoSQL Now! conference emerged to be one of the most exciting gatherings for NoSQL practitioners and those who are looking to apply NoSQL in production. This conference is particularly interesting as it is a largest vendor neutral forum focused on NoSQL. There is a great variety of alternative NoSQL approaches and learning about them is the key to successful adoption.

At my company, SciSpike, one of our typical engagements is introducing new technologies. NoSQL and BigData are among the most popular topics lately and we have been educating and advising our clients since the early days of NoSQL.

This year at NoSQL Now! 2013, I am teaching a tutorial Introduction to Hadoop Ecosystem. This is a great starting point to those who are newcomers to Hadoop. In this fast paced tutorial we will provide a rapid immersion into Big Data with the Hadoop ecosystem. We start with an introduction to Big Data and the MapReduce algorithm. We continue with the Hadoop framework and show the ways to interact with the Hadoop file system and the cluster. We then introduce Hive and Pig as the higher level interfaces to process data and HBase, as the distributed, columnar database. At the end, we explore some exciting new projects in the Hadoop ecosystem. You will leave with a good understanding of what Hadoop is and how to start using it in your projects.

Join us for NoSQL Now! 2013, August 20-22 in San Jose. Check out the Introduction to Hadoop Ecosystem tutorial and join us for many exciting presentations and discussions.