10/9/13

Succeeding with Polyglot Persistence






As a consultant, I work with clients enabling them to become successful with the new data and software technologies. For some of our clients, relational databases are not the optimal choice, and various NoSQL systems appear as reasonable alternatives.

At the Global Big Data Conference this year in Silicon Valley (Santa Clara), I will be giving a talk about our journey into polyglot persistence: the use of several types of data stores that are each a good match for a particular part of information processing problem.

As we started embracing Big Data and NoSQL across a number of projects, it quickly became clear that one technology is not going to be a solution for all of our needs. We begin by outlining the issues relational technology has with scalability and new data formats. We then illustrate the examples of dominant NoSQL technologies and how they fit into the big picture.

We will show how to productively bring in NoSQL systems into the enterprise, including classical reporting systems and integration strategies with relational systems. You will benefit from getting a clear picture of what type of NoSQL data store is a good match for data processing piece of puzzle.

Outline:

  •  What is polyglot persistence?
  •  The relational database problems
  •  Taming big data with Hadoop and Map Reduce
  •  Scalability with Key/Value and Columnar stores
  •  Flexibility of Document stores
  •  Finding connections with Graph databases
  •  Data Governance for NoSQL
  •  NoSQL and Master Data Management
  •  NoSQL integration strategies


More details and registration at the Global Gig Data Conference site.

8/15/13

Big Data, Hadoop and Big SQL, A Crash Course at the IOD 2013


Big Data and Hadoop are now long-lasting buzzwords in the data processing community. Yet, few database practitioners understand what these technologies are, how to use them productively and how to integrate them into a conventional data processing landscape. It’s no wonder, as nearly all resources on these topics target software developers and not data professionals.

IOD 2013: big data, hadoop, big sql, big insights
At this year’s IBM Information on Demand Conference, November 3-7 in Las Vegas, I will be giving a tutorial that is addressing this concern specifically: we will approach Big Data and Hadoop technologies from the perspective of data professionals. We will introduce the key elements of the Hadoop ecosystem, the IBM’s enhancements and highlight the impact of these technologies on the data systems and practices in the enterprise.

For this tutorial, we use IBM BigInsights Hadoop system and besides exploring the common Hadoop features we delve into some of its unique enhancements.

Here is the overview of what we are going to talk about:


  • What is Big Data? For sure you could not escape the Big Data buzzword, but do you know what Big Data really is? Is your data Big? How about Medium data? Could you/should you apply Hadoop and its tooling to it? There are benefits even if your data is not huge!
  • MapReduce algorithm. At the heart of Hadoop is MapReduce, the algorithm for processing large data sets with a parallel, distributed algorithm executing on a cluster. Learn about this algorithm that brings scalability and fault-tolerance to variety of applications.
  • Hadoop. Hadoop is the framework that implements the common parts of the MapReduce. It provides the environment in which to run user Big Data programs. It is fault tolerant, it scales, it is cost effective and it can enable thousands of computers to jointly process data in parallel.
  • Hive and Pig. While Java APIs for Hadoop allow for a lot of flexibility, they are at a fairly low level. For data professionals, the productive way of approaching the Hadoop is at a higher level: Hive allows for a subset of SQL to be run over the files stored in Hadoop’s Distributed File System (HDFS), while Pig is a data flow language. See the characteristics of both and its strengths and weaknesses.
  • HBase. The database for Hadoop. Complementing traditional Hadoop processing, which falls into a category of batch processing, HBase is a database that provides online / real-time performance. It lies on top of the other Hadoop infrastructure and it is a distributed columnar database.
  • Big SQL. Of course, the most productive approach for a data practitioner would be trusted SQL, but plain Hadoop does not have this feature. IBM’s Big SQL extension to Hadoop provides SQL users a familiar environment to become productive with Hadoop and even to use the JDBC APIs. You will learn how to use Big SQL and quickly become productive with Big Data applications.

How about the labs? In the tutorial we will show hands on how to start exploiting the benefits of Hadoop using the IBM BigInsights Hadoop distribution. We will use the QuickStart edition where you can begin exploring Hadoop in a virtual machine - just unpack and run.  You will get the instructions on how to get it after the tutorial and run the examples yourselves.

I am looking forward to seeing you at the tutorial at the Information on Demand Conference, November 7th 2013. The tutorial is part of the Big Data and Analytics Tutorial Series. Register now here.

8/13/13

NoSQL Now 2013 Tutorial: Introduction to Hadoop Ecosystem

NoSQL Now!
NoSQL Now! conference emerged to be one of the most exciting gatherings for NoSQL practitioners and those who are looking to apply NoSQL in production. This conference is particularly interesting as it is a largest vendor neutral forum focused on NoSQL. There is a great variety of alternative NoSQL approaches and learning about them is the key to successful adoption.

At my company, SciSpike, one of our typical engagements is introducing new technologies. NoSQL and BigData are among the most popular topics lately and we have been educating and advising our clients since the early days of NoSQL.

This year at NoSQL Now! 2013, I am teaching a tutorial Introduction to Hadoop Ecosystem. This is a great starting point to those who are newcomers to Hadoop. In this fast paced tutorial we will provide a rapid immersion into Big Data with the Hadoop ecosystem. We start with an introduction to Big Data and the MapReduce algorithm. We continue with the Hadoop framework and show the ways to interact with the Hadoop file system and the cluster. We then introduce Hive and Pig as the higher level interfaces to process data and HBase, as the distributed, columnar database. At the end, we explore some exciting new projects in the Hadoop ecosystem. You will leave with a good understanding of what Hadoop is and how to start using it in your projects.

Join us for NoSQL Now! 2013, August 20-22 in San Jose. Check out the Introduction to Hadoop Ecosystem tutorial and join us for many exciting presentations and discussions.

4/19/13

Tutorial at EDW 2013: Introduction to Hadoop and Big Data Technologies


For a long time, relational databases have been the only technology for most data architects. In the last couple of years, we see the emergence of Big Data, Hadoop as a dominant way to process them and the rise of non relational stores, collectively known as NoSQL databases.

The tutorial I will be teaching at EDW 2013 has a goal to provide a practical overview of the key approaches, technologies, and systems in this growing space. Regardless if you will apply these technologies, a data architect must understand them in order to make the best choice for the data processing.

Introduction to Hadoop and Big Data Technologies


This tutorial is designed to provide a rapid immersion into Big Data with Hadoop. We start with an introduction to the Hadoop cluster and teach the ways to interact with the Hadoop file system and the cluster. We introduce Hive and Pig, popular higher level interfaces to managing data in the Hadoop system. We complete with an overview of related NoSQL stores. Upon completion, attendees will understand:
  • Big Data concepts and technologies 
  • Map Reduce concepts 
  • The Hadoop file system 
  • Hive and Pig for productive data management and development 
  • NoSQL Stores 
More about the tutorial at the EDW 2013 site.

3/16/13

IBM Champion 2013

I have got the news that I was awarded the title of IBM Champion for 2013 - for the fifth year in a row. IBM awards this title to non-IBMers for the "exceptional contributions to the technical community". Thank you!

It is a pleasure to work with motivated and inspiring community of professionals building better software for our "smarter planet".

12/2/12

Experiences with Big Data, NoSQL, Integration with Conventional Systems and more… Interview at IOD 2012


While at IBM’s Information on Demand 2012 Conference, between the talks and client meetings, I was interviewed by IBM’s developerWorks channel. You can see the interview here:



11/5/12

Developing High Performance Database Applications with pureQuery and IBM Data Studio


NoSQL data stores are in focus when it comes to high performance data applications. However, for many organizations relational databases are the mainstay of infrastructure, and we need to get them executing efficiently as well. In previous years, the enthusiasm around various object-relational mapping has been partly replaced with healthy skepticism. Explicit control over SQL became an important feature, especially when we are dealing with established databases that are not created from scratch by developers.

NoSQL data stores are in focus when it comes to high performance data applications. However, for many organizations relational databases are the mainstay of infrastructure, and we need to get them executing efficiently as well. In previous years, the enthusiasm around various object-relational mapping has been partly replaced with healthy skepticism. Explicit control over SQL became an important feature, especially when we are dealing with established databases that are not created from scratch by developers.

When working in a DB2 environment, pureQuery is a particularly attractive solution through its combination of explicit SQL control, high performance, and excellent tooling. This year I will be speaking on pureQuery and at the IBM Information on Demand 2012 conference in Las Vegas.

Update: The slides of the talk are now on the slideshare, but the really interesting part of the talk, not captured in the slides, is in the hands on demo, where we build a DB2 backed application in a matter of minutes and then profile and optimize SQL performance.