Monday, March 10, 2014

BigData Discussion - MapReduce vs ETL tools? Series 2

Thought of taking MapReduce vs ETL tools discussion to next level and discuss how IBM making initiatives. IBM DataStage already provides integration for BigData processing within Hadoop, with Information Server 9.1, launches an Oozie workflow within Hadoop which can process unstructured data and return the results into your ETL workflow. As mentioned in previous blog, custom coding is not way forward, leveraging tools like Data Stage enables developer integrate BigData into their ETL workflow using same tooling. Here is link to Information Server 9.1

Using IBM's BigInsights Enterprise Hadoop platform, structured data from an external RDBMS can be brought into BigInsights cluster to be processed next to your unstructured data. Post processing results can be stored at desired location. BigInsights 2.1 enable developers to leverage BigSQL an ANSI SQL interface to Hive, HBase, HDFS, CSV(delimited, sequence files), and JSON. Even existing reporting and BI can leverage this in addition to Data Analysts who don't know Map Reduce or Pig. Data Analysts can also use the BigSheets functionality that gives them a similar spreadsheet interface with a good number of spreadsheet functions. Here is you can refer features of BigInsights 2.1 

Tooling enables any new developer focus into desired requirements than understanding multiple functions, scripts, utilities with no or very minimal documentation and may or may be nor any standards followed. With these tools, start contributing immediately and time to ramp up is greatly diminished.

Based on my interactions and working with customers, MapReduce customer base is that Hadoop is primarily augmenting the existing data warehouse environment with no exploration of replacing it. It enabled them to process unstructured and structured data at scale. Off-course if existing ETL tooling vendors not move towards innovative business propositions and create seamless hadoop and ETL infrastructure, Hadoop certainly has the potential to overcome many of the potential limitations of traditional ETL/ELT tools. Organizations are exploring to remove the ETL bottlenecks and only challenge they have is Hadoop in current form not a complete ETL solution. While it offers powerful utilities and massive horizontal scalability, it does not provide all the functionality/capabilities required to deploy ETL/ELT. Existing vendors already moving towards filling this gap, it need to happen soon.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Monday, March 3, 2014

BigData Discussion - Is Hadoop (MapReduce) replacing ETL tools? Series 1

Around 4 years back I was discussing ETL vs ELT  in my other blog, why Integration tools like DataStage and suite like Information Server remains relevant, with evolution of Hadoop people are thinking is it going to replace ETL tools. It is circle of life but not really we can replace something existing with pieces which are just spread across and not even standardized. To avoid and saying "it depends", let me simply say "No" Hadoop/MapReduce not going to replace ETL tools, might force them to focus more on innovation in their core framework which were out of focus for quite some time.

No doubt Hadoop is a powerful data management framework with MapReduce providing all required aggregation capabilities,  across massive structured and unstructured data sets. We can definitely perform ETL via custom coding but not replacing traditional ETL tools. Custom coding take us to more than decade back where tools like DataStage started changing the Industry. Though agree, they can compliment existing ETL tools or replace core layers to improve the performance. For customers they still remain ETL tools with lots of built-in functionality for data cleansing, alignment, modeling, and transformation which no one want to replace with custom management in future. We need to be progressing than moving backward in the history. Now where Hadoop/MapReduce can help is processing very large datasets, and if we merge their capabilities with traditional engines, next generation ETL tools going to be far more strong with seamless processing of traditional and large data . MapReduce can also be used for complex processing or unstructured data.

Traditional ETL vendors already enhancing their tools to run them on Hadoop and take advantage of the processing and cost benefits of the natively parallel Hadoop environment.  It will help them for large files that are processed on the Hadoop cluster.   Might be we move away from hadoop to something else in future, data processing requirements will change but integration requirements still remains same. You are not going to throw data created over decaded, spending another decade in processing where will get another innovation. From business perspective, you need to have  going to have both traditional ETL tools as well as Hadoop/MapReduce – each of them have role to play.

Will share more on this topic in coming months seeing innovation happening in the area by various vendors and kind of process optimization required in this field.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

Saturday, March 1, 2014

ETL - Performing Predictive Analytic in the flow

Until recent past ETL tools were considered only to create EDW which gets used to perform various analytic later on. Tools like IBM SPSS capable of predicting with confidence, leading to smart decision. With the change in industry dynamics and BigData growth, real-time and "Now" is taking precedence than in future.  
Currently a traditional batch flow standard is followed for predictive analytics is by extracting data from EDW into different mart and apply predictive models to obtain valuable insights and results of the analytics fed to decision makers. Here analytic model is built once and it is applied on large amounts of data in batch. It required repeated I/O and transformation before making it available to the end application. Instead efficient method of performing this operation is to integrate the process of running the analytic models during the import (or export) of new data into (from) the warehouse. 

A possible integration between ETL and predictive analytic tools open scope for entire different set of business opportunities like existing InfoSphere DataStage and SPSS integration already providing this capability. Following this approach analytical model can be applied on the data which is ingested into  warehouse or mart and the output can be stored directly into resulting tables. On availability of statistical model output in the data warehouse or data mart, business applications like reporting tools and marketing campaigns can make use of this data readily without the need for a separate analytic step.

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions