Monday, March 3, 2014

BigData Discussion - Is Hadoop (MapReduce) replacing ETL tools? Series 1

Around 4 years back I was discussing ETL vs ELT  in my other blog, why Integration tools like DataStage and suite like Information Server remains relevant, with evolution of Hadoop people are thinking is it going to replace ETL tools. It is circle of life but not really we can replace something existing with pieces which are just spread across and not even standardized. To avoid and saying "it depends", let me simply say "No" Hadoop/MapReduce not going to replace ETL tools, might force them to focus more on innovation in their core framework which were out of focus for quite some time.

No doubt Hadoop is a powerful data management framework with MapReduce providing all required aggregation capabilities,  across massive structured and unstructured data sets. We can definitely perform ETL via custom coding but not replacing traditional ETL tools. Custom coding take us to more than decade back where tools like DataStage started changing the Industry. Though agree, they can compliment existing ETL tools or replace core layers to improve the performance. For customers they still remain ETL tools with lots of built-in functionality for data cleansing, alignment, modeling, and transformation which no one want to replace with custom management in future. We need to be progressing than moving backward in the history. Now where Hadoop/MapReduce can help is processing very large datasets, and if we merge their capabilities with traditional engines, next generation ETL tools going to be far more strong with seamless processing of traditional and large data . MapReduce can also be used for complex processing or unstructured data.

Traditional ETL vendors already enhancing their tools to run them on Hadoop and take advantage of the processing and cost benefits of the natively parallel Hadoop environment.  It will help them for large files that are processed on the Hadoop cluster.   Might be we move away from hadoop to something else in future, data processing requirements will change but integration requirements still remains same. You are not going to throw data created over decaded, spending another decade in processing where will get another innovation. From business perspective, you need to have  going to have both traditional ETL tools as well as Hadoop/MapReduce – each of them have role to play.

Will share more on this topic in coming months seeing innovation happening in the area by various vendors and kind of process optimization required in this field.

-Ritesh
Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions

3 comments:

  1. Thanks for the post sir, But I would like to know how we are or how we can use a ELT tool with these kind of DBs, as now a days I listen abt using datastage with hadoop db. I know, DataStage supported a new stage Bigdata but can we achieve what we can in hadoop ?

    ReplyDelete
  2. Very nice post! In the IT landscape, ETL (extract, transform, load) processes have long been used for building data warehouses and enabling reporting systems. Using business intelligence (BI) oriented ETL processes, businesses extract data from highly distributed sources, transform it through manipulation, parsing, and formatting, and load it into staging databases. From this staging area data, summarizations, and analytical processes then populate data warehouses and data marts. More at www.youtube.com/watch?v=1jMR4cHBwZE

    ReplyDelete
  3. $20 can get you:

    a) Movie Tickets & popcorn,
    b) A cuppo for your car keys,
    c) A clothespin holder,
    d) A Hadoop 5-in-1 package as investment in your future.

    What are you buying today?
    Visit Now: http://bit.ly/1SqESgK

    ReplyDelete