InfoSphere Information Server How To: BigData Discussion - Is Hadoop (MapReduce) replacing ETL tools? Series 1

Around 4 years back I was discussing ETL vs ELT in my other blog, why Integration tools like DataStage and suite like Information Server remains relevant, with evolution of Hadoop people are thinking is it going to replace ETL tools. It is circle of life but not really we can replace something existing with pieces which are just spread across and not even standardized. To avoid and saying "it depends", let me simply say "No" Hadoop/MapReduce not going to replace ETL tools, might force them to focus more on innovation in their core framework which were out of focus for quite some time.

No doubt Hadoop is a powerful data management framework with MapReduce providing all required aggregation capabilities, across massive structured and unstructured data sets. We can definitely perform ETL via custom coding but not replacing traditional ETL tools. Custom coding take us to more than decade back where tools like DataStage started changing the Industry. Though agree, they can compliment existing ETL tools or replace core layers to improve the performance. For customers they still remain ETL tools with lots of built-in functionality for data cleansing, alignment, modeling, and transformation which no one want to replace with custom management in future. We need to be progressing than moving backward in the history. Now where Hadoop/MapReduce can help is processing very large datasets, and if we merge their capabilities with traditional engines, next generation ETL tools going to be far more strong with seamless processing of traditional and large data . MapReduce can also be used for complex processing or unstructured data.

Traditional ETL vendors already enhancing their tools to run them on Hadoop and take advantage of the processing and cost benefits of the natively parallel Hadoop environment. It will help them for large files that are processed on the Hadoop cluster. Might be we move away from hadoop to something else in future, data processing requirements will change but integration requirements still remains same. You are not going to throw data created over decaded, spending another decade in processing where will get another innovation. From business perspective, you need to have going to have both traditional ETL tools as well as Hadoop/MapReduce – each of them have role to play.

Will share more on this topic in coming months seeing innovation happening in the area by various vendors and kind of process optimization required in this field.

-Ritesh

Disclaimer: The postings on this site are my own and don't necessarily represent IBM's positions, strategies or opinions