Informatica in Big Data

Above presentation is the first topic I’ll cover for series of talks about the Ten Tools for Ten Big Data Areas. I came up this series of topics from ten artifacts from ancient China when people are honored to get one of the ten artifacts.

In terms of data integrations of big data, I pick up Informatica, which is used to be a public company in NASDAQ as INFA. In the middle of this year, Informatica announced the successful completion of its acquisition by a company controlled by the Permira funds and Canada Pension Plan Investment Board (CPPIB). Additionally Informatica announced that Microsoft Corporation and Salesforce Ventures have agreed to become strategic investors in the company alongside the Permira funds and CPPIB. The acquisition is valued at approximately $5.3 billion, with Informatica stockholders receiving $48.75 in cash per share.

Especially in big data area, Informatica has leading product called Informatica big data edition - developer. This is brand new tool in the Informatica family. It has new user interface based on Eclipse. It is single tool including development all ETL job components. In Informatica developer, this is no more session. For instead, there is no concept called application. We can add mapping or workflow in the application to deploy and run the bulk of ETL jobs as an application.

The main advantage of Informatica developer is that it converts the ETL logic/mapping into Hive query and execute it on top of Hadoop cluster. For example, you can even push a none-Hadoop related jobs running on top of Hadoop. This advantage not only make ETL job leverages the power of computing resource of Hadoop but also get rid of additional budget for a dedicated ETL server cluster like what’s in the Informatica PowerCenter period. In addition, this design has lots of potential when Hive evolves in the big data ecosystem, such as Hive over Spark. In future, Informatica developer will be able to leverage more distributed computing framework beyond of Hadoop, such as Spark for better performance.

However, there are still limitations for this tool which is still new in the Informatica family. Below are some limitations I come across recently.

  • Does not support Hive table in complex formats, such as Avro, etc
  • Does not support write into buckets tables
  • Does not support using parameters in row level, complex data objects path.
  • Does not support to return target successful or failed rows from mapping
  • Cannot run the workflow or application straightforward in the developer tool
  • None of errors and exceptions are reported at run time when running in Hive mode
  • Overall reliability need to be improved, such as OOM, exception on data adapters

Alternatives, there are also other ETL tools for choice. Talend and Pentaho all provide big data version of tools. But, their big data version almost the same to the regular one except having common big data tool connector shipped. Therefore, these tools can read or write data between Hadoop rather than running on top of Hadoop.

There is another new data flow tool which people may pay attention, Apache NiFi, which supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. The company behind this tool is called Onyara, an early-stage startup acquired by Hontonworks in Auguest of 2015.