Your shopping cart is empty!
Think of GE, the company has over 100+ years of history & presence in almost all the industries. Over these years company’s management style has been changed from book keeping to SAP. This transition was not a single day transition. In transition, from book keeping to SAP, they used a wide array of technologies, ranging from mainframes to PCs, data storage ranging from flat files to relational databases, programming languages ranging from Cobol to Java. This transformation resulted into different businesses, or to be precise different sub businesses within a business, running different applications, different hardware and different architecture. Technologies are introduced as and when invented & as and when required.
This directly resulted into the scenario, like HR department of the company running on Oracle Applications, Finance running SAP, some part of process chain supported by mainframes, some data stored on Oracle, some data on mainframes, some data in VSM files & the list goes on. If one day company requires a consolidated reports of assets, there are two ways.
Obviously second approach is going to be the best.
Now to fetch the data from different systems, making it coherent, and loading into a Data Warehouse requires some kind of extraction, cleansing, integration, and load. ETL stands for Extraction, Transformation & Load.
ETL Tools provide facility to Extract data from different non-coherent systems, cleanse it, merge it and load into target systems.
This is a typical question which will be asked in interview. Why we need an ETL tool? Why can’t we use a programming language to build a data warehouse?
ETL tools like informatica, datastage, abinitio etc are becoming popular in building a data warehouse. ETL tools are used because of the following reasons:
ETL tools can connect and read data from multiple sources like relational databases, flat files, xml files, cobol files etc. The capability of connecting and reading data from different sources is readily built in ETL tools. As an user you don’t need to write a code for this. If you have used a programming languages, you have to write your own code for connecting to multiple sources and reading.
ETL tools comes with built-in error handling capability. With programming languages again you have to take care of error handling by identifying each and every possible error.
You can process data in parallel with ETL tools easily. Processing data in parallel with programming languages may result in thread safety issues and might lead to data corruption.
For any Datawarehouse project, the common requirements everyone has w.r.t. Extract, Transform and Loading of data is listed below:
1. To be able to extract data from heteregenous sources - files, legacy databases, oracle, db2,ms sql server, mainframes etc . To extract this as a job that can be run/scheduled based on some time interval.
2. To fetch the data in chunks/batches(data retrieval and commit in chunks) so we do not get in to complex way of data processing, all at one time, and also to avoid memory contention issues, memory overflow issues etc
3. To troubleshoot the issues easily i.e., availability of standard error finding process through logs, debugging option during the runtime of job, savepoints , flexibility to see error summary or error in detail etc.
ETL tools are generally needed for enterprise’s “data consolidation” and “data integration” needs. It is very true that ETL tools are not the only way to achieve needed objective. But they provide rapid project development and easy code maintenance.
Data warehouse is typical example of data consolidation, where you need to consolidate many systems to see overall performance of a business, product or service.
Enterprise applications integration is one of the potential ETL domains, Here etl tools enables you in such way that your applications can communicate to each other. In simple words “you can enable two applications to share data / status”.
THe key factors to consider before purchasing an ETL tool.
Ascertain ETL tool’s ability to handle multiple data formats
Third generation ETL tools are capable of handling structured data from different data sources including legacy systems such as mainframe and UNIX based application systems, spreadsheets, XML format, proprietary databases, and others. But managing unstructured data is still a challenge for ETL tools.
The next generation ETL tools would have the ability to handle structured and unstructured data simultaneously as we access huge streams of valuable data from social networking media. Handling streaming data is another challenge for most of the ETL tools.
1) Consider branded ETL tools to save time
Today, numerous business intelligence (BI) vendors are coming up with BI and analytics packages, which contain pre-built data models and ETL modules to load different types of data onto them. These ETL tool packages accelerate the delivery of BI reports. If the package fits the needs of a business user and the required data is easily available, the module can be implemented in a couple of months and the user can start generating reports. On the other hand, the traditional way of building custom BI systems takes at least 9 to10 months.
2) Conduct data profiling at source data level
Data profiling with an ETL tool is an important activity to audit and understand what is being captured into source systems through the business process. A syntactic check validates if the data is meeting technical constraints. whereas a semantic check audits if the data is abiding by the inherent business rules and policies. Data health is quite important for designing and implementing an ETL process or implementing a new platform.
As a best practice, data should be audited and corrected at the source data level rather than the data warehouse, reporting or new application platform.
3) Check if ETL tool supports data quality/ cleansing
Integration of data quality activity with ETL functionality is the most sought after feature in any ETL tool. However, many ETL tools do not support advanced data quality and cleansing features. ETL tools with advanced data quality audit and cleansing features will have better applicability in the modern BI applications where streaming and historical data are required to be integrated seamlessly.
4) Look out for meta data support
Meta data management is a central activity for any ETL and reporting project. Most of the ETL tools support meta data capturing and maintenance features. The main challenge in current meta data management facilities of ETL tools is sharing a meta data at different layers of an information management (IM) system. There are few independent meta data management systems, such as Superglue from Informatica and MetaCenter of Data Advantage Group that can be deployed to create a central meta data repository for IM initiatives.
5) Ensure version control of the ETL maps
An ETL tool should support version controlling mechanism though which the developer can maintain different versions of the source code, without overwriting the original code. The ETL tool should also prevent multiple developers from working on the same extract so that at a time only one can make changes in the ETL work flow.
Almost all ETL tools have version control features, but they are not advanced enough to manage complex project delivery where ETL is being developed with varied groups in different time zones. Integration of advanced software version control tools like CVS and Code Co-op is a challenging process, because they are good to detect and control software versions in a distributed environment, whereas ETL version control requires not only configuration management of ETL codes but also of meta data and embedded business rules.
Confirm if ETL tool supports data manipulation
Most of the current generation ETL tools support data manipulation in and out of target database. In data integration projects, data extracted from multiple data sources are manipulated outside of the target database to fit into a uniform format. In a historical data warehouse, ETL needs to manipulate existing record in the target database before inserting a new instance of the same record. Current ETL tools are capable of managing data manipulation inside and outside of the target database.
There are more than 75 tools on the market that purport to have some ETL function. These may be classified according to function, engine type and development environment.
The tools fall into several general categories:
EtL Tools – These "small t" tools emphasize the extraction and loading processes and should actually be referred to as data migration tools instead of full-function ETL processors.
eTL or ETl Tools – These "small e" or "small l" tools typically accept only a specific input or output type, such as a flat file source or a specific database format, but offer fairly robust transformation functions within the processing engine.
eTl Tools – These "capital T" tools perform the data transformation step relatively well, but may lack efficient connectivity to many of the common data formats that may be encountered.
ETL Tools – These "complete" ETL tools provide a rich mix of functionality and connectivity, but may be significantly more expensive than tools found in the other categories. For extremely complex projects or those attempting to process massive amounts of data, these tools may present the only true option for ensuring success in the ETL phase of the project. In other cases, this class of tool may offer features that simply are not required in the existing environment.
The engine type classification segments the tools by how the developed ETL processes are executed. Typically, the tools encountered fall into one of two categories: server engine or client engine. The server engine allows execution of multiple concurrent jobs from more than one developer. Server engines generally take advantage of multiple CPUs and are designed to coordinate and manage the execution of multiple simultaneous routines.
The client engine is simpler and assumes that the ETL routines are executed on the same machine as they are developed. Concurrency for multiple jobs is limited, if allowed at all, and client engine tools often do not support scaling across multiple CPU machines. Client engines are significantly cheaper to purchase due to their more limited functionality and scalability.
Development environments are typically split two ways: GUI- based or code-based tools. Code-based tools are the most familiar and may not be considered "tools" independent of the language they represent. For example, Perl can be used as a code-based ETL tool, but it is also a more generalized programming language. The embedded transactional code languages within common database platforms (e.g., PL/SQL with Oracle, Transact*SQL with Microsoft SQL Server, etc.) may also provide ETL functionality, but are not limited to this capability. Aside from general programming languages, several tools on the market utilize a custom- scripting language developed explicitly for the optimization of ETL routines.
GUI-based ETL tools have been on the market for at least six years. The purpose of the GUI is to remove the coding layer for the developer and allow generation of ETL routines without requiring the mastery of any particular coding language. GUI tools are also useful in that they provide some self-documentation about the job flow just from the layout and positioning of the graphical elements.
Now that you have some background on the types of tools available, what criteria should you use to help decide which tool is best for you? We will focus the discussion around the five C's: complexity, concurrency, continuity, cost and conformity.
Complexity is generally evaluated through the following series of questions:
Concurrency evaluates both the number of developers and number of simultaneous processes that the tool must support:
Continuity involves two distinct subjects. Continuity establishes the scalability of the tool to meet future demands from system growth, and it evaluates how reliable the product is under unusual circumstances:
Cost represents the real purchase cost and perhaps the hidden support costs associated with each product:
Conformity outlines how the tool behaves against existing architectural limitations or requirements:
You will find that ETL tools can range in cost from free (freeware) to more than several hundred thousand dollars. As such, your decision delivers a widely ranging and potentially lasting impact on your systems architecture. If you are not already using a tool, you obviously first must compare the decision to purchase a tool against the "keep doing it like we're doing it today" option. Very few IT departments are able to meet the current and future development needs of the business users without substantial growth. One way to offset the need for additional headcount is to take advantage of the efficiency that modern ETL tools can bring if chosen and employed effectively.
To be blunt, you may encounter the feeling of buyer's remorse with almost any software purchase you make. Unfortunately, there is no single right answer in any tool choice today (despite what vendors tell you). As such, your goal should be to understand in advance where the expected weaknesses will occur with your chosen solution and be prepared to mitigate those consequences.
Your first order of business in outlining your ETL selection is to determine if any critical business factors exist that might help limit the number of tools you will consider. Examples of these factors may include:
The specific objective is to identify conditions that weed out solutions that simply will not be acceptable within your given architecture. Ideally, this first step should identify a simple list of the top six or seven tools that might meet the majority of your requirements. An exhaustive review of the tools is not required at this point.
After winnowing out the clearly unacceptable choices, your next step in the process is to outline what criteria you will use in the evaluation. You may use the questions in the earlier sections of this article as a guideline; but you should be fairly specific and outline specific data sources with which you require connectivity, hardware expectations, operating system options, etc.
After building the list, generate a matrix with your decision criteria as rows and the tool choices as the columns of the matrix. Then rank each of the tools using publicly available information (Web sites, publications, etc.) as much as possible. If you do not feel you have enough information available to adequately answer your questions, contact a vendor representative to obtain the necessary information to complete the matrix. Note that managing more than two to three vendor presentations and handling appropriate follow-up can be an extremely time-consuming task. For this reason, you may wish not to begin onsite formal vendor presentations until you've created your "short list" of the two to three finalists for the selection. This approach will allow you to truly focus your energies on understanding why one tool is specifically better than just one or two others. With a wider field, you will generally find many "triangular" relationships (tool A is better than tool B at criteria 1, tool B is better than tool C at criteria 2, tool C is better than tool A at criteria 3, etc.) that confuse the overall picture.
Eventually, you should arrive at a reasonable conclusion about which tool will best meet your needs. You should definitely try to obtain a customer reference from each vendor, preferably another company doing approximately the same function as your organization (though for competitive reasons this may be hard to obtain). Focus on your key technical requirements to ensure that the customer reference you obtain is relevant and useful to you. A customer reference accessing similar systems and operating on similar hardware and operating systems is probably more applicable than another similar industry reference on a completely different platform.
In the end, of course, it all comes down to putting ink to paper and taking on the responsibility for this new project. Much like having a baby, you won't know for some time how decisions you make today turn out; and you'll inevitably have second thoughts along the way (Was I ready for this?). However, with a solid decision-making process behind you, you can expect to grow and succeed with your selection. At the final bell, it is the results that count!