ETL Need and factors

ETL Tools Need

Think of GE, the company has over 100+ years of history & presence in almost all the industries. Over these years company’s management style has been changed from book keeping to SAP. This transition was not a single day transition. In transition, from book keeping to SAP, they used a wide array of technologies, ranging from mainframes to PCs, data storage ranging from flat files to relational databases, programming languages ranging from Cobol to Java. This transformation resulted into different businesses, or to be precise different sub businesses within a business, running different applications, different hardware and different architecture. Technologies are introduced as and when invented & as and when required.

This directly resulted into the scenario, like HR department of the company running on Oracle Applications, Finance running SAP, some part of process chain supported by mainframes, some data stored on Oracle, some data on mainframes, some data in VSM files & the list goes on. If one day company requires a consolidated reports of assets, there are two ways.

First completely manual, generate different reports from different systems and integrate them.
Second fetch all the data from different systems/applications, make a Data Warehouse, and generate reports as per the requirement.

Obviously second approach is going to be the best.
Now to fetch the data from different systems, making it coherent, and loading into a Data Warehouse requires some kind of extraction, cleansing, integration, and load. ETL stands for Extraction, Transformation & Load.

ETL Tools provide facility to Extract data from different non-coherent systems, cleanse it, merge it and load into target systems.

ETL Tools

This is a typical question which will be asked in interview. Why we need an ETL tool? Why can’t we use a programming language to build a data warehouse?

ETL tools like informatica, datastage, abinitio etc are becoming popular in building a data warehouse. ETL tools are used because of the following reasons:

ETL tools can connect and read data from multiple sources like relational databases, flat files, xml files, cobol files etc. The capability of connecting and reading data from different sources is readily built in ETL tools. As an user you don’t need to write a code for this. If you have used a programming languages, you have to write your own code for connecting to multiple sources and reading.

ETL tools comes with built-in error handling capability. With programming languages again you have to take care of error handling by identifying each and every possible error.

You can process data in parallel with ETL tools easily. Processing data in parallel with programming languages may result in thread safety issues and might lead to data corruption.

For any Datawarehouse project, the common requirements everyone has w.r.t. Extract, Transform and Loading of data is listed below:
1. To be able to extract data from heteregenous sources - files, legacy databases, oracle, db2,ms sql server, mainframes etc . To extract this as a job that can be run/scheduled based on some time interval.
2. To fetch the data in chunks/batches(data retrieval and commit in chunks) so we do not get in to complex way of data processing, all at one time, and also to avoid memory contention issues, memory overflow issues etc
3. To troubleshoot the issues easily i.e., availability of standard error finding process through logs, debugging option during the runtime of job, savepoints , flexibility to see error summary or error in detail etc.

ETL tools are generally needed for enterprise’s “data consolidation” and “data integration” needs. It is very true that ETL tools are not the only way to achieve needed objective. But they provide rapid project development and easy code maintenance.

Data warehouse is typical example of data consolidation, where you need to consolidate many systems to see overall performance of a business, product or service.

Enterprise applications integration is one of the potential ETL domains, Here etl tools enables you in such way that your applications can communicate to each other. In simple words “you can enable two applications to share data / status”.

ETL Factors

THe key factors to consider before purchasing an ETL tool.

Ascertain ETL tool’s ability to handle multiple data formats

Third generation ETL tools are capable of handling structured data from different data sources including legacy systems such as mainframe and UNIX based application systems, spreadsheets, XML format, proprietary databases, and others. But managing unstructured data is still a challenge for ETL tools.

The next generation ETL tools would have the ability to handle structured and unstructured data simultaneously as we access huge streams of valuable data from social networking media. Handling streaming data is another challenge for most of the ETL tools.

1) Consider branded ETL tools to save time

Today, numerous business intelligence (BI) vendors are coming up with BI and analytics packages, which contain pre-built data models and ETL modules to load different types of data onto them. These ETL tool packages accelerate the delivery of BI reports. If the package fits the needs of a business user and the required data is easily available, the module can be implemented in a couple of months and the user can start generating reports. On the other hand, the traditional way of building custom BI systems takes at least 9 to10 months.

2) Conduct data profiling at source data level

Data profiling with an ETL tool is an important activity to audit and understand what is being captured into source systems through the business process. A syntactic check validates if the data is meeting technical constraints. whereas a semantic check audits if the data is abiding by the inherent business rules and policies. Data health is quite important for designing and implementing an ETL process or implementing a new platform.

As a best practice, data should be audited and corrected at the source data level rather than the data warehouse, reporting or new application platform.

3) Check if ETL tool supports data quality/ cleansing

Integration of data quality activity with ETL functionality is the most sought after feature in any ETL tool. However, many ETL tools do not support advanced data quality and cleansing features. ETL tools with advanced data quality audit and cleansing features will have better applicability in the modern BI applications where streaming and historical data are required to be integrated seamlessly.

4) Look out for meta data support

Meta data management is a central activity for any ETL and reporting project. Most of the ETL tools support meta data capturing and maintenance features. The main challenge in current meta data management facilities of ETL tools is sharing a meta data at different layers of an information management (IM) system. There are few independent meta data management systems, such as Superglue from Informatica and MetaCenter of Data Advantage Group that can be deployed to create a central meta data repository for IM initiatives.

5) Ensure version control of the ETL maps

An ETL tool should support version controlling mechanism though which the developer can maintain different versions of the source code, without overwriting the original code. The ETL tool should also prevent multiple developers from working on the same extract so that at a time only one can make changes in the ETL work flow.

Almost all ETL tools have version control features, but they are not advanced enough to manage complex project delivery where ETL is being developed with varied groups in different time zones. Integration of advanced software version control tools like CVS and Code Co-op is a challenging process, because they are good to detect and control software versions in a distributed environment, whereas ETL version control requires not only configuration management of ETL codes but also of meta data and embedded business rules.

Confirm if ETL tool supports data manipulation

Most of the current generation ETL tools support data manipulation in and out of target database. In data integration projects, data extracted from multiple data sources are manipulated outside of the target database to fit into a uniform format. In a historical data warehouse, ETL needs to manipulate existing record in the target database before inserting a new instance of the same record. Current ETL tools are capable of managing data manipulation inside and outside of the target database.

There are more than 75 tools on the market that purport to have some ETL function. These may be classified according to function, engine type and development environment.

Function

The tools fall into several general categories:

EtL Tools – These "small t" tools emphasize the extraction and loading processes and should actually be referred to as data migration tools instead of full-function ETL processors.

eTL or ETl Tools – These "small e" or "small l" tools typically accept only a specific input or output type, such as a flat file source or a specific database format, but offer fairly robust transformation functions within the processing engine.

eTl Tools – These "capital T" tools perform the data transformation step relatively well, but may lack efficient connectivity to many of the common data formats that may be encountered.

ETL Tools – These "complete" ETL tools provide a rich mix of functionality and connectivity, but may be significantly more expensive than tools found in the other categories. For extremely complex projects or those attempting to process massive amounts of data, these tools may present the only true option for ensuring success in the ETL phase of the project. In other cases, this class of tool may offer features that simply are not required in the existing environment.

Engine Type

The engine type classification segments the tools by how the developed ETL processes are executed. Typically, the tools encountered fall into one of two categories: server engine or client engine. The server engine allows execution of multiple concurrent jobs from more than one developer. Server engines generally take advantage of multiple CPUs and are designed to coordinate and manage the execution of multiple simultaneous routines.

The client engine is simpler and assumes that the ETL routines are executed on the same machine as they are developed. Concurrency for multiple jobs is limited, if allowed at all, and client engine tools often do not support scaling across multiple CPU machines. Client engines are significantly cheaper to purchase due to their more limited functionality and scalability.

Development Environment

Development environments are typically split two ways: GUI- based or code-based tools. Code-based tools are the most familiar and may not be considered "tools" independent of the language they represent. For example, Perl can be used as a code-based ETL tool, but it is also a more generalized programming language. The embedded transactional code languages within common database platforms (e.g., PL/SQL with Oracle, Transact*SQL with Microsoft SQL Server, etc.) may also provide ETL functionality, but are not limited to this capability. Aside from general programming languages, several tools on the market utilize a custom- scripting language developed explicitly for the optimization of ETL routines.

GUI-based ETL tools have been on the market for at least six years. The purpose of the GUI is to remove the coding layer for the developer and allow generation of ETL routines without requiring the mastery of any particular coding language. GUI tools are also useful in that they provide some self-documentation about the job flow just from the layout and positioning of the graphical elements.

Key Questions To Ask

Now that you have some background on the types of tools available, what criteria should you use to help decide which tool is best for you? We will focus the discussion around the five C's: complexity, concurrency, continuity, cost and conformity.

Complexity

Complexity is generally evaluated through the following series of questions:

How many distinct sources of information will need to be processed with this tool?
If using files, what is the expected size of the input files? Are they ASCII or EBCDIC formatted? For mainframe file sources, do they contain OCCURS, REDEFINES or packed fields? For other file sources, are the files delimited or fixed length? What character is used as a delimiter?
If using relational database sources, how many rows will need to be queried each time the job is run? Do primary and foreign key relationships exist within and/or between the tables of interest? Can the tool support the types of joins needed to execute against the tables of interest? Is it possible to see and tune/edit the SQL the tool is generating against the database?
What "magic values" exist within given data fields? Is it necessary to translate computer codes into more understandable terms?
Is there a need to join information together from dissimilar sources? Is there a requirement to match flat- file data with fields in a relational database table? Is there a requirement to join information from two different types of databases?
Does the data warehouse use the "slowly changing dimensions" concept? If so, how does the tool support appropriately updating dimensional attributes?
How well does the tool support testing and debugging while in development?

Concurrency

Concurrency evaluates both the number of developers and number of simultaneous processes that the tool must support:

How many developers will need access to the tool?
Is there a need to implement security to limit access for individuals or groups? How does the tool support this? Is it possible to leverage LDAP or other in-place security mechanisms for authentication purposes?
How are change management functions or "versioning" handled within the tool? What happens if two developers attempt to make changes to the same routine or transformation object at the same time? Can older incarnations of various routines be retained for reference? What complexities are introduced through the specific versioning or change management functions used by the tool?
What is the maximum number of simultaneous processes that are ex-pected to run? What size server would be required to support this number of concurrent processes with this tool? Can the tool support running twice that many simultaneous processes?
How flexible is the tool when executing dependent or sequential job streams? Can the tool recover from a failed job without manual intervention? How flexible is the error condition testing? How are errors logged?
Is the product multithreaded? If so, is the threading unlimited or set within the product? Are multiple jobs required to achieve high degrees of parallelization?

Continuity

Continuity involves two distinct subjects. Continuity establishes the scalability of the tool to meet future demands from system growth, and it evaluates how reliable the product is under unusual circumstances:

If data sources double in size, can this tool support that level of processing?
How easy is it to make changes to developed jobs or routines?
Can more than the "minimum" hardware and operating system requirements to support the product be deployed?
How well does the product respond to spurious failures such as network problems, server crashes or running out of disk space?
How often does the vendor upgrade the product? How painful are the upgrade installations?

Cost

Cost represents the real purchase cost and perhaps the hidden support costs associated with each product:

What is the quoted price for the product options needed to be successful with this product? Where will additional software licensing costs (adding data sources, developer seats, server CPUs, etc.) be incurred in the future?
What future software costs will be incurred to receive software support and upgrades from the vendor? Is the first year of support and maintenance included with product purchase?
Is it necessary to purchase additional hardware or software components in order to effectively utilize this tool?
Who will administer the tool? Will it be necessary to hire a dedicated person or can someone already on the team fulfill this duty?
How much training is required to learn the tool? How much additional training is required to administer it? What advanced topic courses are offered? What do the courses cost? How often and where are they offered?
How available are support resources outside of those provided by the vendor? What resellers or systems integrators support the product? Are there local users groups? Are there technical forums available on the Internet?
How long should it to take to implement our first production processes? How long is the typical learning curve on a tool like this (measure to the point of effective use of the tool, not just through the end of training!)?

Conformity

Conformity outlines how the tool behaves against existing architectural limitations or requirements:

Does this tool support the hardware platform and operating system that will be used? Are there any incompatibilities with any other software typically run on the servers (backup software, virus scanning software, etc.)?
Does this tool integrate with the network and user security protocols used? What, if any, work-arounds will be required?
Does this tool allow access to the type and location of data sources that will be used? Should mainframe files be accessed directly (and can the tool support this?) or is it necessary to extract mainframe data to files first?
Can the tool import and/or export meta data or other key information with the specific tools (data modeling tools, business intelligence tools, etc.) that are deployed in the data warehouse?
What form of documentation does the tool generate? Does this format and content meet internal documentation standards? What additional documentation will need to developed and maintained outside of the tool?
Is this purchase a departmental tool or will the tool be used across the enterprise? How would the tool architecture scale to grow from a departmental solution into one suitable for enterprise deployment?

Evaluating the ETL Purchase Decision

You will find that ETL tools can range in cost from free (freeware) to more than several hundred thousand dollars. As such, your decision delivers a widely ranging and potentially lasting impact on your systems architecture. If you are not already using a tool, you obviously first must compare the decision to purchase a tool against the "keep doing it like we're doing it today" option. Very few IT departments are able to meet the current and future development needs of the business users without substantial growth. One way to offset the need for additional headcount is to take advantage of the efficiency that modern ETL tools can bring if chosen and employed effectively.

To be blunt, you may encounter the feeling of buyer's remorse with almost any software purchase you make. Unfortunately, there is no single right answer in any tool choice today (despite what vendors tell you). As such, your goal should be to understand in advance where the expected weaknesses will occur with your chosen solution and be prepared to mitigate those consequences.

First Steps – "Deal Breakers"

Your first order of business in outlining your ETL selection is to determine if any critical business factors exist that might help limit the number of tools you will consider. Examples of these factors may include:

We must have a tool that runs on {specific operating system}.
We must be able to process {amount} GB of data nightly in a {number} hour batch window.
We must have connectivity to {type of data source}.
We must be able to run our ETL jobs on the mainframe.
We must have a tool that supports concurrent development by multiple sites.
We absolutely cannot spend more than ${x}.

The specific objective is to identify conditions that weed out solutions that simply will not be acceptable within your given architecture. Ideally, this first step should identify a simple list of the top six or seven tools that might meet the majority of your requirements. An exhaustive review of the tools is not required at this point.

Making the Final Decision

After winnowing out the clearly unacceptable choices, your next step in the process is to outline what criteria you will use in the evaluation. You may use the questions in the earlier sections of this article as a guideline; but you should be fairly specific and outline specific data sources with which you require connectivity, hardware expectations, operating system options, etc.

After building the list, generate a matrix with your decision criteria as rows and the tool choices as the columns of the matrix. Then rank each of the tools using publicly available information (Web sites, publications, etc.) as much as possible. If you do not feel you have enough information available to adequately answer your questions, contact a vendor representative to obtain the necessary information to complete the matrix. Note that managing more than two to three vendor presentations and handling appropriate follow-up can be an extremely time-consuming task. For this reason, you may wish not to begin onsite formal vendor presentations until you've created your "short list" of the two to three finalists for the selection. This approach will allow you to truly focus your energies on understanding why one tool is specifically better than just one or two others. With a wider field, you will generally find many "triangular" relationships (tool A is better than tool B at criteria 1, tool B is better than tool C at criteria 2, tool C is better than tool A at criteria 3, etc.) that confuse the overall picture.

Eventually, you should arrive at a reasonable conclusion about which tool will best meet your needs. You should definitely try to obtain a customer reference from each vendor, preferably another company doing approximately the same function as your organization (though for competitive reasons this may be hard to obtain). Focus on your key technical requirements to ensure that the customer reference you obtain is relevant and useful to you. A customer reference accessing similar systems and operating on similar hardware and operating systems is probably more applicable than another similar industry reference on a completely different platform.

In the end, of course, it all comes down to putting ink to paper and taking on the responsibility for this new project. Much like having a baby, you won't know for some time how decisions you make today turn out; and you'll inevitably have second thoughts along the way (Was I ready for this?). However, with a solid decision-making process behind you, you can expect to grow and succeed with your selection. At the final bell, it is the results that count!

Certified Data Mining and Warehousing Professional ETL Need and factors