Site icon Tutorial

ETL Testing

Go back to Tutorial

ETL is commonly associated with Data Warehousing projects but there in reality any form of bulk data movement from a source to a target can be considered ETL. Large enterprises often have a need to move application data from one source to another for data integration or data migration purposes. ETL testing is a data centric testing process to validate that the data has been tranformed and loaded into the target as expected.

ETL Testing Types

ETL or Data warehouse testing is categorized into four different engagements irrespective of technology or ETL tools used:

ETL Testing Techniques

Apart from the above ETL testing methods other testing methods like system integration testing, user acceptance testing, incremental testing, regression testing, retesting and navigation testing is also carried out to make sure everything is smooth and reliable.

ETL Testing Process

Similar to any other testing that lies under Independent Verification and Validation, ETL also goes through the same phase.

Metadata Testing

The purpose of Metadata Testing is to verify that the table definitions conform to the data model and application design specifications.

Data Type Check – Verify that the table and column data type definitions are as per the data model design specifications.

Example: Data Model column data type is NUMBER but the database column data type is STRING (or VARCHAR).

Data Length Check – Verify that the length of database columns are as per the data model design specifications.

Example: Data Model specification for the ‘first_name’ column is of length 100 but the corresponding database table column is only 80 characters long.

Index/Constraint Check

Verify that proper constraints and indexes are defined on the database tables as per the design specifications.

Data Completeness Testing

The purpose of Data Completeness tests are to verify that all the expected data is loaded in target from the source. Some of the tests that can be run are : Compare and Validate counts, aggregates (min, max, sum, avg) and actual data between the source and target.

Record Count Validation – Compare count of records of the primary source table and target table. Check for any rejected records.

Example: A simple count of records comparison between the source and target tables.

Source Query

SELECT count(1) src_count FROM customer

Target Query

SELECT count(1) tgt_count FROM customer_dim

Column Data Profile Validation – Column or attribute level data profiling is an effective tool to compare source and target data without actually comparing the entire data. It is similar to comparing the checksum of your source and target data. These tests are essential when testing large amounts of data.

Some of the common data profile comparisons that can be done between the source and target are:

Data Quality Testing

The purpose of Data Quality tests is to verify the accuracy of the data. Data profiling is used to identify data quality issues and the ETL is designed to fix or handle these issue. However, source data keeps changing and new data quality issues may be discovered even after the ETL is being used in production. Automating the data quality checks in the source and target system is an important aspect of ETL execution and testing.

Duplicate Data Checks – Look for duplicate rows with same unique key column or a unique combination of columns as per business requirement.

Example: Business requirement says that a combination of First Name, Last Name, Middle Name and Data of Birth should be unique.

Sample query to identify duplicates

SELECT fst_name, lst_name, mid_name, date_of_birth, count(1) FROM Customer GROUP BY fst_name, lst_name, mid_name HAVING count(1)>1

Data Validation Rules – Many database fields can contain a range of values that cannot be enumerated. However, there are reasonable constraints or rules that can be applied to detect situations where the data is clearly wrong. Instances of fields containing values violating the validation rules defined represent a quality gap that can impact ETL processing.

Example: Date of birth (DOB). This is defined as the DATE datatype and can assume any valid date. However, a DOB in the future, or more than 100 years in the past are probably invalid. Also, the date of birth of the child is should not be greater than that of their parents.

Data Integrity Checks – This measurement addresses “keyed” relationships of entities within a domain. The goal of these checks is to identify orphan records in the child entity with a foreign key to the parent entity.

Data Transformation Testing

Data is transformed during the ETL process so that it can be consumed by applications on the target system. Transformed data is generally important for the target systems and hence it is important to test transformations. There are two approaches for testing transformations – white box testing and blackbox testing

Transformation testing using White Box approach – White box testing is a testing technique, that examines the program structure and derives test data from the program logic/code.

For transformation testing, this involves reviewing the transformation logic from the mapping design document and the ETL code to come up with test cases.

The steps to be followed are listed below:

The advantage with this approach is that the test can be rerun easily on a larger source data. The disadvantage of this approach is that the tester has to reimplement the transformation logic.

Example: In a financial company, the interest earned on the savings account is dependent the daily balance in the account for the month.

  1. Review the requirement and design for calculating the interest.
  2. Implement the logic using your favorite programming language.
  3. Compare your output with data in the target table.

Transformation testing using Black Box approach – Black-box testing is a method of software testing that examines the functionality of an application without peering into its internal structures or workings. For transformation testing, this involves reviewing the transformation logic from the mapping design document setting up the test data appropriately.

The steps to be followed are listed below:

The advantage with this approach is that the transformation logic does not need to be reimplemented during the testing. The disadvantage of this approach is that the tester needs to setup test data for each transformation scenario and come up with the expected values for the transformed data manually.

 

Go back to Tutorial

Exit mobile version