Data Transformation types and dimensional attributes
 


One of the main functions of an Extract, Transform, and Load (ETL) tool is to transform data. The transformation step is the most vital stage of building a structured data warehouse. With this post, I’d like to help you get a better understanding of the major transformation types in ETL. Here they are:

    • Format revision
    • Decoding of fields
    • Calculated and derived values
    • Splitting of single fields
    • Merging of information
    • Character set conversion
    • Unit of measurement conversion
    • Date/Time conversion
    • Summarization
    • Key restructuring
    • De-duplication

Today I will tell you about the following ETL transformation types: format revision, decoding of fields, calculated and derived values, splitting of single fields, and merging of information.

Format revision. Fields can contain numeric and text data types. If they do, you need to standardize and change the data type to text to provide values that could be correctly perceived by the users. The length of fields can also be different and you can standardize it.

Decoding of fields. In multiple source systems, the same data items are described by a variety of field values. Also, many legacy systems are notorious for using cryptic codes to represent business values. This ETL transformation type changes codes into values that make sense to the end-users.

Calculated and derived values. Sometimes you have to calculate the total cost and the profit margin before data can be stored in the data warehouse, which is an example of the calculated value. You may also want to store customer’s age separately—that would be an example of the derived value.

Splitting single fields. The first name, middle name, and last name, as well as some other values, were stored as a large text in a single field in the earlier legacy systems. You need to store individual components of names and addresses in separate fields in your data repository to improve the operating performance by indexing and analyzing individual components.

Merging of information. This type of data transformation in ETL does not literally mean the merging of several fields to create a single field. In this case, merging of information stands for establishing the connection between different fields, such as product price, description, package types, and viewing these fields as a single entity.

 

Transform

The transform stage applies to a series of rules or functions to the extracted data from the source to derive the data for loading into the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformation types may be required to meet the business and technical needs of the target database:

  • Selecting only certain columns to load (or selecting null columns not to load). For example, if the source data has three columns (also called attributes), for example roll_no, age, and salary, then the extraction may take only roll_no and salary. Similarly, the extraction mechanism may ignore all those records where salary is not present (salary = null).
  • Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female)
  • Encoding free-form values (e.g., mapping "Male" to "1")
  • Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
  • Sorting
  • Joining data from multiple sources (e.g., lookup, merge) and deduplicating the data
  • Aggregation (for example, rollup — summarizing multiple rows of data — total sales for each store, and for each region, etc.)
  • Generating surrogate-key values
  • Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
  • Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns)
  • Disaggregation of repeating columns into a separate detail table (e.g., moving a series of addresses in one record into single addresses in a set of records in a linked address table)
  • Lookup and validate the relevant data from tables or referential files for slowly changing dimensions.
  • Applying any form of simple or complex data validation. If validation fails, it may result in a full, partial or no rejection of the data, and thus none, some or all the data is handed over to the next step, depending on the rule design and exception handling. Many of the above transformations may result in exceptions, for example, when a code translation parses an unknown code in the extracted data.