Data Loading types and modes
Data Load
Data Load is the process that involves taking the transformed data and loading it where the users can access it. If the architecture contains a staging database, then loading is a two step process -
- Load data to the transformed data to the Staging Database.
- Load the data in the staging database to the warehouse/mart.
If the the staging area is a file systems, then we directly load the data to the warehouse/mart. During the load we prevent end users to access the warehouse/mart tables on which the load is happening. The reason for this is to avoid
- Incomplete data analysis - While the load is in progress, the fact and dimension tables do not have the complete data. If the users perform their analysis on the partially loaded data the results would be incomplete/inaccurate.
- Query Performance issue - If a user fires a select query on the dimension and fact tables for required for his analysis and during that an Insert or an Update operation is already happening. This would degrade the performance of both the Select query as well as the Insert/Update queries.
Data Load Operations
The operations involved in loading data to the warehouse/mart are as follows
- Insert Operation - One of the basic operation is to insert or update data in the tables. In DWH, we can perform insert for new data or insert newer versions of dimension data (SCD). We use an INSERT statement to perform this.
- Update Operation - An operation used to overwrite values already present the tables. In DWH, we can perform Update for data that has changed. We use an UPDATE statement to perform this.
- Upsert Operation - This is a special operation, where if a row is already present then we Update all the contents of the row else we inset it into the table.
- Bulk Load Utility - The above mentioned Insert, Update and Upsert operations are performed one row at a time. While dealing with large amount of data, processing one row at a time can become a performance bottleneck. Instead we use database bulk loading utilities. These take the input dataset (flat file) and load the contents of the file into the table.
The Bulk loader utility takes two datasets as inputs, one containing the data to be loaded and the other containing control information. The control dataset names table and its columns where the data has to be loaded, it also instructs the database on how to load the data. The Load utility creates a reject file for records that had non-unique values for the primary key of the table.
There are three modes the load operation works in, namely
- APPEND - Here the rows are appended to the table.
- INSERT - Here the table must be empty and the data from the input dataset is loaded into the table.
- REPLACE - It empties the table and loads the data from the file.
We generally use the APPEND mode. Care must be taken when the load utility is run on a table, there should be no other SQL fired on the table.
- Referential Integrity Constraint - The referential integrity constraint checks if a value for a foreign key column is present in the parent table from which the foreign key is derived. This constraint is applied when we insert new rows or update the foreign key columns. While inserting or loading large amount of data, this constraint can pose a performance bottleneck. Hence we disable the Foreign Key constraint on tables dealing with large amounts of data, especially fact tables. We need to make sure that the purpose of referential integrity is maintained by ETL.
- Indexes - Indexes defined on tables help us speed up fetch data from tables. They could pose a performance issue when loading data to the table, especially while dealing with large amount of data (fact tables). We generally drop indexes on the table while inserting/updating/loading large amount of data because every time a row is inserted or the value of an indexed column has changed, the index will be recomputed. Instead we drop the indexes on the table and rebuild it whence the load operation is over. This can significantly improve the performance of the load operation.
Types of ETL Load
There are two types of ETL load's namely -
- History Load - The Data Warehouse/Mart is expected to house historical data. Based on the duration for how long the end users want to perform the analysis, we keep the data for that long. In other words, we would have users want to compare the Stores Monthly sales and compare it with the monthly sales of for the last 3 years. Here we would have to keep at least 3 years of data so that end users can perform their analysis. When we build the and implement data warehouse/mart, it is empty. We would not want to start building the history from the day it is implemented. In this case the end user would have to wait for 3 years from the day of implementation to perform this particular analysis. Hence we identify the source where we can find the history data, and perform a one time ETL to extract the required history data and load it to the warehouse.
-
Incremental Load - Incremental load is the periodic load to keep the data warehouse updated with the most recent transactional data. This is an on going process that continues till the life of the warehouse/mart. The periodicity of the incremental loads is dependent on the availability time of the source data. we can have incremental loads on a
- Daily basis.
- Weekly basis.
- Monthly basis.
- Quarterly basis.
- Yearly basis.
Please note that for incremental load we could have a combination of periodicities. For example, we could get the sales transaction file on a daily basis, the SKU Master file every week end, the location file every month end. In this case we will perform the sales transaction on a daily basis where as the SKU data once a week and the location data every month.
ETL Incremental Load Window (known as Load Window)
This is the time frame during which the loading of data needs to be completed and loaded data must be ready for end users to perform the analysis upon.
- Load window for daily load's is defined as the timeframe during the 24 hours window before which the most recent data needs to be loaded. Usually this is defined during the night time when the business users would not be accessing data. Eg - 7 PM thru 5:30 AM.
- Load window for weekly load's is defined as the timeframe on one of the 7 days of the week after which the most recent data has been loaded. Usually this is defined during the weekends when the business users would not be accessing data. Eg - Saturday's 7 PM thru 5:30 AM.
- Load window for monthly/quarterly/yearly load's is defined as the timeframe on the last weekend of the month/quarter/year after which the most recent data has been loaded.