Make your own free website on Tripod.com
Venky's World
Handling Error Loads
Home
About Me
Favorite Links
Contact Me
Why I Started This?
Welcome to the DW World
OLTP & OLAP
How to Start Isolating data
Grain is the Grail
Schemas
Normalization
Facts & Dimensions
Oh Yeah, Its ETL Time
SCD
SCD, is this Table/Column dependent
Handling Error Loads

Handling Error Loads

 Consider 2 cases.

1.      Business requires that for certain dates; the data must not be loaded into the warehouse. But due to negligence this has been loaded. How do we remove the data?

2.      Due to some error in the loading process, some older dates data have not been loaded into the warehouse. How do we go about loading them?

These 2 cases are very important to be considered. In my experience, these are the 2 cases that are generally neglected. Such cases are not supposed to happen. But our warehouse must be robust enough to handle all these cases.

            Though both these cases look trivial, people who have encountered such problems would agree with me that this is not trivial & requires some forethought before we go about the load. I will try to address both the above cases here. 

1.                  Removal of data from a warehouse requires careful planning. Removal of facts is easier. This requires an ordinary delete. For example, it is required to delete the data for 10th Jan 2005. To delete the facts, a simple delete statement on the fact table using the corresponding surrogate key for 10th Jan 2005. Removal of dimensions that were loaded on that day should also be removed. But remember we have SCD implementation on these dimensions. Hence, it is not easier to remove the dimensions. In fact, sometimes it is not possible to delete at all. We would analyze why & how we go about this.

a.       SCD-1: We shall recall what SCD-1 does. Whenever a matching business key occurs, then the other attributes for the dimension are updated with the present value. Whenever a new business key that does not exist the dimension occurs, then it is inserted as it is into the dimension. The entire dimension would have timestamps conveying when the dimension was loaded or updated. We cannot just go about deleting the rows that were loaded on 10th Jan 2005. For example, consider a new department that was added on 10th Jan 2005. This would have got loaded on the same day. New employees of this new department would get data after 10th Jan 2005. If we delete the new department, we would orphan the facts. That is the fact table would be having a key that parent dimension does not have. This should not happen. We can go about deleting this department if there are no facts for this department after 10th Jan 2005. Now consider an even tricky situation. A department undergoes a name change on 10th Jan 2005. Hence, according to SCD-1 the old department name would have got updated to the new name on 10th Jan 2005. But according to business whatever happened on 10th Jan 2005 is of no use. Hence we must not be using the new department name. The question is how do we revert back to the old department name? Should we revert back to the old department name? What is the old department name? The new facts that were loaded on 11th Jan 2005 would now be linked to the department dimension table with the new department name. This must be changed to the old department name. How do we do that? This cannot be done unless we get the help of the business users. Looking at our database we cannot get the old department name. This should be obtained from the business & then this should be updated in the dimension table.

I am a constant advocate of including a new column in the SCD 1 implementation that identifies whether the row was updated/inserted. When removing data for a particular date, we can first obtain all the rows that were inserted on that particular day. Then we have to identify whether these new rows have facts after that day. If not we can delete them. If yes, then we have to keep them. For the merged records, there is no other option but to get the help of the business. Of course for the above example it is assumed that the name change that occurred on 10th Jan 2005 is not valid according to the business. 

                        SCD-2:

For SCD-2 dimensions, the removal of records is bit easier. Consider the same example of removing records of 10th Jan 2005. Lets consider the same department name change that occurred on 10th Jan 2005. According to SCD-2 the older department name would be retained. But this record would be logically deleted in the sense that a new record for the same department with a new department name would be the active record in the dimension for this dimension. Our task would be is to delete this record & make the older department name record as the active record. And also all those keys that map to the new department name record key have to be updated to the older record key. 

      For SCD-3 & SCD-6 the explanation would be made in the future articles. 

2.                  The 2nd case is even more delicate than the 1st one. This would involve affecting the entire warehouse if not done properly & carefully. Before moving further I would like to explain the concept of CDC or Change Data Capture.

a.       CDC: A warehouse is loaded on an incremental basis. The loading is done everyday/every week/every month depending on the newness of the data required. Lets assume everyday load. Consider, the load to start on 10th Dec 2004. On doing the load on 10th Dec 2004, it would load all data before 10th Dec 2004. Every day data will be loaded. When a load is done on 11th Dec 2004, the loading process would load the data for 10th Dec 2004 alone. I.e. it would load all the data that was not loaded in the previous load. This is known as change data capture.

Now consider a case wherein some data that was not originally present in the source system has been brought (say lost data recovery) back to the source system. It is required that we load this data into the warehouse. To tackle such cases, a separate process that does not depend on CDC must be developed. Also, this load must not affect the CDC flags/dates.

            Going back & loading older data would affect the data consistency. How? Well consider the case of department dimension. We go back load older data. Now since it’s a older data it would be having older department name. Hence, when a load is done in this data, for SCD-1 implementation the new department name would get updated to older department names. This is absolutely not desired. Hence, for such kind of loads, only inserts must be allowed into the dimensions. Also, for SCD-2, if the same department already exists then the old department name record would be inserted but would be logically deleted since the latest department name record is logically active.

Enter supporting content here

Attitude,Not Aptitude, Determines Altitude