Consider 2 cases.
1. Business requires that for certain dates; the data
must not be loaded into the warehouse. But due to negligence this has been loaded. How do we remove the data?
2. Due to some error in the loading process, some older
dates data have not been loaded into the warehouse. How do we go about loading them?
2 cases are very important to be considered. In my experience, these are the 2 cases that are generally neglected. Such cases
are not supposed to happen. But our warehouse must be robust enough to handle all these cases.
Though both these cases look trivial, people who have encountered such problems would agree with me that this is not
trivial & requires some forethought before we go about the load. I will try to address both the above cases here.
Removal of data from a warehouse requires careful planning. Removal of facts is easier. This requires an ordinary delete.
For example, it is required to delete the data for 10th Jan 2005. To delete the facts, a simple delete statement
on the fact table using the corresponding surrogate key for 10th Jan 2005. Removal of dimensions that were loaded
on that day should also be removed. But remember we have SCD implementation on these dimensions. Hence, it is not easier to
remove the dimensions. In fact, sometimes it is not possible to delete at all. We would analyze why & how we go about
a. SCD-1: We shall recall what SCD-1 does. Whenever
a matching business key occurs, then the other attributes for the dimension are updated with the present value. Whenever a
new business key that does not exist the dimension occurs, then it is inserted as it is into the dimension. The entire dimension
would have timestamps conveying when the dimension was loaded or updated. We cannot just go about deleting the rows that were
loaded on 10th Jan 2005. For example, consider a new department that was added on 10th Jan 2005. This
would have got loaded on the same day. New employees of this new department would get data after 10th Jan 2005.
If we delete the new department, we would orphan the facts. That is the fact table would be having a key that parent dimension
does not have. This should not happen. We can go about deleting this department if there are no facts for this department
after 10th Jan 2005. Now consider an even tricky situation. A department undergoes a name change on 10th
Jan 2005. Hence, according to SCD-1 the old department name would have got updated to the new name on 10th Jan
2005. But according to business whatever happened on 10th Jan 2005 is of no use. Hence we must not be using the
new department name. The question is how do we revert back to the old department name? Should we revert back to the old department
name? What is the old department name? The new facts that were loaded on 11th Jan 2005 would now be linked to the
department dimension table with the new department name. This must be changed to the old department name. How do we do that?
This cannot be done unless we get the help of the business users. Looking at our database we cannot get the old department
name. This should be obtained from the business & then this should be updated in the dimension table.
I am a constant advocate of including a new column in the SCD 1 implementation
that identifies whether the row was updated/inserted. When removing data for a particular date, we can first obtain all the
rows that were inserted on that particular day. Then we have to identify whether these new rows have facts after that day.
If not we can delete them. If yes, then we have to keep them. For the merged records, there is no other option but to get
the help of the business. Of course for the above example it is assumed that the name change that occurred on 10th
Jan 2005 is not valid according to the business.
For SCD-2 dimensions, the removal of records is
bit easier. Consider the same example of removing records of 10th Jan 2005. Lets consider the same department name
change that occurred on 10th Jan 2005. According to SCD-2 the older department name would be retained. But this
record would be logically deleted in the sense that a new record for the same department with a new department name would
be the active record in the dimension for this dimension. Our task would be is to delete this record & make the older
department name record as the active record. And also all those keys that map to the new department name record key have to
be updated to the older record key.
For SCD-3 & SCD-6 the explanation would be made in the future articles.
The 2nd case is even more delicate than the 1st one. This would involve affecting the entire
warehouse if not done properly & carefully. Before moving further I would like to explain the concept of CDC or Change
a. CDC: A warehouse is loaded on an incremental
basis. The loading is done everyday/every week/every month depending on the newness of the data required. Lets assume everyday
load. Consider, the load to start on 10th Dec 2004. On doing the load on 10th Dec 2004, it would load
all data before 10th Dec 2004. Every day data will be loaded. When a load is done on 11th Dec 2004,
the loading process would load the data for 10th Dec 2004 alone. I.e. it would load all the data that was not loaded
in the previous load. This is known as change data capture.
Now consider a case wherein some data that was not
originally present in the source system has been brought (say lost data recovery) back to the source system. It is required
that we load this data into the warehouse. To tackle such cases, a separate process that does not depend on CDC must be developed.
Also, this load must not affect the CDC flags/dates. Going back & loading older data would affect
the data consistency. How? Well consider the case of department dimension. We go back load older data. Now since it’s
a older data it would be having older department name. Hence, when a load is done in this data, for SCD-1 implementation the
new department name would get updated to older department names. This is absolutely not desired. Hence, for such kind of loads,
only inserts must be allowed into the dimensions. Also, for SCD-2, if the same department already exists then the old department
name record would be inserted but would be logically deleted since the latest department name record is logically active.