
Posted on May 27, 2010
Discover how Gain Theory automated their data ingestion and improved collaboration, productivity and time-to-delivery thanks to CloverDX.
Read case studyIn the last part of our data warehouse (DWH) tutorial, I showed you how to load a dimension table that stores historical data according to the Slowly Changing Dimension Type 1 (SCD1). In today’s post, I will focus on a Slowly Changing Dimension Type 2 (SCD2) dimension table. I think that SCD2 is the most challenging sub-task of the ETL part of DWH design and each ETL architect should be able to deal with it.
In contrast to SCD1, SCD2 table stores and preserves the history of attributes. So once the value of the attribute is changed in external system (OLTP) we have to create a new record in SCD2 dimension table with the actual value but we also have to mark the old record in SCD2 table as obsolete. The most common way to obsolete the record is to maintain two additional attributes: valid_from and valid_to. Then the record is considered valid at particular date D when valid_from < D ≤ valid_to. You can find a detailed explanation of SCD2 principles in Kimball’s DWH bible or on wikipedia.org.
Let us show how SCD2 works in real in a small example. We will use DWH schema introduced in SCD1’s post.
It consists of four dimensions (Customer, Product, Store and Date), one degenerate dimension (invoice number) and one fact table (Sales). Fact table stores two additive facts: units and total price.
Store table is populated as SCD1 and we will load Customer table that was marked as SCD2 dimension table. Let’s imagine that Customer changed his email. What will happen in OLTP and DWH Customer table named D_CUSTOMER?
C0001;John;Newman;john.newman@hotmail.com =>
C0001;John;Newman;newman.john@gmail.com
0001;C0001;John;Newman;john.newman@hotmail.com;2009-10-10;null =>
0001;C0001;John;Newman;john.newman@hotmail.com;2009-10-10;2010-05-20
0002;C0001;John;Newman;newman.john@gmail.com;2010-05-21;null
Notice especially the first two attributes (columns) and the last two attributes of DWH table. The first attribute is a surrogate key, it is a unique identifier of the record in D_CUSTOMER table. It is generated by ETL process. The second one (C0001) is a natural key, a unique identifier of customer in OLTP. When you list all records of the same natural key in D_CUSTOMER you will get a complete history of one customer.
The principle how SCD2 works is explained now I will describe an implementation of SCD2 in CloverDX. You can see the CloverDX graph below.
The basic data-flow of the graph is very simple: in the lower branch we read the data from OLTP (for us as in previous DWH post it’s a CSV file), in the upper branch we read data that is already stored in our SCD2 table. We have to use the Dedup component, as we want only one actual record for each customer. The two branches intersect in DataIntersection component that processes the records according to the natural key. The component has three output ports, as there are three possible outcomes: