How to Properly Load Slowly Changing Dimensions using T-SQL Merge
One of the most compelling reasons to learn T-SQL Merge is that it performs Slowly Changing Dimension handling so well. This post is the fourth in a series called Have You Got the Urge to Merge?. This post builds on information from the other three, so I suggest you stop and read those before continuing, particularly the last one What exactly are Dimensions and Why do They Slowly Change? Go ahead… I’ll wait.
For most SSIS developers, their first experience in loading them is to use the SSIS Slowly Changing Dimension Wizard. It’s fairly easy and straight forward to use, but offers painfully poor performance. There are also 3rd Party SQL Server Integration Services (SSIS) SCD options available, but I’ve tested all of them and never found one that I was happy with. As you can probably guess, I prefer to use T-SQL Merge statements in an SSIS Execute SQL Task. There are several reasons for this.
What are the Pros and Cons of using Merge to load Slowly Changing Dimensions?
- T-SQL Merge is among the best performing options for loading Slowly Changing Dimensions.
- Merge statements are very versatile and allow for high degree of customization. This is important when you run into a scenario which deviates from standard data warehousing practices.
- The Merge source can be a table, view, common table expression, etc. In fact, the Merge destination (or target) could be a view as well.
- The Source and Destination should be on the same SQL instance. Yes, you can perform Merge with linked servers, but I wouldn’t do so.
- Merge statements can be wordy and tedious to write, as you’ll see in a minute. Merging a couple of tables with 20+ columns can be described as tedious at best.
- Merge performance relies upon proper indexing and setup. Small mistakes can result in drastically reduced performance.
- The initial load of a table can be painful since all of the information is typically written twice, once in staging and then again in the destination table.
Merge statements can be confusing at first so there’s a bit of a learning curve. However, once you get the hang of it, I think you’ll agree that Merge is a great way to go.
When discussing Merge today we are going to stipulate the following:
- Each entity in our Dimension will have 1 active record which will be designated by a RowIsCurrent (or another variation) field and/or RowStartDate and RowEndDate.
- Entities in our Dimension will be continuous. So, for our DimCustomer, we’ll assume that we don’t have customers cancelling their accounts and then re-starting
them. There will be no gaps in time.
- On each load, there will only be 1 record per entity to Merge. In other words, for a particular customer, there will only be 1 changed record per load. You’ll notice that my code includes a field called “RankNo” which handles situations where this is not the case such as initial loads. I’ll dive deeper into that process in a future article.
Let’s Get to the Code!
First download the following file which will create a database called JediMergeDemo and the necessary schema and tables. For those who read my last article, these tables and data will look very familiar. As in the previous article, CustomerName will be a Type 1 change, while Planet will be Type 2. I’m including the Fact Table even though we aren’t going to load it. I’m including it because Foreign Key relationships affect how you write your Merge statements.
First, let’s pre-load our Dimension table and also our Staging table. We’ll start by loading a record that requires a Type 1 change in staging. Obi-Wan Kenobi changes his name to Ben Kenobi before he goes into hiding.
Type 1 Merge
However after extensive testing, even with a dimension where only 1 of many attributes is handled as Type 1, there was no difference in performance or query plan.So, Type 1 Merge is simple, right? Well, Type 2 Merge is a lot more difficult.
Type 2 Merge
Note: Properly handling Type 2 changes with T-SQL Merge is complex, and most people don’t perform it correctly. In order to convey the basics, as well as the complexities, I’m going to present this in layers. If you want to simply want to see the end result, feel free to scroll down to the final statement.
- Find the current record for each entity (per business key) in the Dimension. Most often this is done by finding the record where RowisCurrent = ‘Y’ or some variation. However, In certain situations, you may need to find the current record by comparing the change date to RowStartDate and RowEndDate.
- If there are changes to any of the Type 2 columns for that current row, expire it. This is typically done by setting the RowIsCurrent value to ‘N’ as well as assigning the RowEndDate equal to the change date. In the last article, I explained that some people handle these dates differently.
- Insert a record with a new Surrogate Key for that Dimension record and insert all of the new values into it. The RowEndDate will be ’12/31/9999′, NULL, or some other value you choose.
The Merge Into portion of the Type 2 is just like the Type 1 Merge.