SQL/SSIS DataWareHouse Fact table loading, best practices?

user1709091 picture user1709091 · Nov 1, 2012 · Viewed 10.2k times · Source

I am building my first datawarehouse in SQL 2008/SSIS and I am looking for some best practices around loading the fact tables.

Currently in my DW I have about 20 Dimensions (Offices, Employees, Products, Customer, etc.) that are of Type 1 SCD. In my dw structure, there are a few things I have already applied:

  • No Nulls (replaced with blank for text or 0 for numeric during staging)
  • unknown key members populated in each dimension (SK ID 0)
  • UPSERT for SCD Type 1 loading from stage to production table
  • SELECT DISTINCT for my loading of dimensions

In my Fact loading SSIS project, the current method I have for loading dimensions is having multiple lookups (20+) to each of the DIMs, then populating the FACT table with the data.

For my lookups I set:

  • Full Cache
  • Ignore Failures for "no matching entries"
  • Derived Transformation with "ISNULL(surrogate_idkey) ? 0 : surrogate_idkey" for each SK so that if lookups fail they will default to the SK ID 0 (unknown member).
  • Some of my dimension lookups have more than one business key

Is this the best approach? Pictures attached to help with my description above.

enter image description here enter image description here enter image description here

Answer

Bill Anton picture Bill Anton · Nov 1, 2012

Looks fine. There are options if you start to run into performance issues, but if this is stable (finishes within data-loading time window, source systems aren't being drained of resources, etc), then I see no reason to change.

Some potential issues to keep an eye on...

  1. having 20+ full-cache lookup-transforms may pose a problem if your dimensions increase in size...due to memory constraints on the SSIS system...but since they are type 1, I wouldn't worry.
  2. full-cache lookups "hydrate" pre-execution...having 20+ of them may slow you down

A common alternative (to what you have above) is to extract the fact table data from the source system and land it in a staging area before doing the dimension key lookups via a single SQL statement. Some even keep a set of dimension key mapping tables in the staging area specifically for this purpose. This reduces locking/blocking on the source system...if you have a lot of data each load, and have to block the source system while you suck the data out and run it through those 20+ lookup transforms.

Having a good staging area strategy becomes more important when you have a large amount of data, large dimensions, complex key mappings (usually due to multiple source systems), and short data-loading time windows.