Handling nulls in Datawarehouse

rrydman picture rrydman · Jun 10, 2009 · Viewed 7.6k times · Source

I'd like to ask your input on what the best practice is for handling null or empty data values when it pertains to data warehousing and SSIS/SSAS.

I have several fact and dimension tables that contain null values in different rows.

Specifics:

1) What is the best way to handle null date/times values? Should I make a 'default' row in my time or date dimensions and point SSIS to the default row when there is a null found?

2) What is the best way to handle nulls/empty values inside of dimension data. Ex: I have some rows in an 'Accounts' dimensions that have empty (not NULL) values in the Account Name column. Should I convert these empty or null values inside the column to a specific default value?

3) Similar to point 1 above - What should I do if I end up with a Facttable row that has no record in one of the dimension columns? Do I need default dimension records for each dimension in case this happens?

4) Any suggestion or tips in regards to how to handle these operation in Sql server integration services (SSIS)? Best data flow configurations or best transformation objects to use would be helpful.

Thanks :-)

Answer

Steve Homer picture Steve Homer · Jun 11, 2009

As the previous answer states there can be many different meanings attached to Null values for a dimension, unknown, not applicable, unknown etc. If it is useful to be able to distinguish between them in your application adding "pseudo" dimension entries can help.

In any case I would avoid having either Null fact foreign keys or dimension fields, having even a single 'unknown' dimension value will help your users define queries that include a catch-all grouping where the data quality isn't 100% (and it never is).

One very simple trick I've been using for this and hasn't bitten me yet is to define my dimensions surrogate keys using int IDENTITY(1,1) in T-sql (start at 1 and increment by 1 per row). Pseudo keys ("Unavailable", "Unassigned", "Not applicable") are defined as negative ints and populated by a stored procedure ran at the beginning of the ETL process.

For example a table created as


    CREATE TABLE [dbo].[Location]
    (
        [LocationSK] [int] IDENTITY(1,1) NOT NULL,
        [Name] [varchar](50) NOT NULL,
        [Abbreviation] [varchar](4) NOT NULL,
        [LocationBK] [int] NOT NULL,
        [EffectiveFromDate] [datetime] NOT NULL,
        [EffectiveToDate] [datetime] NULL,
        [Type1Checksum] [int] NOT NULL,
        [Type2Checksum] [int] NOT NULL,
    ) ON [PRIMARY]

And a stored procedure populating the table with


Insert Into dbo.Location (LocationSK, Name, Abbreviation, LocationBK, 
                      EffectiveFromDate,  Type1Checksum, Type2Checksum)
            Values (-1, 'Unknown location', 'Unk', -1, '1900-01-01', 0,0)

I have made it a rule to have at least one such pseudo row per dimension which is used in cases where the dimension lookup fails and to build exception reports to track the number of facts which are assigned to such rows.