How to overcome Netezza's lack of unique constraint/referential integrity enforcement?

ihadanny picture ihadanny · Apr 13, 2011 · Viewed 14.4k times · Source

It appears that the lack of support for enforcing 2 basic constraints (unique and foreign key), is a reason for loss of many man-hours debugging and troubleshooting difficult problems. What starts as a trivial, easy-to-fix problem (dup rows /inconsistent primary-detail tables) goes un-noticed, grows and raises complex edge-scenarios in our application, or even in our hardware (e.g. self-join with dups can cause inflation and storage exhaustion).

  • Netezza serves multiple purposes in our environment: production, research, qa and staging. Naturally our ETL processes can’t be mature and can’t verify all constraints in all those scenarios.
  • Even in our most mature applications used in production, where data is verified while ETL loads it, we create a series of tables each the result of a calculation on its predecessors. Sometimes the data-integrity is broken along the way, not right on the start (as a result of a buggy statement)

Can anyone recommend a methodology/tool for avoiding these headaches?

Answer

Kerr picture Kerr · Nov 28, 2012

I understand there is an accepted answer, though I wanted to contribute an alternate approach. Being new to my current position, I am not privy to all the business decisions behind primary key declarations in our warehouse. I developed logging type approach to track duplicate row removal efforts over time. Here are the major features of this design:

  • Always up to date, addressing the fluid nature of DDL / DML
    • New / dropped tables
    • New / updated primary keys
    • New / updated / deleted rows
  • Self-populating history
    • Tracks improvement over time
    • Provides basis for trending analysis at all levels
  • Easily query target tables for research purposes
    • No self joins with HAVING clause or lookup of key columns required
  • Addresses only Primary Keys at this time
    • Can easily be expanded to address Unique constraints (CONTYPE = 'u' in _V_RELATION_KEYDATA)

What follows is everything required from a Netezza perspective. Where noted, you will need to fill in gaps to create dynamic SQL.

First, I created a table that tracks the database, table and internal rowid of all duplicate records.

CREATE TABLE
    NZ_DUPLICATE_PKS
    (
        DATABASE_NAME CHARACTER VARYING(128) NOT NULL
        ,TABLE_OWNER CHARACTER VARYING(128) NOT NULL
        ,TABLE_NAME CHARACTER VARYING(128) NOT NULL
        ,ROW_ID BIGINT NOT NULL
        ,CURRENT_RECORD_INDICATOR CHARACTER(1) NOT NULL
        ,CREATE_TIMESTAMP TIMESTAMP NOT NULL
        ,LAST_UPDATE_TIMESTAMP TIMESTAMP NOT NULL
    )
DISTRIBUTE ON
    (
        ROW_ID
    );

NOTE: YMMV on the distribution key and volume of rows going into the table. Row IDs in our Netezza applicance had an even enough natural distribution that it served me well on a Mustang based NPS 10050.

Next, a staging version of this table was created:

CREATE TABLE
    STG_NZ_DUPLICATE_PKS
    (
        DATABASE_NAME CHARACTER VARYING(128)
        ,TABLE_OWNER CHARACTER VARYING(128)
        ,TABLE_NAME CHARACTER VARYING(128)
        ,ROW_ID BIGINT
        ,CURRENT_RECORD_INDICATOR CHARACTER(1)
        ,CREATE_TIMESTAMP TIMESTAMP
        ,LAST_UPDATE_TIMESTAMP TIMESTAMP
    )
DISTRIBUTE ON
    (
        ROW_ID
    );

I then created dynamic queries from system views to seed the staging table. Here is the base query I started with:

SELECT
    DATABASE
    ,OWNER
    ,RELATION
    ,CONSTRAINTNAME
    ,ATTNAME
FROM
    {YOUR_DATABASE_NAME}._V_RELATION_KEYDATA
WHERE
    CONTYPE = 'p'
    -- Exclude the duplicate tracking table
    AND RELATION != 'NZ_DUPLICATE_PKS'
ORDER BY
    DATABASE
    ,OWNER
    ,RELATION
    ,CONSTRAINTNAME
    ,CONSEQ
;

Now I loop through the base query to dynamically create insert queries. My shop uses DataStage, whose approach is esoteric and not worth expounding upon here.

NOTE: Here is where a little work is needed to loop and construct dynamic SQL. One could use myriad flavors of shell, Perl, Python, etc. Using a sample table with two column key, here is what to construct for insertion into the staging table:

INSERT
INTO
    STG_NZ_DUPLICATE_PKS
    (
        DATABASE_NAME
        ,TABLE_OWNER
        ,TABLE_NAME
        ,ROW_ID
        ,CURRENT_RECORD_INDICATOR
        ,CREATE_TIMESTAMP
        ,LAST_UPDATE_TIMESTAMP
    )
SELECT
    '{YOUR_DATABASE_NAME}' DATABASE_NAME
    ,'{YOUR_TABLE_OWNER}' TABLE_OWNER
    ,'{YOUR_TABLE_NAME}' TABLE_NAME
    ,DUPS.ROWID ROW_ID
    ,'Y' CURRENT_RECORD_INDICATOR
    ,CURRENT_TIMESTAMP CREATE_TIMESTAMP
    ,CURRENT_TIMESTAMP LAST_UPDATE_TIMESTAMP
FROM
    {YOUR_TABLE_NAME} DUPS
    INNER JOIN
        (
            SELECT
                {KEY_COLUMN_1}
                ,{KEY_COLUMN_2}
            FROM
                {YOUR_TABLE_NAME}
            GROUP BY
                {KEY_COLUMN_1}
                ,{KEY_COLUMN_2}
            HAVING
                COUNT(*) > 1
        )
        KEYS
        ON
            DUPS.{KEY_COLUMN_1} = KEYS.{KEY_COLUMN_1}
            AND DUPS.{KEY_COLUMN_2} = KEYS.{KEY_COLUMN_2};

After looping through all tables to seed the staging table, I then run a series of queries, treating the database, owner, table name and row ID like a slowly changing dimension. This query end dates records in the target table that do not exist in staging table:

UPDATE
    NZ_DUPLICATE_PKS
SET
    CURRENT_RECORD_INDICATOR = 'N'
    ,LAST_UPDATE_TIMESTAMP = CURRENT_TIMESTAMP
WHERE
    CURRENT_RECORD_INDICATOR = 'Y'
    AND
    (
        DATABASE_NAME
        ,TABLE_OWNER
        ,TABLE_NAME
        ,ROW_ID
    )
    NOT IN
    (
        SELECT
            DATABASE_NAME
            ,TABLE_OWNER
            ,TABLE_NAME
            ,ROW_ID
        FROM
            STG_NZ_DUPLICATE_PKS
    )
;

Finally, insert the latest records into the target table:

INSERT
INTO
    NZ_DUPLICATE_PKS
    (
        DATABASE_NAME
        ,TABLE_OWNER
        ,TABLE_NAME
        ,ROW_ID
        ,CURRENT_RECORD_INDICATOR
        ,CREATE_TIMESTAMP
        ,LAST_UPDATE_TIMESTAMP
    )
SELECT
    DATABASE_NAME
    ,TABLE_OWNER
    ,TABLE_NAME
    ,ROW_ID
    ,CURRENT_RECORD_INDICATOR
    ,CREATE_TIMESTAMP
    ,LAST_UPDATE_TIMESTAMP
FROM
    STG_NZ_DUPLICATE_PKS
WHERE
    (
        DATABASE_NAME
        ,TABLE_OWNER
        ,TABLE_NAME
        ,ROW_ID
    )
    NOT IN
    (
        SELECT
            DATABASE_NAME
            ,TABLE_OWNER
            ,TABLE_NAME
            ,ROW_ID
        FROM
            NZ_DUPLICATE_PKS
        WHERE
            CURRENT_RECORD_INDICATOR = 'Y'
    )
;

NOTE: Our environment is not such that an insert-only model is necessary. Netezza veterans will be familiar with this line of thought. If your environment is insert-only, adjust strategy accordingly.

Once everything is in place, it's a snap to locate duplicate rows for investigation:

SELECT
    *
FROM
    {YOUR_TABLE_NAME}
WHERE
    ROWID IN
    (
        SELECT
            ROW_ID
        FROM
            NZ_DUPLICATE_PKS
        WHERE
            CURRENT_RECORD_INDICATOR = 'Y'
            AND DATABASE_NAME = '{YOUR_DATABASE_NAME}'
            AND TABLE_OWNER = '{YOUR_OWNER_NAME}'
            AND TABLE_NAME = '{YOUR_TABLE_NAME}'
    );

I like this because it's simple and the same for all tables, regardless of differences in primary key declaration.

I also use this query a lot to view current primary key violations by table:

SELECT
    DATABASE_NAME
    ,TABLE_OWNER
    ,TABLE_NAME
    ,COUNT(*) QUANTITY
FROM
    NZ_DUPLICATE_PKS
WHERE
    CURRENT_RECORD_INDICATOR = 'Y'
GROUP BY
    1
    ,2
    ,3
ORDER BY
    1
    ,2
    ,3;

That sums everything up. I hope some folks find it useful. We have already made a lot of progress with this approach. At this point, you may be wondering why I went to all this trouble. I detest that PK violations are allowed into our warehouse, and I desired a comprehensive approach to their eradication. The above process has been running daily in our production environment for a couple months. We have ~350 tables with primary keys declared, ranging in size from 5 row dimensions to ~200 million row facts @ 10Gb. For Netezza, this is a pretty modest outlay. The entire process takes less than 10 minutes on our Mustang NPS 10050.