It appears that the lack of support for enforcing 2 basic constraints (unique and foreign key), is a reason for loss of many man-hours debugging and troubleshooting difficult problems. What starts as a trivial, easy-to-fix problem (dup rows /inconsistent primary-detail tables) goes un-noticed, grows and raises complex edge-scenarios in our application, or even in our hardware (e.g. self-join with dups can cause inflation and storage exhaustion).
Can anyone recommend a methodology/tool for avoiding these headaches?
I understand there is an accepted answer, though I wanted to contribute an alternate approach. Being new to my current position, I am not privy to all the business decisions behind primary key declarations in our warehouse. I developed logging type approach to track duplicate row removal efforts over time. Here are the major features of this design:
What follows is everything required from a Netezza perspective. Where noted, you will need to fill in gaps to create dynamic SQL.
First, I created a table that tracks the database, table and internal rowid of all duplicate records.
CREATE TABLE
NZ_DUPLICATE_PKS
(
DATABASE_NAME CHARACTER VARYING(128) NOT NULL
,TABLE_OWNER CHARACTER VARYING(128) NOT NULL
,TABLE_NAME CHARACTER VARYING(128) NOT NULL
,ROW_ID BIGINT NOT NULL
,CURRENT_RECORD_INDICATOR CHARACTER(1) NOT NULL
,CREATE_TIMESTAMP TIMESTAMP NOT NULL
,LAST_UPDATE_TIMESTAMP TIMESTAMP NOT NULL
)
DISTRIBUTE ON
(
ROW_ID
);
NOTE: YMMV on the distribution key and volume of rows going into the table. Row IDs in our Netezza applicance had an even enough natural distribution that it served me well on a Mustang based NPS 10050.
Next, a staging version of this table was created:
CREATE TABLE
STG_NZ_DUPLICATE_PKS
(
DATABASE_NAME CHARACTER VARYING(128)
,TABLE_OWNER CHARACTER VARYING(128)
,TABLE_NAME CHARACTER VARYING(128)
,ROW_ID BIGINT
,CURRENT_RECORD_INDICATOR CHARACTER(1)
,CREATE_TIMESTAMP TIMESTAMP
,LAST_UPDATE_TIMESTAMP TIMESTAMP
)
DISTRIBUTE ON
(
ROW_ID
);
I then created dynamic queries from system views to seed the staging table. Here is the base query I started with:
SELECT
DATABASE
,OWNER
,RELATION
,CONSTRAINTNAME
,ATTNAME
FROM
{YOUR_DATABASE_NAME}._V_RELATION_KEYDATA
WHERE
CONTYPE = 'p'
-- Exclude the duplicate tracking table
AND RELATION != 'NZ_DUPLICATE_PKS'
ORDER BY
DATABASE
,OWNER
,RELATION
,CONSTRAINTNAME
,CONSEQ
;
Now I loop through the base query to dynamically create insert queries. My shop uses DataStage, whose approach is esoteric and not worth expounding upon here.
NOTE: Here is where a little work is needed to loop and construct dynamic SQL. One could use myriad flavors of shell, Perl, Python, etc. Using a sample table with two column key, here is what to construct for insertion into the staging table:
INSERT
INTO
STG_NZ_DUPLICATE_PKS
(
DATABASE_NAME
,TABLE_OWNER
,TABLE_NAME
,ROW_ID
,CURRENT_RECORD_INDICATOR
,CREATE_TIMESTAMP
,LAST_UPDATE_TIMESTAMP
)
SELECT
'{YOUR_DATABASE_NAME}' DATABASE_NAME
,'{YOUR_TABLE_OWNER}' TABLE_OWNER
,'{YOUR_TABLE_NAME}' TABLE_NAME
,DUPS.ROWID ROW_ID
,'Y' CURRENT_RECORD_INDICATOR
,CURRENT_TIMESTAMP CREATE_TIMESTAMP
,CURRENT_TIMESTAMP LAST_UPDATE_TIMESTAMP
FROM
{YOUR_TABLE_NAME} DUPS
INNER JOIN
(
SELECT
{KEY_COLUMN_1}
,{KEY_COLUMN_2}
FROM
{YOUR_TABLE_NAME}
GROUP BY
{KEY_COLUMN_1}
,{KEY_COLUMN_2}
HAVING
COUNT(*) > 1
)
KEYS
ON
DUPS.{KEY_COLUMN_1} = KEYS.{KEY_COLUMN_1}
AND DUPS.{KEY_COLUMN_2} = KEYS.{KEY_COLUMN_2};
After looping through all tables to seed the staging table, I then run a series of queries, treating the database, owner, table name and row ID like a slowly changing dimension. This query end dates records in the target table that do not exist in staging table:
UPDATE
NZ_DUPLICATE_PKS
SET
CURRENT_RECORD_INDICATOR = 'N'
,LAST_UPDATE_TIMESTAMP = CURRENT_TIMESTAMP
WHERE
CURRENT_RECORD_INDICATOR = 'Y'
AND
(
DATABASE_NAME
,TABLE_OWNER
,TABLE_NAME
,ROW_ID
)
NOT IN
(
SELECT
DATABASE_NAME
,TABLE_OWNER
,TABLE_NAME
,ROW_ID
FROM
STG_NZ_DUPLICATE_PKS
)
;
Finally, insert the latest records into the target table:
INSERT
INTO
NZ_DUPLICATE_PKS
(
DATABASE_NAME
,TABLE_OWNER
,TABLE_NAME
,ROW_ID
,CURRENT_RECORD_INDICATOR
,CREATE_TIMESTAMP
,LAST_UPDATE_TIMESTAMP
)
SELECT
DATABASE_NAME
,TABLE_OWNER
,TABLE_NAME
,ROW_ID
,CURRENT_RECORD_INDICATOR
,CREATE_TIMESTAMP
,LAST_UPDATE_TIMESTAMP
FROM
STG_NZ_DUPLICATE_PKS
WHERE
(
DATABASE_NAME
,TABLE_OWNER
,TABLE_NAME
,ROW_ID
)
NOT IN
(
SELECT
DATABASE_NAME
,TABLE_OWNER
,TABLE_NAME
,ROW_ID
FROM
NZ_DUPLICATE_PKS
WHERE
CURRENT_RECORD_INDICATOR = 'Y'
)
;
NOTE: Our environment is not such that an insert-only model is necessary. Netezza veterans will be familiar with this line of thought. If your environment is insert-only, adjust strategy accordingly.
Once everything is in place, it's a snap to locate duplicate rows for investigation:
SELECT
*
FROM
{YOUR_TABLE_NAME}
WHERE
ROWID IN
(
SELECT
ROW_ID
FROM
NZ_DUPLICATE_PKS
WHERE
CURRENT_RECORD_INDICATOR = 'Y'
AND DATABASE_NAME = '{YOUR_DATABASE_NAME}'
AND TABLE_OWNER = '{YOUR_OWNER_NAME}'
AND TABLE_NAME = '{YOUR_TABLE_NAME}'
);
I like this because it's simple and the same for all tables, regardless of differences in primary key declaration.
I also use this query a lot to view current primary key violations by table:
SELECT
DATABASE_NAME
,TABLE_OWNER
,TABLE_NAME
,COUNT(*) QUANTITY
FROM
NZ_DUPLICATE_PKS
WHERE
CURRENT_RECORD_INDICATOR = 'Y'
GROUP BY
1
,2
,3
ORDER BY
1
,2
,3;
That sums everything up. I hope some folks find it useful. We have already made a lot of progress with this approach. At this point, you may be wondering why I went to all this trouble. I detest that PK violations are allowed into our warehouse, and I desired a comprehensive approach to their eradication. The above process has been running daily in our production environment for a couple months. We have ~350 tables with primary keys declared, ranging in size from 5 row dimensions to ~200 million row facts @ 10Gb. For Netezza, this is a pretty modest outlay. The entire process takes less than 10 minutes on our Mustang NPS 10050.