Just testing out AWS Redshift, and having discovered some dupe data on an insert that I'd hoped would just fail on duplication in the key column, reading the docs reveal that primary key constraints aren't "enforced".
Anyone figured out how to prevent duplication on primary key (per "traditional" expectation).
Thanks to any Redshift pioneers!
I assign UUIDs when the records are created. If the record is inherently unique, I use type 4 UUIDs (random), and when they aren't I use type 5 (SHA-1 hash) using the natural keys as input.
Then you can follow this instruction by AWS very easily to perform UPSERTs. If your input has duplicates, you should be able to clean up by issuing a SQL that looks something like this in your staging table:
CREATE TABLE cleaned AS
SELECT
pk_field,
field_1,
field_2,
...
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY pk_field order by pk_field) AS r,
t.*
from table1 t
) x
where x.r = 1