Deleting duplicates rows from redshift

Question 1

Deleting duplicates rows from redshift

sql amazon-redshift sql-delete

Neil · Jun 2, 2016 · Viewed 36.6k times · Source

Answer

Answer

Redshift being what it is (no enforced uniqueness for any column), Ziggy's 3rd option is probably best. Once we decide to go the temp table route it is more efficient to swap things out whole. Deletes and inserts are expensive in Redshift.

begin;
create table table_name_new as select distinct * from table_name;
alter table table_name rename to table_name_old;
alter table table_name_new rename to table_name;
drop table table_name_old;
commit;

If space isn't an issue you can keep the old table around for a while and use the other methods described here to validate that the row count in the original accounting for duplicates matches the row count in the new.

If you're doing constant loads to such a table you'll want to pause that process while this is going on.

If the number of duplicates is a small percentage of a large table, you might want to try copying distinct records of the duplicates to a temp table, then delete all records from the original that join with the temp. Then append the temp table back to the original. Make sure you vacuum the original table after (which you should be doing for large tables on a schedule anyway).

Question 2

I am trying to delete some duplicate data in my redshift table.

Below is my query:-

With duplicates
As
(Select *, ROW_NUMBER() Over (PARTITION by record_indicator Order by record_indicator) as Duplicate From table_name)
delete from duplicates
Where Duplicate > 1 ;

This query is giving me an error.

Amazon Invalid operation: syntax error at or near "delete";

Not sure what the issue is as the syntax for with clause seems to be correct. Has anybody faced this situation before?

Deleting duplicates rows from redshift

Answer

Related questions