How do I remove duplicated SNPs using PLink?

user1236418 picture user1236418 · Mar 25, 2012 · Viewed 12.9k times · Source

I am working with PLINK to analyse genome-wide data.

Does anyone know how to remove duplicated SNPs?

Answer

Benjamatic picture Benjamatic · Mar 23, 2016

In PLINK 1.9, use --list-duplicate-vars suppress-first, which will list duplicates, and remove one (the first one), leaving the other intact. I've know this to slip up though.

Instead of using --exclude as Davy suggested, you can also use --extract, keeping rather than getting rid of a list of SNPs. There's an easy method on any Unix based system (assuming your data are in PED/MAP format and cut up by chromossome):

for i in {1..22}; do
  cat yourfile_chr${i}.map | grep "$i" | cut -f -4 | uniq | cut -f -2 | keepers_chr${i}.txt;
done

This will create a keepers_chr.txt file with SNP IDs for SNPs at unique locations. Then run PLINK feeding it your original file(s) and use --extract keepers_chr, with --make-bed --out unique_file