I am working with PLINK to analyse genome-wide data.
Does anyone know how to remove duplicated SNPs?
In PLINK 1.9, use --list-duplicate-vars suppress-first
, which will list duplicates, and remove one (the first one), leaving the other intact. I've know this to slip up though.
Instead of using --exclude
as Davy suggested, you can also use --extract
, keeping rather than getting rid of a list of SNPs. There's an easy method on any Unix based system (assuming your data are in PED/MAP format and cut up by chromossome):
for i in {1..22}; do
cat yourfile_chr${i}.map | grep "$i" | cut -f -4 | uniq | cut -f -2 | keepers_chr${i}.txt;
done
This will create a keepers_chr.txt
file with SNP IDs for SNPs at unique locations. Then run PLINK feeding it your original file(s) and use --extract keepers_chr
, with --make-bed --out unique_file