Extract sample data from VCF files

rokosir picture rokosir · Feb 6, 2014 · Viewed 11.5k times · Source

I have a large Variant Call format (VCF) file (> 4GB) which has data for several samples.

I have browsed Google, Stackoverflow as well as tried the VariantAnnotation package in R to somehow extract data only for a particular sample, but have not found any information on how to do that in R.

Did anybody try anything like that, or maybe knows of another package that would enable this?

Answer

Martin Morgan picture Martin Morgan · Feb 6, 2014

In VariantAnnotation use a ScanVcfParam to specify the data that you'd like to extract. Using the sample VCF file included with the package

library(VariantAnnotation)
vcfFile = system.file(package="VariantAnnotation", "extdata", "chr22.vcf.gz")

discover information about the file

scanVcfHeader(vcfFile)
## class: VCFHeader 
## samples(5): HG00096 HG00097 HG00099 HG00100 HG00101
## meta(1): fileformat
## fixed(0):
## info(22): LDAF AVGPOST ... VT SNPSOURCE
## geno(3): GT DS GL

Formulate a request for the "LDAF", "AVGPOST" info fields, "GT" genotype field for samples "HG00097", "HG00101" for variants on chromosome 22 between coordinates 50300000, 50400000

param = ScanVcfParam(
    info=c("LDAF", "AVGPOST"),
    geno="GT",
    samples=c("HG00097", "HG00101"),
    which=GRanges("22", IRanges(50300000, 50400000)))

Read the requested data

vcf = readVcf(vcfFile, "hg19", param=param)

and extract from VCF the relevant data

head(geno(vcf)[["GT"]])
##             HG00097 HG00101
## rs7410291   "0|0"   "0|0"  
## rs147922003 "0|0"   "0|0"  
## rs114143073 "0|0"   "0|0"  
## rs141778433 "0|0"   "0|0"  
## rs182170314 "0|0"   "0|0"  
## rs115145310 "0|0"   "0|0"  
head(info(vcf)[["LDAF"]])
## [1] 0.3431 0.0091 0.0098 0.0062 0.0041 0.0117
ranges(vcf)
## IRanges of length 1169
##           start      end width             names
## [1]    50300078 50300078     1         rs7410291
## [2]    50300086 50300086     1       rs147922003
## [3]    50300101 50300101     1       rs114143073
## [4]    50300113 50300113     1       rs141778433
## [5]    50300166 50300166     1       rs182170314
## ...         ...      ...   ...               ...
## [1165] 50364310 50364312     3 22:50364310_GCA/G
## [1166] 50364311 50364313     3 22:50364311_CAT/C
## [1167] 50364464 50364464     1       rs150069372
## [1168] 50364465 50364465     1       rs146661152
## [1169] 50364609 50364609     1       rs184235324

Maybe you're only interested in genotype element "GS" as a simple R matrix, then just specify the samples and / or ranges you're interested in and use readGeno (or readGT or readInfo for similar specialized queries).

There is extensive documentation in the VariantAnnotation vignettes and reference manual; see also ?ScanVcfParam; example(ScanVcfParam).