I have a large Variant Call format (VCF) file (> 4GB) which has data for several samples.
I have browsed Google, Stackoverflow as well as tried the VariantAnnotation package in R to somehow extract data only for a particular sample, but have not found any information on how to do that in R.
Did anybody try anything like that, or maybe knows of another package that would enable this?
In VariantAnnotation use a ScanVcfParam
to specify the data that you'd like to extract. Using the sample VCF file included with the package
library(VariantAnnotation)
vcfFile = system.file(package="VariantAnnotation", "extdata", "chr22.vcf.gz")
discover information about the file
scanVcfHeader(vcfFile)
## class: VCFHeader
## samples(5): HG00096 HG00097 HG00099 HG00100 HG00101
## meta(1): fileformat
## fixed(0):
## info(22): LDAF AVGPOST ... VT SNPSOURCE
## geno(3): GT DS GL
Formulate a request for the "LDAF", "AVGPOST" info fields, "GT" genotype field for samples "HG00097", "HG00101" for variants on chromosome 22 between coordinates 50300000, 50400000
param = ScanVcfParam(
info=c("LDAF", "AVGPOST"),
geno="GT",
samples=c("HG00097", "HG00101"),
which=GRanges("22", IRanges(50300000, 50400000)))
Read the requested data
vcf = readVcf(vcfFile, "hg19", param=param)
and extract from VCF the relevant data
head(geno(vcf)[["GT"]])
## HG00097 HG00101
## rs7410291 "0|0" "0|0"
## rs147922003 "0|0" "0|0"
## rs114143073 "0|0" "0|0"
## rs141778433 "0|0" "0|0"
## rs182170314 "0|0" "0|0"
## rs115145310 "0|0" "0|0"
head(info(vcf)[["LDAF"]])
## [1] 0.3431 0.0091 0.0098 0.0062 0.0041 0.0117
ranges(vcf)
## IRanges of length 1169
## start end width names
## [1] 50300078 50300078 1 rs7410291
## [2] 50300086 50300086 1 rs147922003
## [3] 50300101 50300101 1 rs114143073
## [4] 50300113 50300113 1 rs141778433
## [5] 50300166 50300166 1 rs182170314
## ... ... ... ... ...
## [1165] 50364310 50364312 3 22:50364310_GCA/G
## [1166] 50364311 50364313 3 22:50364311_CAT/C
## [1167] 50364464 50364464 1 rs150069372
## [1168] 50364465 50364465 1 rs146661152
## [1169] 50364609 50364609 1 rs184235324
Maybe you're only interested in genotype element "GS" as a simple R matrix, then just specify the samples and / or ranges you're interested in and use readGeno
(or readGT
or readInfo
for similar specialized queries).
There is extensive documentation in the VariantAnnotation vignettes and reference manual; see also ?ScanVcfParam
; example(ScanVcfParam)
.