how to download complete genome sequence in biopython entrez.esearch

user2662581 picture user2662581 · Aug 27, 2013 · Viewed 7.6k times · Source

I have to download only complete genome sequences from NCBI (GenBank(full) format). I am intrested in 'complete geneome' not 'whole genome'.

my script:

from Bio import Entrez
Entrez.email = "[email protected]"
gatunek='Escherichia[ORGN]'
handle = Entrez.esearch(db='nucleotide',
     term=gatunek, property='complete genome' )#title='complete genome[title]')
result = Entrez.read(handle)

As a results I get only small fragments of genomes, whith size about 484 bp:

LOCUS       NZ_KE350773              484 bp    DNA     linear   CON 23-AUG-2013
DEFINITION  Escherichia coli E1777 genomic scaffold scaffold9_G, whole genome
       shotgun sequence.

I know how to do it manually via NCBI web site but it is very time consuming, the query that I use there:

escherichia[orgn] AND complete genome[title]

and as result I get multiple genomes with sizes range about 5,154,862 bp and this is what I need to do via ENTREZ.esearch.

Answer

peterjc picture peterjc · Oct 18, 2013

You've done the hard part and worked out the query,

escherichia[orgn] AND complete genome[title]

So use that as the search query via Biopython as well!

from Bio import Entrez
Entrez.email = "[email protected]"
search_term = "escherichia[orgn] AND complete genome[title]"
handle = Entrez.esearch(db='nucleotide', term=search_term)
result = Entrez.read(handle)
handle.close()
print(result['Count']) # added parenthesis 

Currently that gives me 140 results, starting with 545778205, which is the same as the website: http://www.ncbi.nlm.nih.gov/nuccore/?term=escherichia%5Borgn%5D+AND+complete+genome%5Btitle%5D