Im trying to extract only the first hit from an NCBI xml BLAST file. next I would like to get only the first HSP. at the final stage I would like to get these based on best score. to make things clear here a sample of the xml file:
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastx</BlastOutput_program>
<BlastOutput_version>blastx 2.2.22 [Sep-27-2009]</BlastOutput_version>
<BlastOutput_reference>~Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, ~Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), ~"Gapped BLAST and PSI-BLAST: a new generation of protein database search~programs", Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
<BlastOutput_db>/Applications/blast/db/viral1.protein.faa</BlastOutput_db>
<BlastOutput_query-ID>lcl|1_0</BlastOutput_query-ID>
<BlastOutput_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</BlastOutput_query-def>
<BlastOutput_query-len>1024</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_matrix>BLOSUM62</Parameters_matrix>
<Parameters_expect>1e-05</Parameters_expect>
<Parameters_gap-open>11</Parameters_gap-open>
<Parameters_gap-extend>1</Parameters_gap-extend>
<Parameters_filter>F</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>lcl|1_0</Iteration_query-ID>
<Iteration_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</Iteration_query-def>
<Iteration_query-len>1024</Iteration_query-len>
<Iteration_stat>
<Statistics>
<Statistics_db-num>68007</Statistics_db-num>
<Statistics_db-len>19518578</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>0.041</Statistics_kappa>
<Statistics_lambda>0.267</Statistics_lambda>
<Statistics_entropy>0.14</Statistics_entropy>
</Statistics>
</Iteration_stat>
<Iteration_message>No hits found</Iteration_message>
</Iteration>
<Iteration>
<Iteration>
<Iteration_iter-num>6</Iteration_iter-num>
<Iteration_query-ID>lcl|6_0</Iteration_query-ID>
<Iteration_query-def>DSAD-090629_plate11A05a.g1 CHROMAT_FILE: DSAD-090629_plate11A05a.g1 PHD_FILE: DSAD-090629_plate11A05a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A05a DIRECTION: rev</Iteration_query-def>
<Iteration_query-len>1068</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gnl|BL_ORD_ID|23609</Hit_id>
<Hit_def>gi|38707884|ref|NP_945016.1| Putative ribose-phosphate pyrophosphokinase [Enterobacteria phage Felix 01]</Hit_def>
<Hit_accession>23609</Hit_accession>
<Hit_len>293</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>49.2914</Hsp_bit-score>
<Hsp_score>116</Hsp_score>
<Hsp_evalue>5.15408e-06</Hsp_evalue>
<Hsp_query-from>580</Hsp_query-from>
<Hsp_query-to>792</Hsp_query-to>
<Hsp_hit-from>202</Hsp_hit-from>
<Hsp_hit-to>273</Hsp_hit-to>
<Hsp_query-frame>-1</Hsp_query-frame>
<Hsp_identity>26</Hsp_identity>
<Hsp_positive>45</Hsp_positive>
<Hsp_gaps>2</Hsp_gaps>
<Hsp_align-len>73</Hsp_align-len>
<Hsp_qseq>MHIIGDVE--GRTCILVDDMVDTAGTLCHAAKALKERGAAKVYAYCTHPVLSGRAIENIENSVLDELVVTNTI</Hsp_qseq>
<Hsp_hseq>MRILDDVDLTDKTVMILDDICDGGRTFVEAAKHLREAGAKRVELYVTHGIFS-KDVENLLDNGIDHIYTTNSL</Hsp_hseq>
<Hsp_midline>M I+ DV+ +T +++DD+ D T AAK L+E GA +V Y TH + S + +EN+ ++ +D + TN++</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
<Hit>
<Hit_num>2</Hit_num>
<Hit_id>gnl|BL_ORD_ID|2466</Hit_id>
<Hit_def>gi|51557505|ref|YP_068339.1| large tegument protein [Suid herpesvirus 1]</Hit_def>
<Hit_accession>2466</Hit_accession>
<Hit_len>3084</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>48.9062</Hsp_bit-score>
<Hsp_score>115</Hsp_score>
<Hsp_evalue>6.70494e-06</Hsp_evalue>
<Hsp_query-from>369</Hsp_query-from>
<Hsp_query-to>875</Hsp_query-to>
<Hsp_hit-from>2312</Hsp_hit-from>
<Hsp_hit-to>2465</Hsp_hit-to>
<Hsp_query-frame>-2</Hsp_query-frame>
<Hsp_identity>52</Hsp_identity>
<Hsp_positive>70</Hsp_positive>
<Hsp_gaps>4</Hsp_gaps>
<Hsp_align-len>173</Hsp_align-len>
<Hsp_qseq>APESQEPGASTWRSSTSVVKKGQPSQK*CTSSVTSKAVPASWSTTWSTLPAPCATPPKR*KSAAPPRSTPTAPTRCCPAAPSRTSRIPSWTSWWSPTPSRCPLRRSPARVFASSTSPR-SSPKRSAASATKNRSAP---CSAKRNWPDHTAPPRAGLFALPPEAGRKPQGGLV</Hsp_qseq>
<Hsp_hseq>APPAQKPPAQPATAAATTAPKATPQTQPPTRAQTQTAPPPPSAAT-----AAAQVPPQ------PPSSQPAAKPRGAPPAPPAPP--PPSAQTTLPRPAAPPAPPPPS---AQTTLPRPAPPPPSAPAATPTPPAPGPAPSAKKSDGDRIVEPKAG---APPDVRDAKFGGKV</Hsp_hseq>
<Hsp_midline>AP +Q+P A ++ + K P + T + T A P + T A PP+ PP S P A R P AP P P P+ P P+ A +T PR + P SA +AT AP SAK++ D P+AG PP+ GG V</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>68007</Statistics_db-num>
<Statistics_db-len>19518578</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>0.041</Statistics_kappa>
<Statistics_lambda>0.267</Statistics_lambda>
<Statistics_entropy>0.14</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
basically each query search creates an Iteration element. each iteration can have multiple hit which in turn can have multiple HSPs. I would like to get only the first hit and it's first HSP from each iteration. if the BLAST found no hits I would like to ignore the iteration. I worked up this simple code:
#!/usr/bin/env python
from elementtree.ElementTree import parse
from elementtree import ElementTree as ET
file = open("/Applications/blast/blanes_viral_nr_results.xml", "r")
save_file = open("/Applications/blast/Blast_parse_ET.txt", 'w')
tree = parse(file)
elem = tree.getroot()
print elem
Per_ID = ()
save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t\n\n\n\n' % ("It_Num\t", "It_ID\t", "Hit_Def\t", "Num\t", "ID\t", "ACC\t"))
iteration = tree.findall('BlastOutput_iterations/Iteration')
for iteration in iteration:
for hit in iteration.findall('Iteration_hits/Hit'):
It_Num = iteration.findtext('Iteration_iter-num')
It_ID = iteration.findtext('Iteration_query-def')
Hit_Def = hit.findtext('Hit_def')
Num = hit.findtext('Hit_num')
ID = hit.findtext('Hit_id')
DEF = hit.findtext('Hit_def')
ACC = hit.findtext('Hit_accession')
save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t' % (It_Num, It_ID[12:26], Hit_Def[1:10], Num, ID, ACC,))
for hsp in hit.findall('Hit_hsps'):
HSPN = hsp.findtext('Hsp/Hsp_num')
identities = hsp.findtext('Hsp/Hsp_identity')
#print 'id: ', identities.rjust(4),
length = hsp.findtext('Hsp/Hsp_align-len')
#print 'len:', length.rjust(4),
Per_ID = int(identities) * 100.0 / int(length)
#print hsp.findtext('Hsp/Hsp_qseq')[:50]
#print hsp.findtext('Hsp/Hsp_midline')[:50]
#print hsp.findtext('Hsp/Hsp_hseq')[:50]
save_file.write('%s\t%s\t%s\%st\n' % ('***', '%', HSPN, Per_ID))
save_file.write('n\n' % ())
any help would be greatly appriciated!
While building your own parser may be "fun" there is already a package out there that can parse BLAST xml files ... it can even do the intermediate calling of a local BLAST instance for you if you so desire.
The main site is here: http://biopython.org/wiki/Biopython
and the XML BLAST parser is here: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc82
Something like:
from Bio.Blast import NCBIXML
with open('xml/results/file') as handle:
all_records = NCBIXML.parse(handle)
first_record = all_records.next()
Should work. I generally love the BioPython parsers and writers but I do not like the class-structure organization. So I generally just use the parser and extract the info I need into my own structure. YMMV
Hope that helps.