how to break out of only one nested loop

biohazard picture biohazard · Sep 1, 2013 · Viewed 61.1k times · Source

I have two tab-delimited files, and I need to test every row in the first file against all the rows in the other file. For instance,

file1:

row1    c1    36    345   A
row2    c3    36    9949  B
row3    c4    36    858   C

file2:

row1    c1    3455  3800
row2    c3    6784  7843
row3    c3    10564 99302
row4    c5    1405  1563

let's say I would like to output all the rows in (file1) for which col[3] of file1 is smaller than any (not every) col[2] of file2, given that col[1] are the same.

Expected output:

row1    c1    36    345   A
row2    c3    36    9949  B

Since I am working in Ubuntu, I would like the input command to look like this:
python code.py [file1] [file2] > [output]

I wrote the following code:

import sys

filename1 = sys.argv[1]
filename2 = sys.argv[2]

file1 = open(filename1, 'r')
file2 = open(filename2, 'r')

done = False

for x in file1.readlines():
    col = x.strip().split()
    for y in file2.readlines():
        col2 = y.strip().split()
        if col[1] == col2[1] and col[3] < col2[2]:
            done = True
            break
        else: continue
print x

However, the output looks like this:

row2    c3    36    9949  B

This is evident for larger datasets, but basically I always get only the last row for which the condition in the nested loop was true. I am suspecting that "break" is breaking me out of both loops. I would like to know (1) how to break out of only one of the for loops, and (2) if this is the only problem I've got here.

Answer

NPE picture NPE · Sep 1, 2013

break and continue apply to the innermost loop.

The issue is that you open the second file only once, and therefore it's only read once. When you execute for y in file2.readlines(): for the second time, file2.readlines() returns an empty iterable.

Either move file2 = open(filename2, 'r') into the outer loop, or use seek() to rewind to the beginning of file2.