I have two tab-delimited files, and I need to test every row in the first file against all the rows in the other file. For instance,
file1:
row1 c1 36 345 A
row2 c3 36 9949 B
row3 c4 36 858 C
file2:
row1 c1 3455 3800
row2 c3 6784 7843
row3 c3 10564 99302
row4 c5 1405 1563
let's say I would like to output all the rows in (file1) for which col[3] of file1 is smaller than any (not every) col[2] of file2, given that col[1] are the same.
Expected output:
row1 c1 36 345 A
row2 c3 36 9949 B
Since I am working in Ubuntu, I would like the input command to look like this:
python code.py [file1] [file2] > [output]
I wrote the following code:
import sys
filename1 = sys.argv[1]
filename2 = sys.argv[2]
file1 = open(filename1, 'r')
file2 = open(filename2, 'r')
done = False
for x in file1.readlines():
col = x.strip().split()
for y in file2.readlines():
col2 = y.strip().split()
if col[1] == col2[1] and col[3] < col2[2]:
done = True
break
else: continue
print x
However, the output looks like this:
row2 c3 36 9949 B
This is evident for larger datasets, but basically I always get only the last row for which the condition in the nested loop was true. I am suspecting that "break" is breaking me out of both loops. I would like to know (1) how to break out of only one of the for loops, and (2) if this is the only problem I've got here.
break
and continue
apply to the innermost loop.
The issue is that you open the second file only once, and therefore it's only read once. When you execute for y in file2.readlines():
for the second time, file2.readlines()
returns an empty iterable.
Either move file2 = open(filename2, 'r')
into the outer loop, or use seek()
to rewind to the beginning of file2
.