I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.
For example, if this is file1:
line1
line2
line3
And this is file2:
line1
line4
line5
Then my result/output should be:
line2
line3
This works:
grep -v -f file2 file1
But it is very, very slow when used on my large files.
I suspect there is a good way to do this using diff()
, but the output should be just the lines, nothing else, and I cannot seem to find a switch for that.
Can anyone help me find a fast way of doing this, using bash and basic Linux binaries?
EDIT: To follow up on my own question, this is the best way I have found so far using diff()
:
diff file2 file1 | grep '^>' | sed 's/^>\ //'
Surely, there must be a better way?
The comm command (short for "common") may be useful comm - compare two sorted files line by line
#find lines only in file1
comm -23 file1 file2
#find lines only in file2
comm -13 file1 file2
#find lines common to both files
comm -12 file1 file2
The man
file is actually quite readable for this.