Python Difflib Deltas and Compare Ndiff

NealWalters picture NealWalters · Jan 20, 2011 · Viewed 12.2k times · Source

I was looking to do something like what I believe change control systems do, they compare two files, and save a small diff each time the file changes. I've been reading this page: http://docs.python.org/library/difflib.html and it's not sinking in to my head apparently.

I was trying to recreate this in a somewhat simple program shown below, but the thing that I seem to be missing is that the Delta's contain at least as much as the original file, and more.

Is it not possible to get to just the pure changes? The reason I ask is hopefully obvious - to save disk space.
I could just save the entire chunk of code each time, but it would be better to save current code once, then small diffs of the changes.

I'm also still trying to figure out why many difflib functions return a generator instead of a list, what's the advantage there?

Will difflib work for me - or I need to find a more professional package with more features?

# Python Difflib demo 
# Author: Neal Walters 
# loosely based on http://ahlawat.net/wordpress/?p=371
# 01/17/2011 

# build the files here - later we will just read the files probably 
file1Contents="""
for j = 1 to 10: 
   print "ABC"
   print "DEF" 
   print "HIJ"
   print "JKL"
   print "Hello World"
   print "j=" + j 
   print "XYZ"
"""

file2Contents = """
for j = 1 to 10: 
   print "ABC"
   print "DEF" 
   print "HIJ"
   print "JKL"
   print "Hello World"
   print "XYZ"
print "The end"
"""

filename1 = "diff_file1.txt" 
filename2 = "diff_file2.txt" 

file1 = open(filename1,"w") 
file2 = open(filename2,"w") 

file1.write(file1Contents) 
file2.write(file2Contents) 

file1.close()
file2.close() 
#end of file build 

lines1 = open(filename1, "r").readlines()
lines2 = open(filename2, "r").readlines()

import difflib

print "\n FILE 1 \n" 
for line in lines1:
  print line 

print "\n FILE 2 \n" 
for line in lines2: 
  print line 

diffSequence = difflib.ndiff(lines1, lines2) 

print "\n ----- SHOW DIFF ----- \n" 
for i, line in enumerate(diffSequence):
    print line

diffObj = difflib.Differ() 
deltaSequence = diffObj.compare(lines1, lines2) 
deltaList = list(deltaSequence) 

print "\n ----- SHOW DELTALIST ----- \n" 
for i, line in enumerate(deltaList):
    print line



#let's suppose we store just the diffSequence in the database 
#then we want to take the current file (file2) and recreate the original (file1) from it
#by backward applying the diff 

restoredFile1Lines = difflib.restore(diffSequence,1)  # 1 indicates file1 of 2 used to create the diff 

restoreFileList = list(restoredFile1Lines)

print "\n ----- SHOW REBUILD OF FILE1 ----- \n" 
# this is not showing anything! 
for i, line in enumerate(restoreFileList): 
    print line

Thanks!

UPDATE:

contextDiffSeq = difflib.context_diff(lines1, lines2) 
contextDiffList = list(contextDiffSeq) 

print "\n ----- SHOW CONTEXTDIFF ----- \n" 
for i, line in enumerate(contextDiffList):
    print line

----- SHOW CONTEXTDIFF -----




* 5,9 **

 print "HIJ"

 print "JKL"

 print "Hello World"
  • print "j=" + j

    print "XYZ"

--- 5,9 ----

 print "HIJ"

 print "JKL"

 print "Hello World"

 print "XYZ"
  • print "The end"

Another update:

In the old days of Panvalet an Librarian, source management tools for the mainframe, you could create a changeset like this:

++ADD 9
   print "j=" + j 

Which simply mean add a line (or lines) after line 9. Then there word words like ++REPLACE or ++UPDATE. http://www4.hawaii.gov/dags/icsd/ppmo/Stds_Web_Pages/pdf/it110401.pdf

Answer

Jim Brissom picture Jim Brissom · Jan 20, 2011

I'm also still trying to figure out why many difflib functions return a generator instead of a list, what's the advantage there?

Well, think about it for a second - if you compare files, those files can in theory (and will be in practice) be quite large - returning the delta as a list, for exampe, means reading the complete data into memory, which is not a smart thing to do.

As for only returning the difference, well, there is another advantage in using a generator - just iterate over the delta and keep whatever lines you are interested in.

If you read the difflib documentation for Differ - style deltas, you will see a paragraph that reads:

Each line of a Differ delta begins with a two-letter code:
Code    Meaning
'- '    line unique to sequence 1
'+ '    line unique to sequence 2
'  '    line common to both sequences
'? '    line not present in either input sequence

So, if you only want differences, you can easily filter those out by using str.startswith

You can also use difflib.context_diff to obtain a compact delta which shows only the changes.