Diff algorithm C++

João Santos picture João Santos · May 28, 2011 · Viewed 8.2k times · Source

I'm trying to create a program in C++ that could be able to diff two .txt files.

struct line
{
    string text;
    size_t num;
    int status;
};

void compareFiles(vector<line> &buffer_1, vector<line> &buffer_2, size_t index_1, size_t index_2)
{
    while(index_1 < buffer_1.size())
    {
         while(index_2 < buffer_2.size())
         {  
             X = buffer_1[index_1].text;
             Y = buffer_2[index_2].text;
             if(X == Y)
             {
                 ++index_1;
                 ++index_2;
             }
             else
             {
                 LCS();
                 string lcs = printLCS(X.length(), Y.length());

                 /*
                 * Here's my problem
                 */

             }
         }
     }
 }

As you can see, I have two buffers (vectors of lines) previously loaded with files contents . I also have the LCS algorithm fully functional (tested). LCS works on strings X and Y globally defined.

So, what I really need to do is compare buffers line by line with LCS, but I didn't manage a way to do it.

Could you please help me?

Answer

Zach Rattner picture Zach Rattner · May 28, 2011

When in doubt, I usually defer to someone who's done it before. The venerable diff program has been around forever, and does what you want to do. Furthermore, it is open source, so head over to ftp://mirrors.kernel.org/gnu/diffutils/diffutils-3.0.tar.gz and check it out.

Once you unzip the archive, open up src/analyze.c. The diff_2_files function begins on line 472. The code that does the actual comparison runs from lines 512 - 537. They are reproduced below:

for (;; cmp->file[0].buffered = cmp->file[1].buffered = 0)
{
    /* Read a buffer's worth from both files.  */
    for (f = 0; f < 2; f++)
        if (0 <= cmp->file[f].desc)
            file_block_read (&cmp->file[f],
                buffer_size - cmp->file[f].buffered);

    /* If the buffers differ, the files differ.  */
    if (cmp->file[0].buffered != cmp->file[1].buffered
            || memcmp (cmp->file[0].buffer,
                    cmp->file[1].buffer,
                    cmp->file[0].buffered))
    {
        changes = 1;
        break;
    }

    /* If we reach end of file, the files are the same.  */
    if (cmp->file[0].buffered != buffer_size)
    {
        changes = 0;
        break;
    }
}

The idea there is to load two buffers of identical size, then load each file into a buffer. Compare the two files one buffer at a time using memcmp, and see if any buffer differs from the other. If any buffer comparison does not return equal, then the two files are different. It's also important to note that you never have to read more than two buffers worth of data at a time, so this approach works on huge files too.