How to verify that the source code is copied from web

rspr picture rspr · Aug 20, 2012 · Viewed 9.1k times · Source

I am building a web tool to check whether the submitted content is taken from web or is it submitter own work. A plagiarism detector.

I have some idea that I can generated check sum and use that as a key to compare with other entries. However, if someone has made some small changes like including/removing comments, changing variables/function name and so on then the checksum will be different, so this approach won't work.

Any suggestions for a better way?

Answer

Craig Ringer picture Craig Ringer · Aug 20, 2012

Plagiarism detection is a special case of similarity detection. This is a big field of study that's almost as old as computer science its self. There is a lot of published research, and there just isn't a single simple answer.

See, eg, a Google Scholar search for "code similarity plagiarism" or "plagiarism detection". Regular Google searches for things like "source code similarity detection algorithm" can also be useful.

There are plenty of existing tools in the space, too, so I'm surprised you're trying to write your own.

As you've noted, a check-sum won't do the job unless the code is perfectly identical. Techniques that can help include:

  • Building word-frequency histograms and comparing them

  • Extracting comment text and looking for copied comments using text-substring matching

  • Extracting variable, class and method names and looking for other code that uses the same names. You have to do a lot of correction for "obvious" names that everyone will choose, and for names that're dictated by the problem, like implementing a particular interface or API. Private class member variables and the local variables inside a function or method are the most useful to compare. You will need the help of a compiler or at least syntax parser for the language to extract these.

  • Looking for differences in indenting style. Did the user use all-spaces indenting, except for this one function that's indented with tabs?

  • Comparing parse trees or token streams to strip out the effects of formatting. You'd usually have to compare individual functions, etc, not just the code as a whole.

  • ... and lots more

What you'll have to do is produce a report that weighs all these factors and others and presents them to a human so the human can make a decision. Your tool should explain why it thinks two results are similar, not just that they are similar.