Similar code detector

Šimon Tóth picture Šimon Tóth · Jun 6, 2012 · Viewed 33.6k times · Source

I'm search for a tool that could compare source codes for similarity.

We have a very trivial system right now that has huge amount of false positives and the real positives can easily get buried in them.

My requirements are:

  • reasonably small amount of false positives
  • good detection rate (yeah these are going against each other)
  • ideally with a more complex output than just a single value
  • usable for C (C99) and C++ (C++03 and optimally C++11)
  • still maintained
  • usable for comparing two source files against each other
  • usable in non-interactive mode

EDIT:

To avoid confusion, the following two code snippets are identical and should be detected as such:

for (int i = 0; i < 10; i++) { bla; }

int i; while (i < 10) { bla; i++; }

The same here:

int x = 10; y = x + 5;

int a = 10; y = a + 5;

Answer

Throwback1986 picture Throwback1986 · Jun 6, 2012

I've used MOSS in the past: http://theory.stanford.edu/~aiken/moss/ to detect plagiarized code. Since it works on a semantic level, it will detect the situations you presented above. The tool is language-aware, so comments are not considered in the analysis, and it goes a long way in detecting code that has been modified through simple search-and-replace of variable and/or function names.

Note: I used the tool a few years ago when I taught computer science in grad school, and it worked wonderfully in detecting code that had been yanked from the internet. Here is a well-documented account of similar application: http://fie2012.org/sites/fie2012.org/history/fie99/papers/1110.pdf

If you google "measure software similarity", you should find a few more useful hits: http://www.ics.heacademy.ac.uk/resources/assessment/plagiarism/detectiontools_sourcecode.html