Moss
Moss
Strings - look for exact textual matches of segments, for instance five-word runs. Fast, but can be confused by
renaming identifiers.
Tokens - as with strings, but using a lexer to convert the program into tokens first. This discards whitespace,
comments, and identifier names, making the system more robust to simple text replacements. Most academic
plagiarism detection systems work at this level, using different algorithms to measure the similarity between token
sequences.
Parse Trees - build and compare parse trees. This allows higher-level similarities to be detected. For instance, tree
comparison can normalize conditional statements, and detect equivalent constructs as similar to each other.
Program Dependency Graphs (PDGs) - a PDG captures the actual flow of control in a program, and allows much
Metrics - metrics capture 'scores' of code segments according to certain criteria; for instance, "the number of loops
and conditionals", or "the number of different variables used". Metrics are simple to calculate and can be compared
quickly, but can also lead to false positives: two fragments with the same scores on a set of metrics may do entirely
different things.
Hybrid approaches - for instance, parse trees + suffix trees can combine the detection capability of parse trees with
The previous classification was developed for code refactoring, and not for academic plagiarism detection (an important goal
of refactoring is to avoid duplicate code, referred to as code clones in the literature). The above approaches are effective
against different levels of similarity; low-level similarity refers to identical text, while high-level similarity can be due to similar
specifications. In an academic setting, when all students are expected to code to the same specifications, functionally
equivalent code (with high-level similarity) is entirely expected, and only low-level similarity is considered as proof of
cheating.
MOSS and JPlag
Many source-code similarity detection tools exist. Below are a selection and a brief description of the most popular ones. These tools aim to detect and point out
similarities between source-code files. These similarities should be carefully investigated by the academic prior to taking actions for plagiarism against the students. The
output provided by the tools could be used as evidence in the event that the academic decides to take matters further.
Free tools
JPlag promises to find ‘similarities among multiple sets of source code files’. JPlag was developed by Guido Malpohl in 1996. It currently supports Java, C#, C, C++,
Scheme, and natural language text. JPlag is free but users are required to create an account. JPlag uses a variation of the Karp-Rabin comparison algorithm developed
by Wise, but adds different optimizations for improving its run time efficiency.
MOSS (Measure Of Software Similarity) was developed by Alex Aiken in 1994. MOSS finds similarities in a number of different languages: C, C++, Java, Pascal, Ada,
ML, Lisp, and Scheme programs. MOSS is a free service but the users must create an account.
Sherlock was developed at Warwick University’s Computer Science department. It is available as part of the BOSS Online Submission System or as a stand-alone
The PMD open source tool provides a Copy/Paste Detector (CPD) for finding duplicate code. CPD uses the Karp-Rabin string matching algorithm. It works with Java,
JSP, C, C++, Fortan and PHP code. It also provides guidance on how to add other programming languages to the tool. Unlike JPlag, MOSS, and Sherlock this tool is not
specifically aimed at detecting similarities in students’ work but works well in doing so. Similarly to JPlag, CPD uses a variation of the Karp-Rabin string matching
algorithm developed by Wise. The developers of PMD provide excellent support and documentation for this tool. Because it is a duplicate code detector, this tool scans
the files themselves for duplicate code, hence it returns similar code found within the same file. However, it is also successful in returning similar code across different
files and can be used as a tool for detecting similarity in source-code files.
Commercial tools
CodeMatch is a commercial source-code plagiarism detector claiming to have a superior algorithm to the others listed here. CodeMatch currently supports the following
programming languages: BASIC, C, C++, C#, Delphi, Flash ActionScript, Java, JavaScript, MASM, Pascal, Perl, PHP, PowerBuilder, Ruby, SQL, Verilog, VHDL.