0% found this document useful (0 votes)
225 views2 pages

Moss

The document discusses various source code similarity detection algorithms and tools that can be used to detect plagiarism in student code submissions. It categorizes algorithms based on whether they analyze strings, tokens, parse trees, program dependency graphs, or metrics. It also mentions hybrid approaches. Popular free tools described are JPlag, MOSS, Sherlock, and the PMD Copy/Paste Detector, while CodeMatch is identified as a commercial option.

Uploaded by

Avneesh Kar
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
225 views2 pages

Moss

The document discusses various source code similarity detection algorithms and tools that can be used to detect plagiarism in student code submissions. It categorizes algorithms based on whether they analyze strings, tokens, parse trees, program dependency graphs, or metrics. It also mentions hybrid approaches. Popular free tools described are JPlag, MOSS, Sherlock, and the PMD Copy/Paste Detector, while CodeMatch is identified as a commercial option.

Uploaded by

Avneesh Kar
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

source-code similarity detection algorithms can be classified as based on either

 Strings - look for exact textual matches of segments, for instance five-word runs. Fast, but can be confused by

renaming identifiers.

 Tokens - as with strings, but using a lexer to convert the program into tokens first. This discards whitespace,

comments, and identifier names, making the system more robust to simple text replacements. Most academic

plagiarism detection systems work at this level, using different algorithms to measure the similarity between token

sequences.

 Parse Trees - build and compare parse trees. This allows higher-level similarities to be detected. For instance, tree

comparison can normalize conditional statements, and detect equivalent constructs as similar to each other.

 Program Dependency Graphs (PDGs) - a PDG captures the actual flow of control in a program, and allows much

higher-level equivalences to be located, at a greater expense in complexity and calculation time.

 Metrics - metrics capture 'scores' of code segments according to certain criteria; for instance, "the number of loops

and conditionals", or "the number of different variables used". Metrics are simple to calculate and can be compared

quickly, but can also lead to false positives: two fragments with the same scores on a set of metrics may do entirely

different things.

 Hybrid approaches - for instance, parse trees + suffix trees can combine the detection capability of parse trees with

the speed afforded by suffix trees, a type of string-matching data structure.

The previous classification was developed for code refactoring, and not for academic plagiarism detection (an important goal

of refactoring is to avoid duplicate code, referred to as code clones in the literature). The above approaches are effective

against different levels of similarity; low-level similarity refers to identical text, while high-level similarity can be due to similar

specifications. In an academic setting, when all students are expected to code to the same specifications, functionally

equivalent code (with high-level similarity) is entirely expected, and only low-level similarity is considered as proof of

cheating.

[edit]Academic program plagiarism systems

MOSS and JPlag 

Source Code Similarity Detection Tools

Many source-code similarity detection tools exist. Below are a selection and a brief description of the most popular ones. These tools aim to detect and point out

similarities between source-code files. These similarities should be carefully investigated by the academic prior to taking actions for plagiarism against the students. The

output provided by the tools could be used as evidence in the event that the academic decides to take matters further.
Free tools

JPlag promises to find ‘similarities among multiple sets of source code files’. JPlag was developed by Guido Malpohl in 1996. It currently supports Java, C#, C, C++,

Scheme, and natural language text. JPlag is free but users are required to create an account. JPlag uses a variation of the Karp-Rabin comparison algorithm developed

by Wise, but adds different optimizations for improving its run time efficiency.

MOSS (Measure Of Software Similarity) was developed by Alex Aiken in 1994. MOSS finds similarities in a number of different languages: C, C++, Java, Pascal, Ada,

ML, Lisp, and Scheme programs. MOSS is a free service but the users must create an account.

Free open-source tools

Sherlock was developed at Warwick University’s Computer Science department. It is available as part of the BOSS Online Submission System or as a stand-alone

application. It compares source-code, and natural language texts for similarity.

The PMD open source tool provides a Copy/Paste Detector (CPD) for finding duplicate code. CPD uses the Karp-Rabin string matching algorithm. It works with Java,

JSP, C, C++, Fortan and PHP code. It also provides guidance on how to add other programming languages to the tool. Unlike JPlag, MOSS, and Sherlock this tool is not

specifically aimed at detecting similarities in students’ work but works well in doing so. Similarly to JPlag, CPD uses a variation of the Karp-Rabin string matching

algorithm developed by Wise. The developers of PMD provide excellent support and documentation for this tool. Because it is a duplicate code detector, this tool scans

the files themselves for duplicate code, hence it returns similar code found within the same file. However, it is also successful in returning similar code across different

files and can be used as a tool for detecting similarity in source-code files.

Commercial tools

CodeMatch is a commercial source-code plagiarism detector claiming to have a superior algorithm to the others listed here. CodeMatch currently supports the following

programming languages: BASIC, C, C++, C#, Delphi, Flash ActionScript, Java, JavaScript, MASM, Pascal, Perl, PHP, PowerBuilder, Ruby, SQL, Verilog, VHDL.

You might also like