Palagiarism Detection
Palagiarism Detection
Palagiarism Detection
Abstract
In order to detect plagiarism, comparisons must be made between a target document (the suspect)
and reference documents. Numerous automated systems exist which check at the text-string level.
If the scope is kept constrained, as for example in within-cohort plagiarism checking, then per-
formance is very reasonable. On the other hand if one extends the focus to a very large corpus
such as the WWW then performance can be reduced to an impracticable level. The three case
studies presented in this paper give insight into the text-string comparators, whilst the third case
study considers the very new and promising conceptual analysis approach to plagiarism detection
which is now made achievable by the very computationally efficient Normalised Word Vector
algorithm. The paper concludes with a caution on the use of high-tech in the absence of high-
touch.
Keywords: academic malpractice, conceptual analysis, conceptual footprint, semantic footprint,
Normalised Word Vector, NWV, plagiarism.
Introduction
Plagiarism is now acknowledged to pose a significant threat to academic integrity. There is a
growing array of software packages to help address the problem. Most of these offer a string-of-
text comparison. New to emerge are software packages and services to ‘generate’ assignments.
Naturally there will be a cat and mouse game for a while and in the meantime academics need to
be alert to the possibilities of academic malpractice via plagiarism and adopt appropriate and
promising counter-measures, including the newly emerging algorithms to do fast conceptual
analysis. One such emergent agent is the Normalised Word Vector (NWV) algorithm (Williams,
2006), which was originally developed for use in the Automated Essay Grading (AEG) domain.
AEG is a relatively new technology which aims to score or grade essays at the level of expert
humans. This is achieved by creating a mathematical representation of the semantic information
in addition to checking spelling, grammar, and other more usual parameters associated with essay
assessment. The mathematical represen-
Material published as part of this publication, either on-line or
in print, is copyrighted by the Informing Science Institute.
tation is computed for each student es-
Permission to make digital or paper copy of part or all of these say and compared with a mathematical
works for personal or classroom use is granted without fee representation computed for the model
provided that the copies are not made or distributed for profit answer. If we can represent the semantic
or commercial advantage AND that copies 1) bear this notice content of an essay we are able to com-
in full and 2) give the full citation on the first page. It is per-
missible to abstract these works so long as credit is given. To pare it to some standard model- hence
copy in all other cases or to republish or to post on a server or determine a grade or assign an authen-
to redistribute to lists requires specific permission and payment ticity parameter relative to any given
of a fee. Contact Publisher@InformingScience.org to request
redistribution permission.
Automatic Conceptual Analysis for Plagiarism Detection
Despite having plagiarism detection technology available, its effective implementation can be a
challenge in itself. In one prominent case where technology was forced on students, the reaction
led to a court ruling granting the student the right to bypass a university mandated plagiarism
check prior to assignment submission (Figure 2).
602
Dreher
A student at McGill University has won the right to have his assignments marked without first submit-
ting them to an American, anti-plagiarism website.
Jesse Rosenfeld refused to submit three assignments for his second-year economics class to Tur-
nitin.com, a website that compares submitted works to other student essays in its database, as well as
to documents on the web and published research papers.
Last Updated: Friday, January 16, 2004 | 11:11 AM ET
Figure 2: McGill student wins fight over anti-cheating website
Source: http://www.cbc.ca/canada/story/2004/01/16/mcgill_turnitin030116.html
Whilst the plagiarism problem is significant, it is not solvable only by applying plagiarism detec-
tion techniques. There needs to be a recognition that the students are not entirely to blame (Wil-
liamsJ 2002). Quite obviously we need to agree on a working definition of plagiarism which is
simple to understand and to check.
In a light-hearted vein, the entry for plagiarism in The Devil’s Dictionary by Ambrose Bierce
reads: PLAGIARISM, n.
A literary coincidence compounded of a discreditable priority and an honorable subsequence.
This might be the sort of definition which would be used to justify excusing a first or minor in-
stance of plagiarism but it does not admit of the measures which may be needed to detect it. A
more precise and practically applicable definition, that indicates the measures which may be
needed to detect plagiarism, is found on the www.plagiarism.org site:
• copying words or ideas from someone else without giving credit
• changing words but copying the sentence structure of a source
without giving credit
• copying so many words or ideas from a source that it makes up the majority
of your work, whether you give credit or not (see our section on "fair use" rules)
From the above we can see the essential elements: words; style or structure; and ideas. Therefore,
checking systems must look for matching words, analyze style, and create a map of the ideas con-
603
Automatic Conceptual Analysis for Plagiarism Detection
tained in candidate plagiarism cases. The first of these is well catered for by the established sys-
tems, such as string-of-text matching.
Case 1: WCopyfind
The University of Virginia’s freely available WCopyfind software
(http://plagiarism.phys.virginia.edu) is a delightful example of the power of the computer to help
in addressing the plagiarism problem. It makes text-string comparisons and can be instructed to
find sub-string matches of given length and similarity characteristics. Such fine tuning permits the
exclusion of obvious non-plagiarism cases despite text-string matches.
To determine the efficacy of WCopyfind the author devised a trial. Some 600 student assign-
ments from a course on Societal Impacts of Information Technology were checked for within-
cohort plagiarism. The assignments were between 500 and 2000 words and were either in the
English or German language. The system is computationally very efficient and took only seconds
to highlight five cases requiring closer scrutiny.
Figure 4, Figure 5, and Figure 6 show WCopyfind – system interface, WCopyfind – report, and
WCopyfind – document comparison, respectively.
604
Dreher
rized text was submitted for a Google search and immediately revealed a source on the Web con-
taining the same text (Figure 7).
605
Automatic Conceptual Analysis for Plagiarism Detection
It is interesting but perhaps not surprising to note that those who plagiarize from fellow students
will also copy from elsewhere (personal experience). The analysis thus far has not proven plagia-
rism but simply highlighted its possible existence and located the evidence. Simply because text-
strings match does not permit one to conclude plagiarism, as the text may be properly referenced.
The suspect text was found in document www.bsi.de/fachthem/rfid/RIKCHA.pdf (Figure 7) and
can now be carefully matched with student text to determine the extent and accuracy of the copy-
ing. In short, WCopyfind is one text-string-matching approach to plagiarism detection that is use-
ful for within-cohort applications, but is not amenable to large scale ‘extra-cohort’ plagiarism
detection (i.e., searching the WWW). Case study 2 investigates one program that is designed for
this purpose.
606
Dreher
The result was ‘disappointing’ too – EVE2 only flagged a low level of potential plagiarism most
of which was due to legitimate referencing and flagged two websites (Figure 9). On the other
hand one is delighted that one’s research students are creating their own work!
607
Automatic Conceptual Analysis for Plagiarism Detection
to humans but not so simply detected automatically. Consider the assignment fragment in Figure
10. These words appeared in an assignment submitted by a student doing a capstone course in
Information Systems & Technology.
Web sites involves a mixture development between print publishing and software development, be-
tween marketing and computing, between internal communications and external relations, and be-
tween art and technology. Software engineering provides processes that are useful in developing the
web sites and web site engineering priniciples can be used to help bring web projects under control
and minimize the risk of a project being delivered late or over budget.
608
Dreher
The next section of the paper presents some case analyses using a promising new technology to
aid in plagiarism detection – the use of the Normalised Word Vector (NWV) algorithm to create a
conceptual footprint of student assignments.
609
Automatic Conceptual Analysis for Plagiarism Detection
610
Dreher
Since conceptual analysis is the main topic being addressed in this paper we consider a second
example – this time with assignments of 300 to 500 words written by year 10 students on the
topic of School Leaving Age” (Figure 15).
611
Automatic Conceptual Analysis for Plagiarism Detection
The age at which students are legally allowed to leave school should not be raised from 15 years of age
to 17 years of age.
If a student is kept in school against their will they are less likely to do well in their studies. As a major-
ity of them wont even try to do well and learn what is being tort. They will also disrupt other student that
are willing to learn.
Also at the age of 15 these people are becoming adults. They are beginning to form their own ideas of
what they want to do in the future. Some students are good at school and it would benefit them to stay
others are not good at school and are better off in the work force. These people know there strengths
and weaknesses and are there for better equipt to make the decision for them selves.
People have to be allowed to make their own mistakes. There are always options if the person is un-
happy with their decision. Such as seeking higher education in the field they wish to go into at places like
TAFE or they could even go back to school if they decide they have made the wrong choice. The point is
it should be their responsibility to make the choice that is going to effect the rest of there lives.
Plagiarism detected
Extract from the master essay
Paragraph 1
The age at which students are legally allowed to leave school should not be raised from 15 years of age
to 17 years of age.
Matched extract(s)
I belivev that the legal age to leave school should be raise from the age of 15 to the age of 17.
I agree with the Minister of Education, that the legal age that students should be allowed to leave school,
is at the age of 17 years old not 15 years old.
The arguement that is stated in this essay is should students be allowed to leave school at 15 years of
age or should be change to a later age.
* According to the Minister of Education, the legal age for students to leave school will be changed from
15 years of age to 17 in 2002.
612
Dreher
indicate plagiarism, however the reader may contemplate how well the computer, the NWV algo-
rithm actually, has determined semantic proximity, which may be an indicator of plagiarism.
Conclusion
Through three case studies the author has illustrated how text-string comparison can be effective
in detecting within-cohort plagiarism (Case 1), but can be inefficient for plagiarism detection on a
lar-ger scale such as the WWW (Case 2). Furthermore it has been shown that while text-string
comparisons are effective they may not flag the replication of others’ ideas using semantically
similar words. To detect such forms of copying one needs to use a conceptual analysis. We have
applied the NWV algorithm because it is the fastest method known to extract semantic content
from essays of arbitrary length, the efficacy of which was shown in Case 3.
Whilst the results achieved with this ‘hi-tech’ approach are promising one should stress that a ‘hi-
touch’ approach is not to be ruled out and may be used in a complimentary manner for increased
efficacy in detecting and addressing plagiarism (Figure 16).
In Fig 16 the ‘hi-tech’ approach can be seen used in step 2) whereas the ‘hi-touch’ approach is
relied upon for the remainder of the steps. The term ‘hi-tech/hi-touch’ comes from Naisbitt
(1982). As in all cases where humans rely on technology to help solve problems, in this situation
there is a very large degree of reliance on human (6 out of 7 steps), as opposed to artificial, intel-
ligence.
1) select some text fragment which is ‘unlikely’ to come from the nominated source and search for
<selected text>
2) compare search results with original & highlight matching text
3) professor invites student for interview – bring paper copy of assignment
4) ask student to highlight all words which have been copied
5) compare student’s highlighting with professor’s highlighting and you can guess the student’s
reaction: DISBELIEF
6) professor listens patiently student’s explanation, protestation, justification .
7) professor explains:
if we are HONEST in the assessment process and with each other,
then we can TRUST that the system is FAIR to everyone
and society will RESPECT the worth of your degree from this university:
for this reason we both have the RESPONSIBILITY to uphold academic INTEGRITY
References
Bierce, A. (1911). The Devil’s dictionary. Retrieved from http://www.thedevilsdictionary.com - text by
Ambrose Bierce, 1911; copyright expired.
Kloda, L.A. & Nicholson, K. (2005). Plagiarism detection software and academic integrity: The Canadian
perspective. In Proceedings Librarians’ Information Literacy Annual Conference (LILAC), London
(UK). Retrieved from http://eprints.rclis.org/archive/00005409/
Karner, J. (2001). Der Plagiator Retrieved from http://old.onlinejournalismus.de/meinung/plagiator.html
Maurer, H., Kappe, F. & Zaka, B. (2006). Plagiarism – A survey. Journal of Universal Computer Science,
12(8), 1050-1084.
Naisbitt, J. (1982). Megatrends. Ten new directions transforming our lives. Warner Books.
Turnitin. (2007). http://www.turnitin.com/static/home.html
613
Automatic Conceptual Analysis for Plagiarism Detection
Williams, J.B. (2002). “The plagiarism problem: Are students entirely to blame?” In Proceedings of AS-
CILITE 2002. Retrieved from
http://www.ascilite.org.au/conferences/auckland02/proceedings/papers/189.pdf
Williams, R. (2006). The power of normalised word vectors for automatically grading essays. The Journal
of Issues in Informing Science and Information Technology, 3, 721-730. Retrieved from
http://informingscience.org/proceedings/InSITE2006/IISITWill155.pdf
Biography
Heinz Dreher is Associate Professor in Information Systems at the
Curtin Business School, Curtin University, Perth, Western Australia.
He has published in the educational technology and information sys-
tems domain through conferences, journals, invited talks and seminars;
is currently the holder of Australian National Competitive Grant fund-
ing for a 4-year E-Learning project and a 4-year project on Automated
Essay Grading technology development, trial usage and evaluation; has
received numerous industry grants for investigating hypertext based
systems in training and business scenarios; and is an experienced and
accomplished teacher, receiving awards for his work in cross-cultural
awareness and course design. In 2004 he was appointed Adjunct Pro-
fessor for Computer Science at TU Graz, and continues to collaborate in teaching & learning and
research projects with European partners.
Dr Dreher’s research and development in the hypertext domain has centred on the empowering
aspects of text & document technology since 1988. The systems he has developed provide sup-
port for educators and teachers, and document creators and users from business and government.
‘DriveSafe’, ‘Active Writing’, ‘The Effectiveness of Hypertext to Support Quality Improvement’,
‘Water Bill 1990 Hypertext Project’, ‘A Prototype Hypertext Operating Manual for LNG Plant
Dehydration Unit’, ‘Hypertextual Tender Submission - Telecom Training Programme’, were all
hypertext construction and evaluation projects in industry or education. The Hypertext Research
Laboratory, whose aim was to facilitate the application of hypertext-based technology in
academe, business and in the wider community, was founded by him in December 1989
Acknowledgements
The author would like to acknowledge the InSITE reviewers for their helpful comments and in
particular thank Carl Dreher for his extensive and critical appraisal of early drafts of the paper.
614