paper4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Detecting Fake Content with Relative Entropy Scoring1

Thomas Lavergne2 and Tanguy Urvoy3 and François Yvon4

Abstract. How to distinguish natural texts from artificially gener- To keep on with this example, the “Dada-engine” is able to gen-
ated ones ? Fake content is commonly encountered on the Internet, erate thousands of essays about postmodernism that may fool a tired
ranging from web scraping to random word salads. Most of this fake human reader. Yet, a classifier trained on stylistic features immedi-
content is generated for spam purpose. In this paper, we present two ately detects reliable profiling behaviours like these ones:
methods to deal with this problem. The first one uses classical lan-
guage models, while the second one is a novel approach using short • this generator never generates sentences of less than five words;
range information between words. • it never uses more than 2500 word types (this bounded vocabulary
is a consequence of the bounded size of the grammar);
• it tends to repeatedly use phrases such as “the postcapital-
1 INTRODUCTION ist paradigm of”.

Fake content is flourishing on the Internet. The motivations to build To ensure, at low cost, a good quality of the generated text and
fake content are various, for example: the diversity of the generated patterns, most fake contents are built
by copying and blending pieces of real texts collected from crawled
• many spam e-mails and spam blog comments are completed with web sites or RSS-feeds: this technique is called web scraping. There
random texts to avoid being detected by conventional methods are many tools like RSSGM5 or RSS2SPAM6 available to generate
such as hashing; fake content by web scraping. However, as long as the generated con-
• many spam Web sites are designed to automatically generate thou- tent is a patchwork of relatively large pieces of texts (sentences or
sands of interconnected web pages on a selected topic, in order to paragraphs), semi-duplicate detection techniques can accurately rec-
reach the top of search engines response lists [7]; ognize it as fake [6, 16]
• many fake friends generators are available to boost one’s popular-
ity in social networks [9].

The textual content of this production ranges from random


“word salads” to complete plagiarism. Plagiarism, even when it in-
cludes some alterations, is well detected by semi-duplicate signature
schemes [2, 10]. On the other hand, natural texts and sentences have
many simple statistical properties that are not matched by typical
word salads, such as the average sentence length, the type/token ra-
tio, the distribution of grammatical words, etc [1]. Based on such
attributes, it is fairly straightforward to build robust, genre indepen-
dent, classification systems that can sort salads from natural texts Figure 1. A typical web page generated by a Markovian generator. This
with a pretty high accuracy [5, 13, 11]. page was hidden in www.med.univ-rennes1.fr web site
(2008-04-08).
Some spammers use templates, scripts, or grammar based gener-
ators like the “Dada-engine” [3] to mimic efficiently natural texts.
The main weakness of these generators is their low productivity and The text generators that perform the best trade-off between patch-
their tendency to always generate the same patterns. The productivity works and word salads are the ones that use statistical language mod-
is low because a good generator requires a lot of rules, hence a lot of els to generate natural texts. A language model, trained on a large
human work, to generate semantically consistent texts. On the other dataset collected on the Web can indeed be used to produce com-
hand, a generator with too many rules will be hard to maintain and pletely original, yet relatively coherent texts. In the case of Web
will tend to generate incorrect patterns. spamming, the training dataset is often collected from search en-
As a consequence, the “writing style” of a computer program is gines response lists to forge query specific or topic specific fake web
often less subtle and therefore more easy to characterize than a hu- pages. Figure 1 gives a nice example of this kind of fake web pages.
man writer’s. We have already proposed an efficient method to detect This page is part of a huge “link farm” which is polluting several
the “style” of computer generated HTML codes [20], and similar universities and government’s web sites to deceive the Trustrank [8]
methods apply to text generators. algorithm. Here is a sample of text from this web page:
1 Work supported by MADSPAM 2.0 ANR project. 5 The “Really Simple Site Generator Modified” (RSSGM) is a good example
2 Orange Labs / ENST Paris, email: name.surname@orange-ftgroup.com of a freely available web scraping tool which combines texts patchworks
3 Orange Labs, email: name.surname@orange-ftgroup.com and Markovian random text generators.
4 Univ Paris Sud 11 & LIMSI/CNRS, email:surname@limsi.fr 6 See web site rss2spam.com
Example 1 The necklace tree is being buttonholed to play cellos and The answer to the question “Who will win this game ?” is not
the burgundian premeditation in the Vinogradoff, or Wonalancet am trivial. Player A should generate a text “as real as possible”, but he
being provincialised to connect. Were difference viagra levitra cialis should not replicate too long pieces of the originals texts, by copying
then the batsman’s dampish ridiculousnesses without Matamoras did them directly or by using a generator that overfits its training set.
hear to liken, or existing and tuneful difference viagra levitra cialis Indeed, this kind of plagiarism may be detected by the other player. If
devotes them. Our firm stigmasterol with national monument if amid
the microscopic field was reboiling a concession notwithstanding his dataset is too small, he will not be able to learn anything from rare
whisks. events (3-gram or more) without running the risk of being detected
as a plagiarist.
Even if it is a complete nonsense, this text shares many statistical
properties with natural texts (except for the high frequency of stuffed 2.2 Fair Use Abuses
keywords like “viagra” or “cialis”). It also presents the great advan-
tage of being completely unique. The local syntactic and semantic Wikipedia is frequently used as a source for web scraping. To illus-
consistency of short word sequences in this text suggests that it was trate this point, we performed an experiment to find the most typical
probably generated with a second order (i.e. based on 2-gram statis- Wikipedia phrases.
tics) Markovian model. We first sorted and counted all 2-grams, 3-grams and 4-grams ap-
The aim of this paper is to propose a robust and genre independent pearing at last two times in a dump of the English Wikipedia. From
technique to detect computer generated texts. Our method comple- these n-grams, we selected the ones that do not appear in Google
ments existing spam filtering systems, and shows to perform well on 1 Tera 5-grams collection [19]. If we except the unavoidable prepro-
text generated with statistical language models. cessing divergence errors related in Section 3.2, our computation re-
In Section 2, we discuss the intrinsic relation between the two veals that respectively 26%, 29%, and 44% of Wikipedia 2-grams,3-
problems of plagiarism detection and fake content detection, and we grams and 4-grams are out of Google collection: all these n-grams
propose a game paradigm to describe the combination of these two are likely to be markers of Wikipedia content. This means that even
problems. In Section 3, we present the datasets that we have used small pieces of text may be reliable markers of plagiarism.
in our experiments. In Section 4, we evaluate the ability of standard The most frequent markers that we found are side effects of
n-gram models to detect fake texts, and conversely, of different text Wikipedia internal system: for example “appprpriate” and “the main-
generators to fool these models. In Section 5, we introduce and eval- tenance tags or” are typical outputs of Smackbot, a robot used by
uate a new approach: relative entropy scoring, whose efficiency is Wikipedia to cleanup tags’ dates. We also found many “natural”
boosted by the huge Google’s n-gram dataset (see Section 6). markers like “16 species worldwide” or “historical records the vil-
lage”. When searching for “16 species worldwide” on Google search
engine, we found respectively two pages from Wikipedia, two sites
2 ADVERSARIAL LANGUAGE MODELS about species and two spam sites (See Figure 3). The same test with
“historical records the village” yielded two Wikipedia pages and
2.1 A fake text detection game
many “fair use” sites such as answer.com or locr.com.
The problem of fake texts detection is well-defined as a two player
variant of the Turing game: each player is using a dataset of “hu-
man” texts and a language model. Player A (the spammer) gener-
ates fake texts and Player B (the tester) tries to detect them amongst
other texts. We may assume, especially if Player B is a search engine,
that Player A’s dataset is included into Player B’s dataset, but even
in this situation, Player B is not supposed to know which part it is.
The ability of Player B to filter generated texts among real texts will
determine the winner (See Figure 2). Each element of the game is
crucial: the relative sizes of the datasets induces the expressiveness
of the language model required to avoid overfitting, which in turn
determines the quality and quantity of text that may be produced or Figure 3. The 6th answer of Google for the query “16 species worldwide”
detected. The length of the texts submitted to the Tester is also an is a casino web scraping page hidden in worldmassageforum.com web
site (2008-04-14)
important factor.

To conclude this small experiment, even if it is “fair use” to


pick some phrases from a renowned web site like Wikipedia, a web
scraper should avoid using pieces of texts that are too rare or too
large if he wants to avoid being considered with too much attention
by anti-spam teams.

3 DATASETS AND EXPERIMENTAL


PROTOCOL
3.1 Datasets
Figure 2. Adversarial language models game rules.
For our experiments, we have used 3 natural texts corpora and the
Google n-grams collection:
• newsp : articles from the French newspaper “Le Monde”; 4 STANDARD N-GRAM MODELS
• euro : English EU parliament proceedings;
• wiki : Wikipedia dumps in English; 4.1 Perplexity-based filtering
• google1T : a collection of English n-grams from Google [19]. Markovian n-gram language models are widely used for natural
language processing tasks such as machine translation and speech
We chose newsp and euro corpora for testing on small but homo- recognition but also for information retrieval tasks [12].
geneous data and wiki to validate our experiments on more realistic These models represent sequences of words under the hypothesis
data. Sizes and n-gram counts of these corpora are summarized in of a restricted order Markovian dependency, typically between 2 and
Table 1. 6. For instance, with a 3-gram model, the probability of a sequence
of k > 2 words is given by:
Table 1. Number of words and n-grams in our datasets. There is no low
frequency cut-off except for google1T en collection, where it was set to 200 p(w1 . . . wk ) = p(w1 )p(w2 |w1 ) · · · p(wk |wk−2 wk−1 ) (1)
for 1-grams and 40 for others n-grams.
tokens 1gms 2gms 3gms 4gms A language model is entirely defined by the conditional probabilities
newsp 76M 194K 2M 7M 10M p(w | h), where h denotes the n − 1 words long history of w. To
euro 55M 76K 868K 3M 4M ensure that all terms p(w | h) are non-null, even for unknown h or
wiki 1433M 2M 27M 92M 154M w, the model probabilities are smoothed (see [4] for a survey). In all
google1T 1024B 13M 314M 977M 1313M
our experiments, we resorted to the simple Katz backoff smoothing
scheme. A conventional way to estimate how well a language model
p predicts a text T = w1 . . . wN is to compute its perplexity over T :
N
1
−N
P
log2 p(wi |hi )
P P (p, T ) = 2H(T,p) = 2 i=1 (2)
3.2 Text preprocessing
Our baseline filtering system uses conventional n-gram models
We used our own tools to extract textual content from XML and (with n = 3 and n = 4) to detect fake content, based on the assump-
HTML datasets. For sentence segmentation, we used a conserva- tion texts having a high perplexity w.r.t. a given language model are
tive script, which splits text at every sentence final punctuation mark, more likely to be forged than text with a low perplexity. Perplexi-
with the help of a list of known abbreviations. For tokenization, we ties are computed with the SRILM Toolkit [18] and the detection is
used the Penn-TreeBank tokenization script, modified here to fit more performed by thresholding these perplexities, where the threshold is
precisely the tokenization used for google1T en n-grams collection. tuned on some development data.

4.2 Experimental results


3.3 Experimental Protocol
Table 2 summarizes the performance of the our classifier for different
Each corpus was evenly split into three parts as displayed in Figure 2: corpora and different text lengths.
one for training the detector, one for training the generator and the
last one as a natural reference. Because we focus more on text gen- Table 2. F-measure of fake content detector based on perplexity
eration than on text plagiarism, we chose to separate the training set calculation using 3 and 4 order n-gram models against corpora of fake texts
of the detector and the training set of the generator. All the numbers described in Section 3.3
reported above are based on 3 different replications of this splitting 3-gram model 4-gram model
procedure. newsp euro wiki newsp euro wiki
In order to evaluate our detection algorithms, we test them on dif- pw5 2k 0.70 0.76 0.26 0.70 0.78 0.28
5k 0.90 0.89 0.39 0.90 0.85 0.37
ferent types of text generators: pw10 2k 0.31 0.50 0.21 0.30 0.51 0.17
5k 0.43 0.65 0.30 0.42 0.67 0.29
• pw5 and pw10: patchworks of sequences of 5 or 10 words; ws10 2k 0.85 0.94 0.44 0.81 0.95 0.51
• ws10, ws25 and ws50: natural text stuffed with 10%, 25% or 50% 5k 0.97 0.97 0.71 0.96 0.95 0.73
of common spam keywords; ws25 2k 1.00 0.99 0.79 1.00 0.99 0.99
• lm2, lm3 and lm4: Markovian texts, produced using the SRILM 5k 0.97 1.00 0.80 0.98 1.00 0.98
toolkit [18] generation tool, using 2, 3 and 4-gram language mod- ws50 2k 1.00 1.00 0.90 1.00 1.00 1.00
5k 1.00 1.00 0.91 1.00 1.00 1.00
els.
lm2 2k 0.95 0.88 0.83 0.95 0.87 0.97
5k 0.96 0.92 0.90 0.94 0.96 0.97
Each of these generated texts as well as natural texts used as refer- lm3 2k 0.39 0.25 0.20 0.45 0.27 0.29
ence are split in sets of texts of 2K, 5K and 10K words, in order to 5k 0.56 0.25 0.21 0.60 0.30 0.38
assess the detection accuracy over different text sizes. A small and lm4 2k 0.46 0.25 0.28 0.48 0.28 0.41
randomly chosen set of test texts is kept for tuning the classification 5k 0.60 0.25 0.21 0.66 0.29 0.44
threshold. The remaining lot are used for evaluation; the performance spam 2k 1.00 1.00
are evaluated using the F measure, which averages the system’s re-
call and precision.
We also test our algorithms against a “real” fake content set of A first remark is that detection performance is steadily increasing
texts crawled on the Web from the “viagra” link-farm of Figure 1. with the length of the evaluated texts; likewise, larger corpora are
This spam dataset represent 766K words. globally helping the detector.
We note that patchwork generators of order 10 are hard to detect
with our n-gram models: only low order generators on homogeneous
corpora are detected. Nevertheless, as explained in Section 2.2, even
5-word patchworks can be accurately detected using plagiarism de-
tection techniques.
In comparison, our baseline system accurately detects fake con- Figure 4. Examples of useful n-grams. “and” has many possible
successors, “the” being the most likely; in comparison, “ladies and” has few
tents generated by word stuffing, even with moderate stuffing rate. It plausible continuations, the most probable being “gentlemen”; likewise for
also performs well with fake contents generated using second order “bed and”, which is almost always followed by “breakfast”. Finding “bed
Markovian generators. 3-gram models are able to generate many nat- and the” in a text is thus a strong indicator of forgery.
ural words patterns, and are very poorly detected, even by “stronger”
4-gram models.
The last line of Table 2 displays detection results against “real” tween the conditional distributions p(·|h) and p(·|h0 ) ([12]):
fake contents from the link farm of Figure 1. We used models trained X p(w|h)
and tuned on the Wikipedia corpus. Detection is 100% correct for KL(p(·|h)||p(·|h0 )) = p(w|h)log (3)
p(w|h0 )
this approximately 10% stuffed second order Markovian text. w

The KL divergence reflects the information lost in the simpler model


5 A FAKE CONTENT DETECTOR BASED ON when the first word in the history is dropped. It is always non-
RELATIVE ENTROPY negative and it is null if the first word in the history conveys no infor-
mation about any successor word i.e. if w: ∀w, p(w|h) = p(w|h0 ).
5.1 Useful n-grams In our context, the interesting histories are the ones with high KL
scores.
The effectiveness of n-gram language models as fake content detec- To score n-grams according to the dependency between their first
tors is a consequence of their ability to capture short-range semantic and last words, we use the pointwise KL divergence, which measures
and syntactic relations between words: fake contents generated by the individual contribution of each word to the total KL divergence:
word stuffing or second order models fail to respect these relations.
In order to be effective against 3-gram or higher order Marko- p(w|h)
P KL(h, w) = p(w|h)log (4)
vian generators, this detection technique requires to train a strictly p(w|h0 )
higher order model, whose reliable estimation requires larger vol-
umes of data. In fact, a side effect of smoothing is that the probability For a given n-gram, a high PKL signals that the probability of the
of unknown n-grams is computed through “backing off” to simpler word w is highly dependent from the n−1 preceding word. To detect
models. Furthermore, in natural texts, many relations between words fake contents, ie. contents that fail to respect these “long-distance”
are short range enough to be captured by 3-gram models: even if a relationships between words, we penalize n-grams with low PKL
model is built with a huge amount of high order n-grams to mini- when there exists n-grams sharing the same history with higher PKL.
mize the use of back off, most of these n-grams will be well pre- The penalty score assigned to an n-gram (h, w) is:
dicted by lower order models. The few mistakes of the generator will
S(h, w) = max P KL(h, v) − P KL(h, w) (5)
be flooded by an overwhelming number of natural sequences. v
In natural language processing, high order language models gen-
This score represents a progressive penalty for not respecting the
erally yield improved performance, but these models require huge
strongest relationship between the first word of the history h and a
training corpus and lots of computer power and memory. To make
possible successor7 : argmax P KL(h, v) .
these models tractable, pruning needs to be carried out to reduce the v
model size. As explained above, the information conveyed by most The total score S(T ) of a text T is computed by averaging the
high order n-grams is low: these redundant n-grams can be removed scores of all its n-grams with known histories.
from the model without hurting the performance, as long as adequate
smoothing techniques are used. 5.2 Experimentation
Language model pruning can be performed using conditional
probability estimates [14] or relative entropy between n-gram distri- We replicated the experiments reported in section 4, using P KL
butions [17]. Instead of removing n-grams from a large model, it is models to classify natural and fake texts. The table 3 summarizes
also possible to start with a small model and then insert those higher our main findings. These results show a clear improvement for the
order n-grams which improve performance until a maximum size is detection of fake content generated with Markovian generators using
reached [15]. Our entropy-based detector uses a similar strategy to a smaller order than the one used by the detector. Models whose or-
score n-grams according to the semantic relation between their first der is equal or higher tend to respect the relationships that our model
and last words. This is done by finding useful n-grams, ie. n-grams tests and cannot be properly detected.
that can help detect fake content. The drop of quality in detection of texts generated using word
Useful n-grams are the ones with a strong dependency between stuffing can be explained by the lack of smoothing in the probability
the first and the last word (see Figure 4). As we will show, focusing estimates of our detector. In order to be efficient, our filtering system
on these n-grams allows us to significantly improve detection per- needs to find a sufficient number of known histories; yet, in these
formance. Our method gives a high penalty to n-grams like “bed and texts, a lot of n-grams contain stuffed words, and are thus unknown
the” while rewarding n-grams such as “bed and breakfast”. by the detector. This problem can be fixed using bigger models or
Let {p(·|h)} define a n-gram language model. We denote h0 the larger n-gram lists. The drop in quality for patchwork detection has
truncated history, that is the suffix of length n − 2 of h. For each 7 This word is not necessary the same as argmax P (v|h)
history h, we can compute the Kullback-Leibler (KL) divergence be- v
a similar explanation, and call for similar fixes. In these texts, most n- word salads or patchworks of search engines response lists. Most of
grams are natural by construction. The only “implausible” n-grams the texts we found are easily detected by standard 2-grams models,
are the ones that span over two of the original word sequences, and this justifies our use of “artificial” artificial texts.
these are also often unknown to the system. We presented two approaches to fake content detection. A lan-
guage model approach, which gives fairly good results, and a novel
Table 3. F-measure of fake content detector based on relative entropy technique, using relative entropy scoring, which yielded improved re-
scoring using 3 and 4 order n-gram models against our corpora of natural sults against advanced generators such as Markovian text generators.
and fake content. We showed that it is possible to efficiently detect generated texts that
3-gram model 4-gram model are natural enough to be undetectable with standard stylistic tests,
newsp euro wiki newsp euro wiki yet sufficiently different each others to be uncatchable with plagia-
pw5 2k 0.47 0.82 0.81 0.25 0.42 0.44 rism detection schemes. These methods have been validated using a
5k 0.68 0.93 0.91 0.35 0.57 0.59 domain independent model based on Google’s n-grams, yielding a
pw10 2k 0.28 0.48 0.47 0.16 0.27 0.31
very efficient fake content detector.
5k 0.36 0.64 0.62 0.18 0.27 0.32
We believe that robust spam detection systems should combine a
ws10 2k 0.18 0.27 0.21 0.09 0.21 0.23
5k 0.16 0.43 0.45 0.20 0.25 0.31
variety of techniques to effectively combat the variety of fake content
ws25 2k 0.50 0.67 0.66 0.30 0.29 0.33 generation systems: the techniques presented in this paper seem to
5k 0.67 0.87 0.81 0.28 0.43 0.45 bridge a gap between plagiarism detection schemes, and stylistics
ws50 2k 0.82 0.90 0.92 0.40 0.45 0.51 detection systems. As such, they might become part of standard anti-
5k 0.94 0.98 0.96 0.64 0.63 0.69 spam toolkits.
lm2 2k 0.99 0.99 0.99 0.72 0.78 0.82 Our future works will include tests with higher order models build
5k 0.98 0.99 0.99 0.82 0.96 0.97
with Google’s n-grams and detection tests against other generators
lm3 2k 0.26 0.35 0.29 0.85 0.88 0.87
5k 0.35 0.35 0.39 0.87 0.87 0.92 such as texts produced by automatic translators or summarizers.
lm4 2k 0.32 0.35 0.34 0.59 0.58 0.58
5k 0.35 0.33 0.34 0.77 0.79 0.80
REFERENCES
[1] R. H. Baayen, Word Frequency Distributions, Kluwer Academic Pub-
lishers, Amsterdam, The Netherlands, 2001.
[2] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig, ‘Syntactic
clustering of the web’, in Computuer Networks, volume 29, pp. 1157–
6 TRAINING WITH GOOGLE’S N-GRAMS 116, Amsterdam, (1997). Elsevier Publishers.
[3] A. C. Bulhak. The dada engine. http://dev.null.org/dadaengine/.
The previous experiments have shown that bigger corpus are re- [4] S. F. Chen and J. T. Goodman, ‘An empirical study of smoothing tech-
quired in order to efficiently detect fake-contents. To validate our niques for language modeling’, in 34th ACL, pp. 310–318, Santa Cruz,
techniques, we have thus built a genre independent detector by using (1996).
Google’s n-grams corpus. This model is more generic and can be [5] D. Fetterly, M. Manasse, and M. Najork, ‘Spam, damn spam, and statis-
used do detect fake contents in any corpus of English texts. tics: using statistical analysis to locate spam web pages’, in WebDB ’04,
pp. 1–6, New York, NY, USA, (2004).
Using the same datasets as before, the use of this model yielded [6] D. Fetterly, M. Manasse, and M. Najork, ‘Detecting phrase-level dupli-
the results summarized in Table 4. As one can see, improving the cation on the world wide web’, in ACM SIGIR, Salvador, Brazil, (2005).
coverage of rare histories payed its toll, as it allows an efficient de- [7] Z. Gyöngyi and H. Garcia-Molina, ‘Web spam taxonomy’, in AIRWeb
tection of almost all generators, even for the smaller texts. The only Workshop, (2005).
[8] Z. Gyöngyi., H. Garcia-Molina, and J. Pedersen, ‘Combating Web spam
generators that pass the test are the higher order Markovian genera- with TrustRank’, in VLDB’04, pp. 576–587, Toronto, Canada, (2004).
tors. [9] P. Heymann, G. Koutrika, and H. Garcia-Molina, ‘Fighting spam on so-
cial web sites: A survey of approaches and future challenges’, Internet
Table 4. F-measure of fake content detector based on relative entropy Computing, IEEE, 11(6), 36–45, (2007).
scoring using 3-gram and 4-gram models learn on Google n-grams against [10] A. Kołcz and A. Chowdhury, ‘Hardening fingerprinting by context’, in
our corpora of natural and fake content. CEAS’07, Mountain View, CA, USA, (2007).
[11] T. Lavergne, ‘Taxonomie de textes peu-naturels’, in JADT’O8, vol-
3-gram model 4-gram model ume 2, pp. 679–689, (2008). in French.
euro wiki euro wiki [12] C. D. Manning and H. Schütze, Foundations of Statistical Natural Lan-
pw5 2k 0.92 0.97 0.42 0.77 guage Processing, The MIT Press, Cambridge, MA, 1999.
pw10 2k 0.92 0.81 0.67 0.81 [13] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly, ‘Detecting spam
web pages through content analysis’, in WWW Conference, (2006).
ws10 2k 0.90 0.79 0.90 0.92 [14] K. Seymore and R. Rosenfeld, ‘Scalable backoff language models’, in
ws25 2k 0.91 0.97 0.72 0.96 ICSLP ’96, volume 1, pp. 232–235, Philadelphia, PA, (1996).
ws50 2k 0.95 0.97 0.42 0.89 [15] V. Siivola and B. Pellom, ‘Growing an n-gram model’, in 9th INTER-
lm2 2k 0.96 0.96 0.96 0.98 SPEECH, pp. 1309–1312, (2005).
lm3 2k 0.68 0.32 0.88 0.98 [16] B. Stein, S. Meyer zu Eissen, and M. Potthast, ‘Strategies for retrieving
lm4 2k 0.77 0.62 0.77 0.62 plagiarized documents’, in ACM SIGIR, pp. 825–826, New York, NY,
USA, (2007).
[17] A. Stolcke. Entropy-based pruning of backoff language models, 1998.
[18] A. Stolcke. Srilm – an extensible language modeling toolkit, 2002.
[19] A. Franz T. Brants. Web 1T 5-gram corpus version 1.1, 2006.
LDC ref: LDC2006T13.
7 CONCLUSION [20] T. Urvoy, E. Chauveau, P. Filoche, and T. Lavergne, ‘Tracking web
spam with HTML style similarities’, ACM TWeb, 2(1), 1–28, (2008).
Even if advanced generation techniques are already used by some
spammers, most of the fake contents we found on the Internet were

You might also like