paper4
paper4
paper4
Abstract. How to distinguish natural texts from artificially gener- To keep on with this example, the “Dada-engine” is able to gen-
ated ones ? Fake content is commonly encountered on the Internet, erate thousands of essays about postmodernism that may fool a tired
ranging from web scraping to random word salads. Most of this fake human reader. Yet, a classifier trained on stylistic features immedi-
content is generated for spam purpose. In this paper, we present two ately detects reliable profiling behaviours like these ones:
methods to deal with this problem. The first one uses classical lan-
guage models, while the second one is a novel approach using short • this generator never generates sentences of less than five words;
range information between words. • it never uses more than 2500 word types (this bounded vocabulary
is a consequence of the bounded size of the grammar);
• it tends to repeatedly use phrases such as “the postcapital-
1 INTRODUCTION ist paradigm of”.
Fake content is flourishing on the Internet. The motivations to build To ensure, at low cost, a good quality of the generated text and
fake content are various, for example: the diversity of the generated patterns, most fake contents are built
by copying and blending pieces of real texts collected from crawled
• many spam e-mails and spam blog comments are completed with web sites or RSS-feeds: this technique is called web scraping. There
random texts to avoid being detected by conventional methods are many tools like RSSGM5 or RSS2SPAM6 available to generate
such as hashing; fake content by web scraping. However, as long as the generated con-
• many spam Web sites are designed to automatically generate thou- tent is a patchwork of relatively large pieces of texts (sentences or
sands of interconnected web pages on a selected topic, in order to paragraphs), semi-duplicate detection techniques can accurately rec-
reach the top of search engines response lists [7]; ognize it as fake [6, 16]
• many fake friends generators are available to boost one’s popular-
ity in social networks [9].