0% found this document useful (0 votes)
394 views15 pages

Fighting Obfuscated Spam

The document discusses how spammers use Unicode character obfuscation to bypass spam filters. It introduces a prototype tool that maps polymorphic, obfuscated messages to a common representation that can then be filtered traditionally. The document also describes a de-obfuscation technique to catch messages obfuscated in this way.

Uploaded by

manjunathbhatt
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
394 views15 pages

Fighting Obfuscated Spam

The document discusses how spammers use Unicode character obfuscation to bypass spam filters. It introduces a prototype tool that maps polymorphic, obfuscated messages to a common representation that can then be filtered traditionally. The document also describes a de-obfuscation technique to catch messages obfuscated in this way.

Uploaded by

manjunathbhatt
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Fighting Unicode-Obfuscated Spam

Changwei Liu Sid Stamm


Indiana University Indiana University
changweil@gmail.com sstamm@indiana.edu

ABSTRACT for users, but also adds to network congestion. Addition-


In the last few years, obfuscation has been used more and ally, some spam emails pose a serious threat, in the form of
more by spammers to make spam emails bypass filters. The phishing. While most spam is just unsolicited bulk e-mail,
standard method is to use images that look like text, since phishing spam can be an attempt by attackers to get people
typical spam filters are unable to parse such messages; this to reveal private information (such as bank account informa-
is what is used in so-called “rock phishing”. To fight image- tion or PIN numbers) by providing deceptive links in those
based spam, many spam filters use heuristic rules in which emails. Both types of emails involve some sort of decep-
emails containing images are flagged, and since not many tion centered around obfuscation, or hiding the truth of the
legit emails are composed mainly of a big image, this aids message from filters or a quick visual look at the message.
in detecting image-based spam. The spammers are thus in- In response to these deceptive problems, people have de-
terested in circumventing these methods. Unicode translit- veloped many methods to fight against spam email. At
eration is a convenient tool for spammers, since it allows a first, blacklists and whitelists were used to block or accept
spammer to create a large number of homomorphic clones email [3]. Later, private channels between senders and re-
of the same looking message; since Unicode contains many ceivers were created as a means to classify good email and
characters that are unique but appear very similar, spam- spam email [13, 10]. This was done because spammers can
mers can translate a message’s characters at random to hide change IP addresses very easily, circumventing blacklists and
black-listed words in an effort to bypass filters. In order to whitelists, and possibly causing users to lose good email [12].
defend against these unicode-obfuscated spam emails, we de- At about the same time, spam filters (tools that identify
veloped a prototype tool that can be used with SpamAssas- spam by detecting keywords and phrases in an email mes-
sin to block spam obfuscated in this way by mapping poly- sage’s headers and body text) were put into practice [7]. An
morphic messages to a common, more homogeneous repre- effective, mature and currently widely-deployed open source
sentation. This representation can then be filtered using tra- spam filter called SpamAssassin [26] can catch most of the
ditional methods. We demonstrate the ease with which Uni- common spam email by being correctly “trained”, a process
code polymorphism can be used to circumvent spam filters involving feeding the filter good and bad email, letting it
such as SpamAssassin, and then describe a de-obfuscation learn from that data.
technique that can be used to catch messages that have been However, spammers are very agile. In response to the
obfuscated in this fashion. above methods of blocking spam, they have developed in-
genious techniques to circumvent anti-spam tools. With
the wide-spread application of excellent spam filters such
Keywords as SpamAssassin and CRM114, spammers have developed
Spam emails, Unicode characters, obfuscated emails, de- “image-based” spam emails (like the Rock Phishing discussed
obfuscated emails, SpamAssassin above), word obfuscation, and other methods to get around
the filters [20]. One particularly difficult obfuscation mech-
anism is HTML redrawing; spammers can slice a message
1. INTRODUCTION up into columns, and send the columns in the email physi-
Since 1998, large quantities of spam email have been sent cally appearing to filters as rows. Using some HTML tricks,
to recipients’ email accounts. Currently, spam email is es- the email can display the slices vertically, instead of hori-
timated to make up 75-80% of the overall volume of email zontally as they appear to the filters, essentially rendering
sent [25], which is not only annoying and time-consuming text top-to-bottom first instead of left-to-right. Spam filter-
ing software sees the columns in order, more-or-less garbage,
but when rendered on the screen, people can read across the
columns and decipher the spam.
Among all these circumvention techniques, word obfus-
cation is a very prevalent method used by spammers. In
general, they use following techniques to hide spam words
from filters: misplaced spaces, purposeful misspelling, em-
Copyright is held by the author/owner. Permission to make digital or hard
copies of all or part of this work for personal or classroom use is granted
bedded special characters, Unicode letter transliteration and
without fee. HTML redrawing. When these techniques are combined,
APWG eCrime Researchers Summit, Oct. 4–5, 2007, Pittsburgh, PA, USA.
spammers can produce a vast number of ways to hide key- Better solutions have been developed [13, 10]. R.J. Hall
words. For example, there are more than six hundred quin- suggests using electronic mail channel identifiers [13]: a sender
tillion (600,000,000,000,000,000,000) ways available to spell gives every receiver a specific channel by giving him a unique
the word Viagra [2]! Dealing with misplaced spaces, pur- email address that differs in channel ID but has the same
poseful misspelling and embedded special characters has been user name (for example, Jim could ask alice to send him mail
mentioned in other papers such as [1]. We will focus on Uni- at jim+1@server.com, and ask Bob to use jim+2@server.com).
code letter substitutions (also called partial-word translit- Incoming emails will be divided to different classes accord-
erations). Although Unicode letter substitution is just a ing to receivers’ email address channels. Those emails that
small part of obfuscation, as there are many visually or come from an unknown email address will be put in a public
semantically similar characters existing in UCS (Universal channel (like jim@server.com) and subject to more scrutiny.
Character Set), the spammers can use similar but unequal A similar method is suggested where a sender and receiver
characters to replace corresponding original English charac- communicate with a unique email address that is composed
ters, thus producing spam emails that are practically unde- of a core address and an extension [10]. The extension can be
tectable to most peoples’ eyes. This requires that the com- obtained by a handshake between a receiver and a sender,
munity adopt new tactics in order to maintain the upper which may cause the receiver to pay a computational or
hand against spammers and phishers. monetary cost. Of course, this monetary cost can be used
In this paper, we describe a script to emulate the pro- to “punish” spammers as well.
cess of replacing English characters at random with Unicode Both of the above anti-spam techniques can be eaves-
characters. We then show the effect of processing those gen- dropped by attackers on the communication lines. This can
erated obfuscated messages through SpamAssassin, which be defended against with message authentication codes [16].
assigns scores based on the “spammyness” of a message. The Unfortunately, it also adds to more computational cost to
result shows that obfuscated email can bypass SpamAssassin sending mail.
quite easily. We also show how to de-obfuscate the messages
so they will be assigned the more accurate, much higher
scores. (This is not merely a matter of converting the pre-
vious obfuscation, as sometimes, additional language-based 2.2 Commonly Used Anti-Spam Tools
heuristics must be used to resolve situations in which there is
The three methods mentioned above can effectively pre-
more than one possible and sensible preimage, e.g., involving
vent spam emails with complicated “passwords” and compu-
characters such as “I” and “1”.) As a result, we can integrate
tational or monetary cost, but there are some people who
the de-obfuscating script into SpamAssassin to fight against
simply will not make the effort to complete the complicated
obfuscated email messages.
process, or do not wish to pay the cost. This shadows
The rest of this paper is organized as follows. Section 2 is
the convenience of email and suppresses its usefulness. Be-
related work and background, section 3 describes our specific
cause of this, many users resort to email filters to prevent
experimental process, results, analysis, and shortcomings to
spam emails. Currently, the most popular spam filter is
this technique. We finish up with a summary of our findings
content-based spam filter [19]. Another simple, standard
and future work.
spam filter technique is called Filtering by Duplicate Detec-
tion (FDD) [14]. The idea is that casual spammers proba-
bly send a couple, very similar emails to the same recipient.
2. RELATED WORK AND BACKGROUND This similarity can be used by email software agent to iden-
Spam emails have created onerous work on email providers tify spam and delete these emails received more than once.
as well as email users all over the world. According to the However, Hall points that spam countermeasures based on
International Messaging Anti-Abuse Working Group, spam duplicate detection schemes are foiled if spammers randomly
accounted for about 80% of all email-traffic on Internet dur- assign users’ email addresses to different subsets and send
ing the first three months of 2006. If providers and users somewhat different versions to each subset [14]. Duplicates
don’t take effective anti-spam measures, email might be- then won’t be received, but a few very similar ones might.
come so inconvenient people will stop using it. Since 1998, Moreover, spammers can easily increase their effectiveness
considerable progress has been made to stop spam emails. in combatting FDD by increasing the number of subsets.
Basically, there are two kinds of anti-spam methods: one Keyword-based filtering and latent semantic indexing are
prevents spam emails before they reach users’ email boxes, some more examples of content-based filtering techniques.
and the other removes them from users’ inboxes. They block spam emails by finding bad words in message
headers or body. This is an easy but efficient method; the
2.1 Anti-Spam Techniques drawback is that these techniques rely on manually con-
The simplest methods of preventing spam emails before structed (probably non-exhaustive) keyword lists, and thus
they reach users’ email boxes are black list and white list. A perhaps many errors [4].
black list is a database of IP addresses that send spam emails By continually training a naive Bayesian filter with spam
[12]. In practice, this method is effective, but spammers emails and non-spam (ham) emails can get a robust, adap-
can easily switch their IP addresses by switching to another tive and highly accurate system. Particular words have par-
computer, using a bot network, or using a different mail ticular, different probabilities of occurring in spam email ver-
relay. By contrast, a white list is a list of “known” senders sus in legitimate email. When being trained, the Bayesian
or IP addresses, and is used to prevent anyone not in the list filter will adjust the probabilities that a word appears in
from sending mails to a user’s mailbox. Unfortunately, this spam emails or legitimate emails based on input from the
causes users to miss good emails sometimes, resulting in a user. Later, the filter is used (out of training) to identify
loss of unexpected-but-important emails [3]. spam based on the results of training.
Both CRM114 [23] and SpamAssassin [26] combine sev- probably involves simply blocking spam emails, including
eral technologies with Bayesian techniques to fight spam. phishing emails.
The creators of Spamassassin even claim that it “typically This paper is not directly about phishing prevention, so
differentiates successfully between spam and non-spam in anti-phishing techniques will not be discussed here, though
between 95% and 99% of cases” [27]. This measurement is similar techniques to ours may be employed to more-or-less
based on the SpamAssasin corpus [28], but shows the confi- “homogenize” any string (URL or other) into one language
dence its creators have over its accuracy. or character set.
Spammers can use many different ingenious techniques,
including obfuscation, to circumvent these filters. Just as
mentioned in the introduction, spammers can make use of
Unicode to obfuscate emails while still maintaining “good
looking” messages. 3. OUR EXPERIMENT
The final goal of our scheme is to block obfuscated spam
2.3 Partial-Word Unicode Transliteration emails by converting obfuscated characters to original En-
At a time before Unicode was deployed, there were too glish characters, but, before that, we first emulate attack-
many computer encoding systems to satisfy different require- ers by creating obfuscation spam emails and phishing spam
ments of people who speak different languages. To some emails with similar characters in Unicode.
extent, those different encoding systems resulted in con-
flict. For example, two encodings can use the same code
to represent two different characters or use different codes 3.1 Simulating the Threat
for the same character, which makes it extremely impor- Our experiment was conducted in two phases: first we
tant to carefully translate the characters when moving from wrote a script to produce obfuscated spam emails from raw
one encoding scheme to another. Amidst all the confusion, spammy samples, replacing original English characters at
Unicode was designed to represent almost all standard char- random with Unicode characters. Next, we wrote a script
acters in the world. However, people did not know that it to de-obfuscate these obfuscated emails by detecting and
also brings more problems—there are too many similar char- changing those non-English characters to English characters.
acters in Unicode, thus making it hard to visually differen- After training SpamAssassin, we used it to process the
tiate between characters that may be technically different. original (not-yet-obfuscated) spam emails, obfuscated emails
Spammers can thus exploit this visual similarity to replace and de-obfuscated emails to measure the scores SpamAs-
original characters in emails, generating messages without sassin assigned to each. We then analyzed those scores to
any blacklisted keywords that can bypass spam filters in- see whether or not the obfuscated spam emails can bypass
cluding CRM114 and SpamAssassin. For example, a cyrillic spam filters, but the de-obfuscated ones wouldn’t. In short,
v (U+FF56) can be used to replace the latin v (U+0076) the result of the experiment was promising, showing that
and the cyrillic i (U+0456) to replace the latin i (U+0069) all obfuscated emails can circumvent the spam filter, but
in “Viagra” to get a technically different, yet visually similar our de-obfuscation script helps SpamAssassin detect the ob-
word. In particular, phishers may use similar characters to fuscated emails by converting them to de-obfuscated forms.
replace IRI/IDN in URLs to hide from blacklists, or look However, this requires a trade-off of a little computational
visually legit [6]. The IRI/IDN attack and the spam ob- time since the script has to spend a few moments searching
fuscation are called “Unicode attack”, and are used to send for and removing possible obfuscation.
users “perfect looking” phishing emails. These messages in- To improve the speed of the de-obfuscation script, we fo-
clude the legit-looking address, in an attempt to persuade cus only on reducing to ASCII characters; we just searched
users to give out their private information [11, 15, 17]. for Unicode characters used for obfuscation in the simlists of
To fight this “Unicode attack”, a UC-Simlist was struc- ASII charaters and then replaced them with corresponding
tured [6]. In this list, every character is followed with some ASCII characters. (In fact, there is no need to find any other
visually or semantically similar characters. The listed char- characters since our experiment is based on spam emails in
acters are paired with a similarity value from 0.8 to 1, with English.) The computational time is approximately 5-7 sec-
1 being visually or semantically identical. Though this UC- onds on a 2GHz system for a normal-size email, but the
Simlist just includes English, Chinese and Japanese and a time will be longer for longer emails because the script will
complete version is still needed, it was enough for our tests spend more time searching and replacing non-English Uni-
based on English. Our goal was to replace original English code characters with original English characters. Addition-
characters with Unicode characters in spam emails, result- ally, our code can be optimized with techniques discussed in
ing in obfuscated spam emails; we then aimed to convert Section 4.2 to run more quickly.
non-English characters back to form words that might sig- We also performed the same experiment on spam phish-
nal spam filters. Our technique will be described in depth ing emails. Because spam phishing emails are generally well
later. written and “perfect-looking”, not like spam emails peddling
Many anti-phishing tools are available to help users pro- erectile disfunction medication or “hot sex” that normally
tect themselves from phishing attacks. In defense against don’t need to look like some authority like a bank. We ad-
homograph phishing attacks, the IRI/IDN SecuChecker and justed our scripts to produce nearly “perfect-looking” spam
browser address/status bars can be used to warn users, some- phishing emails (only substituting English characters for
times with different colors [5]. Some phishing websites are exact-match unicode substitutes). However, the result is
very well spoofed and hidden, easily fooling users, and sev- the same as mentioned above: the obfuscated emails still
eral recent studies (see, e.g., [8, 21, 22]) indicate that typical pass through the filter, but can be de-obfuscated using our
users often ignore warnings, so the best anti-spam method technique.
3.2 Experiment Tools. more than 20. After analysis, we found that those spam
emails that contained URI listed in the Spamhaus Blocklist
UC-Simlist. We used the UC-Simlist to detect “Unicode (SBL) or the list from ws.surbl.org blocklist or obvious
attacks”. According to visual similarity and semantic spam keywords (like porn-related words), got higher scores
similarity, Fu, et al [5, 6] constructed a UC-Simlist, than ordinary spam emails, and could likely benefit most
where a character is paired with a set of visually and from Unicode-related obfuscation. We chose to use more
semantically similar Unicode characters. When replac- than 100 spam emails like these whose scores were larger
ing a reference character in the original message with a than 7.9. To obfuscate those emails whose scores are below
similar target character, we used targets whose similar- 7.9 is not be very useful, since their lower scores show that
ity to the reference character is equal to 1 to produce they are not obvious spam emails.
decent-looking obfuscated spam emails (this made the
change as inobvious as possible). In addition, using 3.4 Obfuscating Spam
only English messages allowed us to use a smaller por- Next, we obfuscated the chosen set of spam emails. Our
tion of the UC-Simlist, reducing the amount of com- obfuscation script randomly selected some original charac-
putational time required. ters in the input messages and replaced them with characters
chosen at random from the UC-Simlist entry for the original
SpamAssassin. Using the de-obfuscation technique, any
character. For example, in the sentence “Finally the real
good spam filter like SpamAssassin or CRM114 should
thing”, our script might choose characters “ i r h”. Each one
produce similar results. We chose to use SpamAs-
of these letters is looked up in the UC-Simlist and one of the
sassin as the spam filter since it is free software and
respective similar characters whose similarity is 1 is picked
widely used. It is also easy to configure and add new
at random to replace it. Each English character might have
rules. Furthermore, it is written in Perl, making it
many similar characters; in fact, in the UC-Simlist, the char-
easy to integrate our scripts (also written in Perl).
acter U+006C (“I”) has more than 16 similar characters (see
SpamAssassin uses a diverse range of tests to iden-
Table 1).
tify spam emails including not only header tests, body
phrase tests, white/blacklist, collaborative spam iden-
tification databases, DNS Blacklists(“ RBLs”), charac- 1:217C:ⅼ 1:FF29:I 1:0406:І 1:FF4C:l
ters sets and locales, but also Bayesian filtering [26]
that can be trained by each individual user with spam 1:2160:Ⅰ 1:006C:l 1:10C0:ǀ 1:0399:Ι
emails and non-spam (ham) emails. If the Bayesian 1:0049:I 0.93103:05C0:‫׀‬ 0.93103:05C0:‫׀‬ 0.89843:0196:Ɩ
filter is well-trained, it will catch up to “99.5% of spam
with less than 0.03% false positives” [10]. 0.8879:1F77:ί 0.87931:1F30:ἰ 0.87931:1F31:ἱ 0.87878:0140:ŀ
In our experiment, we used SpamAssassin’s public cor-
pus [28], Jose Nazario’s phishing corpus [18] and spam Table 1: Similar characters of 006C “I” in UC-
or phishing emails collected at Indiana University/Purdue Simlist. Format: hsimilarityi:hhex codei:hcharacteri.
University in Indianapolis (IUPUI) to train SpamAs- Here, we use similarity to mean the visual or se-
sassin. We randomly selected 3000 spam emails as mantic correspondence, as defined by Fu et al. The
well as more than 3000 ham (legit) emails from the similarity “1” means that the Unicode character is
corpora to train SpamAssassin. As it would not affect visually or semantically identical to the “spoofed”
the analysis for the final results, those spam email and character.
ham email headers were removed and only the content
was used for our tests. Our obfuscation script randomly chose a similar character
Linux and Perl. Our obufuscating and deobfuscating pro- among these similar characters whose similarity is 1 to re-
cessing was written in Perl 5.6 with the SpamAssas- place the original character. This randomness gives a differ-
sin perl modules installed from the CPAN repository. ent obfuscated spam email every time the obfuscation script
This experiment was conducted on Fedora Core 6, with is run with the same spam email (though the result looks
Linux kernel version 2.6.20. the same, and the inputs are all the same). This would get
a different score for every different attempt at obfuscating a
3.3 Training SpamAssassin given spam email. In this case, we generated 10 obfuscated
First, we trained SpamAssassin using the method men- emails for each input message and computed the average
tioned in Section 3.2. SpamAssassin has a threshold value, score to determine an expected value (Table 2).
which is used to decide whether an email is spam or not.
All emails whose scores are above this threshold are spam, 3.5 De-Obfuscating the Spam
while those that have lower scores than the threshold are not Once we figured out the expected score of each obfuscated
classified as spam. In general, the default threshold is 5, but email, we had to see what the expected de-obfuscation score
users can set it according to their own needs. Anyhow, the will be. We wrote a second script that converts the non-
threshold’s value will not greatly affect our desired outcome. English Unicode characters in its input to a (hopefully same-
In our experiment, the default value of 5 was used. as-original) English character. Because we chose only En-
After training SpamAssassin, we selected some typical glish inputs, the script could convert the non-English char-
spam emails from the collection of spam emails in the IUPUI acters to its original (pre-obfuscated) character nearly every
corpus for obfuscating and de-obfuscating. There are more time.
than 1,000 text spam emails available for our experiments, The de-obfuscating script addresses every character with
and their scores through SpamAssassin range from 5.4 to value greater than 255 (the barrier of dividing ASCII En-
glish characters and other Unicode characters). It also ad-
dresses all characters that are numeric and between two let-
ters, such as the number one in “ pen1s”. For each of these
chosen characters, a reverse-lookup is performed on the UC-
Simlist (the exact inverse of what was performed to chose an
obfuscated replacement during the obfuscation-stage). The
de-obfuscation reverse lookup is a bit more complex since
the script must choose the right original English character;
for example, “1”is not only similar to “l”, but also similar to
“i”. To avoid using a dictionary to find the “best” result, we
can maximize accuracy by attempting the reverse-lookups
in order of English-letter frequency (for example, checking
if something matches “O” before checking to see if it matches
“Q”). This de-obfuscation script could be further refined for
more accuracy sacrificing a bit of speed; an English lan-
Figure 1: Score comparisons between spam mes-
guage dictionary could be employed to make sure that the
sages’ raw score and obfuscated score for 10 emails
de-obfuscation generated something resembling an English
found in appendix A.
word. We did not employ this, as it turns out even choosing
any replacement that has a perfect (1) similarity will recover
most of the original characters well enough.
highly. Thus, obfuscated emails’ scores will probably always
Since we used the average of ten runs to determine the
be lower than spam emails.
expected score of the obfuscated messages, we de-obfuscated
Also, we can see that some spam emails got high scores,
each of those ten messages and took the average score of the
while some of them got lower scores. For example, message
de-obfuscated messages to determine how effective the de-
1’s score is much higher than message 5. This is because
obfuscation was expected to be.
message 1 is a common spam email including not only bad
keywords, but also a blacklisted URL. After obfuscation,
4. RESULTS some bad keywords as well as the URL in SpamAssassin’s
Using the obfuscation and de-obfuscation techniques men- blacklist were changed (making them new to SpamAssas-
tioned above, we produced some obfuscated spam emails sin). As a result, obfuscated message scores are on average
(based on some selected from the IUPUI spam corpus) and much lower than the original score. In contrast with mes-
then de-obfuscated them. Each message’s “spammyness” sage 1, message 5 does not have a blacklisted URL, so its
was measured by SpamAssassin, and the average score for score is lower (7.9). Also, its body is short, which can’t be
the ten-runs of obfuscated and de-obfuscated messages were changed very much, and ends up still above the threshold of
calculated. 5 (classified as spam).
Table 2 and Figure 1 shows the experimental results for a In fact, not all spam emails’ scores are fixed. It depends
size-ten subset of the original emails we tested. The contents on SpamAssassin’s training for it’s naive Bayesian learning
of the spam messages can be found in the appendix A. (mentioned in 3.2). However, this does not affect our results,
since those keywords and URLs partly changed by Unicode
No. Raw Score Obfuscated De-obfuscated characters in spam emails will be new to specific trained
1 16.2 3.15 16.2 SpamAssassin (no matter how it is trained).
2 21.7 5.6 21.7 In Figure 2, some parts of the first spam message and
3 17.2 4.38 17.2 one of its obfuscated versions are displayed. (Screenshots of
4 10.6 1.9 10.6 the full-text can be found in appendix B). In the obfuscated
5 7.9 5.85 7.9 email, we can see that the link at the bottom and some words
6 21.7 5.0 21.7 like “pen1s” and “enlargment” were partly changed into non-
7 19.2 5.4 19.2 English Unicode characters. (Notice in this original spam
8 12.1 2.55 12.1 message, some key words had already been obfuscated by
9 13.1 5.94 13.1 spammers in order to bypass spam filter. Examples of this
10 10.6 5.35 10.6 are “enhacment” and “pen1s”—note the mis-spelling and the
use of digits. We note that these two are still blocked by
Table 2: Scores of 10 spam emails—the emails are SpamAssassin). It was the obfuscation’s change of the URL
shown in appendix A. The table also shows the av- and these keywords that decreased the email’s spam score.
erage scores, as produced by ten independent obfus- As mentioned above, we used characters whose similarity
cations and associated de-obfuscations. The scores is specified as 1 for obfuscation characters to get highest-
were calculated by SpamAssassin. quality obfuscation. As a result, our de-obfuscated script
can quite closely convert those obfuscated messages back
From Figure 1, we can see that obfuscated emails got to their original form. When close to the original form,
lower scores than original emails. In fact, in these spam their corresponding scores are very close to (if not exactly
emails, there are many keywords, phrases or URLs tagged the same as) the original messages. However, it is almost
as spam token in databases of SpamAssassin that has been impossible to convert all obfuscated characters to the “real”
trained, but in those spam emails’ obfuscated emails, most English characters without some sort of context inference.
of those keywords, phrases or URLs were changed, which are For example, the word of “penls” in the spam email of figure
not in SpamAssassin’s keyword list, so they aren’t scored as 2, we can invert “pen1s” to “penis” by judging if the number
...
A top team of British scientists and medical doctors have worked to
develop the state-of-the-art Pen1s Enlargment Patch delivery system
which automatically increases penls size up to 3-4 full inches.
...
Here's the link to check out!
http://www.all-love-pillzz.net/pt/?46&wnwug

...
A top teаm οf British sϲientists аnd medicaǀ doсtorѕ һаve worked to
develοp tһe ѕtate-ofˍtһe-аrt Рenlѕ EnΙarɡment Раtϲһ deӀiverу system
whiϲһ autοmaticalǀy inсreaѕeѕ pen1ѕ ѕize up tο 3-4 fulӀ inсһeѕ.
...
Here's tһe ǀink to cһeck out!
http://www.all-Ӏove-piІlzz.net/pt/?46&wnwuɡ

Figure 2: A spam email before and after obfusca-


tion. The screenshots of the full texts can be found
in appendix B. In this figure, we can see that, in
original spam email, the spammer has changed the
form of the keyword “penis” as “pen1s” and “penls”
and misspelled “enlargement” as “enlargment”, try-
ing to bypass spam filter. However, after being
trained, SpamAssassin can block this spam email. In
the obfuscated email, some characters in many key-
words, even including “pen1s” and “penls” in original Figure 3: A phishing email before and after “per-
spam email, were replaced with other similar Uni- fect” obfuscation. In order to make the obfuscated
code characters in UC-Simlist by our obfuscation email look perfect, we just chose Unicode characters
script. Therefore, those keywords that have been that take up the same size space as English charac-
changed are new to SpamAssassin, which can help ters and look very similar to their corresponding
the obfuscated spam email to bypass SpamAssassin. English characters to replace original English char-
acters. Moreover, we adjusted the font to make all
characters, including original English characters and
1’s left and right characters are letters instead of numbers, Unicode characters, more similar. We use red to
but it is hard to catch “l” and invert it to “i” since both “l” identify the obfuscated characters in the after fig-
and “i” are English characters. This is since we did not make ure.
the script to look up in a dictionary whether a word is real
or not. (This can be done, but will increase the processing
time.) As a result, not all de-obfuscation will be perfect, and by training them, but this probably adds to more probabil-
in fact some might recover words more fully than present in ity of a false positive, or filtering out a real bank message.
the original source. However, since phishers are in the business to make money,
We did this experiment with those more than 100 spam the phishing emails always include some telephone numbers,
emails, analyzing their obfuscated and de-obfuscated ver- links to some websites or postal addresses. For those in
sions. The results showed that obfuscated emails’ aver- which there are some keywords or URLs in SpamAssassin’s
age score is usually much lower than original spam emails databases, we will get a similar result as the spam email
for typical spam emails, which means that well obfuscated tests.
emails can surely bypass spam filter. However, after being However, in our experiment for dozens of phishing spam
de-obfuscated by our script, they can more easily be caught. emails obtained from various sites on the Internet, we found
that most of them do not get high scores at all through Spa-
4.1 The Phishing Email Twist mAssassin in the first place. One possible reason is that the
In comparison with spam emails, phishing emails are well phone numbers and website links are not listed in SpamAs-
written and look more “authentic”. Before doing experi- sassin’s blacklists. Another reason might be that phishers
ments with phishing emails, we adjusted the Simlist used may put authentic websites in those emails, but redirect
for finding non-English replacements, making the obfuscated users to malicious websites using techniques like pharming.
emails look “perfect”. For example, in most fonts, the Cyril-
lic “i” and “v” look exactly the same as the ASCII equiva- 4.2 Shortcomings and Improvements
lents, but they are different to the filter. This subset of the Our de-obfuscation method presented works well, but is
Simlist produced “perfect” looking results, though shrunk not perfect. It’s tough to measure its accuracy since each
the number of obfuscation variations for each input message. person receives a different set of spam emails, and the spam-
We assert that it is hard for a human reader to discern which mers regularly change techniques. The few seconds per mes-
characters are not English characters in the obfuscated ver- sage overhead is not desirable — a person does not wish to
sion in the sample (Figure 3). wait five seconds for each email they receive. The perfor-
Since phishing emails are well written, they often bypass mance of our scripts should be optimized to perform better
a filter more easily. Of course, we can adjust SpamAssassin and faster.
4.2.1 Tweaks and Improvements [6] A. Y. Fu, X. Deng, W. Liu, G. Little, “The
There are tweaks and changes that can make our de- Methodology and an Application to Fight Against
obfuscation faster or more accurate. For example, two UC- Unicode Attacks,” in Proceedings of the Second
Simlist indexes can be used: one that maps English/ASCII Symposium on Usable Privacy and Security
characters to similar-looking Unicode characters, and an- (SOUPS’06) July 2006. ACM Press.
other that maps each (non-English) Unicode character back [7] F. D. Garcia, J.H. Hoepman, J. V. Nieuwenhuizen,
to English/ASCII characters that resemble it. These two “Spam Filter Analysis,” arXiv report, February 2004.
structures can be placed into hash tables to enable very Available at http://arxiv.org/PS_cache/cs/pdf/
fast lookups for obfuscation and de-obfuscation, respectively. 0402/0402046v1.pdf
Using these hash tables to (de-)obfuscate characters would [8] S. L. Garfinkel and R. C. Miller, “Johnny 2: a user
make the processing time for each email negligible. test of key continuity management with S/MIME
To improve accuracy, we could use dictionary lookups to and Outlook Express,” Proceedings of the 2005
verify that the de-obfuscation of a word worked, or help Symposium on Usable Privacy and Security, 2005,
guide the de-obfuscation choices when there are many de- pp. 13 – 24
obfuscations of a word (such as in the case where the script [9] P. Graham, “Better Bayesian Filtering,” Spam
must chose between an upper-case “I” or a lower-case “l”). Conference, January 2003. Available at
Additionally, more advanced word-stemming [1] or edit-distance http://www.paulgraham.com/better.html.
(from banned keywords) techniques can be used on words in [10] E. Gabber, M. Jakobsson, Y. Matias, A. Mayer,
the message to more accurately find “spammy” content. “Curbing Junk E-mail via Secure Classification,”
Financial Cryptograpy, 1998.
5. CONCLUSION [11] E. Gabrilovich, A. Gontmakher, “The Homograph
Nowadays, spammers are more intense and agile than ever Attack,” Communications of the ACM, February
before and they employ many tactics while trying to get 2002.
their messages past spam filters. We have described a few [12] J. Goodman, G. V. Cormack, D. Heckerman, “Spam
methods they are using, including one that involves obfus- and the Ongoing Battle for the Inbox,”
cation by similar characters in Unicode: replacing English Communications of the ACM, February 2007.
characters with non-English equivalents. We described a de- [13] R. J. Hall, “Channels: Avoiding Unwanted Electronic
obfuscation that quite well reverses this obfuscation process; Mail,” Communications of the ACM, Volume 41 Issue
this script can be integrated into an existing spam filter to 3, 1998.
improve its efficacy against spam that is obfuscated in this [14] R. J. Hall, “A Countermeasure to Duplicate-detecting
fashion. However, spammers probably use more obfusca- Anti-spam Techniques,” Available at
tion, such as html techniques and such on, to bypass filters, http://citeseer.ist.psu.edu/279802.html,
so this is not at all a complete solution for spam filtering, accessed 25 July 2007.
but this is surely a step in the right direction. [15] M. Jakobsson, “Modeling and Preventing Phishing
Attacks,” Phishing Panel in Financial Cryptography
6. ACKNOWLEDGEMENTS 2005. Available at www.informatics.indiana.edu/
markus/papers/phishing_jakobsson.pdf
We would like to thank Markus Jakobsson for his guid-
ance, Arvind Ashok for help with the experiments, and Rachna [16] M. Jakobsson, J. Linn, J. Algesheimer, “How to
Dhamija and Jon Praed for their helpful feedback on a pre- Protect Against a Militant Spammer,”
vious version of the paper. http://www.informatics.indiana.edu/markus/
papers/spam.pdf, accessed 1 July 2007.
[17] M. Jakobsson and S. A. Myers (Eds.), Phishing and
7. REFERENCES Countermeasures: Understanding the Increasing
[1] S. Ahmed, F. Mithun, “Word Stemming to Enhance Problem of Electronic Identity Theft. ISBN
Spam Filtering,” in the Conference on Email and 0-471-78245-9, Hardcover, 739 pages, December 2006.
Anti-Spam (CEAS’04) 2004. [18] J. Nazario, “Phishing Corpus,” http://monkey.org/
http://www.ceas.cc/papers-2004/167. ~jose/blog/viewpage.php?page=phishing_corpus.
[2] R. Cockerham, “There are Accessed 22 May 2007.
600,426,974,379,824,381,952 ways to spell Viagra.” [19] U. Shardanand, P. Maes, “Social Information
http: Filtering: Algorithms for Automating ’Word of
//cockeyed.com/lessons/viagra/viagra.html. Mouth’,” Proceedings of the SIGCHI Conference on
Retrieved on 25 July 2007. Human Factors in Computing Systems. May 1995.
[3] D. Cook,J. Hartnett,K. Manderson,J. Scanlan, [20] B. Thorson, “How Spammers Bypass E-mail
”Catching Spam Before it Arrives:Domain Specific Security,” EE Times, 25 July 2007.
Dynamic Blacklists,” http://www.eetimes.com/showArticle.jhtml?
http://crpit.com/confpapers/CRPITV54Cook.pdf. articleID=23900564
[4] L. F. Cranor, B. A. LaMacchia, “Spam!” [21] A. Tsow and M. Jakobsson, “Deceit and Deception:
Communications of the ACM, August 1998. A Large User Study of Phishing,” Technical Report
[5] A. Y. Fu, W. Zhang, X. Deng, W. Liu, “Safeguard TR649, Indiana University, August 2007. http://
against unicode attacks: generation and Application www.cs.indiana.edu/pub/techreports/TR649.pdf
of UC-simlist,” in the 15th International World Wide [22] S. Srikwan, M. Jakobsson, ”Using Cartoons to Teach
Web Conference (WWW’06), May 2006.
Internet Security.” DIMACS Technical Report
2007-11, July 2007.
http://www.informatics.indiana.edu/markus/
documents/security-education.pdf
[23] CRM114. http://crm114.sourceforge.net,
Accessed 22 May 2007.
[24] Anti-Phishing Group of City University of Hong
Kong, http://antiphishing.cs.cityu.edu.hk.
[25] Messaging Anti-Abuse Working Group, Email
Metrics Program: “The Network Operator’s
Perspective, Report #4—3rd and 4th Quarters
2006,” Available at http://www.maawg.org/about/
MAAWGMetric_2006_3_4_report.pdf
[26] SpamAssassin.
http://wiki.apache.org/spamassassin, Accessed
22 May 2007.
[27] SpamAssassin Readme file.
http://www.cpan.org/modules/by-module/Mail/
Mail-SpamAssassin-2.64.readme Accessed 22 May
2007.
[28] SpamAssassin public Corpus,
http://spamassassin.apache.org/publiccorpus,
Accessed 25 May 2006.
APPENDIX
A. 10 SAMPLE SPAM MESSAGES BEFORE OBFUSCATION

Figure 4: This email scored 16.2 by SpamAssassin. As mentioned in the analysis in Section 4 the body of
this email is obviously a spam email. SpamAssassin recognizes the message as spam due to the black-listed
URLs. Both the content and link cause SpamAssassin to assign the message a high score. Obfuscating lowers
the score because the key words, phrases, and links are disguised.

Figure 5: This email scored 21.7 by SpamAssassin. Although the text of the body of this email is confusing
to users(In fact, spammers can use a Java script to remove those small letters among capital letters before
receivers read this kind of emails), it is tagged as spam by SpamAssassin trained by spam emails and ham
emails. Especially, the link is in SpamAssassin’s several blocklists, which make it have a high score.
Figure 6: This email scored 17.2 by SpamAssassin. Although the text of the body of this email is confus-
ing to users(In fact, spammers can use a Java script to remove those small letters among capital letters
before receivers read this kind of emails), it is tagged as spam by SpamAssassin. Especially, the link is in
SpamAssassin’s blocklists, which make it have a higher score.

Figure 7: The email scored 10.6 by SpamAssassin. Both the content and link contribute to the score.
Figure 8: The email scored 7.9.by SpamAssassin. Thebody is judges to be spam, but the link is not in the
black list, so the score of this email is not as high as that of other emails with blocked links.

Figure 9: The email scored 21.7 by SpamAssassin. Although the text of the body of this email is confusing
to users, it is tagged as spam by SpamAssassin, largely due to the link. We note that spammers can use CSS
to remove the small letters before the email is displayed.
Figure 10: The email scored 19.2. The body of this email is not pure text, but it is still considered spam by
SpamAssassin. In addition, the link contributes to the high score.

Figure 11: The email scored 12.1. The content of this email is judged as spam by SpamAssassin. The link in
SpamAssassin’s blocklist also contributes to this email’s higher score gotten from SpamAssassin.
Figure 12: The email scored 13.1. The small letters can be removed using CSS.

Figure 13: The email scored 10.6. SpamAssassin considers this spam and blocks the link. Both the content
and link cause a higher score.
B. FULL-SIZE SCREENSHOTS

Figure 14: The original spam message before obfuscation. We can see that the spammer has changed the
form of the keyword ”penis” as ”pen1s” and ”penls” and misspelled ”enlargement” as ”enlargment”, trying to
bypass spam filter. However, after being trained, SpamAssassin can block this.
Figure 15: The spam message after obfuscation. In this obfuscated email, some characters were replaced with
other similar Unicode characters in UC-Simlist by our obfuscation script. This helps the obfuscated spam
email bypass SpamAssassin.

You might also like