An - Gram Analysis of Korean English Learners' Writing: Shinchul Hong

An N-gram Analysis of Korean English
Learners' Writing1
Shinchul Hong
(Busan University of Foreign Studies)
Hong, Shinchul. 2013. An N-gram Analysis of Korean English

Learners' Writing. Korean Journal of English Language and
Linguistics 13-2, 00-00. The purpose of the study is to
investigate EFL Korean learners' use of continuous word
sequences. For this, the study compiles a small corpus of
Korean learners' essays for contrastive analysis with a native
learners' corpus, and extracts lists of 4-word sequences
(4-grams) from the Korean learners corpus (KLC) and the
native learners corpus (LOCNESS). The analysis is based on
the following: grammatical structures and functional roles. First,
the result of the study from the perspective of grammatical
structure is that Korean learners overuse a type of grammatical
structure: Noun phrase/Pronoun + verb (I do not think).
Meanwhile, native learners prefer to use the structure of
Preposition + Noun phrase (at the end of). Second, from the
perspective of functional roles, Korean learners overused
4-grams indicating their attitudes towards proposition (I have to
study), and underuse text-organizing 4-grams (on the other
hand).
Key Words: 4-gram, type/token ratio, grammatical categories,

functional categories, cut-off point, noticing.
1. Introduction
The use of continuous sequences of words has been
1 This work was supported by the 2013 Busan University of Foreign
Studies research grant.
- 1 -
considered a useful means to measure learners' fluent
language use (Bamberg, 1983; Cortes, 2004; Howarth, 1998;
Hyland, 2008). The theoretical background to this issue is
that native speakers rely more on prefabricated word
sequences in their language use (Ädel & Erman, 2012).
Moreover, they tend to use many more fixed sequences of
words in spoken language than in written language (Pawley &
Syder, 1983). According to Bresnan (1999), speech is based
on prefabricated word formulas which are easily retrieved
from long-term memory, so speakers can use them in
real-time communication. An important feature here is that
native speakers are likely to have a unique way of using
formulaic language in a particular register. Hyland (2008)
describes it as naturalness which indicates the identity of
their language community. For this reason, ESL/EFL learners
need to acquire this kind of naturalness in order to achieve
communicative competence. In spite of its importance, it is
true that ESL/EFL teachers do not have enough
understanding of the mechanism of formulaic language, so
learners may have difficulties in making appropriate
grammatical and lexical choices (Howarth, 1998).
Fortunately, the development of corpus linguistics has
encouraged researchers to investigate collocational patterns
of continuous word sequences in various areas (Biber,
Johansson, Leech, Conrad & Finegan, 1999; Hong, 2012).
Futhermore, it is quite easy to extract continuous sequences
of n-words from corpora using user-friendly concordancers
(e.g., WordSmith 5.0; Antconc 3.2.1). A continuous sequence
of n-words here refers to n-grams, such as in order to
(3-gram), or in the long run (4-gram) (for more details, see:
Section 2.1). For this reason, a variety of lexical patterns
can be explored from collocations of one or two words to
n-grams. Even though n-gram analyses provide pedagogically
- 2 -
useful information and enhance teachers' understanding of
their distinctive patterns, learners still have great difficulties
in their production (Cortes, 2006). This phenomenon seems
to indicate that simplistic presentations of n-grams do not
guarantee their appropriate use and acquisition. According to
Howarth (1998), even advanced non-native writers may fail
to communicate their subject matter efficiently and
effectively, not because of any lack of academic weakness,
but because of incompleteness of language use. Even though
corpus-based collocations have had a great impact on
syllabus design and approaches (e.g., data-driven learning,
lexical approach), it is necessary to emphasise pedagogically
useful n-grams so that learners can achieve awareness of a
certain sensitivity to appropriate disciplinary repertoires of
native speakers. In this regard, the research question of the
study seeks to investigate how Korean learners use 4-word
sequences (4-grams,) focusing on grammatical structures and
four discourse functions: referential, text-organizer, stance
and other.
2. Literature Review
2.1 Terminology
One of the difficulties of researching continuous word
sequences is that different terminologies are used with more
or less similar definitions. Thus it is necessary to clarify
what an n-gram is in this study. In the literature, there are
perhaps 10 different terminologies in common use:
phraseology (Cowie, 1998; Granger & Meunier, 2008),
formulaic sequences (Wray, 2008), lexical bundles (Biber et
al., 1999), n-grams (Stubbs, 2007; Hong, 2012), clusters
(Hyland, 2008), recurrent word combinations (Altenberg,
1998), phrasicons (De Cock, Granger, Leech, & McEnery,
- 3 -
1998), multi-word constructions (Liu, 2012), skipgrams
(Cheng, Greaves, & Warren 2006) and concgrams (Cheng,
Greaves, Sinclair, & Warren, 2008). In a broad sense, these
10 terminologies can be divided into three groups (see
Table1).
Table 1 Terminologies and definitions

Group Terminologies Definition
phraseology, formulaic
co-occurring word patterns
1 sequence, phrasicon,
(not necessarily continuous)
concgram2
lexical bundle, n-gram,
multi-word construction, repeated continuous
2
cluster, recurrent word sequences of words
combination
non-continuous sequence of
words using a skip distance
3 skipgram of n (e.g., 2-skipgram:
reduce word-A word-B
expenditure) (Hong, 2012)
Even though the above terminologies refer to a similar notion

of word sequences, the researcher would like to discuss the
notion of phraseology because it is widely used in Second
Language Acquisition (SLA) and language education (Granger,
1998a; Gries, 2008). According to Gries, phraseology is
different from n-gram in two respects. First, phraseology
consists of at least two grammatical elements (e.g., to eke
out a living: to eke out + Determiner + living). Second,
phraseology should have semantic unity unlike an n-gram
(e.g., 4-gram: far as is necessary). In this study, the
definition of n-gram refers to a continuous sequence of
n-words (e.g., 4-gram: the nature of the) without any
2 Concgram refers to co-occurring word sequences which are generated
by the software "Concgram" which indicates whether or not they are

continuous (Cheng et al., 2008; Hong, 2012).
- 4 -
consideration of the number of grammatical elements or
semantic unit. One reason for using n-gram rather than
phraseology in this study is that n-gram can be easily
extracted from corpus data with a concordancer, such as
WordSmith 5.0. Furthermore, n-grams are able to include
phraseological patterns if the value of n is increased3.
2.2 Linguistic Features

The theoretical background of n-gram is based on the
study of lexico-grammar in terms of syntagmatics and
paradigmatics (Granger, 1998a). According to Granger, an
n-gram can play the role of a sentence-builder which
functions as a macro-text organizer in the text. Therefore,
native speakers' language use is viewed as placing a series
of sentence builders in the right order, which means as in
common use in their language community. This idea is not
very different from Sinclair's (1991) idiom principle. His idea
refers to meaning being constructed by a series of
prefabricated word chunks. These prefabricated word chunks
have flexible characteristics (e.g., during the winter, in the
winter) unlike idioms which are fixed (e.g., kick the bucket).
From this viewpoint, Hoey (2004) extends Sinclair's idea to
give lexical priming, meaning that every lexical item is
primed for collocational use in terms of collocation,
colligation, and semantic association. For example, red is
primed as an adjective (a red sunset) and a noun (the colour
red) (Hoey, 2004, p. 23).
In a narrow sense, Cortes (2004) describes the specific
features of lexical bundles (n-grams). First, they have a form
of extended collocation, which is a sequence of three or
more words. Since n-gram is a more or less technical term,
it need not be limited to three or four words. However,
3 WordSmith 5.0 can adjust the span of words (from 1 to 12 words).
- 5 -
4-grams are commonly analysed in the sense that they are
considered to be an optimum number for an n-gram (for
more details, see: Section 3.3). Second, n-gram has the
feature of flexible compositionality and its meaning can be
retrieved from individual words (e.g., at the beginning of).
Third, n-gram does not have complete units of grammatical
structure. Concordancers extract n-grams on the basis of
continuous sequences of n-words without any consideration
of grammatical units. Fourth, n-gram does not have semantic
compositional meaning. For example, the 4-gram, theoretical
and do not, does not play a role as a single unit of meaning.
2.3 Previous Research on Formulaic Language

In the literature, there are two broad trends. The first is to
investigate the written production of native vs. non-native
speakers and students vs. experts (Ädel & Erman, 2012; Chen
& Baker, 2010; Cortes, 2004; Hyland 2008). The second
trend is corpus-based register analysis (Biber et al., 1999)
and ESP (English for Specific Purposes) (Hong, 2012). With
regard to the first one, phraseology rather than n-gram or
lexical bundles has been explored to discover the formulaic
nature of pragmatic rules when using continuous sequences
of words and to apply the findings to the language classroom
(Howarth, 1998). Chen and Baker (2010) analyse lexical
bundles (4-grams) of academic writing. For this, three
corpora (published academic writing, native students' writing,
Chinese students' writing) are analysed both quantitatively
and qualitatively. The results of comparing the corpora show
that Chinese students have the smallest range of lexical
bundles of the other two corpora. A distinctive feature of
their study is that native and non-native students' use of
lexical bundles is surprisingly similar in terms of structure
and function. However, the experts' lexical bundles are
- 6 -
different from those of the students in that their writing
shows a wider range of bundles.
Hyland (2008) analyses variations in multi-word
expressions (4-grams) within the four different disciplines4 in
terms of form, structure and function. Three corpora are
used: research articles, Ph.D dissertations, masters' theses.
The result of the study demonstrates that each discipline has
its own features for using types of bundles, and supports
those of other studies such as Cortes (2004) and Biber
(2006). In this study, bundles can play the role of
representing the identity of each discipline. For this reason,
in terms of EAP (English for Academic Purposes), learners
need to understand how lexical bundles are commonly used
in their subject matter.
The above studies show why formulaic language, bundles,
or n-grams are important to achieve natural language use.
Therefore, it is necessary to use bundles appropriately in the
classroom. Cortes (2006) examines the effects of teaching
multi-word combinations to university students taking a
history class. For this, she analyses two types of production
by the course: pre-and post-instruction. Interestingly, in the
results, the students do not display any significant difference
between the two. However, she insists that their awareness
of using bundles is increased. Furthermore, her study
emphasises the systematic exposure to target bundles on the
basis of the development of students' knowledge within their
specialised community.
In light of the results of the above study, the use of
n-grams needs to be taken into account in pedagogical
applications for EFL learners. Like the study of Cortes
(2006), a systematic approach is likely to be required to
4 Four different disciplines: Electrical engineering, biology, business
studies, applied linguistics.
- 7 -
achieve the naturalness which may characterize their target
language community. For this reason, the study will focus on
investigating the similarities and differences of Korean
learners' n-grams in their writing in terms of grammatical
patterns and discourse functions.
3. Method
3.1 Data Collection and Participants

435 undergraduate students majoring in English language
and literature participate (females:154, males: 281) in the
Korean Learner Corpus (KLC) project. The level of the
participants is considered to be intermediate. The
intermediate level is determined on the basis of TOEIC
scores (700-750). The KLC project collects learner profiles
and consent forms for specific data collocation. In their
profiles, learners are supposed to check the score for their
proficiency test (TOEIC). As part of an assignment on an
obligatory course, they are supposed to write a 500-word
essay. In reality, the range is 400-500 words, so the
average size of each essay is approximately 450 words. In
order to achieve homogeneity of data collection, two criteria
are employed: TOEIC scores and essay length.
3.2 Corpus Compilation

The study analyses 4-grams in EFL Korean learners'
written production. For this, 435 academic essays are used
to compile the KLC (see Table 2). They are collected from
two universities in Busan. The design criteria for KLC are
the same as those of the ICLE (International Corpus of
Learner of English) project conducted by Granger (see
Granger, 1998b).
- 8 -
Table 2 KLC Design Criteria
Feature Category Attribute
Mode Written
Genre Academic essay
Language-
Style Argumentative
related
Suggested topics (see
Topic
Appendix)
Age range 20-30 years old
Intermediate (undergraduates
Learner- Level majoring in English language
related and/or literature)
Mother tongue Korean
Learning context EFL
Data collection Cross-sectional
Task setting Untimed
Task-related
Elicitation Prepared
Technicality Non-technical
Since the KLC is supposed to be contrastively compared

with the LOCNESS (Louvain Corpus of Native English Essays:
see Granger, 1998b) as a native learner corpus, the KLC
compilation has similar criteria to its counterpart (see Table
3). Therefore, the specifications of the two corpora are more
or less similar.
Table 3 Specifications of the KLC and the LOCNESS

Category KLC LOCNESS
Number of Texts 435 189
Tokens 196,453 201,839
Types 11,234 12,749
Type/Token Ratio (TTR) 5.79 6.32
Standardized TTR 41.15 39.85
3.3 Generating Lists of 4-grams

Two kinds of 4-gram lists are generated using WordSmith
Tools 5.0 (Scott, 2010) in the study: KLC 4-gram list and
LOCNESS 4-gram list. To each list are applied the same
- 9 -
criteria to extract 4-gram lists. There are two critical issues
in this regard. One is why 4-word sequences rather than
other word sequences (e.g., 3- or 5-grams) are applied.
According to Cortes (2004), 4-grams are ideal in that they
include 3-grams, and have more tokens than 5-grams (over
10 times more frequent). Furthermore, according to Ädel and
Erman (2012), it is conducive to a richer study in that
4-gram analysises can be compared with that of other
studies. For this reason, 4-grams are commonly used in
studies of continuous word sequences (Ädel & Erman, 2012;
Chen & Baker, 2010; Cortes, 2004, 2006).
Another issue is cut-off point in terms of frequency and
dispersion. Many studies with regard to cut-off point adapt
different criteria (see Table 4). According to Ädel and Erman
(2012), cut-off point is somewhat arbitrary. So, the study
adapts the cut-off point: 4-occurrences (frequency) and
3-texts (dispersion). On the basis of the studies presented in
Table 4, the cut-off point is conservatively determined.
Table 4 Frequency and Dispersion

Study Frequency Dispersion
20 occurrences per
Cortes (2004) 5 texts
million words
25 occurrences per
Chen & Baker (2010) 3 texts
million words
Biber et al. (1999) 10 occurrences 5 texts
20 occurrences per
Hyland (2008) 10% of texts
million words
25 occurrences per
Ädel & Erman (2012) 9 texts
million words
3.4 Procedure of Data Analysis

The KLC project analysises EFL Korean learners' essays in
term of grammatical and functional categories. Twelve
grammatical categories are used (see Table 5).
- 10 -
Table 5 Grammatical Categories
Grammatical category Example
Noun Phrase (NP) a high school student
Preposition+NP (PP) at the same time
Passive (PA) have been interested in
Anticipatory-it (Ant-it) it is important to
Be+NP/AP/PP (BE) is one of the
Verb+NP/PP (VN) have the right to
Modal (MO) would like to go
-ing (ING) getting a good job,
To-infinitive (TO) to be a teacher
Conjunction (CO) when I was a
NP/Pronoun+Verb (NV) I do not think
Others (Other)* alive at the end
*
Others: It includes 4-grams which cannot be categorised into the
other categories.
In the literature, grammatical structures of n-grams depend

on the taxonomy developed by Douglas Biber, and these are
used to analyse the structural patterns of lexical bundles
adapted to different genres (Biber et al., 1999; Biber, 2006).
His taxonomy is based on three categories: verb-phrase
fragments, dependent clause fragments, and
noun/prepositional phrases (Juknevičiné, 2009). In this paper,
Biber's grammatical categories are modified as in the
following.
In terms of functional categories, four main categories are
adapted in the study (see Table 6). Three categories
(Referential, Text-organizer, Stance) are widely used in
linguistic research (Biber, Conrad & Cortes, 2004; Chen &
Baker, 2010; Cortes, 2004; Liu 2012). One of the differences
is that the study adapts the category of Other, which the
above studies exclude.
Table 6 Functional categories

Category Sub-category Example
Referential Time markers (RT) At the end of,
- 11 -
Place markers (RP) the center of the
Descriptive bundles (RD) The root of the,
Quantifying bundles (RQ) A large number of,
Contrast/Comparison
On the other hand,
inferences (TC)
Text
Focus (TFo) In my case I
organizers
Framing (TFr) In addition to the,
Topic introduction (TT) In this essay I,
Epistemic-impersonal/Pr
I think that this
obable-possible (SE)
Stance
Obligatory/directive (SO) do not have to
Ability (SA) it is difficult to
Other OTHER to go to the
The 4-grams are generated through the WordSmith

concordancer and then manually classified in terms of the
grammatical and functional categories presented in Table 5
and 6 following two steps. As a first step, the researcher
categories the list of 4-grams and two native speakers
working as English teachers at university level examine the
first categorization. When there is a difference between the
two, they discuss and reach a consensus for validity and
reliability.
4. Results
4.1 4-gram Lists

The KLC has more tokens for 4-grams than LOCNESS (see
Table 7). A distinctive feature here is that the KLC has a
lower frequency of 4-gram list without the cut-off point, but
a higher frequency with the cut-off point.
Table 7 Frequencies of 4-gram lists

KLC LOCNESS
with the without the with the without the
cut-off point* cut-off point cut-off point cut-off point
- 12 -
Type 510 (0.35%)** 144,521 420 (0.25%) 166,776
Token 3,470 (2.26%) 152,926 2833 (1.62%) 174,696
TTR*** 14.69 94.50 14.82 95.46
* Cut-off point: Frequency 4/ at least 3 texts
** 0.35%= 510/144521*100
*** TTR=type/token*100
Since the two corpora have the same design criteria, the
difference can be understood as Korean learners use more
4-grams. In other words, they are likely to depend on more
recurrent formulaic sequences than learners who are native
speakers. Relative frequencies of the token with cut-off point
are 2.26% (KLC) and 1.62% (LOCNESS). The results for type
show a similar pattern to that of token. TTR (Type-Token
Ratio) also indicates that the two corpora have similar
patterns in terms of variety. In a broad sense, Korean
learners' list of 4-grams is not likely to be much different
from that of the native learners. Henceforth, all analyses in
this study will be based on lists of 4-grams with the cut-off
point.
4.2 Grammatical Category

The results for 4-grams show a significant difference
between the two groups in terms of grammatical categories
(see Table 8 and Figure 1). A distinctive feature here is that
NP (Noun Phrase) is the one most commonly used among the
categories. The specific structure of the NP is a noun with a
post modifier fragment (Noun+of+Noun:①). Another
interesting point in the results is that the category of NV
(Noun/Pronoun+Verb:③) has the second highest frequency of
type and token in the KLC, but it is PP (Prepositional
Phrase: ➁) in LOCNESS .
① Noun Phrase
Friendship is one of the most important relationships that
- 13 -
we build in our lives (KLC: 14).
② Prepositional Phrase
Another rule is at the end of the race, come in the pit
slowly and do not hit the person in front of you
(LOCNESS: 68)
③ Noun/Pronoun + Verb
I do not think that dispatch of troops is the decision for
world peace. (KLC: 47)
Table 8 Raw Frequencies of Grammatical Categories

Type Token TTR
Category*
KLC LOC** KLC LOC KLC LOC
NP 114 154 868 1149 13.13 13.40
PP 58 100 424 765 13.67 13.07
PA 3 9 12 51 25 17.64
Ant-it 12 18 68 99 17.64 18.18
BE 34 16 260 90 13.07 17.77
VN 53 16 334 92 15.86 17.39
MO 8 14 49 86 16.32 16.27
ING 3 3 17 17 17.64 17.64
TO 19 9 113 50 16.81 18
CO 72 47 469 257 15.35 18.28
NV 101 19 667 94 15.14 20.21
OTHER 33 15 189 83 17.46 18.07
*see Table 5 **LOC=LOCNESS
A distinctive feature here is that Korean learners' overuse of

NV is likely to reflect a simplistic way of expressing their
personal opinions with the personal pronoun I (e.g., I want to
go, I think it is, I really want to, I do not agree). On the
other hand, native learners' use of PP (27%) shows a similar
frequency pattern to the academic genre in the native
reference corpus (LSWE): KLC=12%, LOC=27%, LSWE= 33%5
(see Figure 1).
5 LSWE (Longman Spoken and Written English) Corpus: Noun phrase-30 %,

Prepositional phrase-33%, Verb phase-37% (for more detail, see: Biber
et al., 1999).
- 14 -
Figure 1 Relative Frequencies of Type and Token
From the perspective of language variety, the 4-grams of

KLC and LOCNESS do not have much difference (see Table
7). However, specific TTRs in each category have a different
pattern (see Table 8 and Figure 2).
Figure 2 Relative frequencies of TTR
First, the TTR of PA (Passive:④) in the KLC is much higher

than that of LOCNESS.
④ Passive
I have been interested in volunteering since I translated
- 15 -
from Africa (KLC: 120).
However, this difference may be influenced by its low

frequency of tokens and types (see Table 8). Second, the
category of BE (be + NP/AP/PP: ⑤) in the KLC has a lower
frequency than that of LOCNESS. In other words, Korean
learners are likely to focus on one particular pattern (be +
NP) rather than others. However, native learners use several
different patterns (be + NP, be + PP).
⑤ Be + NP/AP/PP/-ing
There are some famous coastal towns such as Cancun
which is one of the greatest museums in the world and it
is considered as a Holy Land for the tourists of the city
(KLC: 239).
His belief is that everything is for the best, even the death
of two hundred thousand in the earthquake at Lisbon is
deemed as God's will... (LOCNESS 101).
Third, Korean learners' variety of NV (Noun/Pronoun+Verb)

is also less than native learners. The pattern in the KLC
focuses on several types of patterns (personal pronouns +
verb). The percentages of patterns are 79.61% (token) and
77.22% (type). In LOCNESS, the percentages are 58.51%
(token) and 68.42% (type). In spite of Korean learners'
overuse of NV (see Table 8), their usage is quite limited, to
just a few types of 4-grams.
4.3 Functional Category

The results of analysing functional category (reduced
categories: Stance, Text-organizer and Referential) show that
two corpora have the highest frequency of Referential in
terms of type and token (see Figure 3). However, they have
- 16 -
different patterns in the categories of Stance➀ and
Text-organizer➁. In particular, the KLC has a lower
frequency of Text-organizer and a higher frequency of
Stance than LOCNESS.
① Stance
But I 'd like to introduce two things that I think it's more
special than others (KLC: 121)
② Text-organizer
On the other hand, Korean economy is located in the
middle of the ranking (KLC: 234)
These are concrete reasons and resulting consequences
that would come about as a result of televising executions
(LOCNESS: 152)
Figure 3 Relative Frequencies of Token and Type
From the perspective of function, Korean learners are

likely to overuse 4-grams when describing writers' evaluation
and attitudes about propositions, but underuse expressions
indicating the structure of essays. First, Korean learners may
indicate their weakness in organizing what they want to
write, such as topic introduction (in this essay I), text
framing (in my case I), and focusing (this is the most).
According to Chen and Baker (2010), Chinese students'
essays also show the same patterns. They significantly
- 17 -
underuse the category of Text-organizers. In the case of
Stance, Korean learners overuse this category and this may
be related to their overemphasis on expressing their
opinions. Since the essay topics are argumentative, they may
deliver their ideas or comments as clearly as possible. In a
study by Herriman and Aronsson (2009), non-native speakers
tend to overuse pseudo-clefts with a first person pronoun
(e.g., what I think, I think that) when they make a comment.
On the basis of the above two results, it is possible to infer
the pattern of Korean learners when they write an essay.
They try to focus on the organization of their ideas with a
weak structure for their essay format in terms of discourse
function.
In Table 9, a more specific distribution of functional
categories in Korean leaners' essays is presented. From the
view point of type, the KLC has the highest frequency of SO
(Stance: Obligatory/directive) to describe writers' attitudes,
but the LOCNESS has RD (Referential: Descriptive bundles)
(excluding the category of Other). Moreover, they have a
similar pattern in the frequencies for token. Korean leaners'
overuse of SO (e.g., I want to do, I do not have, I don't want
to) may be related to over-simplifying the describing their
attitude to propositions instead of using different ways. In
other words, since Korean learners may not know various
expressions to describe their opinions, they are likely to
prefer to use a couple of fixed expressions. However, native
learners may adopt a functionally different strategy when
they argue a certain topic. They often use descriptive
bundles (e.g., as a symbol of, as an example of, and the use
of) rather than a typical way to make their comments, such
as the category of Stance.
Table 9 Raw Frequencies for Functional Category
- 18 -
Type Token TTR
Category*
KLC LOC KLC LOC KLC LOC
RT 45 13 387 163 11.62 7.97
RP 25 20 175 171 14.28 11.69
Referential
RD 24 48 139 293 17.26 16.38
RQ 37 32 369 205 10.02 15.60
TC 18 13 123 122 14.63 10.65
Text- TFo 15 8 85 47 17.64 17.02
organizer TFr 9 28 43 206 20.93 13.59
TT 4 7 19 35 21.05 20
SE 46 45 291 266 15.80 16.91
Stance SO 67 18 480 111 13.95 16.21
SA 10 5 58 37 17.24 13.51
Other 210 183 1301 1177 16.14 15.54
* see Table 6
Another distinctive feature in the two corpora is the use of

TFr (Text-organizer: Framing). The KLC has a lower
frequency for this category than LOCNESS. Korean learners
might have great difficulties in structuring topics or
comments (e.g., in addition to the). For this reason, they tend
to adopt a direct approach to organize and introduce what
they want to say (e.g., in this essay I).
From the perspective of diversity, each category's TTR
shows a significant difference between the two corpora. In
particular, RQ (Referential➀: Quantifying bundles) and TFr
have different frequencies of TTR (see Figure 4).
➀ Referential (Quantifying bundles)

A lot of students choose majors what will lead to jobs for
life and will make a lot of money over what suit them
(KLC: 345).
- 19 -
Figure 4 Relative Frequencies of TTR
Even though the frequency of tokens in the KLC is higher

than that of LOCNESS, Korean learners use limited types of
4-grams such as a lot of students, a lot of people, a lot of
information. They tend to use typical patterns of "a lot of
+plural countable noun/uncountable noun" However, native
learners have various types of RQ in their essays such as
the rest of the, a large number of, the amount of money, a
great deal of. Another distinctive feature is the TTR
frequency of TFr. The LOCNESS has a higher frequency of
tokens, but the TTR of TFr is much lower than that of KLC.
This higher frequency of TTR in the KLC is likely to be
related to the low frequency of tokens. Since Korean
learners' raw frequency for TFr is too low, their high
frequency for TTR may not represent their variety of using
the category.
- 20 -
5. Conclusion and Implications
The descriptive patterns for Korean learners and native

learners show how they use 4-grams differently in this
comparative study. In terms of grammatical category,
NP-based 4-grams were the most common category used in
their essays. However, Korean learners also overused the
category of NV (NP+Verb), unlike native learners who use
PP (Prepositional Phrase) as the second most common
category. In terms of functional category, a distinctive feature
is that Korean learners overuse the category of Stance and
underuse Text-organizers. In other words, they concentrate
on demonstrating their attitude or evaluation in their
argumentative essays. However, these features do not seem
to be connected to the quality of their essays, because the
format of their essays is not well organized with regard to
introducing, framing, contrasting and focusing on what they
want to argue.
This kind of difference reveals pedagogically useful
information in that Korean learners use 4-grams differently,
which is considered as one of the parameters for determining
native-like proficiency. One issue in this regard is whether
teachers need to teach the formulaic patterns of language to
encourage learners to use them. According to Cortes (2004),
unconscious learning of n-grams is not very helpful to EFL
learners. In other words, they need to learn n-grams
consciously but, in one sense, it is likely to suspect to
memorize a long list of 4-grams in a simplistic manner.
Instead, native speakers' typical patterns of n-grams can be
acquired through EFL learners' noticing (Conzett, 2000;
Cortes, 2004). From the viewpoint of pedagogy, noticing
needs careful application to maximise learners' subsequent
language development due to the complexity of cognitive
- 21 -
processes underlying it (Hong, 2010). For this reason, EFL
learners are required to raise their awareness of how
formulaic language is used differently in the area of their
target language in a proper manner. Appropriate methodology
for teaching and learning n-grams is necessary, and this also
should contribute to learners' creative use on the basis of
their acquisition of recurrent n-grams.
This study has some limitations in terms of the following
aspects. First, extracting 4-grams from the corpora is based
on the cut-off point (frequency of 4 and at least 3 texts). In
this regard, since there is no reasonable consensus in the
literature, the point adopted in the study is tentative.
However, the study has tried to adopt it as conservatively as
possible. The level of the cut-off point can be determined
according to research questions, but it is necessary for
contrastive analysis among other types of corpora. Second,
the taxonomy of grammatical and functional categories is still
problematic. In order to maximise validity and reliability in
the classification, the study followed two steps to check it.
Like the first limitation, it will be very reasonable to set a
standard for classification. Third, the study needs more
investigation of learners' psychological aspects in terms of
using 4-grams to describe why a particular type of formulaic
language is preferred. Furthermore, it could be a future
research question to explore EFL learners' use of n-grams
References
Ädel, A. & Erman, B. 2012. Recurrent word combinations in

academic writing by native and non-native speakers of
English: A lexical bundles approach. English for Specific
Purposes, 31, 81-92.
Altenberg, B. 1998. On the phraseology of spoken English: The
evidence of recurrent word-combinations. In A. P. Cowie
(Ed.), Phraseology: Theory, analysis, and application,
- 22 -
101-122. Oxford: Oxford University Press.
Bamberg, B. 1983. What makes a text coherent? College
Composition and Communication, 34(4), 417-429.
Biber, D. 2006. University language: A corpus-based study of
spoken and written registers. Amsterdam: Benjamins.
Biber, D., Conrad, S., & Cortes, V. 2004. If you look at...: Lexical
bundles in university teaching and textbooks. Applied
Linguistics, 25, 371-405.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. 1999.
Language grammar of spoken and written English. London:
Longman.
Bresnan, J. 1999. Linguistic theory at the turn of the century.
Plenary address to the 12th World congress of Applied
Lingusitics. Tokyo, Japan.
Chen, Y-H & Baker, P. 2010. Lexical bundles in L1 and L2
academic writing. Language Learning and Technology, 14(2),
30-49.
Cheng. W., Greaves, C. and Warren, W. 2006. From n-gram to
skipgram to Concgram. International Journal of corpus
linguistics, 11(4), 411-433.
Cheng. W., Greaves, C., Sinclair, J. M. and Warren, W. 2008.
Uncovering the extent of the phraseological tendency:
Towards a systematic analysis of Concgrams. Applied
linguistics, 30(2), 236-252.
Conzett, J. 2000. Integrating collocation into a reading and writing
course. In M. Lewis (Ed.). Teaching collocation, 70-87. Hove:
Language Teaching Publication.
Cortes, V. 2004. Lexical bundles in published and student
disciplinary writing: Examples from history and biology.
English for Specific Purposes, 23, 397-423.
Cortes, V. 2006. Teaching lexical bundles in the disciplines: An
example from a writing intensive history class. Linguistics
and Education, 17, 391-406.
Cowie, A. P. 1998. Introduction. In A. P. Cowie (Ed.), Phraseology:
Theory, analysis, and application, 1-20. Oxford: Oxford
University Press.
De Cock, S., Granger, S., Leech, G. & McEnery, T. 1998. An
automated approach to the phrasicon of EFL learners. In S.
Granger (Ed.), Learner English on computer, 67-79. London:
Longman.
Granger, S. 1998a. Prefabricated patterns in advanced EFL writing:
Collocations and formulae. In A. Cowie (Ed.), Phraseology:
Theory, analysis, and application, 145-160. Oxford: Oxford
University Press.
Granger, S. 1998b. Learner English on computer (Ed.). London:
- 23 -
Longman.
Granger, S. & Meunier, F. (Ed.). 2008. Phraseology: An
interdisciplinary perspective. Amsterdam: John Benjamins.
Gries. S. T. 2008. Phraseology and linguistic theory: A brief survey.
In S. Granger, & F. Meunier (Eds.), Phraseology: An
interdisciplinary perspective, 3-25. Amsterdam & Philadelphia:
John Benjamins.
Herriman, J. & Aronsson, M. B. 2009. Themes in Swedish advanced
learners' writing in English. In K. Aijmer (Ed.), Corpora and
language teaching, 101-120. Amsterdam: John Benjamins
Hoey, M. 2004. A world beyond collocation: New perspectives on
vocabulary teaching. In M. Lewis (Ed.), Corpora and language
learners. Amsterdam: John Benjamins.
Hong, S. C. 2010. EFL learners' consciousness-raising through a
corpus-based approach. English Teaching, 65(1), 57-86.
Hong, S. C. 2012. An n-gram analysis of maritime English. The
Journal of Linguistic Science, 61(2), 283-328.
Howarth, P. 1998. The phraseology of learner's academic writing. In
A. Cowie (Ed.), Phraseology: Theory, analysis, and
application, 161-186. Oxford: Oxford University Press.
Hyland, K. 2008. As can be seen: Lexical bundles and disciplinary
variation. English for Specific Purposes, 27, 4-21.
Juknevičiné, R. 2009. Lexical bundles in learner language:
Lithusanian learners vs. native speakers. KALBOTYRA, 61(3),
61-72.
Liu, D. 2012. The most frequently used multi-word constructions in
academic written English: A multi-corpus study. English for
Specific Purposes, 31, 25-35.
Pawley, A. & Syder, F. H. 1983. Two puzzles for linguistic theory
native like selection and native like fluency. In J. C. Richards
& R. W. Schmidt (Eds.), Language and communication,
191-230. London: Longman.
Scott, M. 2010. WordSmith tool (version 5.0)[Computer software].
Oxford: Oxford University Press.
Sinclair, J. 1991. Corpus, concordance, collocation. Oxford: Oxford
University Press.
Stubbs, M. 2007. An example of frequent English phraseology:
Distribution, structures and functions. In R. Facchinetti (Ed.),
Corpus Linguistics 25 years on, 89-105. Amsterdam: Radopi.
Wray, A. 2008. Formulaic language: Pushing the Boundaries. Oxford:
Oxford University Press.
Appendix
- 24 -
1. Crime does not pay.
2. The prison system is outdated. No civilised society should punish
its criminals: it should rehabilitate them.
3. Most university degrees are theoretical and do not prepare
students for the real world. They are therefore of very little
value.
4. A man/woman's financial reward should be commensurate with
their contribution to the society they live in.
5. The role of censorship in our society.
6. Marx once said that religion was the opium of the masses. If he
was alive at the end of the 20th century, he would replace
religion with television.
7. All armies should consist entirely of professional soldiers: there
is no value in a system of military service.
8. The Gulf War has shown us that it is still a great thing to fight
for one's country.
9. Feminists have done more harm to the cause of women than
good.
10. In his novel Animal Farm, George Orwell wrote "All men are
equal: but some are more equal than others". How true is this
today?
11. In the words of the old song "Money is the root of all evil".
12. Europe.
13. In the 19th century, Victor Hugo said: "How sad it is to think
that nature is calling out but humanity refuses to pay heed. "Do
you think it is still true nowadays?
14. Some people say that in our modern world, dominated by
science technology and industrialisation, there is no longer a place
for dreaming and imagination. What is your opinion ?
Hong, Shinchul
Department of English Interpretation and Translation
Busan University of Foreign Studies
15 Seokpo-ro Nam-gu Busan, 608-738, Korea
Tel: 051-640-3726
Email: garstang@bufs.ac.kr
- 25 -

An - Gram Analysis of Korean English Learners' Writing: Shinchul Hong

Uploaded by

Copyright:

Available Formats

An - Gram Analysis of Korean English Learners' Writing: Shinchul Hong

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An - Gram Analysis of Korean English Learners' Writing: Shinchul Hong

Uploaded by

Copyright:

Available Formats

An N-gram Analysis of Korean English

Hong, Shinchul. 2013. An N-gram Analysis of Korean English

Key Words: 4-gram, type/token ratio, grammatical categories,

The use of continuous sequences of words has been

1 This work was supported by the 2013 Busan University of Foreign

Studies research grant.

Table 1 Terminologies and definitions

Even though the above terminologies refer to a similar notion

2 Concgram refers to co-occurring word sequences which are generated

by the software "Concgram" which indicates whether or not they are

2.2 Linguistic Features

3 WordSmith 5.0 can adjust the span of words (from 1 to 12 words).

2.3 Previous Research on Formulaic Language

4 Four different disciplines: Electrical engineering, biology, business

studies, applied linguistics.

3.1 Data Collection and Participants

3.2 Corpus Compilation

Since the KLC is supposed to be contrastively compared

Table 3 Specifications of the KLC and the LOCNESS

3.3 Generating Lists of 4-grams

Table 4 Frequency and Dispersion

3.4 Procedure of Data Analysis

In the literature, grammatical structures of n-grams depend

Table 6 Functional categories

The 4-grams are generated through the WordSmith

4.1 4-gram Lists

Table 7 Frequencies of 4-gram lists

4.2 Grammatical Category

Table 8 Raw Frequencies of Grammatical Categories

A distinctive feature here is that Korean learners' overuse of

5 LSWE (Longman Spoken and Written English) Corpus: Noun phrase-30 %,

From the perspective of language variety, the 4-grams of

Figure 2 Relative frequencies of TTR

First, the TTR of PA (Passive:④) in the KLC is much higher

However, this difference may be influenced by its low

Third, Korean learners' variety of NV (Noun/Pronoun+Verb)

4.3 Functional Category

Figure 3 Relative Frequencies of Token and Type

From the perspective of function, Korean learners are

Table 9 Raw Frequencies for Functional Category

Another distinctive feature in the two corpora is the use of

➀ Referential (Quantifying bundles)

Even though the frequency of tokens in the KLC is higher

The descriptive patterns for Korean learners and native

Ädel, A. & Erman, B. 2012. Recurrent word combinations in

You might also like