An - Gram Analysis of Korean English Learners' Writing: Shinchul Hong
An - Gram Analysis of Korean English Learners' Writing: Shinchul Hong
An - Gram Analysis of Korean English Learners' Writing: Shinchul Hong
Learners' Writing1
Shinchul Hong
(Busan University of Foreign Studies)
1. Introduction
- 1 -
considered a useful means to measure learners' fluent
language use (Bamberg, 1983; Cortes, 2004; Howarth, 1998;
Hyland, 2008). The theoretical background to this issue is
that native speakers rely more on prefabricated word
sequences in their language use (Ädel & Erman, 2012).
Moreover, they tend to use many more fixed sequences of
words in spoken language than in written language (Pawley &
Syder, 1983). According to Bresnan (1999), speech is based
on prefabricated word formulas which are easily retrieved
from long-term memory, so speakers can use them in
real-time communication. An important feature here is that
native speakers are likely to have a unique way of using
formulaic language in a particular register. Hyland (2008)
describes it as naturalness which indicates the identity of
their language community. For this reason, ESL/EFL learners
need to acquire this kind of naturalness in order to achieve
communicative competence. In spite of its importance, it is
true that ESL/EFL teachers do not have enough
understanding of the mechanism of formulaic language, so
learners may have difficulties in making appropriate
grammatical and lexical choices (Howarth, 1998).
Fortunately, the development of corpus linguistics has
encouraged researchers to investigate collocational patterns
of continuous word sequences in various areas (Biber,
Johansson, Leech, Conrad & Finegan, 1999; Hong, 2012).
Futhermore, it is quite easy to extract continuous sequences
of n-words from corpora using user-friendly concordancers
(e.g., WordSmith 5.0; Antconc 3.2.1). A continuous sequence
of n-words here refers to n-grams, such as in order to
(3-gram), or in the long run (4-gram) (for more details, see:
Section 2.1). For this reason, a variety of lexical patterns
can be explored from collocations of one or two words to
n-grams. Even though n-gram analyses provide pedagogically
- 2 -
useful information and enhance teachers' understanding of
their distinctive patterns, learners still have great difficulties
in their production (Cortes, 2006). This phenomenon seems
to indicate that simplistic presentations of n-grams do not
guarantee their appropriate use and acquisition. According to
Howarth (1998), even advanced non-native writers may fail
to communicate their subject matter efficiently and
effectively, not because of any lack of academic weakness,
but because of incompleteness of language use. Even though
corpus-based collocations have had a great impact on
syllabus design and approaches (e.g., data-driven learning,
lexical approach), it is necessary to emphasise pedagogically
useful n-grams so that learners can achieve awareness of a
certain sensitivity to appropriate disciplinary repertoires of
native speakers. In this regard, the research question of the
study seeks to investigate how Korean learners use 4-word
sequences (4-grams,) focusing on grammatical structures and
four discourse functions: referential, text-organizer, stance
and other.
2. Literature Review
2.1 Terminology
One of the difficulties of researching continuous word
sequences is that different terminologies are used with more
or less similar definitions. Thus it is necessary to clarify
what an n-gram is in this study. In the literature, there are
perhaps 10 different terminologies in common use:
phraseology (Cowie, 1998; Granger & Meunier, 2008),
formulaic sequences (Wray, 2008), lexical bundles (Biber et
al., 1999), n-grams (Stubbs, 2007; Hong, 2012), clusters
(Hyland, 2008), recurrent word combinations (Altenberg,
1998), phrasicons (De Cock, Granger, Leech, & McEnery,
- 3 -
1998), multi-word constructions (Liu, 2012), skipgrams
(Cheng, Greaves, & Warren 2006) and concgrams (Cheng,
Greaves, Sinclair, & Warren, 2008). In a broad sense, these
10 terminologies can be divided into three groups (see
Table1).
- 4 -
consideration of the number of grammatical elements or
semantic unit. One reason for using n-gram rather than
phraseology in this study is that n-gram can be easily
extracted from corpus data with a concordancer, such as
WordSmith 5.0. Furthermore, n-grams are able to include
phraseological patterns if the value of n is increased3.
- 5 -
4-grams are commonly analysed in the sense that they are
considered to be an optimum number for an n-gram (for
more details, see: Section 3.3). Second, n-gram has the
feature of flexible compositionality and its meaning can be
retrieved from individual words (e.g., at the beginning of).
Third, n-gram does not have complete units of grammatical
structure. Concordancers extract n-grams on the basis of
continuous sequences of n-words without any consideration
of grammatical units. Fourth, n-gram does not have semantic
compositional meaning. For example, the 4-gram, theoretical
and do not, does not play a role as a single unit of meaning.
- 6 -
different from those of the students in that their writing
shows a wider range of bundles.
Hyland (2008) analyses variations in multi-word
expressions (4-grams) within the four different disciplines4 in
terms of form, structure and function. Three corpora are
used: research articles, Ph.D dissertations, masters' theses.
The result of the study demonstrates that each discipline has
its own features for using types of bundles, and supports
those of other studies such as Cortes (2004) and Biber
(2006). In this study, bundles can play the role of
representing the identity of each discipline. For this reason,
in terms of EAP (English for Academic Purposes), learners
need to understand how lexical bundles are commonly used
in their subject matter.
The above studies show why formulaic language, bundles,
or n-grams are important to achieve natural language use.
Therefore, it is necessary to use bundles appropriately in the
classroom. Cortes (2006) examines the effects of teaching
multi-word combinations to university students taking a
history class. For this, she analyses two types of production
by the course: pre-and post-instruction. Interestingly, in the
results, the students do not display any significant difference
between the two. However, she insists that their awareness
of using bundles is increased. Furthermore, her study
emphasises the systematic exposure to target bundles on the
basis of the development of students' knowledge within their
specialised community.
In light of the results of the above study, the use of
n-grams needs to be taken into account in pedagogical
applications for EFL learners. Like the study of Cortes
(2006), a systematic approach is likely to be required to
- 7 -
achieve the naturalness which may characterize their target
language community. For this reason, the study will focus on
investigating the similarities and differences of Korean
learners' n-grams in their writing in terms of grammatical
patterns and discourse functions.
3. Method
- 8 -
Table 2 KLC Design Criteria
Feature Category Attribute
Mode Written
Genre Academic essay
Language-
Style Argumentative
related
Suggested topics (see
Topic
Appendix)
Age range 20-30 years old
Intermediate (undergraduates
Learner- Level majoring in English language
related and/or literature)
Mother tongue Korean
Learning context EFL
Data collection Cross-sectional
Task setting Untimed
Task-related
Elicitation Prepared
Technicality Non-technical
- 9 -
criteria to extract 4-gram lists. There are two critical issues
in this regard. One is why 4-word sequences rather than
other word sequences (e.g., 3- or 5-grams) are applied.
According to Cortes (2004), 4-grams are ideal in that they
include 3-grams, and have more tokens than 5-grams (over
10 times more frequent). Furthermore, according to Ädel and
Erman (2012), it is conducive to a richer study in that
4-gram analysises can be compared with that of other
studies. For this reason, 4-grams are commonly used in
studies of continuous word sequences (Ädel & Erman, 2012;
Chen & Baker, 2010; Cortes, 2004, 2006).
Another issue is cut-off point in terms of frequency and
dispersion. Many studies with regard to cut-off point adapt
different criteria (see Table 4). According to Ädel and Erman
(2012), cut-off point is somewhat arbitrary. So, the study
adapts the cut-off point: 4-occurrences (frequency) and
3-texts (dispersion). On the basis of the studies presented in
Table 4, the cut-off point is conservatively determined.
- 10 -
Table 5 Grammatical Categories
Grammatical category Example
Noun Phrase (NP) a high school student
Preposition+NP (PP) at the same time
Passive (PA) have been interested in
Anticipatory-it (Ant-it) it is important to
Be+NP/AP/PP (BE) is one of the
Verb+NP/PP (VN) have the right to
Modal (MO) would like to go
-ing (ING) getting a good job,
To-infinitive (TO) to be a teacher
Conjunction (CO) when I was a
NP/Pronoun+Verb (NV) I do not think
Others (Other)* alive at the end
*
Others: It includes 4-grams which cannot be categorised into the
other categories.
- 11 -
Place markers (RP) the center of the
Descriptive bundles (RD) The root of the,
Quantifying bundles (RQ) A large number of,
Contrast/Comparison
On the other hand,
inferences (TC)
Text
Focus (TFo) In my case I
organizers
Framing (TFr) In addition to the,
Topic introduction (TT) In this essay I,
Epistemic-impersonal/Pr
I think that this
obable-possible (SE)
Stance
Obligatory/directive (SO) do not have to
Ability (SA) it is difficult to
Other OTHER to go to the
4. Results
- 12 -
Type 510 (0.35%)** 144,521 420 (0.25%) 166,776
Token 3,470 (2.26%) 152,926 2833 (1.62%) 174,696
TTR*** 14.69 94.50 14.82 95.46
* Cut-off point: Frequency 4/ at least 3 texts
** 0.35%= 510/144521*100
*** TTR=type/token*100
Since the two corpora have the same design criteria, the
difference can be understood as Korean learners use more
4-grams. In other words, they are likely to depend on more
recurrent formulaic sequences than learners who are native
speakers. Relative frequencies of the token with cut-off point
are 2.26% (KLC) and 1.62% (LOCNESS). The results for type
show a similar pattern to that of token. TTR (Type-Token
Ratio) also indicates that the two corpora have similar
patterns in terms of variety. In a broad sense, Korean
learners' list of 4-grams is not likely to be much different
from that of the native learners. Henceforth, all analyses in
this study will be based on lists of 4-grams with the cut-off
point.
① Noun Phrase
Friendship is one of the most important relationships that
- 13 -
we build in our lives (KLC: 14).
② Prepositional Phrase
Another rule is at the end of the race, come in the pit
slowly and do not hit the person in front of you
(LOCNESS: 68)
③ Noun/Pronoun + Verb
I do not think that dispatch of troops is the decision for
world peace. (KLC: 47)
- 14 -
Figure 1 Relative Frequencies of Type and Token
④ Passive
I have been interested in volunteering since I translated
- 15 -
from Africa (KLC: 120).
⑤ Be + NP/AP/PP/-ing
There are some famous coastal towns such as Cancun
which is one of the greatest museums in the world and it
is considered as a Holy Land for the tourists of the city
(KLC: 239).
His belief is that everything is for the best, even the death
of two hundred thousand in the earthquake at Lisbon is
deemed as God's will... (LOCNESS 101).
- 16 -
different patterns in the categories of Stance➀ and
Text-organizer➁. In particular, the KLC has a lower
frequency of Text-organizer and a higher frequency of
Stance than LOCNESS.
① Stance
But I 'd like to introduce two things that I think it's more
special than others (KLC: 121)
② Text-organizer
On the other hand, Korean economy is located in the
middle of the ranking (KLC: 234)
These are concrete reasons and resulting consequences
that would come about as a result of televising executions
(LOCNESS: 152)
- 17 -
underuse the category of Text-organizers. In the case of
Stance, Korean learners overuse this category and this may
be related to their overemphasis on expressing their
opinions. Since the essay topics are argumentative, they may
deliver their ideas or comments as clearly as possible. In a
study by Herriman and Aronsson (2009), non-native speakers
tend to overuse pseudo-clefts with a first person pronoun
(e.g., what I think, I think that) when they make a comment.
On the basis of the above two results, it is possible to infer
the pattern of Korean learners when they write an essay.
They try to focus on the organization of their ideas with a
weak structure for their essay format in terms of discourse
function.
In Table 9, a more specific distribution of functional
categories in Korean leaners' essays is presented. From the
view point of type, the KLC has the highest frequency of SO
(Stance: Obligatory/directive) to describe writers' attitudes,
but the LOCNESS has RD (Referential: Descriptive bundles)
(excluding the category of Other). Moreover, they have a
similar pattern in the frequencies for token. Korean leaners'
overuse of SO (e.g., I want to do, I do not have, I don't want
to) may be related to over-simplifying the describing their
attitude to propositions instead of using different ways. In
other words, since Korean learners may not know various
expressions to describe their opinions, they are likely to
prefer to use a couple of fixed expressions. However, native
learners may adopt a functionally different strategy when
they argue a certain topic. They often use descriptive
bundles (e.g., as a symbol of, as an example of, and the use
of) rather than a typical way to make their comments, such
as the category of Stance.
- 18 -
Type Token TTR
Category*
KLC LOC KLC LOC KLC LOC
RT 45 13 387 163 11.62 7.97
RP 25 20 175 171 14.28 11.69
Referential
RD 24 48 139 293 17.26 16.38
RQ 37 32 369 205 10.02 15.60
TC 18 13 123 122 14.63 10.65
Text- TFo 15 8 85 47 17.64 17.02
organizer TFr 9 28 43 206 20.93 13.59
TT 4 7 19 35 21.05 20
SE 46 45 291 266 15.80 16.91
Stance SO 67 18 480 111 13.95 16.21
SA 10 5 58 37 17.24 13.51
Other 210 183 1301 1177 16.14 15.54
* see Table 6
- 19 -
Figure 4 Relative Frequencies of TTR
- 20 -
5. Conclusion and Implications
- 21 -
processes underlying it (Hong, 2010). For this reason, EFL
learners are required to raise their awareness of how
formulaic language is used differently in the area of their
target language in a proper manner. Appropriate methodology
for teaching and learning n-grams is necessary, and this also
should contribute to learners' creative use on the basis of
their acquisition of recurrent n-grams.
This study has some limitations in terms of the following
aspects. First, extracting 4-grams from the corpora is based
on the cut-off point (frequency of 4 and at least 3 texts). In
this regard, since there is no reasonable consensus in the
literature, the point adopted in the study is tentative.
However, the study has tried to adopt it as conservatively as
possible. The level of the cut-off point can be determined
according to research questions, but it is necessary for
contrastive analysis among other types of corpora. Second,
the taxonomy of grammatical and functional categories is still
problematic. In order to maximise validity and reliability in
the classification, the study followed two steps to check it.
Like the first limitation, it will be very reasonable to set a
standard for classification. Third, the study needs more
investigation of learners' psychological aspects in terms of
using 4-grams to describe why a particular type of formulaic
language is preferred. Furthermore, it could be a future
research question to explore EFL learners' use of n-grams
References
- 22 -
101-122. Oxford: Oxford University Press.
Bamberg, B. 1983. What makes a text coherent? College
Composition and Communication, 34(4), 417-429.
Biber, D. 2006. University language: A corpus-based study of
spoken and written registers. Amsterdam: Benjamins.
Biber, D., Conrad, S., & Cortes, V. 2004. If you look at...: Lexical
bundles in university teaching and textbooks. Applied
Linguistics, 25, 371-405.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. 1999.
Language grammar of spoken and written English. London:
Longman.
Bresnan, J. 1999. Linguistic theory at the turn of the century.
Plenary address to the 12th World congress of Applied
Lingusitics. Tokyo, Japan.
Chen, Y-H & Baker, P. 2010. Lexical bundles in L1 and L2
academic writing. Language Learning and Technology, 14(2),
30-49.
Cheng. W., Greaves, C. and Warren, W. 2006. From n-gram to
skipgram to Concgram. International Journal of corpus
linguistics, 11(4), 411-433.
Cheng. W., Greaves, C., Sinclair, J. M. and Warren, W. 2008.
Uncovering the extent of the phraseological tendency:
Towards a systematic analysis of Concgrams. Applied
linguistics, 30(2), 236-252.
Conzett, J. 2000. Integrating collocation into a reading and writing
course. In M. Lewis (Ed.). Teaching collocation, 70-87. Hove:
Language Teaching Publication.
Cortes, V. 2004. Lexical bundles in published and student
disciplinary writing: Examples from history and biology.
English for Specific Purposes, 23, 397-423.
Cortes, V. 2006. Teaching lexical bundles in the disciplines: An
example from a writing intensive history class. Linguistics
and Education, 17, 391-406.
Cowie, A. P. 1998. Introduction. In A. P. Cowie (Ed.), Phraseology:
Theory, analysis, and application, 1-20. Oxford: Oxford
University Press.
De Cock, S., Granger, S., Leech, G. & McEnery, T. 1998. An
automated approach to the phrasicon of EFL learners. In S.
Granger (Ed.), Learner English on computer, 67-79. London:
Longman.
Granger, S. 1998a. Prefabricated patterns in advanced EFL writing:
Collocations and formulae. In A. Cowie (Ed.), Phraseology:
Theory, analysis, and application, 145-160. Oxford: Oxford
University Press.
Granger, S. 1998b. Learner English on computer (Ed.). London:
- 23 -
Longman.
Granger, S. & Meunier, F. (Ed.). 2008. Phraseology: An
interdisciplinary perspective. Amsterdam: John Benjamins.
Gries. S. T. 2008. Phraseology and linguistic theory: A brief survey.
In S. Granger, & F. Meunier (Eds.), Phraseology: An
interdisciplinary perspective, 3-25. Amsterdam & Philadelphia:
John Benjamins.
Herriman, J. & Aronsson, M. B. 2009. Themes in Swedish advanced
learners' writing in English. In K. Aijmer (Ed.), Corpora and
language teaching, 101-120. Amsterdam: John Benjamins
Hoey, M. 2004. A world beyond collocation: New perspectives on
vocabulary teaching. In M. Lewis (Ed.), Corpora and language
learners. Amsterdam: John Benjamins.
Hong, S. C. 2010. EFL learners' consciousness-raising through a
corpus-based approach. English Teaching, 65(1), 57-86.
Hong, S. C. 2012. An n-gram analysis of maritime English. The
Journal of Linguistic Science, 61(2), 283-328.
Howarth, P. 1998. The phraseology of learner's academic writing. In
A. Cowie (Ed.), Phraseology: Theory, analysis, and
application, 161-186. Oxford: Oxford University Press.
Hyland, K. 2008. As can be seen: Lexical bundles and disciplinary
variation. English for Specific Purposes, 27, 4-21.
Juknevičiné, R. 2009. Lexical bundles in learner language:
Lithusanian learners vs. native speakers. KALBOTYRA, 61(3),
61-72.
Liu, D. 2012. The most frequently used multi-word constructions in
academic written English: A multi-corpus study. English for
Specific Purposes, 31, 25-35.
Pawley, A. & Syder, F. H. 1983. Two puzzles for linguistic theory
native like selection and native like fluency. In J. C. Richards
& R. W. Schmidt (Eds.), Language and communication,
191-230. London: Longman.
Scott, M. 2010. WordSmith tool (version 5.0)[Computer software].
Oxford: Oxford University Press.
Sinclair, J. 1991. Corpus, concordance, collocation. Oxford: Oxford
University Press.
Stubbs, M. 2007. An example of frequent English phraseology:
Distribution, structures and functions. In R. Facchinetti (Ed.),
Corpus Linguistics 25 years on, 89-105. Amsterdam: Radopi.
Wray, A. 2008. Formulaic language: Pushing the Boundaries. Oxford:
Oxford University Press.
Appendix
- 24 -
1. Crime does not pay.
2. The prison system is outdated. No civilised society should punish
its criminals: it should rehabilitate them.
3. Most university degrees are theoretical and do not prepare
students for the real world. They are therefore of very little
value.
4. A man/woman's financial reward should be commensurate with
their contribution to the society they live in.
5. The role of censorship in our society.
6. Marx once said that religion was the opium of the masses. If he
was alive at the end of the 20th century, he would replace
religion with television.
7. All armies should consist entirely of professional soldiers: there
is no value in a system of military service.
8. The Gulf War has shown us that it is still a great thing to fight
for one's country.
9. Feminists have done more harm to the cause of women than
good.
10. In his novel Animal Farm, George Orwell wrote "All men are
equal: but some are more equal than others". How true is this
today?
11. In the words of the old song "Money is the root of all evil".
12. Europe.
13. In the 19th century, Victor Hugo said: "How sad it is to think
that nature is calling out but humanity refuses to pay heed. "Do
you think it is still true nowadays?
14. Some people say that in our modern world, dominated by
science technology and industrialisation, there is no longer a place
for dreaming and imagination. What is your opinion ?
Hong, Shinchul
Department of English Interpretation and Translation
Busan University of Foreign Studies
15 Seokpo-ro Nam-gu Busan, 608-738, Korea
Tel: 051-640-3726
Email: garstang@bufs.ac.kr
- 25 -