Academia.eduAcademia.edu

Annotating particle realization and ellipsis in Korean

2012, Proceedings of the Sixth Linguistic Annotation Workshop

We present a novel scheme for annotating the realization and ellipsis of Korean particles. Annotated data include 100,128 Ecel (a spacebased word unit) in spoken and written corpora composed of four different genres in order to evaluate how register variation contributes to Korean particle ellipsis. Identifying the grammatical functions of particles and zero particles is critical for deriving a valid linguistic analysis of argument realization, semantic and discourse analysis, and computational processes of parsing. The primary challenge is to design a reliable scheme for classifying particles while making a clear distinction between ellipsis and non-occurrences. We determine in detail issues involving particle annotation and present solutions. In addition to providing a statistical analysis and outcomes, we briefly discuss linguistic factors involving particle ellipsis.

Annotating Particle Realization and Ellipsis in Korean Sun-Hee Lee Wellesley College Wellesley, MA 02481, U.S.A. slee6@wellesley.edu Jae-young Song Yonsei University Seoul, Korea jysong@yonsei.ac.kr In example (1), the particle ka indicates subjecthood and ul refers to objecthood.1 Abstract We present a novel scheme for annotating the realization and ellipsis of Korean particles. Annotated data include 100,128 Ecel (a spacebased word unit) in spoken and written corpora composed of four different genres in order to evaluate how register variation contributes to Korean particle ellipsis. Identifying the grammatical functions of particles and zero particles is critical for deriving a valid linguistic analysis of argument realization, semantic and discourse analysis, and computational processes of parsing. The primary challenge is to design a reliable scheme for classifying particles while making a clear distinction between ellipsis and non-occurrences. We determine in detail issues involving particle annotation and present solutions. In addition to providing a statistical analysis and outcomes, we briefly discuss linguistic factors involving particle ellipsis. 1 (1) onul-un Mina-ka kyosil-eyse cemsim-ul mek-e. today-TOP M-SUBJ classroom-in lunch-OBJ eat-ENG ‘Mina eats lunch in the classroom.’ Introduction In Korean, the grammatical function of a nominal is represented by a morphologically-attached postpositional particle. Particles involve a wide range of linguistic information such as grammatical relations (subject, object), semantic roles (Agent, Patient, Location, Instrument, etc.), discourse/pragmatic properties, such as topic markers, delimiters and auxiliary particles, as well as conjunctions. Due to their complex linguistic functions, particles are one of the most rigorously investigated topics in Korean linguistics. The subject particle ka also marks Agent (semantic role); the locative particle eyse combines with a nominal referring to Location; un marks topichood in the given discourse, etc. In spite of their linguistic function (representing the grammatical relations of subject and object), these particles frequently disappear, particularly in spoken Korean (Hong et al., 1998; Kim and Kwon, 2004; Lee and Thompson, 1985; Lee 2006, 2008). Previous studies have mainly focused on case particles and suggested that register variation is the key factor in particle ellipsis. However, few studies have comprehensively examined both spoken and written data with specific annotation features and guidelines. By using balanced spoken and written data, this paper explores the realization of all particles and ellipsis of case particles including subject and object case. In order to test the effect of register variation on particle realization, we designed a balanced corpora to include four different styles. The spoken corpora include everyday conversations, informal monologues (story-telling), TV debates, and lectures/speeches; the written corpora include personal essays, novels, news articles, and academic papers. Categorizing particles requires a well-articulated classification. Particles have complex grammatical 1 The subject and object particles have the phonological variants i and lul, respectively. 175 Proceedings of the 6th Linguistic Annotation Workshop, pages 175–183, Jeju, Republic of Korea, 12-13 July 2012. c 2012 Association for Computational Linguistics functions, and it is difficult to determine if a missing particle is a case of ellipsis or non-occurrence. We discuss these challenges in the context of developing a novel annotation scheme and guidelines. We examine particle ellipsis patterns across registers, as well as semantic and pragmatic factors triggering particle ellipsis. 2 Relevant Background Within theoretical linguistics, Korean particles have been classified according to three distinct linguistic functions: case particles, auxiliary particles, and conjunctive particles (Nam, 2000; Lee, 2006)2. A case particle combines with an argument or adjunct nominal and specifies the grammatical relation and semantic role of the nominal within the argument structure of a predicate. In contrast to case particles, auxiliary particles are not based on the grammatical relation of a nominal and a predicate; they introduce extra semantic and discourse interpretations. This category includes topic markers and delimiters, as well as other particles with diverse lexical meanings. In addition, there are conjunctive particles that attach to nominals and connect them to the following ones. Identifying the diverse functions of particles is important for syntactic, semantic, and discourse analyses in Korean. When a particle is elided, recovering the information behind a missing particle is essential for determining accurate grammatical relations, which is a prerequisite for computational processes of parsing, discourse analysis, machine translation, etc. However, the recovery process for missing particles does not include auxiliary particles as candidates due to their unpredictable distributions; auxiliary particles have their own discourse and pragmatic meanings, and their distributions over nominals are not restricted by grammatical relations with predicates. On the one hand, the validity of recovering a missing particle into its original form itself can be questionable; it has been argued in the literature that zero marking is the unmarked option and there is no ellipsis or deletion of particles (Lee and 2 Although particles combine with nominals, they sometimes follow a verbal phrase or a sentence adding semantic and pragmatic meanings of honorification, focus, etc. Some researchers assign these particles to a special category (Nam, 2000). In this study, we only examine particles combining with nominals and not with phrasal or sentential categories. 176 Thompson, 1989; Fujii and Ono, 2000 inter alia). However, whether a particle is deleted or originates as a zero form, it is important that a missing particle corresponds to a particular case particle and identification of it is crucial for determining the grammatical and semantic function of the bare nominal. With respect to particle ellipsis in Korean and also Japanese, most previous research has focused on subject and object particles. There have been contradictory reports on the dropping rates of these particles. Whereas Kwon (1989) and Hong et al. (1998) report a higher dropping rate for subject particles, Kim and Kwon (2004) and Lee (2006) argue for a higher dropping rate for object case markers in colloquial Korean. Among these studies, Hong et al. (1998) analyzes different radio shows with a total time span of 60 minutes and Lee (2006) analyzes the Call Friend Korean (CFK) corpus of telephone speech. Even disregarding the small data size (the former with fewer than 2000 noun phrases and the latter with 1956 overtly expressed subject and object NPs), the statistical results are less than convincing given the lack of a specific annotation scheme and guidelines. For example, Hong et al. (1998) include nominals with some topic markers or delimiters as tokens of case marker ellipsis. However, as mentioned in Lee (2008), these cases need to be excluded from the list of case ellipsis because the subject or object particles are morphologically restricted from co-occurring with auxiliary particles in Korean. Although Lee (2008) excludes optional occurrences of object particles in light verb constructions, it is not quite clear how non-occurrences of particles are separated from ellipsis of particles in the corpus study without specific guidelines. In order to develop a more comprehensive analysis of case ellipsis, it is necessary to employ large data sets with different registers across spoken and written Korean and a wellestablished annotation scheme and guidelines. 3 The Data and Annotation Scheme 3.1 Data We extracted 100,128 Ecel with morphological tagging from the Sejong Corpora to create spoken and written balanced corpora composed of four different registers with different degrees of formality. Approximately 2000 Ecel were each selected from 49 files to build balanced corpora. Table 1 summarizes the composition of the data. Type # of Files Size Everyday Conversations (E) 7 12,504 Monologues (M) 6 12,502 TV Debates & Discussions (D) 6 12, 547 Lectures & Speeches (L) 6 12, 526 6 12, 510 6 12, 505 Newspaper Articles (P) 6 12, 511 Academic Textbooks (A) 6 12, 505 Registers Private Spoken Public Personal Essays (PE) Written Novels (N) Table 1. Composition of Balanced Corpora 3.2 Annotation Scheme In agglutinative languages like Korean, particles are attached to preceding nominals without spaces, and identifying the position of a particle requires accurate segmentation. Although we extracted data with morphological tags, the tags sometimes reflected errors in spacing, morpheme identification, segmentation, etc. Therefore, we manually corrected relevant errors in segmentation and morpheme tags before performing annotation. Using morpheme tags, we identified all the nominal categories in the corpora that can combine with particles, including all the nominals with and without particles. We annotated realized particles and determined their categories using the tag set in Figure 1. In addition, we selected four annotation features to mark up particle realization and ellipsis. The given tag set has been used to annotate both realized particles and missing particles. However, annotating missing particles presents challenges and requires a new annotation scheme. Elided particles are recovered using the case particles based upon grammatical relations between a nominal and a predicate. The details are presented in the next section.  Tag Set of Particles o Case Particles3: Subject (S): ka/i Subject Honorific (SH): keyse Object (O): ul/lul Genitive (G): uy 3 We focused on particles that directly follow nominals. Thus, particles that appear after verb phrases or sentences have been excluded from our tag set, including the direct quotation particle lako and hako. 177 Dative (D): ey/eykey (‘to’), hanthey (‘to’) Dative Honorific (DH): kkey (‘to’) Complement (C): ka/i Adverbial Case (B): Time (BT): ey (‘in, at’) Location (BL): ey (‘to’), eyse (‘from’) Instrument (BI): lo/ulo (‘with’) Direction (BD): lo/ulo (‘to, as’) Source (BS): eyse (‘from’), eykey(se) (‘from’), hanthey(se) (‘from’) , pwuthe (‘from’), ulopwuthe (‘from’), eysepwuthe (‘from’), Goal (BG): ey (‘to’), kkaci (‘to’) Accompany (BA): wa/kwa (‘with’), hako (‘with’), ilang/lang (‘with’) Vocative (V): a/ya Comparative (R): pota ('than'), mankhum ('as~as'), etc. o Discourse/Modal: Topic (T): un/nun/n Auxiliary (A): to (‘also’), man (‘only), mata (‘each’), pakkey (‘only’), chelem (‘like’), mankhum (‘as much as’), etc. o Conjunction (J): wa/kwa (‘and’), hako (‘and’), ina/na (‘or’), itunci/tunci (‘or’), ilang/lang (‘and’), etc.  Annotation Features Realized Particle, Realized Particle Type Missing Particle, Missing Particle Type Figure 1. Annotation Scheme of Particles 3.3. Ellipsis vs. Non-Occurrence of Particles As defined in Fry (2001), ellipsis is the phenomenon whereby a speaker omits an obligatory element of syntactic structure. However, there are at least three morpho-syntactic constructions in Korean where a particle does not need to be recovered because it is not obligatory in the given position. Our annotation distinguishes these optional nonoccurrences from the particle ellipsis phenomenon and marks them separately. First, the occurrence of the genitive case uy is optional depending on various syntactic and semantic relation between two nominals in Korean. For example, the genitive uy tends to disappear after a complement nominal of a verbal noun, e.g., yenghwa-uy/Ø chwalyeng (movie-GEN + filming) 'filming of a movie', whereas it appears after a subject nominal of a verbal noun, e.g., John-uy/*?Ø wusung (John-GEN + winning) 'John's winning'. Due to complex linguistic factors, there is still controversy regarding how to predict occurrences of the genitive case in Korean (Lee, 2005; Hong, 2009), and native speakers' intuitions on the positions of the dropped genitive particle and its recoverability vary.4 Therefore, we chose not to annotate the genitive particle uy when it does not occur and we do not count particle ellipsis within a nominal phrase. Second, particles are optional in light verb constructions, as mentioned in previous research (e.g., Lee and Thompson, 1989; Lee and Park, 2008). In Korean, the morphological formation of a SinoKorean (or foreign-borrowed) verbal noun and the light verbs (LV) hata 'do', toyta 'become', and sikhita 'make' is very frequent, e.g., silhyen (accomplishment)+hata/toyta/sikhitato 'accomplish /to be accomplished/to make it accomplish', stheti (study) +hata, 'to study' etc. In these light verb constructions, the subject particle i/ka or the object particle ul/lul can appear after the verbal nouns as in silhyen-ul hata (accomplishment-OBJ do), silhyen-i toyta (accomplishment-SBJ become), silhyen-ul sikhita (accomplishment-OBJ make), stheti-lul hata (study-OBJ do), etc. Realization of these case particles, however, is not mandatory and even unnatural when the argument of a verbal noun appears in the same sentence, as in the following example. adverb intervenes between a verbal noun and the LV, and the particle i/ka or ul/lul follows the verbal noun. In those constructions, we exceptionally assume particle ellipsis. This decision affects the result of our corpus analysis due to the high frequency of LV combinations, particularly with respect to object particle ellipsis. In contrast, Lee and Thompson (1989) assume particle ellipsis in N+LV combinations unless there is another nominal with an object particle licensed in front of the verbal noun. Although we exclude particle ellipsis in light verb constructions, we separately mark up possible case realizations of LV combinations in order to measure the extent to which they affect the statistical results. Third, optional particles frequently appear with bound nouns (or defective nouns) in Korean. Bound nouns refer to nominals that do not occur without being preceded by a demonstrative, an adnoun clause, or another noun, which includes tey 'place', ttay 'time' swu 'way', ke(s) 'thing', cwul 'way', check 'pretense', etc. (3) ?*John-i kkum-ul silhyen-ul hayssta. J-nom dream-OBJ accomplishment-OBJ did 'John accomplished his dream.' Bound nouns are functionally limited with respect to neighboring constituents. For instance, a bound noun ttay 'time' only combines with a clause ending with the adnominal ending -(u)l, whereas hwu 'after' combines with a clause ending with -(u)n.6 In addition to morpho-syntactic reliance on the preceding clause, many bound nouns form formulaic expressions with the following predicates (i.e., the bound noun swu 'way' only combines with existential predicates, issta 'exist' and epsta 'do not exist'). Considering that particles in bound nouns are frequently dropped and do not represent grammatical relations of bound nouns with respect to the predicate, we also exclude them as cases of ellipsis.7 In considering the morpho-syntactic unity of N+LV combinations as single predicates and the awkwardness of a realized particle after a verbal noun, we conclude that N + LV combinations do not involve case ellipsis. 5 However, when these LV combinations include negation, the negative 4 Although semantic change and lexical insertion can be used for identifying morphological compounds, it is still very difficult to distinguish nominal compounds and syntactic nominal complexes. Therefore, school grammars present some inconsistent distinctions. For example, wuli nala (we country) 'our country' is considered a single lexical word, a compound nominal, whereas the similar combination, wuli kacok (we family) 'our family' is a complex NP composed of two separate nouns. 5 It is also arguable whether the realization of a particle after a verbal noun is based on the subcategorization feature of the light verb hata or toyta. Through personal conversations, some scholars suggest that the realization of a particle after a verbal noun may be a case of insertion. When adopting this argument, particle omission is not even possible for the LV constructions. This needs to be more thoroughly investigated through examining historical corpus data. 178 (4) hakkyo-eyse kongpwuha-l swu(-ka) issta. school-at study-REL way (-NOM) exist 'It is possible to go to study at school.' 6 For bound nouns in Korean, refer to Sohn (1999). Classifiers belonging to bound nouns show interesting patterns of case particle realization in Korean; classifiers form morphosyntactic combinations such as [Noun + Number + Classifier], e.g., sakwa han kay (apple one thing) 'one apple'. Normally, a case particle appears on the initial content noun or the final classifier (e.g. [sakwa-ka/lul han kay][ [sakwa han kay-ka/lul]) or there is a copy of the case particle from the content noun (e.g.[sakwa-ka/lul han kay-ka/lul]). In this study, 7 Spoken Corpora Particle Realization Predicate Nominals (P) Zero Particles Ellipsis Compounds (N) Optional (E) Light Verb (L) Vocative (V) Errors Written Corpora Particle Realization Predicate Nominals (P) Zero Particles Ellipsis Compounds (N) Optional (E) Light Verb (L) E 2081 741 843 320 796 308 24 82 PE 4707 593 98 406 996 361 M 2853 590 395 297 735 190 3 36 N 4715 600 86 104 1125 437 D 3334 742 237 350 841 482 6 41 P 4603 393 165 1941 1492 965 L 3672 757 185 411 802 410 20 43 A 4928 612 12 728 712 917 Total 11940 2830 1660 1378 3174 1390 53 202 Total 18953 2197 361 3179 4325 2680 Table 2. Grammatical Realization of the Nominal Category8 In addition to optional particles, we also note that some constructions mandatorily require nonoccurrence of particles. We have already seen that the genitive particle is not allowed within nominal compounds, e.g. [palcen+Ø(*-uy) keyhwoyk+Ø(*uy) pokose] 'development plan report'. In addition, some bound nouns form formulaic (or idiomatic) expressions with their neighboring words and do not combine with particles, e.g., kes-(*kwa)+ kathta (thing-(*with) + similar) 'seem', ke-Ø + aniya (thing + isn't) 'isn't it?', N-Ø + ttaymwun (N + reason), etc. Also, particle omission is required by the lexical properties of nominals. For example, numbers belonging to the nominal category combine with subject or object particles as well as with other auxiliary and discourse particles (e.g., tases-un/-i/ul 'five-TOP/SBJ/OBJ'). However, they cannot take any particle when followed by count bound nouns, e.g., tases-Ø + kay/salam/pen/kaci/... (five + items/people/sorts, etc.). Similarly, time nominals such as onul 'today', ecey 'yesterday', nayil 'tomorrow' stand alone without particles as adverbial phrases even though they combine with other particles in different syntactic positions. In contrast, time nominals such as onul achim 'this morning' and 2000 nyen 'year 2000' can stand alone but also combine with the time particle ey. These temporal eys are considered to be optional. In summary, optional and mandatory nonoccurrence of particles restricted by morphosyntactic and lexical constraints needs to be distin- guished from the omission of obligatory particles. Therefore, we include the following features to annotate bare nominals that do not mandate recovery of particles. E N - L - 179 Non-occurrence of a particle based upon lexical or morpho-syntactic constraints. Non-occurrence of a particle after a nominal that forms a compound with the following nominal Non-occurrence of a particle in light verb constructions In addition, nominals can be combined with copula ita or appear at the end of a phrase or a sentence without the copula in Korean. These predicate nominals have been annotated separately from other nominals. When a nominal is repeated by mistake with or without a particle, these erroneous nominals are separately marked and excluded from counts of particle realization and ellipsis. Separate features are given to handle these cases. P- ER - 8 as long as there is one particle realized in either the content noun or the classifier, we do not count it as case ellipsis. - Predicate nominals combining with copula ita. It also marks a nominal standing alone without ita, as answering utterance. Errors including a repeated nominal by mistake or an incomplete utterance E: Everyday Conversations; M: Monologues, D: Debates; L: Lectures; PE: Personal Essays; N:Novels; P:Newspapers, A:Academic Texts 3.4 Principles of Annotating Particle Omission and Inter-Annotator Agreement Our annotation principles of missing particles are presented as follows:    With respect to missing particles, we annotate only obligatory case particles and conjunctive particles while excluding discourse/modal particles. This captures the minimum needed for a particle prediction system. In the process of recovering elided forms, there are cases in which more than one particle could be correct. Instead of selecting a single best particle, we present a set of multiple candidates without preference ranking. Particle stacking is allowed in Korean. We annotate stacked particles as single units without separating them into smaller particles. However, their segmentation is specified under the annotation feature of realized particle type. Missing particles, however, exclude stacked particles. Most particle stacking includes a discourse/modal particle that adds its specific meaning to the attached nominals. Based on our annotation scheme and guidelines, two experienced annotators manually annotated realized particles, missing particles, and their types on the spoken and written corpora separately and cross-examined each other's annotation. Difficult cases were picked out and discussed with each other to reach an agreement. In order not to overly inflate the values with words that do not take particles, we removed words that do not belong to the nominal categories (nouns, pronouns, bound nouns, and numbers). The realized particles were provided to the annotators with the morphological analysis. Thus, we decided to compute the inter-annotator agreement on only 466 nominals with no particles within 5000 Ecels (before cross-examination). The kappa statistic on the case ellipsis by the two annotators is 91.23% for the specific particles. The agreement rate is much higher than we expected, but can be attributed to the annotation guidelines, which were clear and limited recovery of particles to case particles not including auxiliary and discourse particles. The two annotators were highly trained, having over two years of experience with particle annotation tasks. 180 4 Corpus Analysis Table 2 summarizes the results of particle annotation of all the nominals, and Table 3 focuses on particle realization and ellipsis. Table 2 shows all nominal realizations with particles and without. Zero particles include both bare particle ellipsis and bare nominals including nominals that do not require particles as a component of compound nominals (N) and nominals that appear without particles in the corpora although they may optionally (E). In addition, the spoken corpora include bare nominals used as vocative phrases without particles. These cases have been counted separately. Erroneous usage of nominals only appears in the spoken corpora. Light verb combinations here only include cases that may allow realization of subject or object case particles, whose numbers are significantly high both in the spoken corpora and the written corpora. As we see in Table 3, the overall case ellipsis rates are not that high across the two registers, but the difference between the spoken and the written corpora is significant (χ2=851.78, p <.001). Spoken E M D L Total Realized 71% 88% 93% 95% 88% Ellipsis 29% 12% 7% 5% 12% Written PE N P A Total Realized 98% 98% 97% 99.7% 98% Ellipsis 2% 2% 3% 0.3% 2% Table 3. Particle Realization vs. Ellipsis Furthermore, genre plays an even more significant role within the spoken corpora. Particle ellipsis in everyday conversations is significantly more frequent than in monologues, debates, or lectures using a Bonferroni adjusted alpha level of .008 per comparison (.05/6). (χ2(1)=266.64, p<.001; χ2(1)=571.19, p<.001; χ2(1)=746.93, p<.001). Particle ellipsis in monologues is significantly more frequent with debates or lectures ( χ2(1)=61.66, p<.001; χ2(1)=126.59, p<.001). In contrast, particle ellipsis between debates and lectures shows a lower chi-square value than the other cases, although the value is still significant. (χ2(1)==11.72, p<.001). Table 4 presents the annotation results of case particle realization and ellipsis including subject and object particles. Spoken Written Particles E M D L Total PE N P A Total SUBJ + 63% (539) 88% (776) 93% (927) 95% (848) 85% (3090) 97% (743) 97% (840) 92% (635) 99.7% (588) 98% (2806) SUBJ − 37% (318) 11% (97) 7% (67) 5% (48) 15% (530) 3% (25) 3% (24) 3% (18) 0.3% (2) 2% (69) OBJ + 51% (398) 73% (535) 85% (698) 89% (771) 75% (2402) 94% (967) 95% (1066) 99% (1050) 99% (1026) 97% (4109) OBJ − 49% (389) 27% (198) 15% (121) 11% (92) 25% (800) 5% (56) 5% (53) 1% (13) 1% (9) 3% (131) CONJ + 92% (57) 68% (54) 90% (89) 98% (137) 88% (337) 100% (133) 100% (113) 97% (226) 99.7% (276) 99% (748) CONJ − 8% (5) 32% (26) 10% (10) 2% (3) 12% (44) 0% 0 0% (0) 3% (7) 0.3% (1) 1% (8) OTHERS + 81% (549) 90% (634) 95% (859) 97% (1174) 92% (3213) 99% (1778) 99.5% (1739) 93% (1680) 100% (2173) 98% (7370) OTHERS − 19% (131) 10% (74) 4% (39) 3% (42) 8% (286) 1% (17) 0.5% (9) 7% (127) 0% (0) 2% (153) Table 4. Realization and Ellipsis of Case Particles Overall dropping rates of subject particles and object particles show a difference between the spoken and the written corpora. Object particle dropping is significantly more frequent in the spoken corpora than in the written corpora (χ2 =797.03, p<.001). Within the spoken corpora, there is also some variation according to genre. Both subject and object dropping rates increase as the genres become less formal. In everyday conversations, the dropping rate of object particles reaches 49% and the dropping rate of subject particles is 37%. While the dropping rates of both particles decrease in the formal registers of the spoken corpora, the dropping rate of the object particles is consistently higher than the dropping rate of the subject particles at each register. In parallel, conjunctive particles and other case particles are more frequently dropped in the spoken corpora than in the written corpora.9 Our findings can be summarized as follows:  In Korean, particle ellipsis is not very frequent. The particle dropping rate for subjects is 12% in the spoken corpora and 2% in the written corpora.  The effect of register variation on particle ellipsis (everyday conversations vs. debates & lectures) demonstrates that particle dropping is less preferred in formal contexts. However, formality 9 Unexpectedly, conjunctive particles drop more frequently in monologues than in everyday conversations. 181 per se is not the deciding factor, but a partially related factor.10  Across the spoken corpora, object particles drop more frequently than subject particles. (χ2 =115.17, p <.001)  Other case and connective particles are also more frequently elided in the spoken corpora. 5 Linguistic Properties in Particle Ellipsis The frequent case particle ellipsis in the spoken corpora suggests that discourse need to be further investigated. This implies that discourse factors contribute to particle ellipsis, as suggested in Lee and Thompson (1989). Using the corpus annotation, we can explore linguistic properties involving in particle ellipsis. 5.1 Definiteness and Specificity A case particle is likely to be dropped when the preceding noun is definite or specific (Kim, 1991). The definite NP ku haksayng 'that student' can drop subject case. This contrasts with the fact that the indefinite expression etten haksayng ‘some student’ cannot appear without the subject particle. 10 This can be supported by the fact that register variation does not affect particle dropping in the written corpus. b. saylo o-n sensayng-Ø (ul), ne alla? newly come-REL teacher-Ø (OBJ) you know 'Do you know the new teacher?' (5) a. ku haksayng-i/-Ø na-lul chacawa-ss-e. that s tudent-SBJ/Ø I-OBJ visit-PAST-END ‘That student visited me.’ b. etten haksayng-i/*Ø na-lul chacawa-ss-e. some student-SBJ /Ø I-OBJ visit-PAST-END ‘Some student visited me.’ In our annotated corpus, the particles that are attached to personal pronouns and wh-pronouns are frequently dropped. This implies that definiteness is a crucial factor for licensing particle dropping.11 6 5.2 Familiarity and Salience in Discourse Particle ellipsis is also based on discourse properties of familiarity (background).12 In the following example, it is more natural to drop the object particle from tampay 'cigarette' when speaking in a convenience store. This is because selling cigarettes is already familiar knowledge shared among the discourse participants. (6) tampay-?lul/-Ø cwu-seyyo. cigarette-OBJ-Ø give-IMPERATIVE ‘Please give me cigarette.’ However, when the object particle is used in (6), the object cigarette is exclusively designated or highlighted. This contrasts with the fact that the speaker commonly uses a nominal referring to discourse participants such as you and I, proper names, or titles without a particle in order to catch the attention of the listener(s). Also, when a subject or object nominal is scrambled out of its original position and appears at the sentence initial or final position, the particle disappears to emphasize the salience of the nominal element, as in (7). (7) Examination of our annotated corpora strongly suggests that particle ellipsis is associated with two contrastive discourse properties, familiarity and salience, and also that it interacts with other grammatical mechanisms such as word order, lexical category, and possibly prosody.13 a. philyohan-n kel hanato mos tule, na-Ø. necessary- REL thing anything not take I-Ø 'I cannot take anything that is necessary.' 11 Lee (2006, 2010) argues that case ellipsis of subjects and objects interacts with the definiteness of nominals. The rate of case ellipsis for strongly definite subject NPs is significantly higher than the rate for weakly definite NPs. However, object case ellipsis works in the opposite direction. It is difficult to identify definiteness of a nominal in Korean, where definite and indefinite articles do not exist. We have not annotated definiteness features in our corpora, but intend to as part of future work. 12 Similarly, Lee and Thompson (1989) propose that "sharedness between communicators" is the pragmatic factor determining object particle ellipsis in discourse. 182 Final Remarks In this study, we presented our annotation work on particle realization and ellipsis using spoken and written corpora in Korean. A new annotation scheme and principles were presented, along with challenging issues and solutions, such as the recovery of missing particles and the distinction between ellipsis and non-occurrence of particles. In order to evaluate the effect of register variation on particle ellipsis, we incorporated four different genres. Our major finding is that the rate of particle ellipsis in Korean is not as high as generally assumed and register variation is a significant factor only in spoken corpora. The more informal dialogs are, the more often particles are elided. Our corpus annotation suggests that particle ellipsis is related to activated semantic/pragmatic constraints among discourse participants, which include definiteness, specificity, familiarity and salience. The implication of these findings is significant not only for linguistic theory, but also for language processing, Korean language teaching, and translation. Particle ellipsis will be a more serious issue for computational modeling that incorporates informal spoken dialogs than for computational processing on written texts. In language teaching, particles need to be emphasized more for formal writing and formal speaking based on their frequency in the given register (Lee et al., this volume). Next, we plan to run error detection software on our corpus to verify the consistency of our annotation (Dickinson and Meurers, 2003), to prepare for releasing the data with guidelines, to further analyze the results of the annotation, and to address more elaborate linguistic implications in the annotated data. 13 Case ellipsis and realization have been also examined within information structure-based analyses such as Lee (2006, 2010) and Kwon and Zribi-Hertz (2008) References Song-Nim Kwon and Anne Zribi-Hertz. 2008. Differential Function Marking, Case, and Information Structure: Evidence from Korean. Language, 84:2:258-299 Markus Dickinson and Detmar Meurers. 2003. Detecting Errors in Part-of-Speech Annotation. Proceedings of the 10th Conference of European Chapter of the Association for Computational Linguistics (EACL03). Budapest, Hungary. Hanjung Lee. 2010. Explaining Variation in Korean Case Ellipsis: Economy versus Iconicity. Journal of East Asian Linguistics, 19: 292-318. John Fry. 2001. Ellipsis and ‘wa’-marking in Japanese conversation. Doctoral Dissertation. Stanford University. Seon-woong Lee. 2005. A Study on Realization of Nominal Arguments (Myengsa-uy Nonhang Silhyen Yangsang, In Korean). Noriko Fujii and Tsuyoshi Ono. 2000. The Occurrence and Non-Occurrence of the Japanese Direct Object Marker O in Conversation. Studies in Language, 24(1): 1-39. Sun-Hee Lee. 2006. Particles (Cosa). Why Do We Need to Reinvestigate Part of Speeches? (in Korean): 302346. Jeanette K. Gundel, Nancy Hedberg and Ron Zacharski. 1993. Cognitive Status and the Form of Referring Expressions in Discourse. Language, 69: 274-307. 2012. Developing Learner Corpus Annotation for Korean Particle Errors. In Proceedings of the Young-joo Hong. 2009. Syntactic Relation between Two Nominals in NP and Non-occurrence of の/uy (Myengsakwu Nay-uy Cenhang Myengsa-wa Hwuhang Myengsa-uy Thongsacek Kwankey-wa の /uy-uy Pisilhyen, In Korean). Japanese Study (Ilbon Yengoo), 40: 639-653. The Institute of Japanese Studies. Seoul. Paul Hopper and Sandra A. Thompson. 1984. The Discourse Basis for Lexical Categories in Universal Grammar. Language, 60: 703-752. Ji-Eun Kim. 1991. A Study on the Condition in Realizing Subject without Case Marker in Korean, Hangul, 212. Kun-hee Kim and Jae-il Kwon. 2004. Korean Particles in Spoken Discourse-A Statistical Analysis for the Unification of Grammar. Hanmal Yenku, 15: 1-22. Eon-Suk Ko. 2000. A Discourse Analysis of the Realization of Objects in Korean. Japaenese/Korean Linguistics, 9: 195-208. Stanford: CSLI Publication. Jae-il Kwon. 1989. Characteristic of Case and the Methodology of the Case Ellipsis, Language Research, 25(1): 129-139. Song-Nim Kwon and Anne Zribi-Hertz. 2008. Differential function marking, case, and information Structure: Evidence from Korean. Language, 84(2): 25899. Hyo Sang Lee and Sandra A. Thompson. 1989. A discourse account of the Korean accusative marker. Studies in Language, 13: 105-128. Hanjung Lee. 2006. Parallel Optimization in Case Systems: Evidence from Case Ellipsis in Korean. Journal of East Asian Linguistics, 15: 69-96. 183 Sun-Hee Lee, Markus Dickinson, and Ross Israel. Sixth Linguistic Annotation Workshop (this volume). Jeju, Korea Minpyo Hong, Kyongjae Park, Inkie Chung, and Jiyoung Kim. 1998. Elided Postpositions in Spoken Korean and their Implications on Center Management, Korean Journal of Cognitive Science, 9(3): 35-45. Yoon-jin Nam. 2000. A Statistical Analysis of Mondern Korean Particles (Hyentay Hankwuke-ey tayhan Kyelyang Enehakcek Yenkwu). Thayhaksa. Ho-Min Sohn. 1999. The Korean Language. Cambridge University Press. Cambridge, UK. Yongkyoon No. 1991. A Centering Approach to the *[CASE][TOPIC] Restriction in Korean. Linguistics, 29: 653-668. Yu-hyun Park. 2006. A Study on the Particle '-ka''s Non-Realization in Modern Korean Spoken Language. Emwunlonchong, 45: 211-260. EnricVallduví and Maria Vilkuna, M. 1998. On Rheme and Kontrast. The Limits of Syntax, eds. Peter Culicover and Louise McNally, 79-109. New York: Academic Press. Shuichi Yatabe. 1999. Particle Ellipsis and Focus Projection in Japanese. Language, Information, Text, 6: 79-104.