Academia.eduAcademia.edu

Unsupervised Relation Extraction for E-Learning Applications

2012

AI-generated Abstract

The rapid growth of e-Learning due to advancements in information technology has led to an increasing need for effective assessment methods, particularly Multiple Choice Tests (MCTs). This research presents an unsupervised Natural Language Processing (NLP) approach for generating MCTs by extracting semantic relations using Information Extraction (IE). Two unsupervised Relation Extraction methods, surface-based and dependency-based, are proposed to identify key semantic relations in documents for automatic question generation, demonstrating high precision rates and improved relevance compared to traditional methods.

Unsupervised Relation Extraction for E-Learning Applications Naveed Afzal A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy 2012 This work or any part thereof has not previously been presented in any form to the University or to any other institutional body whether for assessment, publication, or for other purposes (unless otherwise indicated). Save for any express acknowledgements, references and/or bibliographies cited in the work, I confirm that the intellectual content of the work is the result of my own efforts and of no other person. The right of Naveed Afzal to be identified as author of this work is asserted in accordance with ss.77 and 78 of the Copyright, Designs and Patents Act 1988. At this date, copyright is owned by the author. Signature…………………. Date……………….. i Wisdom is not the product of schooling but the lifelong attempt to acquire it. (Albert Einstein) ii Abstract In this modern era many educational institutes and business organisations are adopting the e-Learning approach as it provides an effective method for educating and testing their students and staff. The continuous development in the area of information technology and increasing use of the internet has resulted in a huge global market and rapid growth for e-Learning. Multiple Choice Tests (MCTs) are a popular form of assessment and are quite frequently used by many e-Learning applications as they are well adapted to assessing factual, conceptual and procedural information. In this thesis, we present an alternative to the lengthy and time-consuming activity of developing MCTs by proposing a Natural Language Processing (NLP) based approach that relies on semantic relations extracted using Information Extraction to automatically generate MCTs. Information Extraction (IE) is an NLP field used to recognise the most important entities present in a text, and the relations between those concepts, regardless of their surface realisations. In IE, text is processed at a semantic level that allows the partial representation of the meaning of a sentence to be produced. IE has two major subtasks: Named Entity Recognition (NER) and Relation Extraction (RE). In this work, we present two unsupervised RE approaches (surface-based and dependencybased). The aim of both approaches is to identify the most important semantic relations in a document without assigning explicit labels to them in order to ensure broad coverage, unrestricted to predefined types of relations. In the surface-based approach, we examined different surface pattern types, each implementing different assumptions about the linguistic expression of semantic relations between named entities while in the dependency-based approach we explored how dependency relations based on dependency trees can be helpful in extracting relations between named entities. Our findings indicate that the presented approaches are capable of achieving high precision rates. iii Our experiments make use of traditional, manually compiled corpora along with similar corpora automatically collected from the Web. We found that an automatically collected web corpus is still unable to ensure the same level of topic relevance as attained in manually compiled traditional corpora. Comparison between the surfacebased and the dependency-based approaches revealed that the dependency-based approach performs better. Our research enabled us to automatically generate questions regarding the important concepts present in a domain by relying on unsupervised relation extraction approaches as extracted semantic relations allow us to identify key information in a sentence. The extracted patterns (semantic relations) are then automatically transformed into questions. In the surface-based approach, questions are automatically generated from sentences matched by the extracted surface-based semantic pattern which relies on a certain set of rules. Conversely, in the dependencybased approach questions are automatically generated by traversing the dependency tree of extracted sentence matched by the dependency-based semantic patterns. The MCQ systems produced from these surface-based and dependency-based semantic patterns were extrinsically evaluated by two domain experts in terms of questions and distractors readability, usefulness of semantic relations, relevance, acceptability of questions and distractors and overall MCQ usability. The evaluation results revealed that the MCQ system based on dependency-based semantic relations performed better than the surface-based one. A major outcome of this work is an integrated system for MCQ generation that has been evaluated by potential end users. iv Acknowledgements First of all, I would like to thank the Almighty who has enabled me to complete this thesis. This thesis would not have been possible without help from a lot of the people and I would like to take this opportunity to thank them. Special thanks to my director of studies, Ruslan Mitkov, my supervisors Viktor Pekar and Atefeh Farzindar for their continuous guidance, support and encouragement. I would like to express my special gratitude and appreciation to Alison Carminke and Erin Stokes who took the trouble of reading the final draft of my thesis and helped me improve it with their valuable comments. I would also like to thank Syed Amir Iqbal and Ruth Seal for their help and feedbacks during evaluation. This thesis would not be the way it is without the valuable comments and suggestions from members of the Research Group in Computational Linguistics at the University of Wolverhampton. In alphabetical order they are Miranda Chong, Iustin Dornescu, Richard Evans, Le An Ha, Iustina IIisei, Natali Konstantinova, Georgiana Marsic, Constantin Orasan, Yvonne Skalban, Lucia Specia and Irina Temnikova. I would like to express my special gratitude and appreciation to my former research advisor Mark Stevenson who introduced me to the world of Natural Language Processing. I still think fondly of my time as a postgraduate student spent working with him. Finally, but most importantly, I would also like to thank my family and friends for their unstinting support, motivation and encouragement. My parents have always believed in me and encouraged me to strive for excellence in all that I do. There are no sufficient words to thank them for their help and everything they did for me. v Table of Contents ABSTRACT................................................................................................................ iii ACKNOWLEDGEMENTS ........................................................................................v TABLE OF CONTENTS ...........................................................................................vi LIST OF TABLES ......................................................................................................ix LIST OF FIGURES .....................................................................................................x ABBREVIATIONS ....................................................................................................xii CHAPTER 1: INTRODUCTION..............................................................................1 1.1 E-Learning ........................................................................................................................................1 1.2 Automatic Assessment in E-Learning.............................................................................................2 1.3 Multiple-Choice Questions (MCQs)................................................................................................3 1.4 Challenges in Automatic Generation of Multiple Choice Questions............................................5 1.5 Aims of the Thesis.............................................................................................................................7 1.6 Original Contributions.....................................................................................................................9 1.7 System Overview ............................................................................................................................11 1.8 Structure of the Thesis ...................................................................................................................12 CHAPTER 2: BACKGROUND ..............................................................................14 2.1 Automatic Multiple Choice Question Generation .......................................................................14 2.2 Information Extraction (IE) ..........................................................................................................18 2.2.1 Applications of IE ...................................................................................................................20 2.2.2 Subtasks of IE .........................................................................................................................21 2.2.3 Evaluation of IE Systems .......................................................................................................23 2.2.4 Strategies to Perform IE.........................................................................................................26 2.2.5 Machine Learning Approaches in IE ....................................................................................27 2.3 Approaches to building Named Entity Recognition Systems......................................................29 2.3.1 Supervised Learning Approach.............................................................................................30 2.3.2 Semi-supervised Learning Approach....................................................................................31 vi 2.3.3 Unsupervised Learning Approach ........................................................................................32 2.4 Rule-based Approaches to Relation Extraction...........................................................................33 2.4.1 AutoSlog ..................................................................................................................................33 2.4.2 PALKA ....................................................................................................................................34 2.5 Supervised Approaches to Relation Extraction ...........................................................................34 2.5.1 CRYSTAL ...............................................................................................................................35 2.5.2 LIEP.........................................................................................................................................35 2.5.3 WHISK ....................................................................................................................................35 2.5.4 GATE.......................................................................................................................................36 2.6 Semi-supervised Approaches to Relation Extraction ..................................................................36 2.6.1 AutoSlog-TS ............................................................................................................................37 2.6.2 Snowball: Extracting Relations from Large Plain-Text Collections ..................................37 2.6.3 Dependency Tree based Pattern Models...............................................................................38 2.7 Unsupervised Approaches to Relation Extraction.......................................................................43 2.8 Relation Extraction in the Biomedical Domain ...........................................................................46 2.9 Use of Web as a corpus ..................................................................................................................48 2.10 Summary .......................................................................................................................................50 CHAPTER 3: STEM SENTENCES SELECTION VIA IE...................................51 3.1 Unsupervised Surface-based Patterns ..........................................................................................51 3.1.1 Our Approach .........................................................................................................................52 3.1.2 NER and PoS Tagging of Biomedical Texts .........................................................................53 3.1.3 Extraction of Candidate Patterns..........................................................................................56 3.1.4 Pattern Ranking......................................................................................................................60 3.1.5 Evaluation................................................................................................................................64 3.1.6 Results......................................................................................................................................68 3.2 Unsupervised Dependency-based Patterns...................................................................................80 3.2.1 Automatic Parsing of Text .....................................................................................................80 3.2.2 Our Approach .........................................................................................................................81 3.2.3 Extraction of Candidate Patterns..........................................................................................82 3.2.4 Pattern Ranking......................................................................................................................87 3.2.5 Evaluation................................................................................................................................87 3.2.6 Results......................................................................................................................................87 3.3 Comparison between Surface-based and Dependency-based Approaches................................91 3.4 Summary .........................................................................................................................................93 CHAPTER 4: QUESTIONS AND DISTRACTORS GENERATION .................95 4.1 Question Generation ......................................................................................................................95 4.2 Our Approach.................................................................................................................................98 4.2.1 Surface-based Patterns...........................................................................................................98 4.2.2 Dependency-based Patterns .................................................................................................101 4.3 Distractors Generation.................................................................................................................103 4.3.1 Our Approach .......................................................................................................................106 vii 4.4 Summary .......................................................................................................................................108 CHAPTER 5: EXTRINSIC EVALUATION ........................................................109 5.1 Overview........................................................................................................................................109 5.2 Our Approach...............................................................................................................................111 5.2.1 Evaluation Data ....................................................................................................................111 5.2.2 Evaluation Method ...............................................................................................................112 5.2.3 Results....................................................................................................................................115 5.2.4 Comparison ...........................................................................................................................115 5.2.5 Discussion ..............................................................................................................................117 5.3 Summary .......................................................................................................................................122 CHAPTER 6: CONCLUSIONS .............................................................................124 6.1 Thesis Contributions ....................................................................................................................124 6.2 Thesis Review................................................................................................................................126 6.3 Future Work .................................................................................................................................128 APPENDIX A: PREVIOUSLY PUBLISHED WORK ........................................131 APPENDIX B: EXAMPLES OF AUTOMATICALLY GENERATED MCQS132 APPENDIX C: RESULT TABLES........................................................................136 BIBLIOGRAPHY ....................................................................................................167 viii List of Tables Table 1: Contingency table ..........................................................................................24 Table 2: Example gene and protein names in various linguistic forms ......................54 Table 3: Tagging accuracies ........................................................................................55 Table 4: GENIA NER performance.............................................................................56 Table 5: Untagged word patterns along with their frequencies ...................................58 Table 6: PoS-tagged word patterns along with their frequencies ................................58 Table 7: Verb-centred patterns along with their frequencies.......................................59 Table 8: Patterns only containing stop-words..............................................................60 Table 9: Homogeneity scores of corpora .....................................................................66 Table 10: Similarity scores of corpora.........................................................................67 Table 11: Percentages of heads correctly attached ......................................................83 Table 12: SVO patterns along with their frequencies..................................................85 Table 13: Adapted linked-chain patterns along with their frequencies .......................86 Table 14: Examples of extracted patterns along with automatically generated questions ....................................................................................................................100 Table 15: Examples of automatically generated distractors ......................................107 Table 16: Evaluation results of surface-based and dependency-based MCQ systems ....................................................................................................................................115 Table 17: Interpretation of Kappa score ....................................................................117 Table 18: Kappa score ...............................................................................................118 Table 19: Weighted Kappa score...............................................................................119 Table 20: p-values of Chi-Square ..............................................................................121 ix List of Figures Figure 1: An example of a Multiple Choice Question...................................................4 Figure 2: Overall system architecture ..........................................................................11 Figure 3: Relation Extraction approach .......................................................................53 Figure 4: Rank-thresholding results for untagged word patterns using GENIA corpus ......................................................................................................................................69 Figure 5: Score-thresholding results for untagged word patterns using GENIA corpus ......................................................................................................................................69 Figure 6: Rank-thresholding results for PoS-tagged word patterns using GENIA corpus...........................................................................................................................70 Figure 7: Score-thresholding results for PoS-tagged word patterns using GENIA corpus...........................................................................................................................71 Figure 8: Rank-thresholding results for verb-centred word patterns using GENIA corpus...........................................................................................................................71 Figure 9: Score-thresholding results for verb-centred word patterns using GENIA corpus...........................................................................................................................72 Figure 10: Rank-thresholding results for untagged word patterns along with prepositions using GENIA corpus ...............................................................................73 Figure 11: Score-thresholding results for untagged word patterns along with prepositions using GENIA corpus ...............................................................................74 Figure 12: Rank-thresholding results for PoS-tagged word patterns along with prepositions using GENIA corpus ...............................................................................74 Figure 13: Score-thresholding results for PoS-tagged word patterns along with prepositions using GENIA corpus ...............................................................................75 Figure 14: Rank-thresholding results for verb-centred word patterns along with prepositions using GENIA corpus ...............................................................................75 Figure 15: Score-thresholding results for verb-centred word patterns along with prepositions using GENIA corpus ...............................................................................76 Figure 16: Precision scores of best performing ranking method for verb-centred patterns in score-thresholding ......................................................................................77 Figure 17: Precision scores of best performing ranking method for verb-centred patterns with prepositions in score-thresholding .........................................................77 Figure 18: Precision, recall and F-score for verb-centred patterns with prepositions in score-thresholding measure using CHI........................................................................79 Figure 19: Precision, recall and F-score for verb-centred patterns with prepositions in score-thresholding measure using NMI .......................................................................79 Figure 20: Dependency tree of ‘PROTEIN activates PROTEIN in CELL’ ................84 Figure 21: Encoded biomedical text ............................................................................84 Figure 22: Rank-thresholding results for adapted linked chain patterns using GENIA corpus...........................................................................................................................88 Figure 23: Rank-thresholding results for adapted linked chain patterns using GENIA corpus...........................................................................................................................88 Figure 24: Precision scores of best performing ranking method for adapted linked chain dependency patterns in score-thresholding ........................................................89 Figure 25: Precision, recall and F-score for adapted linked chain dependency patterns in score-thresholding measure using CHI....................................................................90 x Figure 26: Precision, recall and F-score for adapted linked chain dependency patterns in score-thresholding measure using NMI...................................................................90 Figure 27: Comparison of precision scores using NMI for GENIA corpus between dependency-based and verb-centred surface-based patterns .......................................92 Figure 28: Comparison of precision scores using CHI for GENIA corpus between dependency-based and verb-centred surface-based patterns .......................................92 Figure 29: Automatic question generation from dependency tree.............................102 Figure 30: Screenshot of extrinsic evaluation interface.............................................114 Figure 31: Comparison between surface-based and dependency-based MCQ systems ....................................................................................................................................116 xi Abbreviations BNC – British National Corpus CHI – Chi-Square FBQ – Fill-in-the-Blank Question GUI – Graphical User Interface ICT – Information and Communication Technology IE – Information Extraction IG – Information Gain IGR – Information Gain Ratio IR – Information Retrieval LL – Log-Likelihood MCQ – Multiple Choice Question MCT – Multiple Choice Test MI – Mutual Information MT – Machine Translation MUC – Message Understanding Conference NE – Named Entity NER – Named Entity Recognition NLG – Natural Language Generation NMI – Normalised Mutual Information NLP – Natural Language Processing PoS – Parts-of-Speech RE – Relation Extraction SC – Semantic Class SVO – Subject Verb Object VLE – Virtual Learning Environment xii Chapter 1: Introduction 1.1 E-Learning In the modern era of information technology many organisations and institutions offer diverse forms of training to their employees or learners and most of these training options utilise e-Learning. In the last two decades, e-Learning has seen an exponential growth mainly due to the development of the internet, which has made online materials accessible to more people than ever, allowing many corporations, educational institutes, governments and other organisations to use it in their training process. E-learning has also been referred to by different terms such as online learning, web-based training and computer-based training. E-learning is fundamentally a learning process that is facilitated and supported by Information and Communications Technology (ICT). Learning objectives play a pivotal role in the design of any learning material as they help to design lessons which are easier for the learner to comprehend and the instructor to evaluate. The quality of e-Learning depends upon its contents and its delivery. The concept of e-Learning is growing at a rapid rate, since more and more people are using computers frequently in every field of life. E-learning has made a huge impact in the field of education as it has been exploited effectively in higher education to enhance the traditional forms of teaching and administration and students are more comfortable with e-Learning methods and e-Learning technologies. E-learning can be CD-ROM-based, networkbased or internet-based and it can contain text, audio, video and a Virtual Learning Environment (VLE). A VLE is a software platform on which learning materials are assembled and made available. Distance education (in which the learner and the instructor are separated by space and/or time) has also provided a base for e-Learning development. It is delivered through a variety of learning resources e.g. learning guides and supplementary digital media. Currently many educational institutes use blended learning, a term used to describe education that combines on-campus and distance learning approaches. It includes conventional on-campus courses 1 supplemented by some e-Learning. In order for e-Learning to be effective it must use reliable and easy-to-use technology. E-learning also has a major impact in the industrial field. The ability to acquire new skills and knowledge is important for any professional in this fast-moving world. According to a survey report in 2008 1 the vast majority of public sector (82%) and 42% of private sector organisations used e-learning for the training of their employees. The global market for e-Learning is growing at a rapid rate as many business organisations and educational institutes are seeking to deliver their learning in a smarter and more cost-effective way. E-learning products have a huge market world-wide: the UK e-learning market alone was estimated at between £500m £700m in 2009 2 . The future of e-Learning depends on the development of IT technologies. 1.2 Automatic Assessment in E-Learning Automatic assessment is one of the main strengths of e-Learning. Assessment is a process used to test the acquired knowledge of a person on a specific topic/subject. According to Linn and Miller (2005), “Assessment is a general term that includes the full range of procedures used to gain information about student learning (observations, ratings of performance or projects, paper-and-pencil tests) and the formation of value judgements concerning learning progress.” Assessment has a vital role to play in the areas of education and training as it determines whether or not learning objectives are being met. Educational institutes such as schools and universities conduct regular assessments of their students. Effective assessment aids teachers in analysing learning problems and progress, improving and enhancing their own performance and achieving and maintaining academic standards. Many organisations, both in public and private sectors also conduct regular assessments of their employees as well as job applicants. In many areas, such as health-care and law, 1 http://www.cipd.co.uk/NR/rdonlyres/3A3AD4D6-F818-4231-863B4848CE383B46/0/learningdevelopmentsurvey.pdf 2 http://www.elearningcentre.co.uk/Reviews_and_resources/Market_Size_Reports_/The_UK_e_learning_market_200 9 2 specialists have to undertake compulsory assessment procedures in order to attain national qualifications and the right to practice their profession. The development and delivery of assessment materials, the analysis of their results and provision of feedback to numerous test takers is an extremely laborious and time consuming task. According to (Stiggins, 2001) most teachers in schools or higher education institutes often lack the knowledge and skills to create effective assessment materials. Moreover they are also unable to correctly interpret the assessment results in order to use them for future adaptation. Automatic assessment in e-Learning provides immediate feedback, enables the instructor to ensure the continuous intellectual, social and physical development of the learner and moreover can also be linked to other computer-based or online materials. ICT-based assessment support technologies have been used for some time in different educational scenarios (see McFarlane 2001, 2002; Weller, 2002 for a review of ICTbased assessment support). The use of ICT-based assessment has many advantages when compared to paper-pencil testing and it is more appropriate for large-scale assessments (e.g. Ball et al., 2003; Abell et al., 2004; Scheuermann and Pereira, 2008). Moreover ICT-based assessments considerably lessen the amount of time and money spent on manually producing assessment exercises (Pollock et al., 2000). ICT has been widely used to help authorise and deliver assessments to students by software such as TRIADS and QuestionMark, and frameworks such as OLAAF. ICT has also been used in assessment scoring and feedback provision (e.g. Leacock and Chodorow, 2003; Higgins et al., 2004; Pulman and Sukkarieh, 2005). TOEFL (Test of English as a Foreign Language), GRE (Graduate Record Examinations) and GMAT (Graduate Management Admission Test) are examples of widely used ICT-based assessments. 1.3 Multiple-Choice Questions (MCQs) Multiple choice questions (MCQs), also known as Multiple-choice tests (MCTs), provide a popular solution for large-scale assessments as they make it much easier for test-takers to take tests and for examiners to interpret their results. MCTs are 3 frequently used in various fields (e.g. education, market research, elections and policies) and can effectively measure a learner’s knowledge and understanding levels. The emergence of e-Learning has created even higher demand for MCTs as it is one of the most effective ways for an e-learner to get feedback. Multiple-choice tests (MCTs) are a form of objective assessment in which a user selects one answer from a set of alternative choices for a given question. In the literature (see, e.g., Isaacs, 1994) the structure of a multiple choice question is described as follows. A multiple choice question is known as an item. The part of text which states the question is called the stem while the set of possible answers (correct and incorrect) are called options. The correct answer is called the key while incorrect answers are called distractors. Figure 1 shows an example of a multiple choice question. Figure 1: An example of a Multiple Choice Question MCT items are close-ended questions and more suitable for assessing factual, conceptual and procedural information as they are straightforward to conduct and instantaneously provide an effective measure of test-takers’ performance. MCTs have been employed by many instructors as a preferred assessment tool and it is estimated that 45% - 67% of student assessments utilise MCTs (Siegfried and Kennedy, 1995; Lister 2000, 2001; Becker and Watts, 2001 and Carter et al., 2003). MCTs lend themselves well to online delivery and computer grading. Most students are quite familiar with this mechanism of assessment. Usually an expert, trained in the relevant disciplines, is employed to create an MCT. The expert familiarises himself with all 4 the reading materials examinees are supposed to know, designs questions and exercises relevant to the most vital concepts discussed in the materials and creates a list of possible answers. MCTs face criticism due to the belief that they only test a superficial memorisation of facts and also that MCTs may be useful for formative assessment but they have no place in examinations where the student should be tested on more then just their ability to recall facts. Moreover, it requires substantial efforts to design the content of an MCT (McKeachie, 2002) as poorly written MCTs conceal learners’ knowledge rather than revealing it (Becker and Johnston, 1999; Dufresne, Leonard and Gerace, 2002). The process of manually creating high quality MCTs is quite expensive in terms of time and resources. These costs become even higher when assessments are conducted at short intervals and the content of the test needs to be fresh for every session. Benton et al. (2004) presented a detailed analysis of MCT item generation, comparing MCT items generated with and without the aid of ICT. In ICT, WebCT™, a commercial course-management software package was used to deliver MCT items. Their experimental results revealed that the MCT items generated without the aid of ICT were poor and that ICT could really help instructors in creating better quality MCT items. They argued that MCT items generated with the aid of ICT would help instructors to achieve educational objectives by providing guidance and feedback for them to produce better quality MCT items in the future. Their study also affirmed the claims made by Stiggins (2001) that instructors do not know how to design effective assessments. Research has been carried out in order to determine the best ways to construct MCTs which can provide valid measures of target knowledge. Haladyna et al. (2002) conducted a literature review in the area of MCTs and presented a set of guidelines for the instructors to follow during manual construction MCTs. 1.4 Challenges in Automatic Generation of Multiple Choice Questions In the previous section, we have discussed the definition of MCTs, their advantages, drawbacks and main guidelines to follow when writing MCTs (see Haladyna et al., 5 2002 for more details). In this section, we will look at the automatic generation of MCTs. As mentioned earlier, the main challenge in the construction of MCT items is the selection of important concepts in a document and the selection of plausible distractors which will enable confident test takers to be better distinguished from unconfident ones. Automated generation of MCT items would solve the problems faced during manual creation of MCT items. The objective of this research is to provide an alternative to the lengthy and laborious activity of developing MCT items by proposing a new automated approach for multiple choice questions (MCQs) generation. All the recent approaches to automatically generate MCQs (see Section 2.1 for further details), in principle take input texts and generate questions by removing some words from a sentence, for example Mitkov et al. (2003, 2006) employed conversion patterns in order to convert declarative sentences into interrogatives. Their approach mainly relied on the use of a simple set of syntactic transformational rules in order to automatically generate questions. The methodology for distractors (wrong alternatives) varies from research to research. The main idea for distractor selection is to select semantically or compositionally similar words to the correct answer. Most of the studies use machine readable dictionaries for distractors selection. Mitkov et al. (2003, 2006) and Brown et al. (2005) employed WordNet 3 , a lexical resource in which English nouns, verbs, adjectives and adverbs are grouped into synonym sets while Kunichika et al. (2002) and Sumita et al. (2005) used their in-house thesauri (see Section 2.1 for further details). There are also a few commercial systems for effective delivery of learning materials such as MCQs. Questionmark 4 is a well-known and established leader in computerised education technologies in the world. Its products and services focus on technologies to facilitate remote, efficient and secure assessment of numerous testtakers. ETS 5 is a US-based, private non-profit organisation that provides assessment services around the world. ETS is the developer of the world-wide known TOEFL, SAT, and GRE tests. ETS also develops software tools for computerised MCTs 3 an online lexical reference system by Princeton University. ( http://wordnet.princeton.edu/) http://www.questionmark.com 5 http://www.ets.org 4 6 assessment, which are similarly concerned with the management of test materials, their secure administration, analysis of their results and feedback to students. In the field of Natural Language Processing (NLP), dealing with automatic generation of multiple choice questions is gaining a lot of attention since the last decade (see Section 2.1 for more details). NLP is a field in computer science and linguistics in which computers are used to process human languages in textual form in a way that is based on the meaning of the text in order to perform some useful task. The main motivation behind NLP is to build computer systems that can perform tasks which require understanding of textual language and to understand how humans communicate using language. Automatic generation of MCQs is an emerging topic in the application of NLP. In order to automatically generate MCQs it is important to identify important concepts and the relationships between those concepts in a text. NLP applications such as Term Extraction and Information Extraction help us to accomplish the aforementioned tasks. Automatic generation of questions can be considered as a specialised application of Natural Language Generation (NLG) which is a sub-area of NLP. The NLG task is to generate a natural language from a machine representing system such as a knowledge base or a logical form. Recent advances in NLP technologies have enabled researchers to employ them in automatic generation of MCQs, but still the work done in this area does not have a long history. Most of the approaches (see Section 2.1 for further details) have extracted important concepts employing NLP technologies and transforming declarative sentences into questions. Some researchers (e.g. Brown et al., 2005; Hoshino and Nakagawa, 2005; Sumita et al., 2005) have employed automatically generated MCQs to measure test takers’ proficiency in English. In recent times domain ontologies have also been employed to automatically generate MCQs. 1.5 Aims of the Thesis The main aim of the thesis is to identify the ways in which Information Extraction (IE) methodologies can improve the quality of automatically generated MCT items 7 and overcome the shortcomings faced by the previous approaches. Previous approaches (e.g. Mitkov et al., 2003, 2006 and Sumita et al., 2005) mostly rely on the syntactic structures of sentences to generate questions. The main problem with these approaches is the selection of appropriate sentences from which to automatically generate questions as sometimes a sentence is too simple or too complicated to be used. Therefore, in this research we will explore semantic relations between important concepts as processing of text at the semantic level allows us to produce the representation of the meaning of the sentence. The advantage of using a semantic relation is that it can be expressed using different syntactic structures. Semantic relations are the principal relations between two concepts expressed by words or phrases, e.g. Hypernymy (IS-A relation) and meronymy (Part-Whole relation). Semantic relations play a vital role in many NLP fields such as Information Extraction, Question Answering and Automatic Summarisation. Identification of semantic relations in a text is a complex task and it involves the discovery of certain linguistic patterns in the text that indicate the presence of a particular relation. A pattern consists of words and syntactic categories in text or the underlying syntactic structure (parse tree) of the text and a pattern represents the entities related by the semantic relation in the text. One of the drawbacks of the syntactic approach is that the wording of the question is similar to that of the original sentence (e.g. “Aspirin can relieve headaches.” “Which of the following drugs can relieve headaches?”), hence it can be answered by somebody who tries to memorise complete sentences from the textbook. On the other hand, if the semantic relation between “aspirin” and “headache” can be established (“aspirin RELIEVE headache”), then patterns can be used to generate questions whose wordings do not depend on the original sentence wording. For example, if the relationship is “DRUG A RELIEVE SYMPTOM B” then the following templates for question can be used: Which of the following drugs can relieve SYMPTOM B? If you have SYMPTOM B, you should use which of the following drugs? In this way, the generation engine would be more flexible and would be able to generate questions with different wordings. 8 1.6 Original Contributions This thesis provides original contributions in the field of automatic generation of MCTs. This research presents a system for the automatic generation of MCT items based on Information Extraction methodologies as it is important to recognise the most important concepts present in a text and the relations between those concepts, regardless of their surface realisations. This research is mainly focused on generating MCTs from the biomedical domain but the presented approach is quite flexible and can easily be adapted to generate MCTs from other domains as well. Many NLP technologies which deliver promising results in the newswire or business domains do not yield good results in the biomedical domain (see Section 2.2 for further details). Moreover there is a lot of interest in techniques which can identify, extract, manage, integrate and discover new hidden knowledge from the biomedical domain. In order to achieve this main aim, several goals need to be met. First of all, it is necessary to introduce the concept of IE, its major components and the important issues which need to be considered during the IE process. IE has two major components: Named Entity Recognition (NER) and Relation Extraction (RE). The thesis looks at various approaches (supervised, semi -supervised and unsupervised) for each component of IE. This thesis focuses on the RE component and investigates an unsupervised approach for RE as most of the recent IE approaches rely on some sort of domain-specific knowledge (e.g. seed examples, training data or hand-crafted rules, see Section 2.6 for more details) to extract relations from unannotated free text (e.g. Basili et al., 2000; Català et al., 2000; Harabagiu and Maiorano, 2000; Yangarber and Grishman, 2000; Yangarber 2000, 2003; Català, 2003, Greenwood et al., 2005; Stevenson and Greenwood, 2009) which is quite laborious and time-consuming. We employed an unsupervised RE approach as it allowed us to cover a potentially unrestricted range of semantic relations while most supervised and semi-supervised approaches can learn to extract only those relations that have been exemplified in annotated text, seed patterns or seed named entities. After the unsupervised RE process, important extracted semantic relations are then transformed into questions. The important issues which need to be considered during the question generation phase are the quality of generated questions and their syntactic correctness. After the 9 question generation phase the generation of plausible distractors takes place. To assess the usefulness of the investigation, quality of the generated questions and distractors, an extrinsic evaluation is carried out. The system will be evaluated in terms of automatically generated questions for their readability, relevance, acceptability and usefulness of semantic relations and similarly automatically generated distractors will also be evaluated for their readability, relevance and acceptability. At the end, the overall acceptability of the whole automatically generated MCT items will also be assessed. To summarise, the original contributions of this thesis are:  Fully implemented automatically generated MCQ systems based on IE  Adopted unsupervised Relation Extraction approaches (surface-based and dependency-based patterns) for the MCQs problem which extract important relations from the text.  Various evaluation approaches to measure the association of extracted relations within the biomedical domain as compared to the general domain.  Developed new methods for the generation of high quality questions which are grammatically and syntactically correct based on the extracted relations.  Generation of plausible distractors for each question by utilising different semantic similarity measures.  An extrinsic evaluation of automatically generated multiple choice test items. 10 1.7 System Overview The overall architecture of the proposed system mainly consists of three modules: IE, question generation and distractor generation (see Figure 2). In order to automatically generate MCTs our research will focus on the following main steps: first we will recognise the important concepts in the text and the semantic relations between them using Information Extraction (IE) methodologies (Chapter 3). The extracted semantic relations will allow us to select the most appropriate sentences for automatic question generation. In later stages (Chapter 4) the extracted semantic relations will be transformed into questions by employing certain set of rules. The process of selecting plausible distractors will make use of a distributional similarity measure (Chapter 4). Unannotated corpus Named Entity Recognition Extraction of Candidate Patterns Patterns Ranking Evaluation Semantic Relations Rules Distributional Similarity Question Generation Distractors Generation Output (MCQ) Figure 2: Overall system architecture As mentioned earlier, Haladyna et al. (2002) proposed a set of guidelines for instructors to follow during the manual construction of MCTs in order to produce more effective and valid MCTs. These empirical guidelines address various issues during the manual construction of MCT items such as their readability, content, usability and effectiveness. Moreover, these guidelines emphasised that during the construction of MCTs instructors should focus on important concepts to test a higher 11 level of learning, MCTs should not be too general, should be grammatically correct, should use simple vocabulary, must contain a single right answer and should make all distractors plausible. Our research will also follow these guidelines to automatically generate high-quality and effective MCT items. The use of semantic relations in our research will enable us to generate better quality MCT items by focusing on important concepts in the text while plausible distractors will be automatically generated using the distributional similarity measure. 1.8 Structure of the Thesis The rest of this thesis is structured as follows: Chapter 2 provides the background for the automatic generation of MCT items and IE. Chapter 3 discusses the unsupervised approaches for relation extraction based on surface form and dependency trees, their evaluation in order to select stem sentences for the automatic generation of MCQs. Chapter 4 elaborates on the process of question generation and distractor generation while chapter 5 presents the extrinsic evaluation of the automatically generated MCT items. Chapter 6 contains the concluding remarks and future directions of work. In this section we elaborate the various tasks performed by each chapter in this thesis. Chapter 2 provides the summary of the work done so far in the area of automatic generation of MCT items. This chapter then discusses the field of Information Extraction (IE), applications of IE, subtasks of IE, its two major components: Named Entity Recognition (NER) and Relation Extraction (RE), various supervised and unsupervised approaches for these components, evaluation of IE systems and various supervised, semi-supervised and unsupervised IE systems. In this chapter we look at the various dependency tree based pattern models and the comparison among these models. At the end of this chapter we also describe the use of Web as corpus. Chapter 3 discusses unsupervised semantic relations extracted using IE techniques for stem sentences selection. It elaborates on two unsupervised approaches (surfacebased and dependency-based) for RE from the biomedical domain. In the surfacebased approach, we explore several different types of linguistic patterns while the 12 dependency-based approach makes use of a slightly modified version of the linked chain model. Different pattern ranking methods (information theoretic and statistical) are used to rank the extracted patterns. We employed two different approaches to select the extracted patterns. The chapter ends by making a comparison between two unsupervised approaches. Chapter 4 describes how extracted semantic relations in the form of linguistic patterns are used to select stem sentences and how these patterns are then transformed into syntactically correct automatically generated questions. Moreover, this chapter explains the different distributional similarity measures used to select plausible distractors for the automatically generated questions. Chapter 5 presents an extrinsic evaluation of the whole MCT system in terms of question and distractor readability, relevance, usefulness of semantic relation and acceptability. At the end we also look at the overall usability of automatically generated MCT items. Chapter 6 contains the concluding remarks and directions for future work. 13 Chapter 2: Background In this chapter, we will discuss work done so far in the area of automatic generation of multiple choice test items. After that we will review previous work on NLP methods on which our own work draws in order to develop a new, semantics-aware method for automatic generation of MCQs. This chapter will present an overview of Information Extraction, its application in the real world and its two major components: Named Entity Recognition and Relation Extraction. This chapter will also provide a survey of the various supervised, semi-supervised and unsupervised approaches to building Information Extraction systems. We will also examine and compare various dependency tree based pattern models along with the use of the Web as a corpus. 2.1 Automatic Multiple Choice Question Generation Even though NLP has made significant progress in recent years, NLP methods, and the area of automatic generation of MCT items in particular, have started being used in e-Learning applications only very recently. One of the first significant studies in this area was published by Mitkov et al. (2003, 2006), who presented a computer-aided system for the automatic generation of multiple choice test items. Their system offered an alternative to the lengthy and demanding activity of manual construction of MCT items by proposing an NLP-based methodology for automatic generation of MCT items from instructive texts such as textbook chapters and encyclopaedia entries. Their system mainly consists of three parts: term extraction, stem generation and distractor selection. In the term extraction phase (Ha, 2007); the source text is parsed by a parser. The parser labelled each word in a source text with its part-of-speech and syntactic category. After the part-ofspeech identification, nouns are sorted by their frequencies. The system employs certain rules and frequency thresholds for each noun and if any noun exceeds that threshold then that noun is regarded as a key term. The key terms are used to identify important concepts in a text from which questions are automatically generated. The 14 key terms are domain-specific terms that will serve as the answers for the items. In the stem generation phase, stems are generated from the eligible clauses of sentences from the source text. A clause is considered eligible if it is finite and has SVO (Subject-Verb-Object) or SV (Subject-Verb) structure. The system makes use of several rules in order to generate a stem and to ensure grammaticality between the stem, the answer and the distractors. In order to produce plausible distractors, the system uses WordNet and retrieves hypernyms and coordinates of key terms from WordNet. The system was tested using a linguistic textbook in order to generate MCT items and found that 57% automatically generated MCT items were judged worthy of keeping as test items, of which 94% required some level of post-editing. The main advantage of this approach is that it has given a completely new alternative solution to the time-consuming and laborious activity of manual construction of MCT items, which is at the present moment the most extensively, used method for the students’ knowledge evaluation. The main disadvantage of this system is its reliance on the syntactic structure of sentences to produce MCT items as it produces questions from sentences which have SVO or SV structure. Moreover, the identification of key terms in a sentence is also an issue as identification of irrelevant concepts (key terms) results in unusable stem generation. Karamanis et al. (2006) conducted a pilot study to use Mitkov et al. (2006) system in a medical domain and their results revealed that some questions were simply too vague or too basic to be employed as MCQ in a medical domain. They concluded that further research is needed regarding question quality and usability criteria. Skalban (2009) presented a detailed analysis of the Mitkov et al. (2006) system and highlighted the short-comings it faced. Her work distinguishes between critical and non-critical errors identified in the system output. Non-critical errors are errors with a low impact on the overall worthiness of the item; questions containing non-critical errors can typically be used after post-editing. Critical errors, however, have a detrimental impact on the worthiness of a question; post-editing is not possible. Her work also revealed that key term errors created the most unusable MCT items, accounting for nearly 50% of unworthy items. A key term error occurs, where a question has been generated based on a term which does not represent an important concept in the source text. On the surface, these questions can be syntactically 15 flawless. However, they are still unworthy because questions generated from unimportant concepts are not useful for knowledge assessment. Sumita et al. (2005) presented a system which automatically generated questions in order to measure test-takers’ proficiency in English. The method described in this paper generates Fill-in-the-Blank Questions (FBQs) using a corpus, a thesaurus and the Web. The FBQs are created by replacing verbs with gaps in an input sentence. The possible distractors are retrieved from a thesaurus and then new sentences are created by replacing each gap in the input sentence with a distractor. They conducted their experiments on non-native speakers of the English Language and found that their method is quite effective in measuring proficiency of English in non-native speakers. The main drawback of this approach is that the selection of wrong input sentences results in FBQs which even native speakers are unable to answer. Moreover, the quality of generated FBQs is evaluated by a single native English speaker and it needs to be evaluated further. Brown et al. (2005) used an approach that tests knowledge of students by automatically generating test items for vocabulary assessment. Their system produced six different types of questions for vocabulary assessment by making use of a WordNet. The six different types of questions include: definition, synonym, antonym, hypernym, hyponym and cloze questions. The cloze question requires the use of a target word in a specific context. In order to produce the definition questions, the system made use of the WordNet glosses to choose the first definition which did not include the target word. In synonym questions, it requires the matching of a target word to its synonym, which is extracted from WordNet. An antonym question requires a word to match its antonym which is also obtained from WordNet while in hypernym and hyponym questions require the matching of a word to its hypernym and hyponym respectively. In order to produce cloze questions the system made use of the WordNet glosses. The experimental results suggested that automatically generated questions produced using this approach provides an efficient way to automatically assess word knowledge. The approach presented in this paper relied heavily on WordNet and is unable to produce any questions for words which are not present in WordNet. 16 Chen et al. (2006) presented an approach for the semi-automatic generation of grammar test items by employing NLP techniques. Their approach was based on manually designed patterns which were further used to find authentic sentences from the Web and were then transformed into grammatical test items. Distractors were also obtained from the Web with some modifications in manually designed patterns e.g. changing part-of-speech, adding, deleting, replacing or reordering of words. The experimental results of this approach revealed that 77% of the generated MCQs were regarded as worthy (i.e. can be used directly or needed only minor revision). The disadvantage of this approach is that it requires a considerable amount of effort and knowledge to manually design patterns which can later be employed by the system to generate grammatical test items. A semi-automatic system to assist teachers to produce cloze tests based on online news articles was presented by Hoshino and Nakagawa (2007). In cloze tests, questions are generated by removing one or more words from a passage and the test takers have to fill in the missing words. According to this paper, one of the reasons for selecting newspaper articles is that they are usually grammatically correct and suitable for English education. The system focuses on multiple-choice fill-in-the-blank tests and generates two types of distractors: vocabulary distractors and grammar distractors. For vocabulary distractors the system employs a frequency-based method while for grammar distractors the system makes use of ten grammar targets based on Tateno’s (2005) research. The system mainly consists of two components: preprocessed component and graphical user interface (GUI). The input documents are first pre-processed and then go through various sub-processes which include: text extraction, sentence splitting, tagging and lemmatisation, synonym lookup, frequency annotation, inflection generation, grammar target mark-up, grammar distractor generation and selection of vocabulary distractors. The GUI allows the user to interact with the system. User evaluation reveals that 80% of the generated items were deemed to be suitable. A system for automatic generation of MCT items which makes use of domain ontologies was presented by Papasalouros et al. (2008). Ontologies contain the domain knowledge of important concepts and relationships among these concepts. Ontologies contain knowledge which can be inferred, i.e. facts which are not 17 explicitly defined. In order to generate MCTs, this paper utilised three different strategies: class-based strategies (based on hierarchies), property-based strategies (based on roles between individuals) and terminology-based strategies. The MCTs generated by this approach were evaluated in terms of quality, syntactic correctness and number of questions produced for different domain specific ontologies. The experimental results revealed that not all questions produced are syntactically correct and in order to overcome this problem more sophisticated Natural Language Generation (NLG) techniques are required. Moreover, property-based strategies produced a greater number of questions than class-based and terminology-based strategies but the questions produced by the property-based strategies are difficult to manipulate syntactically. Most of the previous approaches to automatically generating MCTs have been used for vocabulary and grammatical assessments of English. Fundamentally most of the approaches generate questions by replacing some words from input text and mostly relies on syntactic transformations (e.g. Mitkov et al. 2003, 2006), generating questions by transforming declarative sentences into questions. The main drawback of these approaches is that generated MCTs are mostly based on recalling facts, grammatically correct but unusable in real life applications, so the main challenge is to automatically generate MCTs which will allow the examiner/instructor to evaluate test takers not only on superficial memorisation of facts but also on higher levels of cognition. This research solves this problem by extracting semantic rather than surface-level or syntactic relations between key concepts in a text via IE methodologies and then generating questions from such semantic relations. The methodology presented in this research will be unsupervised and can easily be adapted to other domains. In the next section we will discuss in detail the concept of IE and various approaches to IE. 2.2 Information Extraction (IE) Information Extraction (IE) is an NLP field which is used to process unstructured natural language text and present it in a structured form such as a database. IE is the 18 identification of specific items of information from text. The goal of IE is to extract salient facts about pre-specified types of semantic classes of objects (entities) and relationships among these entities. Entities are generally noun phrases in unstructured text e.g. names of persons, posts, locations and organisations, while relationships between two or more entities are described in a pre-defined way e.g. “interact with” is a relationship between two biological objects (proteins). This extracted information is then automatically stored into databases in order to be used for further processing. A pattern matching approach is usually employed by many IE systems where each pattern consists of a regular expression and an associated mapping from syntactic to logical form. During the pattern extraction process it is important to extract patterns that are general enough to extract correct information from the text but at the same time make sure that they do not extract incorrect information. For example “James Anderson was appointed vice president of the Proctor & Gamble Company of London”. In the above mentioned example the entities we are interested in extracting are underlined and these are: Person = James Anderson Company = Proctor & Gamble Post = Vice President. Generally, a template is used to define the items of interest in a specific text. A template consists of a collection of slots (e.g. in the aforementioned example these slots are Person, Company and Post), each of which may be filled with one or more values. Portability is one of the major issues in IE as adapting an existing IE system to a new domain requires manual tuning of domain-independent linguistic knowledge such as terminological dictionaries, domain-specific lexico-semantics, and extraction patterns and so on. Building these domain-independent linguistic knowledge resources by 19 hand is very laborious and time-consuming, so automatic methods using NLP are required to learn them. Apart from portability, the large-scale IE systems also face many other challenges in terms of achieving high accuracy, performance, maintainability and usability (see Feldman, 2006 for further details). 2.2.1 Applications of IE IE is widely used in many applications. It is utilised to automatically track specific event types from news sources and tracking disease outbreaks (Grishman et al., 2002). Many customer-oriented organisations collect many forms of unstructured data from customer interactions. In order to make effective use of this data, IE is applied to integrate this data with organisational databases. IE also has a great deal of information to offer to end-user industries of all kinds, mainly banks, financial companies, publishers and governments. For example, finance companies would really be interested to know: which company’s acquisition took place in a specified time span; they would actually like to have widely spread text information compressed into a simple database. IE is used in Personal Information Management (PIM) systems which seek to organise personal data like personal information, emails, personal activities, projects and people in a structured inter-linked format (Cai et al., 2005; Chakrabarti et al., 2005; Cutrell and Dumais, 2006). There is a lot of research being done in the area of bio-informatics recently and a major problem in this area is extraction of biological objects and relationships between them from repositories e.g. extraction of protein names and their interaction from PubMed 6 (Bunescu et al., 2005; Plake et al., 2006). Moreover, IE has been successfully playing its part in the processing of clinical documents including patient discharge summaries, radiology reports and in assisting clinical decisions (Harkema et al., 2005; Savova et al., 2008; Boytcheva et al., 2009). 6 http://www.ncbi.nlm.nih.gov/pubmed/ 20 Many web-oriented applications make frequent use of IE. Many citation web databases such as Citeseer 7 and Google Scholar 8 employ IE in order to extract individual publication records, title, authors, references from papers and segmenting citation strings into individual authors, title, venue and year fields (Ponomareva et al., 2009). IE is used for automatic annotation of web pages for the semantic web (Stevenson and Ciravegna, 2003). IE is also applied to build opinion databases from blogs, newsgroup posts and product reviews which in turn help organisations to find out useful features of a product and widespread polarity of opinion regarding a specific product (Liu et al., 2005; Popescu and Etzioni, 2005). Moreover, IE also interacts with many other areas of NLP including text classification, information retrieval, text mining and question answering (Ravichandran and Hovy, 2002). For example IE in a multi-lingual NLP environment may help a machine translation system to translate important facts accurately into the source language as it can provide the knowledge base for information retrieval, question answering and text summarisation (Heng, 2008). IE can also help to improve the performance of a text mining system by discovering useful knowledge from unstructured text (Mooney and Bunescu, 2005). 2.2.2 Subtasks of IE The process of IE generally consists of the following subtasks (see Jurafsky and Martin, 2008 for more details): Named Entity Recognition (NER): IE task which detects and classifies the proper names mentioned in a text Co-reference resolution: links or clusters all the mentions that refer to the same named entity Relation detection and classification/ Relation extraction: finds and classifies relations among the entities discovered in a given text 7 8 http://citeseer.ist.psu.edu/ http://scholar.google.co.uk/ 21 Event detection and classification: finds events and fills in their participant slots with named entities detected Temporal expression recognition: identifies temporal expressions in text Temporal analysis: maps temporal expressions into specific dates or times of day Template filling: fills in templates using snippets of text extracted from a given text or inferred from the text Most of the aforementioned IE subtasks are domain dependent. In this research we will be focusing on the following two subtasks:  Named Entity Recognition (NER)  Relation Extraction (RE) Named entity recognition (NER) is a key part of the IE system. NER involves identification of proper names in texts and classification into a set of predefined categories of interest. These Named Entities (NEs) will be different according to the nature of the text. For example: newspaper texts will contain the names of people, places and organisations while biological articles will contain the names of genes and proteins. Robust handling of proper names is an essential part of many NLP fields e.g. IR. A large amount of research has been done in NER in the recent past. There have been many main conference tracks and workshops on the topic of NER since 2000. Most of the early systems use handcrafted rule-based algorithms for NER while most of the modern systems employ various machine learning algorithms. The first major event dedicated to the NER task was in MUC-6 (Grishman and Sundheim, 1996). Two shared tasks for NER had been conducted with-in the conference on Computational Natural Language Learning (CoNLL): CoNLL 2002 9 (Tjong Kim Sang, 2000) and CoNLL 2003 10 (Tjong Kim Sang and Meulder, 2003). Several NER systems (Nadeau and Sekine, 2007) were developed to address diverse textual genres and domains, for example; Maynard et.al (2001) designed a system for emails, specific texts and religious texts. Porting an existing NER system to a new domain or textual genre still remains a major challenge. 9 http://www.cnts.ua.ac.be/conll2002/ner/ http://www.cnts.ua.ac.be/conll2003/ner/ 10 22 Following NER the next step is the RE phase. The goal is to identify all the instances of specific relationships or events in text. For example, it is not just sufficient to find the occurrence of two biological objects (e.g. protein, gene) in a biomedical text but also to identify if there is a relationship between those biological objects. Generally, a template is used to classify the items which are to be extracted from the text. 2.2.3 Evaluation of IE Systems Information Extraction systems are normally evaluated by comparing the performance of a system against the human judgement of the same text. The output that is identified by the humans is known as the gold-standard. IE system evaluations began with the Message Understanding Conferences (MUCs), which were sponsored by the U.S. government. These conferences were funded by the Defence Advanced Research Projects Agency (DARPA). One of the purposes of these conferences was to develop methods for the formal evaluation of IE systems (Grishman and Sundheim, 1996). Until now 7 Message Understanding Conferences (MUCs) have taken place and a different domain was selected for each conference. MUC-1 (1987) and MUC-2 (1989) were related to messages about naval operations. MUC-3 (1991) and MUC-4 (1992) were about news articles related to terrorist activities. MUC-5 (1993) was about news articles related to joint ventures and microelectronics. MUC-6 (1995) was about news articles related to management changes while MUC-7 (1997) was about news articles related to space vehicles and missile launches. Automatic Content Extraction (ACE) 11 evaluation has carried forward the work that was started by MUCs conferences by organising various evaluation tasks. ACE tasks include named entity detection and recognition, relation detection and recognition, event relation detection and recognition, co-reference resolution and named entity translation. Text Analysis Conference (TAC) 12 has held a series of evaluations and workshops to provide an infrastructure for large-scale evaluation of different NLP fields (e.g. question answering, recognising textual entailment, summarisation and knowledge base populations). 11 12 http://www.itl.nist.gov/iad/mig//tests/ace/ http://www.nist.gov/tac/about/index.html 23 The main aim of evaluation is to find out whether the system can identify the output in the gold-standards and not the extra ones. IE lends Information Retrieval (IR) concepts of Precision and Recall for evaluation. A system’s Precision score is used to measure the number of relations identified that are correct while Recall score measures the number of correct relations that were identified. Precision (P) = Correct Answers / Answers Produced Recall (R) = Correct Answers / Total Possible Correct Both notions can be made clear by examining the contingency table (Table 1): Correct (System) Incorrect (System) Correct (Gold Standard) True Positives (TP) False Positives (FP) Incorrect (Gold Standard) False Negatives (FN) True Negatives (TN) Table 1: Contingency table True Positives (TP) are the correct answers produced by the system while False Positives (FP) are answers produced by the system which are not present in the goldstandard. False Negatives (FN), correct answers present in the gold-standard but not identified by the system while True Negatives (TN) are incorrect answers identified by both the gold-standard and the system. P  TP ( TP  FP ) R  TP (TP  FN ) Precision ranges between 0 (none of the identified events were correct) and 1 (all of them were correct) while Recall also ranges between 0 (no correct events identified) and 1 (all of the correct events were identified). 24 Precision and Recall is often combined into a single metric: F-measure, which is the harmonic mean of precision and recall. F  2 PR (P  R ) In the aforementioned equation of F-measure both Precision and Recall are given equal weights. Precision and Recall are inversely proportional to each other which means that it is possible to boost one at the cost of reducing the other depending on the needs of the indented application. For example, an IR system (e.g. search engine) can often increase its Recall by retrieving more documents at the cost of increasing number of irrelevant documents retrieved (decreasing Precision). Another alternative to judge an IE or IR system is its Accuracy, that is, the fraction of its classifications (correct and incorrect in IE while relevant and irrelevant in IR) that are correct. In terms of the contingency table (Table 1) Accuracy of a system is identified as: Accuracy  (TP  TN ) (TP  TN  FP  FN ) Accuracy is not considered an appropriate measure of evaluation in either IR or IE due to data skewedness (see Manning et al., 2008 for further details). The measures of Precision and Recall are preferred as both concentrate on the return of True Positives (TP), asking what percentage of correct answers has been found by the system and how many False Positives (FP) have also been returned by the system. In supervised approaches (see Section 2.2.5), in order to evaluate the performance of a classifier the data set is usually divided into three independent parts: the training data, the validation data and the test data. Classifiers used the training data for learning, the validation data for parameter optimisation and the test data to calculate the error rate. Generally, most classifiers used one-third of the data for testing and the remaining two-thirds for training. In situations where training or testing data is not representative enough to cover all classes in the data then a statistical technique 25 known as cross-validation is employed. In cross-validation data is divided into fixed number of folds of equal size and each fold in turn is used for testing and the remainder is used for training. 10-fold cross-validation has become the method mostly used in practical terms. In 10-fold cross-validation data is divided randomly into 10 parts and each part (fold) in turn is used for testing and the remainder for training and this procedure is repeated 10 times. The error rate is calculated each time and finally the 10 error estimates are averaged to obtain an overall error estimate. Lavelli et al. (2004) critically reviewed various evaluation methodologies used by various IE systems and emphasised the need for the development of more reliable and detailed evaluation methodology. 2.2.4 Strategies to Perform IE There are a number of factors that influence the decision to utilise a particular strategy to build an IE system. These factors include: availability of training data, availability of linguistic resources, availability of knowledge engineers and the level of desired performance (see Kaiser and Milksch, 2005 for more details). Generally, there are two strategies to build IE systems:  Knowledge Engineering  Statistical or Machine Learning Most of the early IE systems (e.g. Lehnert et al., 1992; Riloff, 1993) were based on the knowledge engineering strategy but have suffered from a knowledge acquisition bottleneck. In the knowledge engineering strategy a human expert (a person who is familiar with the domain) defines hand-coded rules or regular expressions to perform the task of extracting desired information from the text. In order to achieve this goal, the human expert needs to have a decent linguistic understanding of the task in hand. This strategy is quite laborious and time-consuming as it depends highly on a domainspecific dictionary and therefore requires a great deal of manual engineering. The advantage of this strategy is that with sufficient skills and experience, high-precision 26 systems can be developed. The disadvantages of this strategy are that it has a very meticulous development process and needs experts who have good knowledge and both linguistic and domain expertise. The systems built using this strategy generally have a low coverage/recall because it is very hard to ensure this using introspection alone, while manual analysis of a corpus is also very expensive and cannot guarantee adequate coverage either. This strategy is most suitable in scenarios where training data is scarce or expensive to acquire and the highest possible performance is critical. The machine learning strategy mostly uses statistical methods and learns extraction patterns or rules from annotated corpora and interaction with users. The machine learning strategy is more centred on producing training data rather than hand-crafted rules as is the case in knowledge engineering strategy. Corpus statistics are then derived automatically from the training data and used to process novel data. The advantages of this strategy are domain portability, no need for a human expert and data-driven rules ensuring full coverage of examples. The disadvantage of this strategy is that it will not work if there is no training data (or only a small quantity). This strategy is most appropriate in situations where training data is available in large quantities and easy to obtain and where no skilled rule writers are available for the task. In order to achieve high accuracy, this strategy relies heavily on a large set of training examples. Statistical and machine learning approaches in the last few years have become quite popular among the IE research community (e.g. Soderland and Lehnert, 1994; Bikel et al., 1998; Kleinberg, 2002; McCallum and Jensen, 2003 and Wang et al., 2005). 2.2.5 Machine Learning Approaches in IE In the last section, we introduced knowledge engineering and machine learning strategies in IE; in this section we will discuss various machine learning approaches used in IE. Since 2000, machine learning algorithms have been used quite frequently for building IE systems (Nadeau and Sekine, 2007). There are three main types of machine learning algorithms with respect to the degree of supervision they require: 27  Supervised Algorithms  Semi-supervised Algorithms  Unsupervised Algorithms Supervised approaches in IE exploit a procedure known as classification. Classification is the process of assigning objects from a universe to two or more classes. In a classification task, each input is considered in isolation from all other inputs and the set of labels is defined in advance. The classifier’s performance is measured in terms of the error rate. If a classifier predicts the class of an object correctly then it is counted as success and error otherwise. In Supervised learning algorithms the system is given examples of text manually marked up (annotated) with what should be learned from it (e.g. NEs or relations). The focal point in supervised learning is to study the features of positive and negative examples over a large annotated corpus and devise rules that capture instances of a desired type. Supervised approaches have the advantage of having access to training data (containing positive and negative examples) which enables them to learn complex patterns and give good performance but the annotation of text with entities or events is a very timeconsuming task. The annotation process is quite slow and it is difficult to set guidelines that cover every instance, but without proper guidelines data will be inconsistent. Classifiers use supervised learning in order to sort data into pre-defined groups. Many researchers have effectively used supervised learning for IE (e.g. Zelenko et al., 2003; Culotta and Sorensen, 2004; Bunescu and Mooney, 2006). One example of a supervised learning algorithm in IE is WHISK (Soderland, 1999) discussed in Section 2.5.3. Semi-supervised learning algorithms require a small degree of supervision and utilise a technique called “bootstrapping” which uses a small set of seeds (examples) in order to start the learning process. The system then searches for sentences that contain these seed examples and tries to identify some contextual clues they have in common. The system then identifies other instances that appear in a similar context, adds them to the seed examples and starts the learning process again. This process continues until enough instances are gathered. In this approach very few examples of annotated text are specified and a large quantity of raw text (Ando and Zhang, 2005; Bunescu and 28 Mooney, 2007). The idea of using bootstrapping for IE pattern acquisition was first introduced by Riloff (1996). The examples of semi-supervised learning algorithms based on dependency trees used for pattern learning in IE are the work carried out by Yangarber et al. (2000) and Stevenson and Greenwood (2005) (see Section 2.6.3 for more details). Semi-supervised approaches result in a reduction of time and effort to manually produce hand crafting rules or patterns but it also has some drawbacks. The main disadvantage of semi-supervised approaches is that though seed examples could be very reliable for a given task, the accuracy of the learned patterns decreases dramatically if any wrong patterns are accepted during the iteration process. Moreover, semi-supervised approaches are dependant on the set of seed examples provided by the expert as a bad set of seed examples could lead to a poor set of extraction patterns. Unsupervised learning algorithms do not rely on any hand-labelled training data or seed examples. Most of the unsupervised learning algorithms use a technique called “clustering”. The process of clustering organises similar set of observations (patterns in our case) into small subsets known as clusters. Unsupervised learning algorithms are mostly used in scenarios where annotated data or seed examples are not available. Both classification and clustering place objects into groups or classes but the major difference between classification (supervised learning) and clustering (unsupervised learning) is that in the classification process classes are pre-defined while in the clustering process nothing is defined in advance. Examples of unsupervised learning algorithms applied in IE include Sekine (2006); Shinyama and Sekine (2006) and Eichler et al. (2008) (discussed in detail in Section 2.7). 2.3 Approaches to building Named Entity Recognition Systems The first systems for NER were rule-based, based on pattern matching rules and precompiled lists of information i.e. gazetteers, the research community has since moved towards machine learning methods for NER. For example, in the MUC-7 29 competition 13 five NER systems out of eight were rule-based. In the absence of training examples, handcrafted rules remain the preferred technique for NER (e.g. Sekine and Nobata, 2004 developed a NER system for 200 named entities). In the biomedical domain, rule-based approaches are also used to identify named entities in biomedical literature (see Ananiadou and McNaught, 2006 for more details). The major setback of rule-based approaches is the issue of portability as these approaches are difficult to adapt to different domains. There are three machine learning approaches to build NER systems.  Supervised Learning Approach  Semi-Supervised Learning Approach  Unsupervised Learning Approach 2.3.1 Supervised Learning Approach Supervised learning is the most dominant technique employed to solve the problem of NER. Supervised learning approach studies the features of positive and negative examples of Named Entities (NEs) over a large collection of annotated documents and learns rules that capture instances of a given type. Supervised learning techniques include Hidden Markov Models (HMMs) (Bikel et al., 1998; Borkar et al., 2001; Agichtein and Ganti, 2004; Finkel et al., 2005), Decision Trees (Sekine, 1998), Maximum Entropy Models (ME) (Borthwick et al., 1998; Chieu and Ng, 2003; Florian et al., 2007), Maximum Entropy Morkov Models (MEMMs) (McCallum et al., 2000), Support Vector Machines (SVM) (Asahara and Matsumoto, 2003; Mayfield et al., 2003), boosting (Carreras et al., 2003), memory-based learning (MBL) (Meulder and Daelemans, 2003) and Conditional Random Fields (CRF) (McCallum and Li, 2003). All the abovementioned techniques usually consist of a system which reads a large annotated corpus, memorises lists of entities and creates disambiguation rules based on discriminative features. CRFs (McCallum and Li, 2003) are considered as the state-of-the-art method for label assignment to token sequences (words) as it has a more flexible and dominant mechanism for exploiting 13 http://www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_toc.html#named 30 arbitrary feature sets along with dependency in the labels of neighbouring words (Sarawagi, 2008). Apart from IE, supervised learning approaches are frequently used by many other fields of NLP (e.g. Mehdi et al., 2010 used the supervised learning approach for the summarisation of legal documents). The major shortcoming of the supervised learning approach is the requirement of a large annotated corpus which is sometimes difficult to obtain. 2.3.2 Semi-supervised Learning Approach As mentioned earlier, semi-supervised approaches rely on the process of bootstrapping. There are many systems which have used this bootstrapping technique for NER. Brin (1998) used regular expressions in order to generate lists of book titles paired with book authors from the Web. The system started with a few seed examples and learned new ones. The main idea of this algorithm is that many websites conform to a reasonably uniform format across the site. Collins and Singer (1999) used a parsing technique to search for NE patterns. A pattern is a proper name followed by a noun phrase in apposition. In this system, patterns are kept in pairs {spelling, context} where spelling refers to the proper name and context refers to the noun phrase in its context. The system starts with an initial seed of spelling rules and a candidate which satisfies a spelling rule and they are classified according to how their contexts are accumulated. The most frequent contexts are then turned into a set of contextual rules and later on these rules are used to find further spelling rules and so on. Riloff and Jones (1999) introduced manual bootstrapping technique which consists of a set of entities and a set of contexts. They found out in their experiments that performance of their algorithm deteriorates with the introduction of noise. Cucchiarelli and Velardi (2001) presented a NER system based on Riloff and Jones (1999) manual bootstrapping that used syntactic relations (e.g. subject-object) to discover contextual evidence around named entities. 31 Pasca et al. (2006) presented a semi-supervised approach for NER by employing Lin’s (1998) distributional similarity measure to generate synonyms (e.g. words which are the members of the same semantic class) for pattern generalisation. They conducted their experiments on a huge corpus (100 million web documents) starting with only 10 seed examples and demonstrated that it is possible to generate one million named entities with a precision of about 88%. Data selection also plays an important role in the learning process. Heng and Grishman (2006) noted that selection of documents using information retrieval-like relevance measures brought out the best results in their experiments rather than relying on a huge collection of documents. 2.3.3 Unsupervised Learning Approach Clustering is a typical approach used for unsupervised learning. This approach relies on lexical resources (WordNet), lexical patterns and statistics computed on a large unannotated corpus. Alfonseca and Manandhar (2002) presented an approach to address the problem of assigning a label to an input word with the appropriate NE type. They made use of WordNet synset and the surrounding context of an input word. Evans (2003) presented an NER system based on the idea of hypernyms described by Hearst (1992) in order to identify named entities. Shinyama and Sekine (2004) presented an approach based on an observation that NEs often appear synchronously in several news articles, whereas common nouns do not. This approach allows identification of rare NEs in an unsupervised manner and can be useful in combination with other NER methods. Nadeau et al. (2006) presented an unsupervised approach for NER. Their approach made use of simple heuristics based on the work of Mikheev (1999), Petasis et al. (2001) and Palmer and Day (1997) to perform NE disambiguation. Their approach can be divided into two stages. In the first stage, a large gazetteer of entities (list of entities) was created and in the second stage heuristics were used to identify and 32 classify NEs in the given context of a document. They evaluated their system performance against the basic supervised system using the MUC-7 NER corpus (Chinchor, 1998). The supervised system was able to achieve high precision but low recall while the unsupervised system achieved higher recall at the cost of lower precision. Semi-supervised and unsupervised approaches are useful when a large amount of training data is unavailable or difficult to obtain. There is a lot of research being done in the area of NER spreading across various languages, domains and textual genres (Nadeau and Sekine, 2007). A supervised learning approach gives good performance in the presence of huge collections of annotated data while semi-supervised and unsupervised approaches promise fast deployment of many NE types without the prerequisite of an annotated corpus (Nadeau and Sekine, 2007). 2.4 Rule-based Approaches to Relation Extraction Relation Extraction (RE) is the second most integral part of any IE system after the NE extraction task. Most of the rule-based approaches in IE rely on hand-written rules or dictionaries and do not learn from annotated examples. In this section we review a few of the well-known rule-based approaches employed in relation extraction. 2.4.1 AutoSlog Riloff (1993) presented a system called AutoSlog to handle the bottleneck of knowledge engineering. AutoSlog is based on the idea of automatically constructing a “concept dictionary” for an information extraction task. The AutoSlog approach is based on the selective concept extraction method. Selective concept extraction is a form of extraction that selectively processes relevant texts while effectively ignoring irrelevant texts. CIRCUS proposed by Lehnert (1990) is employed for shallow sentence analysing. In order to extract information from texts CIRCUS depends on concept nodes. Concept nodes are an integral part of the AutoSlog system. A concept node consists of a triggering lexical item, enabling conditions in the context and case 33 frame. The AutoSlog algorithm employed a set of heuristics to determine which words and phrases are more likely to activate useful concept nodes and assumes that the verb will determine the role of noun phrase (NP). The AutoSlog system requires human intervention in order to filter out bad concept node definitions wrongly introduce by heuristics or shallow parser failures. A dictionary for the domain of terrorist events (MUC-4) was constructed in only 5 person-hours using AutoSlog. AutoSlog was evaluated against a manually built dictionary which required approximately 1500 person-hours effort and achieved 98% of the performance of manually built dictionary. 2.4.2 PALKA PALKA (Parallel Automatic Linguistic Knowledge Acquisition) system, presented by Kim J-T and Moldovan (1995), uses knowledge-based information from text for the automatic acquisition of linguistic patterns. PALKA uses an induction method to produce the extraction rules as a pair of a meaning frame and a phrasal pattern, called Frame-Phrasal pattern structure (FP-structure). Patterns are constructed using this FPstructure from training texts and the acquired patterns are then generalised using inductive learning mechanism. PALKA creates a new rule if existing rules cannot be used and then generalises it with the existing ones to include a new positive instance. In the next Sections (2.5 – 2.7), we will look at various machine learning approaches to relation extraction. A good overview of the machine learning approaches for relation extraction is provided by McDonald, 2005 and Bach & Badaskar, 2007. 2.5 Supervised Approaches to Relation Extraction The supervised approaches for relation extraction rely on user involvement to provide training examples for the learning process. Supervised approaches rely on training data to induce extraction rules. This section critically reviews supervised approaches to relation extraction. These systems use rule learning algorithms to automatically generate relation extraction patterns from annotated text corpora. 34 2.5.1 CRYSTAL Soderland et al. (1995) presented a system called CRYSTAL based on the concept of automatic creation of dictionaries to identify relevant information from a training corpus. The CRYSTAL system takes texts which have been processed by a syntactic parser. A domain expert is required to automatically annotate training documents. From these training documents CRYSTAL learns extraction rules. Inductive learning is used to find similar rules and merges them together by finding the most restrictive constraints that cover both rules. 2.5.2 LIEP Huffman (1996) presented the LIEP system which learns dictionaries of extraction patterns directly from user-provided examples of texts and events to be extracted from them. The LIEP system uses multi-slot rules for extraction; it lets the user identify events of interest in texts as the system is based on the assumption that an automated training corpus is difficult to obtain. The LIEP system tries to choose extraction patterns which will maximize the positive examples. If a new example cannot be matched by a known pattern, LIEP attempts to generalize a known pattern to cover the example. If generalization is not possible a new pattern is constructed. 2.5.3 WHISK Soderland (1999) presented the supervised learning system known as WHISK. WHISK uses a machine learning algorithm to deduce regular expressions that are later used as extraction rules. A user annotates the events presented in a set of sentences and WHISK then learns rules from these examples. WHISK has two pre-processing stages: semantic classes in which named entities are marked and chunking parse in which each sentence broken down into groups of words. WHISK annotates more sentences and the rules which disagree with the new examples are rejected. Rules are learned for each sentence not covered by the existing rules and this process continues until all sentences are covered. 35 2.5.4 GATE Cunningham et al. (2002) presented GATE (General Architecture for Text Engineering), a graphical development environment enabling researchers/users to develop and deploy various language engineering components and resources. It contains many useful tools that can be used individually or together with other tools. ANNIE, A Nearly-New IE system is one of them. ANNIE contains a tokeniser, a sentence splitter, a PoS tagger, a gazetteer, a finite state transducer, an orthomatcher and a coreferencer. In the first step, the tokeniser splits text into tokens (e.g. words, punctuations etc). The sentence splitter then segments these tokens into sentences. The PoS tagger is used to annotate these tokens with their PoS tags. The gazetteer consists of a list of named entities (e.g. lists of cities, organisations etc). A finite state transducer/ semantic tagger contains handcrafted rules that illustrate patterns to match and as a result annotation to be created. The orthomatcher recognises relations between named entities and the coreferencer finds identity relations between named entities in the text. GATE is quite user-friendly and has an easy-to-use environment which provides extensive facilities to researchers for annotation. The annotation can be done manually or semi-automatically by running some processing resources over the corpus. GATE was first implemented as a rule-based system and later on it was supplied with the functionality to perform IE using supervised machine learning. GATE has provided a number of useful facilities to researchers to address various ranges of issues in the area of NLP application development. It is quite robust and scalable. 2.6 Semi-supervised Approaches to Relation Extraction In this section we critically review the semi-supervised approaches to relation extraction proposed so far. 36 2.6.1 AutoSlog-TS Riloff (1996) presented an improved version of the AutoSlog system known as AutoSlog-TS. Experiments were conducted in three domains terrorist events (MUC4), joint ventures and microelectronics (MUC-5) and the results were compared against AutoSlog system. One of the drawbacks of the AutoSlog system is that it required an annotated corpus which is quite time-consuming and requires a huge amount of effort. The main idea presented in this paper is that domain-specific expressions will appear more often in relevant documents than in irrelevant ones. The AutoSlog-TS does not require any annotated corpora it only needs a classified corpus: relevant vs. non-relevant. The AutoSlog-TS applies exhaustive processing, after the partial parse it generates an extraction pattern for every noun phrase in the training corpus. This result in a large number of patterns being generated which are then evaluated on the basis of co-occurrence statistics with relevant sub-corpora. The user is involved in the process of judging the patterns’ relevance and patterns with a relevance score of (p) < 0.5 are discarded. The experiments were conducted in all three domains. MUC-4 data consisted of 1500 documents (772 relevant); AutoSlog generated 1237 patterns which were manually filtered to 450 in 5 hours while AutoSlog-TS generated 32,345 patterns and after filtering 11,225 relevant patterns were retained. The results of MUC-4 were compared against the results of AutoSlog and it showed that AutoSlog got higher recall while AutoSlog-TS were able to achieve higher precision. Portability is a big issue in a knowledge-based natural language processing system. The AutoSlog-TS reduces user involvement in porting IE systems to a new domain. A human need to provide texts classified as relevant and non-relevant, judge the resulting ranked list of patterns and label the resulting patterns in order to specify which kinds of event they will generate. 2.6.2 Snowball: Extracting Relations from Large Plain-Text Collections Agichtein and Gravano (2000) presented a semi-supervised relation extraction system known as Snowball system. It was based on the Dual Iterative Pattern Expansion 37 (DIPRE) algorithm (Brin, 1998). The Snowball system relied on a small set of seed examples and a general regular expression that the named entities must match to generate patterns from the text. Snowball system patterns include named entity tags (e.g. <LOCATION>-based <ORGANISATION>) as compared to DIPRE (e.g. <STRING1>-based <STRING2>). In the Snowball system patterns were generated by clustering similar patterns to the seed examples by using a simple single-pass clustering algorithm. The pattern and tuple evaluation was an integral part of the Snowball system and it kept only those patterns and tuples with a high confidence score. The confidence score of a pattern would be high if it was generated by several highly selective patterns. The Snowball system used a newswire corpus in its experiments; the training collection consisted of 178,000 documents, while the test collection consisted of 142,000 documents. The Snowball system was able to achieved higher precision and recall scores compared to DIPRE. Portability is one of the major advantages of the Snowball system as it requires only a handful of seed examples for each new scenario. 2.6.3 Dependency Tree based Pattern Models In this section, we will discuss various dependency tree based pattern models for relation extraction and a comparison among them. 2.6.3.1 SVO Model Yangarber et al. (2000) presented the SVO (Subject-Verb-Object) model. The motive behind the approach presented in this paper is to minimise manual labour required in order to construct pattern bases of new domains by using unannotated text, unclassified text and unsupervised learning. The system learns extraction patterns by using dependency parsing and pattern evaluation scores. Patterns used are tuples consisting of four elements: subject, verb, object and phrase referring to either subject or object. According to the presented approach, good patterns are strong indicators of relevant documents. The system starts with a large corpus of documents and a set of useful extraction patterns named as seeds. These patterns are then used to divide the corpus into relevant and irrelevant documents. Relevant documents are those matched 38 by one or more patterns while irrelevant documents are those not matched by any patterns. The patterns which occur more frequently in the relevant documents are selected and added into seeds. The patterns which matched the seed pattern are given the score of 1 and all others 0. The following formula is used to compute the score of each candidate pattern: score ( p )  |H  R | log(| H  R |) |H | Here H is the set of documents matched by the pattern p and R represents the set of relevant documents. Using the abovementioned formula, the highest scoring pattern is added to the set of accepted patterns. The corpus is first pre-processed to identify named entities and then the Connexor 14 parser is employed for parsing. MUC-6 management succession tasks are used to test the system using the following seed patterns: COMPANY-{appoint, elect, promote, name}-PERSON PERSON-{resign, depart, quit, step-down} The patterns produced by the system cannot be used directly for extraction so it is difficult to apply the MUC-6 approach for evaluation. Evaluation is therefore based on how accurately patterns match relevant documents and do not match irrelevant ones. A corpus consisting of 100 MUC-6 test documents and 150 documents randomly chosen from the main corpus was used for this purpose. The main advantage of the presented system is that it offers an unsupervised approach without any need of annotated examples. The disadvantage of this approach is that patterns can not be used directly for a RE task so it can only be evaluated on a text filtering task rather than extraction. 14 www.connexor.com/ 39 2.6.3.2 Chain Model Sudo et al. (2001) presented a tree-based pattern representation approach where a pattern is represented as a path in the dependency tree of a sentence. Previous approaches described in Riloff (1996) and Yangarber et al. (2000) are based on one common assumption that relevant documents contain good patterns. Both approaches rely on the sentence structure of English. These approaches failed in case of free word order languages like Japanese. This paper offers an alternative approach for the automatic acquisition of patterns. In the first stage, a morphological analyser and NEtagger are employed to do document pre-processing. The second stage retrieves the relevant document set from which the relevant sentence set is extracted. Finally all the sentences in the relevant sentence set are parsed and the system takes those paths with frequency higher than a certain threshold as extracted patterns. Mainichi-Newspaper-95 and Mainichi-Newpaper-94 corpora are used for training and testing the system respectively. The system achieves quite a low recall; moreover this pattern representation may not be able to adequately represent pattern context either. 2.6.3.3 Subtree Model Sudo et al. (2003) describes the limitations of the previous two extraction pattern models (Yangarber et al., 2000 and Sudo et al., 2001) and presents a new subtree model based on subtrees of the dependency tree. The evaluation shows that the proposed model outperforms the previous models. The SVO model (Yangarber et al., 2000) is based upon the direct syntactic relation between a predicate and its arguments. This pattern representation model is limited in what it can extract from a sentence. The chain model (Sudo et al., 2001) pattern representation may not be able to represent the context of a pattern adequately. The subtree model is the generalisation of the two abovementioned pattern models. According to this model any subtree of a dependency tree can be regarded as an extraction pattern candidate and so it contains all of the patterns proposed by the previous two models. The experiments are conducted using two sets of Japanese texts: Management succession scenario and Murder/Arrest scenario. The process of obtaining extraction patterns consists of following three stages: pre-processing, document retrieval and ranking 40 candidate patterns. Patterns for each model are generated and ranked. The following formula is used for the ranking of subtree patterns. score i  tf i   log  N df i     Where:  tfi – the frequency of pattern i in relevant documents  dfi – the number of docs containing pattern i  N – total number of document in the collection  β – used to control weight on the dfi portion The advantages of the subtree model are that it allows the capture of more varied context and can extract more specific scenario patterns while the disadvantage of this approach is that it adds the additional complexity of processing a large number of patterns. 2.6.3.4 Linked Chain Model Greenwood et al. (2005) presented a novel approach which makes use of more complex pattern models than previous approaches. The approach presented a new pattern model ‘Linked Chain Model’ which is the extension of chain models (Sudo et al., 2003). It joins the pairs of chains which share a common verb root but no direct descendants. The motivation behind this approach is that language is frequently used to articulate the same information in different ways. So this approach learns patterns automatically by identifying patterns with similar meanings to a set of seed patterns. In order to extract patterns from the corpora, the paper uses a weakly supervised bootstrapping method similar to Yangarber (2003) which learns patterns from a corpus based upon their similarity to seed patterns. The paper ranked learned patterns by employing an iterative algorithm which compares each candidate pattern against the centroid vector of the currently accepted patterns. The four highest scoring patterns in each iteration are then added to the accepted patterns. 41 2.6.3.5 A Semantic Approach to IE Pattern Induction Stevenson and Greenwood (2005) presented an alternative approach to Yangarber et al. (2000) for learning IE patterns. The approach is based on the assumption that patterns with similar meanings are expected to be valuable for extraction. The algorithm presented in this paper shows that this approach performs well when compared with the previously reported document-centric approach. The approach uses iterative learning algorithm for pattern learning, which starts with a set of seed patterns which are identified to be useful extraction patterns and compares every other pattern with the ones acknowledged to be good and then selects the highest scoring of these and adds them to the set of good patterns. This process continues until enough patterns have been learned. The approach is evaluated using two evaluation regimes: document filtering and sentence filtering In document filtering the task involves identifying relevant documents from irrelevant ones while sentence filtering evaluates how accurately generated patterns can distinguish between relevant and non-relevant sentences. The results produced by this approach are much superior to those produced by Yangarber et al. (2000). This approach failed to represent events which cannot be described as SVO structure so a more expressive model is required. 2.6.3.6 Comparing IE Models Stevenson and Greenwood (2006) compared the four previously reported pattern models based on dependency trees and evaluated them using three different dependency parsers. The results of the experiments conducted in this paper show that linked chain pattern models perform better than the other models. The choice of a pattern model is very important for any extraction task. The pattern model should be expressive enough to extract the required information from a parse of a dependency tree accurately. SVO model (see Section 2.6.3.1) used subject-verb-object tuples from the dependency tree as extraction patterns. The SVO model is unable to represent linguistic constructions such as nominalisations and prepositional phrases. Chain model (see Section 2.6.3.2) has the ability to represent the information expressed as a nominalisation or prepositional phrase but this model is unable to represent sentences 42 containing transitive verbs and it also fails to represent the context of a pattern adequately. The linked chain model (see Section 2.6.3.4) is able to encode the information represented by both SVO and chain models collectively. The subtree model (see Section 2.6.3.3) is richer in terms of information representation as compared to the abovementioned models but it produces too many patterns which are an uphill task to compute and so it adds additional complexity. The experiments are conducted on newspaper texts and biomedical texts using three dependency parsers in order to find suitable pattern representation models for encoding the information of interest to IE systems. Three dependency parsers used in these experiments are: MINIPAR 15 (Lin, 1999), the Machinese Syntax 16 parser (Tapanainen and Järvinen, 1997) and the Stanford 17 parser (Klein and Manning, 2003). SVO model and chain model performed poorly and provided less coverage while the linked chain models achieved a bounded coverage of 95% which means that this model can represent the majority of relations present in the dependency tree. Stevenson and Greenwood (2009) presented an analysis of various models’ performance on two different textual domains: management succession and biomedical text. Their analysis reveals that there is a wide variation between the models’ performance. In this paper, each pattern model was analysed in terms of its ability to represent relevant information, number of generated patterns and performance on an IE scenario. The experiments result showed that the linked chain model performance is quite promising compared to other pattern models. 2.7 Unsupervised Approaches to Relation Extraction In this section, we will review a few of the most recent unsupervised approaches to relation extraction. Hasegawa et al. (2004) presented an unsupervised approach for the discovery of relations among named entities from a newspaper domain. Their approach employed 15 http://webdocs.cs.ualberta.ca/~lindek/minipar.htm www.connexor.com/software/syntax/ 17 http://www-nlp.stanford.edu/software/lex-parser.shtml 16 43 the clustering technique in order to cluster named entity pairs according to the similarity of context words intervening between them. The relation discovery process was based on the assumption that pairs of named entities co-occurring in similar context can be grouped together in a cluster. After the NER, the two named entities are considered to co-occur if they appear within the same sentence and are separated by at most N intervening words. A vector space model and cosine similarity measures were employed to calculate the similarities between the set of contexts of named entities pairs. The approach used the maximum 5 context words between named entities and set the frequency threshold of 30 co-occurring named entities pairs. The presented approach was able to achieve a good precision and recall but one of the drawbacks of this approach is that because of high frequency threshold, the system was unable to discover some valuable relations. Sekine (2006) and Shinyama and Sekine (2006) presented two unsupervised approaches to IE known as ‘On-demand IE’ and ‘Pre-emptive IE’ respectively. The basic motive behind both these approaches was to identify the most salient relations in documents and extract information on user demands by employing unsupervised learning methods. The on-demand IE system (Sekine, 2006) extracts salient relations from the text based on a user query and builds tables based on these extracted relations by using paraphrase discovery technology. The system makes use of recent advances in pattern discovery, paraphrase discovery and extended NE tagging. The system used a newspaper corpus and retrieves relevant documents based on a user query and then applies PoS tagger, a dependency analyser and an extended NE tagger to extract patterns from the relevant documents. These extracted patterns are then arranged into a set of similar patterns by applying paraphrase recognition. A table was created for each pattern set, if the pattern set contained more than two patterns. Shinyama and Sekine (2006) (pre-emptive IE) apply NER, coreference resolution and parsing to a newspaper corpus in order to extract relations between NEs. The approach uses unrestricted relation discovery in order to discover all possible relations from texts and presents them as tables. In unrestricted relation discovery the relations appearing repeatedly in a corpus are extracted automatically (without human intervention). The extracted relations are grouped into pattern tables of NE pairs expressing the same relation. This approach uses clustering in order to cluster the semantically similar relations. 44 Etzioni et al. (2008) presented an unsupervised approach to RE by using Web as a corpus. Their approach used a huge corpus of 9 million web pages to automatically extract all relations between noun phrases. The main contribution of this approach is to introduce an open RE system known as TEXTRUNNER. TEXTRUNNER consists of three key modules: self-supervised learner, single-pass extractor and redundancybased assessor. Self-supervised learner module produces a classifier by using a small sample corpus without any hand-tagged data. This classifier labels candidate extractions as ‘trustworthy’ or not. The single-pass extractor module makes a single pass over the whole corpus to extract tuples of all possible relations from corpus. These extracted tuples are then sent to the classifier and only those which the classifier labels as trustworthy are kept. A redundancy-based assessor module assigns a probability score to each trustworthy tuple based on a probabilistic model of redundancy in text (Downey et al., 2005). The experimental results revealed in this paper show that TEXTRUNNER achieves a 33% relative error reduction for a comparable number of extractions when compared with the state-of-the-art Web RE system KNOWITALL (Etzioni et al., 2005). Moreover, TEXTRUNNER was able to achieve higher precision than KNOWITALL. Eichler et al. (2008) presented an unsupervised RE system (IDEX) which automatically extracts information regarding an input topic provided by the user. The relevant documents related to the given topic are then retrieved and extracted relations are clustered in an unsupervised way. IDEX employs LingPipe 18 for sentence boundary detection, NER and coreference resolution. IDEX only considered those sentences for relation extractions which contain at least two NE’s. These selected sentences are then parsed using Stanford parser 19 . IDEX then extracts all the verb relations i.e. for each verb its subject(s), object(s), preposition(s) with arguments and auxiliary verb(s) and it keeps only those verb relations where at least the subject or object is an NE. Extracted relations are grouped into relation clusters based on their similarity. IDEX used Berlin Central Station corpus for their experiments which comprise 1068 web pages downloaded from Google consisting of 55255 sentences, 10773 relation instances were automatically extracted and clustered by those 18 19 http://alias-i.com/lingpipe/ http://nlp.stanford.edu/ 45 sentences. The system was able to produce 306 clusters out of which 121 were deemed as consistent (i.e. all instances in the cluster express similar relations), 35 partly consistent and 69 were not consistent. 2.8 Relation Extraction in the Biomedical Domain There is a large body of research dedicated to the problem of extracting relations from general-domain texts, and from biomedical texts in particular. BioNLP 20 has played a great role in biomedical research by providing a platform with useful resources to the research community. Most previous approaches have been supervised and tried both to extract relations and assign labels describing the semantic types of the relations (Cunningham et al., 2002; Zelenko et al., 2003; Culotta and Sorensen, 2004; Bunescu and Mooney, 2006 among many others). These approaches required a manually annotated corpus, which is very laborious and time-consuming to produce (see Section 2.5). Semi-supervised and unsupervised approaches rely on seed patterns and/or examples of specific types of relations (see Section 2.6 and Section 2.7). As is known from literature, RE in the biomedical domain is quite difficult as compared to other domains, such as the news domain, due to the inherently complex nature of text in the biomedical domain (e.g. Cohen and Hersh, 2005). As sentences in the biomedical domain are syntactically complex, the subsequent RE phase depends upon the correct identification of the NEs and correct analysis of linguistic constructions expressing relations between them. In the biomedical domain, most work has focused on fully supervised or semi-supervised approaches. For example, Wong (2001) used templates to determine protein-protein interactions from biomedical text. Most of the supervised approaches relied on regular expressions to learn patterns, while semi-supervised approaches exploited pre-defined seed patterns and cue words (Ananiadou and McNaught, 2006). 20 http://www.bionlp.org/ 46 Blaschke et al. (1999) presented a system for the automatic detection of proteinprotein interactions from the scientific abstracts. Their approach relies on prespecified protein names and a set of verbs that represent the actions. This paper does not provide any precision or recall scores. Ono et al. (2001) presented a system for the extraction of protein-protein interactions from biomedical literature. The system employed certain sets of regular expression rules and cue words (“interact”, “bind”, etc.) along with a protein name dictionary to extract the relation between two proteins. The system achieved a high performance with a precision rate of 94% and a recall rate of 85%. One of the shortcomings of this approach is its inability to deal with the complex sentences that distance a subject or object from a verb. Huang et al. (2004) presented a data-driven approach for the extraction of proteinprotein interactions from biomedical literature. Their approach employed a dynamic programming algorithm along with a protein dictionary in order to compute distinguishing patterns by aligning relevant sentences and key verbs that describe protein-protein interactions. Their system was able to attain a precision of 80.5% and a recall of 80%. Corney et al. (2004) describes a system known as BioRAT which constructs templates using a set of regular expressions, part-of-speech, gazetteer categories, literal strings and words. BioRAT is designed to give the people a powerful tool in order to locate and analyse research papers. BioRAT plays the role of a research assistant by finding relevant documents relevant to a given query and automatically highlighting the salient facts in each document. BioRAT was able to achieve a precision of 55.7% and a recall of 20.3% on biomedical abstracts and precision and recall scores of 51.25% and 43.6% respectively on full-length papers. Martin et al. (2004) presented another approach based on pattern matching, the approach extracted protein-protein interactions using a number of dictionaries containing: protein names and their synonyms, protein interaction verbs and their synonyms and common strings used which are helpful in the identification of unknown proteins. 47 Fundel et al. (2007) developed a tool known as RelEx to extract biomedical relations (protein-gene interactions) from free text in a biomedical literature. This tool was based on dependency trees along with rules to process these trees. For NER of gene and protein names this tool employed a synonym dictionary (Fundel and Zimmer, 2006) while a list of restriction terms was used to specify relations of interest in the text. RelEx extracted relations from dependency trees by extracting paths connecting pairs of proteins while making sure that these paths contain relevant terms describing the relation between the given pair of proteins. RelEx was evaluated using a comprehensive set of one million MEDLINE abstracts dealing with gene and protein relations and was able to attain 80% precision and 80% recall. All of the aforementioned approaches mostly rely on pattern matching and require a large number of patterns in order to extract the desired information. Overall, there has been little work on fully unsupervised approaches to RE, ones that would be able to locate significant relations in a particular collection of texts. Semi-supervised approaches, while offering considerable savings on the preparation of training data, are still limited to pre-defined types of relations that have to be instantiated in either seed extraction patterns, seed pairs of related named entities, or annotated examples. Relation Extraction in the biomedical domain has been addressed primarily with either supervised approaches or those based on manually written extraction rules, which are rather inadequate in scenarios where relation types of interest are not known in advance. 2.9 Use of Web as a corpus The Web is the largest possible source of free textual data, containing hundreds of billions of words in various languages and consistently growing at a rapid pace. Presently, many researchers use the Web 21 as a data source in their research. The Web enables researchers to handle the data sparseness bottleneck in various NLP applications. Killgarriff and Grefenstette (2003) shed light on the use of the Web as a 21 http://www.webcorp.org.uk/ 48 corpus for many NLP applications. They argued that having large amounts of data would improve performance more than fine-tuning algorithms. Manning and Schütze (1999) suggested that having a large amount of training data (as in the case with the Web corpus) is very useful for many statistical NLP applications. In many NLP applications the algorithms which used the Web as a corpus were successful at many linguistics tasks and frequently surpassed sophisticated methods based on traditional corpora (e.g. Turney, 2001; Keller and Lapata, 2003). The Web is a huge source of information and it has a huge impact in the field of NLP but it has its drawbacks, too. One main drawback of using the Web as a corpus is that along with text types it also contains a lot of useless material. Another disadvantage is that it is impossible to replicate an experiment in an exact way at a later time as the Web is constantly in flux and growing at a rapid pace. Apart from redundancy, one of the other main criticisms of using the Web as a corpus is that it is not balanced as an ideal or traditional corpus should be and due to that, the data obtained from the Web corpus might not be representative. On the other hand, Killgarriff and Grefenstette (2003) argued that no corpus is completely balanced and representative. The Web is also quite frequently used by many researchers in the area of IE. Brin (1998) presented an approach known as Dual Iterative Pattern Relation Extraction (DIPRE) which extracted relations (book titles and authors) from the Web, automatically or with minimal human intervention. Due to the progress made in computer hardware, many IE researchers have used unsupervised approaches based on the Web e.g. Sekine, 2006; Shinyama and Sekine, 2006; Banko et al., 2007 and Eichler et al., 2008 (see Section 2.7 for further details). Mukherjea and Sahay (2006) used the Web in order to automatically discover biomedical relations. Their approach relied on the retrieval of relevant information from web search engines by employing various lexico-syntactic patterns as queries. In our research, we will carry out our experiments using traditional corpora as well as the corpus collected from the Web and will compare the results obtained from these corpora. 49 2.10 Summary In this chapter, we have discussed various approaches presented so far in order to automatically generate multiple choice test items. We also elaborated the concept of IE, its subtasks, its main components, its evaluation and approaches to build IE systems and its applications in a real world. IE has two main components: NER and RE. We have described various supervised, semi-supervised and unsupervised approaches for each component. We also looked at the various dependency tree based patterns models and comparison among these models in this chapter. At the end of this chapter, we also discussed the growing trend of using the Web as a corpus, its advantages and disadvantages. 50 Chapter 3: Stem Sentences Selection via IE In this chapter, we will discuss the IE component of our system (see Figure 2 in Section 1.7). We will investigate two unsupervised approaches (surface-based and dependency-based) to Relation Extraction to be applied in the context of automatic generation of multiple-choice questions (MCQs). Our assumption for Relation Extraction is that it is between Named Entities stated in the same sentence and that presence or absence of a relation is independent of the text prior to or succeeding the sentence. This connotes that only information obtained from sentences including the two Named Entities will be relevant for Relation Extraction. In the surface-based approach, we will examine three different surface pattern types, each implementing different assumptions about linguistic expression of semantic relations between Named Entities while in the dependency-based approach we will explore how dependency relations based on dependency trees can be helpful in extracting relations between Named Entities. We will evaluate both these approaches in terms of precision, recall and F-score. Our experiments make use of traditional corpora along with the similar corpus collected from the Web. At the end of this chapter, we will perform a comparison between the surface-based approach and the dependency-based approach. 3.1 Unsupervised Surface-based Patterns The approach aims to identify the most important semantic relations in a document without assigning explicit labels to them in order to ensure broad coverage, unrestricted to predefined types of relations, which is particularly important in the context of testing learners’ familiarity with learning material. 51 Our main findings indicate that the approach is capable of achieving high precision scores and its enhancement with linguistic knowledge helps to produce significantly improved patterns. The intended application for the proposed method is in the context of an e-Learning system for automatic assessment of students’ comprehension of training texts; however it can also be applied to other NLP scenarios, where it is necessary to recognise important semantic relations without any prior knowledge as to their types. Information Extraction (IE) is an important problem in many information access applications. As mentioned in Chapter 2, Named Entity Recognition (NER) and Relation Extraction (RE) are the two integral components of any IE system. The first step is the identification of the NEs present in the text. These NEs will be different depending on the nature of the text and the intended application. Following the identification of NEs the next step is the RE phase. The goal is to identify all the instances of specific semantic relations between NEs of interest in the text. For this purpose RE patterns are used to recognise and label these relations. 3.1.1 Our Approach The main advantage of our approach (Afzal and Pekar, 2009) is that it can cover a potentially unrestricted range of semantic relations while other supervised and semisupervised approaches (see Section 2.5 and Section 2.6) can learn to extract only those relations that have been exemplified in annotated text, seed patterns or seed named entities. Moreover, our approach is very suitable for situations where a lot of unannotated text is available as it does not require manually annotated text or seeds. Such an approach can be useful, specifically, in such applications as Multiple-Choice Question generation (Mitkov et al., 2006; see Section 2.1) or a pre-emptive approach in which viable IE patterns are created in advance without human intervention (Shinyama and Sekine, 2006; Sekine, 2006; see Section 2.7). Figure 3 shows the whole architecture of our approach. We elaborate the NER process in Section 3.1.2; Section 3.1.3 explains the process of candidate patterns extraction. Section 3.1.4 describes various information theoretic measures and statistical tests for patterns ranking depending upon patterns associations with a domain corpus while Section 52 3.1.5 discusses the evaluation procedures and the experimental results are discussed in Section 3.1.6. Unannotated corpus Named Entity Recognition Extraction of Candidate Patterns Patterns Ranking Evaluation Semantic Relations Figure 3: Relation Extraction approach We will employ this approach for the automatic generation of MCQs, where it will be used to find relations and NEs in educational texts that are important for testing students’ familiarity with key facts contained in the texts. In order to achieve this, we need an IE method that has a high precision and at the same time works with unrestricted semantic types of relations (i.e. without reliance on seeds), while recall is of secondary importance to precision. 3.1.2 NER and PoS Tagging of Biomedical Texts Biomedical NER is generally considered to be more difficult than other domains like newswire text. There is huge number of NEs in the biomedical domain and new ones are constantly added (Wilbur and Smith, 2007) which means that neither dictionaries nor the training data approach will be sufficiently comprehensive for NER. The volume of published biomedical research has expanded at a rapid rate in the recent past. MEDLINE 22 (Medical Literature Analysis and Retrieval System Online) is the U.S. National Library of Medicine containing over 18 million references to journal articles regarding biomedicine. MEDLINE is currently growing at the rate of over 600,000 new citations each year 23 . PubMed 24 , a search engine, is used to access the MEDLINE content. NER in the biomedical domain has been researched over the 22 http://www.nlm.nih.gov/pubs/factsheets/medline.html http://www.nlm.nih.gov/bsd/stats/cit_added.html 24 http://www.ncbi.nlm.nih.gov/pubmed/ 23 53 years with various challenges such as BioCreAtIvE 25 (Critical Assessment of Information Extraction systems in Biology) and shared tasks in conferences addressing the issues and evaluating the performances of various named entity recognition systems. Named entities (NEs) in the biomedical domain are expressed in various linguistic forms such as abbreviations, plurals, compound, coordination, cascade, acronyms and apposition (Zhou et.al, 2004). These various linguistic forms are exemplified in Table 2 (Ananiadou and McNaught, 2006). Linguistic Forms Example Gene and Protein Names Abbreviation GLA Plural p38MPAKs, ERK1/2 Compound Rpg1p/Tif32p Coordination 91 and 84 kDa proteins Cascade kappa 3 binding factor (such that kappa 3 is a gene name) Description an inhibitor of p53 Acronym Phospholipase D (PLD) Apposition PD98059, specific MEK1/2 inhibitor Table 2: Example gene and protein names in various linguistic forms One NE can be used to represent different concepts which results in further ambiguities, for example ‘ferritin’ can be a biological substance or a laboratory test. Moreover, many biological NEs have several names e.g. ‘PTEN’ and ‘MMACI’ refer to the same gene which in turn makes NER in the biomedical domain more difficult. Another problem is that authors frequently do not follow existing naming conventions, instead introducing their own abbreviations and using them throughout the papers (Chen et al., 2005). Moreover, the NEs in the biomedical domain are much longer on average than NEs from other domains. It is generally much easier for both human and automated systems to find out whether an NE is present than to detect its 25 http://biocreative.sourceforge.net/ 54 boundaries (Yeh et al., 2005) as the case is not always a reliable indicator of sentence boundaries (e.g. a new sentence can start with lowercase word in a biomedical domain). Yeh at al. (2005) also compared the length distribution of gene names with the length distribution of organisation names in the newswire domain. Their results revealed that the average length of gene names was 2.09 compared to 1.69 for organisation names. Due to the syntactic and semantic complexity of the biomedical domain many IE systems have utilised tools (e.g. part-of-speech tagger, NER, parsers, ontologies) specifically designed and developed for the biomedical domain (e.g. Andrade and Valencia, 1998; Pustejovsky et al., 2001, 2002). Moreover, Grover et al. (2005) presented a report investigating the suitability of current NLP resources for syntactic and semantic analysis for the biomedical domain. The GENIA tagger 26 is a specific tool designed for biomedical texts, which is used to analyse English sentences and outputs the base forms, part-of-speech tags, chunk tags and NE tags. The GENIA part-of-speech tagger is trained on a general domain corpus (Wall Street Journal corpus) as well as GENIA corpus and PennBioIE corpus (Kulick et al., 2004). Due to this the GENIA part-of-speech tagger is able to handle various kinds of biomedical text, and achieves a very high accuracy on biomedical text. Table 3 shows the tagging accuracies of a tagger trained on different data sets (Tsuruoka et al., 2005; Tsuruoka and Tsujii, 2005). Wall Street Journal GENIA corpus (WSJ) corpus A tagger trained on WSJ corpus 97.05% 85.19% A tagger trained on GENIA corpus 78.57% 98.49% GENIA tagger 96.94% 98.26% Table 3: Tagging accuracies The GENIA tagger produces the output in the following format: word1 base1 POStag1 chunktag1 NEtag1 word2 base2 POStag2 chunktag2 NEtag2 26 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/ 55 : : : : : The tagger represents the chunk tags in the IOB format (B for BEGIN, I for INSIDE and O for OUTSIDE). The NE tagger is designed to recognise mainly the following named entities: protein, DNA, RNA, cell_type and cell_line. The NE tagger is trained on the NLPBA data set 27 , a shared task of biomedical NE recognition that was held from March to April 2004. The task main objective was to identify and classify terms in bio-molecular biology which correspond to instances of concepts which are of particular interest to biologists. Table 4 shows the performance of GENIA NER 28 . Entity Type Precision Recall F-score Protein 65.82 81.41 72.79 DNA 65.64 66.76 66.20 RNA 60.45 68.64 64.29 Cell Line 56.12 59.60 57.81 Cell Type 78.51 70.54 74.31 Overall 67.45 75.78 71.37 Table 4: GENIA NER performance 3.1.3 Extraction of Candidate Patterns Our general approach to the discovery of interesting extraction patterns consists of two main stages: (i) the construction of potential patterns from an unannotated domain corpus and (ii) their relevance ranking. 3.1.3.1 Linguistic types of patterns Once the training corpus has been tagged with the GENIA tagger, the process of pattern building takes place. Its goal is to identify which NEs are likely to be semantically related to each other. 27 28 http://research.nii.ac.jp/~collier/workshops/JNLPBA04st.htm http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/ 56 The procedure for constructing candidate patterns is based on the idea that important semantic relations are expressed with the help of recurrent linguistic constructions, and these constructions can be recognised by examining sequences of content words (nouns, verbs, adjectives and adverbs) appearing between NEs. Semantic patterns are widely used in the area of IE. As in IE, we are interested in extraction of semantic classes of objects (NEs), relationships among these NEs and events in which these entities participate. To find such constructions, we impose a limit on the number of content words intervening between the two NEs. We experimented with different thresholds and finally settled on a minimum of one content word and a maximum of three content words to be extracted between two NEs. The reason for introducing this condition is that if there are no content words between two NEs then, although some relation might exist between them, it is likely to be a very abstract grammatical relation. For example, in “X of Y” there is a relation between X and Y, but the phrase does not explicitly express any domain-specific knowledge. On the other hand, if there are too many content words intervening between two NEs, then it is likely they are not related at all. We build patterns using this approach and store each pattern along with its frequency in a database. In extracted patterns, lexical items are represented in lowercase while semantic classes are capitalised. For example in the pattern “PROTEIN encode PROTEIN”, here encode is a lexical item while PROTEIN is a semantic class. In this chapter we describe experiments with different surface pattern types each implementing different assumptions about linguistic expression of semantic relation between named entities without prepositions and with the inclusion of prepositions. In the first phase of experiments we consider the following surface pattern types without prepositions:  Untagged word patterns  PoS-tagged word patterns  Verb-centred patterns The reason for choosing these different types of surface patterns is that verbs typically express semantic relations between nouns that are used as their arguments. Untagged word patterns consist of NEs and their intervening content words. Some examples of 57 the most frequent untagged word patterns from GENIA corpus along with their frequencies are shown in Table 5. Patterns Frequency PROTEIN activation PROTEIN 53 DNA contain DNA 46 PROTEIN include PROTEIN 43 PROTEIN bind DNA 39 PROTEIN as well PROTEIN 37 PROTEIN expression PROTEIN 35 PROTEIN activate PROTEIN 32 CELL_TYPE express PROTEIN 31 PROTEIN expression CELL_TYPE 29 PROTEIN induce PROTEIN 29 Table 5: Untagged word patterns along with their frequencies PoS-tagged word patterns contain the PoS of each content word. Table 6 shows examples of the most frequent PoS-tagged word patterns from the GENIA corpus along with their frequencies. Patterns Frequency PROTEIN activation_n PROTEIN 53 DNA contain_v DNA 46 PROTEIN include_v PROTEIN 43 PROTEIN bind_v DNA 39 PROTEIN as_a well_a PROTEIN 37 PROTEIN expression_n PROTEIN 35 PROTEIN activate_v PROTEIN 32 CELL_TYPE express_v PROTEIN 31 PROTEIN expression_v CELL_TYPE 29 PROTEIN induce_v PROTEIN 29 Table 6: PoS-tagged word patterns along with their frequencies 58 Verb-centred patterns contain patterns where the presence of a verb is compulsory in each pattern. Table 7 shows some of the most frequent verb-centred patterns from the GENIA corpus along with their frequencies. We require the presence of a verb in the verb-based patterns as verbs are the main predicative class of words, expressing specific semantic relations between two named entities. Patterns Frequency DNA contain_v DNA 46 PROTEIN include_v PROTEIN 43 PROTEIN bind_v DNA 39 PROTEIN activate_v PROTEIN 32 CELL_TYPE express_v PROTEIN 31 PROTEIN induce_v PROTEIN 29 DNA encode_v PROTEIN 27 CELL_LINE express_v PROTEIN 20 PROTEIN involve_v PROTEIN 18 PROTEIN bind_v PROTEIN 18 Table 7: Verb-centred patterns along with their frequencies Moreover, in the pattern building phase, the patterns containing the passive form of the verb like: PROTEIN be_v express_v CELL_TYPE are converted into the active voice form of the verb like: CELL_TYPE express_v PROTEIN Because such patterns were taken to express a similar semantic relation between NEs, passive to active conversion was carried out in order to relieve the problem of data sparseness: it helped to increase the frequency of unique patterns and reduce the total number of patterns. For the same reason, negation expressions (not, does not, etc.) 59 were also removed from the patterns as they express a semantic relation between NEs equivalent to one expressed in patterns where a negation particle is absent. In addition, patterns containing only stop-words (a list of English stop-words common in IR) were also filtered out. Table 8 shows a few examples of stop-word patterns which were filtered out during the candidate pattern construction. DNA through PROTEIN PROTEIN such as PROTEIN PROTEIN with PROTEIN in CELL_TYPE PROTEIN be same in CELL_LINE PROTEIN against PROTEIN Table 8: Patterns only containing stop-words 3.1.4 Pattern Ranking After candidate patterns have been constructed, the next step is to rank the patterns based on their significance in the domain corpus. The ranking method we use requires a general corpus that serves as a source of examples of use of the patterns in domainindependent texts. To extract candidates from the general corpus, we treated every noun as a potential named-entity holder and the candidate construction procedure described above was applied to find potential patterns of the three different types in the general corpus. Some of these ranking methods have been used in classification of words according to their meanings (Pekar et al., 2004) but to our knowledge this approach is the first one to explore these ranking methods to rank IE patterns. We used these ranking methods in our research as they are more appropriate for our unsupervised RE approach as compared to the pattern ranking method used by semisupervised approaches (Yangarber et al., 2000; Sudo et al., 2001; Sudo et al., 2003), where tf-idf is used in order to iteratively collect IE patterns and relevant documents from a collection of relevant and irrelevant documents. In order to score candidate patterns for domain-relevance, we measure the strength of association of a pattern with the domain corpus as opposed to the general corpus. The 60 patterns are scored using the following methods for measuring the association between a pattern and the domain corpus:  Information Gain (IG)  Information Gain Ratio (IGR)  Mutual Information (MI)  Normalised Mutual Information (NMI)  Log-likelihood (LL)  Chi-Square (CHI) These association measures were included in the study as they have different theoretical principles behind them: IG, IGR, MI and NMI are information-theoretic concepts while LL and CHI are statistical tests of association. Information Gain measures the amount of information obtained about domain specialisation of corpus c, given that pattern p is found in it. IG( p, c)    P( g, d ) log d{c,c '} g{ p, p '} P( g, d ) P( g)P(d ) where p is a candidate pattern, c – the domain corpus, p' – a pattern other than p, c' – the general corpus, P(c) – the probability of c in the “overall” corpus {c, c'}, and P(p) – the probability of p in the overall corpus. Information Gain Ratio aims to overcome one disadvantage of IG consisting in the fact that IG grows not only with the increase of dependence between p and c, but also with the increase of the entropy of p. IGR removes this factor by normalising IG by the entropy of the corpus: IGR ( p , c )   IG ( g , c )  P ( g ) log P ( g ) g { p , p '} 61 Pointwise Mutual information has been traditionally used in statistical NLP to measure the association between two linguistic phenomena, such as the elements of a multiword unit. Pointwise MI between corpus c and pattern p measures how much information the presence of p contains about c, and vice versa: P ( p,c) P ( p ) P (c ) MI ( p , c )  log Mutual Information has a well known problem of being biased towards infrequent events. To tackle this problem, we normalised the MI score by a discounting factor, following the formula proposed in Lin and Pantel (2002). Chi-Square and Log-likelihood are statistical tests which work with frequencies and rank-order scales, both calculated from a contingency table with observed and expected frequency of occurrence of a pattern in the domain corpus. Chi-Square is calculated as follows: (O d  E d ) 2 x ( p,c)   Ed d  { c , c '} 2 where O is the observed frequency of p in domain and general corpus respectively and E is the expected frequency of p in two corpora. E is calculated as: N E i   i  O i i N i i Here Ni is the total frequency of a pattern in corpus i. Log-likelihood is calculated according to following formula: 62  2 ln   2  O i i O ln   E i i    This equates to calculating LL as follows:  O   O  LL( p, c)  2 O1 log  1   O2 log  2    E1   E2    where O 1 and O 2 are observed frequencies of a pattern p in the domain and general corpus respectively, while E 1 and E 2 are its expected frequency values in the two corpora. In addition to these six measures, we introduce a meta-ranking method that combines the scores produced by several individual association measures (apart from MI), in order to leverage agreement between different association measures and downplay idiosyncrasies of individual ones. We excluded MI here because of its bias towards infrequent events as mentioned earlier (Lin and Pantel, 2002). Because the association functions range over different values (for example, IGR ranges between 0 and 1), we first normalise the scores assigned by each method: S norm ( p )  s( p) max q P ( s ( q )) where s(p) is the non-normalised score for pattern p, from the candidate pattern set P. The normalised scores are then averaged across different methods and used to produce a meta-ranking of the candidate patterns. Apart from the aforementioned pattern ranking methods, we also used most frequently used pattern ranking method: tf-idf as a baseline in our experiments too. The tf-idf scoring is commonly used in IR (Manning and Schütze, 1999). Sudo et al (2003) (see Section 2.6.3.3) used this method to rank IE patterns. We used the following formula to rank IE patterns: 63  N   score i  tf i  log df i   where tf i is the frequency of pattern i in domain corpus, df i the number of documents containing pattern i and N is the total number of documents in the collection (both domain and general corpus). Given the ranking of candidate patterns produced by a scoring method, a certain number of highest-ranking patterns can be selected for evaluation. We studied two different ways to select these patterns: (i) one based on setting a threshold on the association score below, in which the candidate patterns are discarded (henceforth, score-thresholding measure) and (ii) one that select a fixed number of top-ranking patterns (henceforth, rank-thresholding measure). During the evaluation, we experimented with different rank- and score thresholding values. 3.1.5 Evaluation 3.1.5.1 Experimental data We used the GENIA Corpus as the domain corpus while British National Corpus (BNC) was used as a general corpus. The GENIA corpus consists of 2000 abstracts extracted from the MEDLINE containing 18,421 sentences. In the evaluation phase, GENIA EVENT Annotation corpus 29 is used (Kim et.al, 2008). It consists of 1000 MEDLINE abstracts similar to the GENIA corpus and has 9,372 sentences. The main difference between the GENIA and GENIA EVENT corpora is that in the GENIA EVENT corpus events are identified and annotated. In order to handle the problem of data sparseness due to the small size of the GENIA corpus we developed a WEB corpus (consisting of 132,582 sentences) by collecting MEDLINE articles similar to the GENIA corpus from the National Library of 29 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/home/wiki.cgi?page=Event+Annotation 64 Medicine 30 . The Web corpus was collected using a commercial web crawler, which implements a methodology for collecting a topical corpus, similar to the one implemented in tools such as BootCat. 31 The commercial web crawler was preferred over BootCat because it has a term extractor integrated with it, so high quality terms were automatically extracted from pages being analysed and used for automatically building more queries while BootCat extracts single words. It is fully automated, i.e. one does not have to do manual revision of the extracted terms after every iteration. Moreover, it queries multiple search engines (Google, Yahoo and Bing) and so the crawling results are not biased towards any particular search engine. As the commercial web crawler uses a term extractor, it is better at crawling highly technical domains which are best captured by multi-word terms. BootCat, instead, was primarily intended to collect language-specific, topic-independent corpora, where single words are more suitable for collecting content. In response to an original set of manually constructed queries built from the GENIA corpus, original queries were constructed by manually defining several topical terms (named entities) e.g. protein, DNA and combining them randomly to create an initial set of queries. The crawler collects web pages by making calls to several popular search engines, extracts topical terminology from the pages, selects the most promising topical terms to create new queries, and uses them to collect more web pages on the topic. The crawler collected web pages in this iterative manner until the desired size of the corpus is reached. The crawler strips off boilerplate content (navigation menus, standard notices etc.) from each page, removes HTML tags, detects and discards duplicate pages. The GENIA named entity tagger was then used for NER and PoS tagging. The quality of the collected corpus was evaluated using corpus homogeneity and similarity scores. In order to ensure that the Web corpus is sufficiently on-topic, it is important to know how similar the two corpora are. Corpus similarity also plays a pivotal role when porting an NLP application from one domain with one corpus to another domain with a different corpus. Corpus similarity is a complex issue and there is no generally accepted method to measure corpus similarity; (Kilgarriff, 1997; Kilgarriff and Rose, 1998 and Kilgarriff, 2001) argued that it is most important to first determine the homogeneity of a corpus before computing its similarity to another corpus, as the 30 31 http://www.nlm.nih.gov/ http://bootcat.sslmit.unibo.it/ 65 judgement of similarity can become unreliable if a homogenous corpus is compared with a heterogeneous one. Kilgarriff (1997) presented an overview of various approaches for corpus similarity and proposed a word frequency list approach to measure corpus similarity and homogeneity. We used the Kilgarriff (1997) approach as it is considerably easier to count words accurately rather than syntactic categories. In order to measure corpus homogeneity, we divided the corpus into two equal parts and produced a word frequency list of each sub-corpus by processing the text using GENIA tagger and filtering out punctuations and stop words. In the next step we took the 500 most frequent words from each sub-corpus and calculated the chi-square statistics for the difference between two sub-corpora, as Kilgarriff and Rose (1998) and Kilgarriff (2001) showed that chi-square statistics perform considerably better than other information-theoretic and statistical measures. To determine the similarity between the two corpora, we also produced the top 500 words from each corpus and calculated the chi-square statistics for each corpus. Low chi-square scores indicate homogeneous and highly similar corpora, while high scores correspond to heterogeneous corpora and dissimilar corpora. Corpus Chi-Score GENIA 1379.693 GENIA EVENT 2364.577 WEB 14750.369 BNC 20872371.995 Table 9: Homogeneity scores of corpora Table 9 shows the homogeneity scores between two sub-corpora in each corpus we used in the experiment. We observe that GENIA and GENIA EVENT corpora achieve quite a low score which in turn shows that both these two corpora are homogenous. This is rather unsurprising as both corpora were compiled by hand to ensure topic relevance and are generally accepted as benchmark biomedical corpora. WEB and BNC scores show that these two corpora are more heterogeneous. BNC exhibits the greatest heterogeneity, which is obviously explained by the fact that the corpus is meant to cover the broadest possible range of domains in general British 66 English. The WEB corpus is much more homogeneous than BNC, but still has a chisquare score of magnitude greater than the GENIA corpora, reflecting the fact that automatic web collection methods are still incapable of ensuring the same level of topic relevance as achieved in manually compiled corpora. In the next step, we will calculate the similarity scores between these corpora using Chi-Score. Table 10 shows similarity scores in which GENIA and GENIA EVENT corpora are quite similar to each other while in the case of all other corpora the high score means that they are quite dissimilar to each other. GENIA EVENT GENIA 2137.63 GENIA EVENT WEB BNC 173207.002 23686564.063 136568.630 23008298.781 WEB 28068572.14 Table 10: Similarity scores of corpora As mentioned earlier that BNC is a heterogeneous corpus, which is also reflected here too in the form of a higher similarity score while the WEB corpus similarity score is also quite high due to a higher homogenous score when compared to the manually compiled corpora of GENIA and GENIA EVENT respectively. We collected the Web corpus to attain higher recall in our experiments but as is quite obvious from the homogeneity and similarity scores (Table 9 and 10), the Web corpus is not homogenous and also not similar to GENIA or GENIA EVENT corpus. One of the possible reasons for this is that GENIA is a very narrow-domain corpus and it is hard to collect relevant topical documents automatically. 3.1.5.2 Evaluation method In order to evaluate the quality of the extracted patterns, we examined their ability to capture pairs of related named entities in the manually annotated evaluation corpus, without recognising the types of the semantic relations. Selecting a certain number of best-ranking patterns, we measured precision, recall and F-score. 67 To test the statistical significance of differences in the results of different methods and configurations, we used a paired t-test, having randomly divided the evaluation corpus (GENIA EVENT Annotation corpus) into 20 subsets of equal size; each subset containing 461 sentences on average. We collected precision, recall and F-score for each of these subsets and then using paired t-test we found statistical significance between different surface pattern types and also between different ranking methods using score-thresholding measure. 3.1.6 Results In the first phase of experiments, we considered all surface pattern types (e.g. untagged, PoS and verb-centred) with out prepositions. We carried out our experiments on all 3 corpora (GENIA, WEB and GENIA+ WEB) for all three surface pattern types. As we found in Section 3.1.5.1 that the WEB corpus is not similar to the GENIA or the GENIA EVENT corpus, in this section we will discuss the results for the GENIA corpus only while Appendix C contains complete results for all three corpora along with precision, recall and F-scores. The numbers of untagged word patterns extracted from each corpus are: GENIA 12230, WEB 42718, GENIA+WEB 52511, BNC 1956473 and GENIA EVENT 5763. Figure 4 shows the rank-thresholding results for untagged word patterns using the GENIA corpus. Precision scores are represented along the Y-axis; recall is very low in rank-thresholding measure (see Table 1 in Appendix C for complete results in terms of precision, recall and F-scores). 68 Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IG IGR MI NMI LL CHI Meta tf-idf Top 100 Top 200 Top 300 Figure 4: Rank-thresholding results for untagged word patterns using GENIA corpus Figure 4 clearly shows that CHI, Meta and NMI are the best performing ranking methods while MI is the worst. Moreover, IG, IGR and LL achieved quite similar results. After rank-thresholding, the next set of experiments is based on the score-thresholding measure for untagged word patterns for each corpus (e.g. GENIA, WEB and GENIA+WEB). Here we are considering only those threshold scores which enable us to attain high precision scores (see Table 4 in Appendix C for complete results in terms of precision, recall and F-score for each corpus). Figure 5 shows the results of Precision score-thresholding measures for untagged word patterns using GENIA corpus. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IG IGR MI NMI LL CHI Meta tf-idf >0.1 >0.2 >0.3 Threshold Scores Figure 5: Score-thresholding results for untagged word patterns using GENIA corpus 69 In Figure 5, we are able to achieve 100% precision scores using CHI and Meta ranking methods but at the cost of a very low recall. Here too, IG, IGR and LL achieved quite similar results while tf-idf performed better than them. We carried out a similar set of experiments using PoS-tagged word patterns. The numbers of PoS-tagged word patterns extracted from each corpus are: GENIA 12239, WEB 43708, GENIA+WEB 53871, BNC 1969040 and GENIA EVENT 5676. Figure 6 shows the rank-thresholding results for PoS-tagged word patterns using the GENIA corpus with precision scores are represented along the Y-axis (see Table 2 in Appendix C for complete results in terms of precision, recall and F-score for each Precision corpus). 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IG IGR MI NMI LL CHI Meta tf-idf Top 100 Top 200 Top 300 Figure 6: Rank-thresholding results for PoS-tagged word patterns using GENIA corpus The results in Figure 6 indicate that similar to Figure 4 (rank-thresholding results of untagged word patterns) CHI, Meta and NMI are the best performing ranking methods while MI is the worst. The overall results obtained using the rankthresholding measure in PoS-tagged word patterns show that it is able to achieve higher precision scores than compared to untagged word patterns (Figure 4). The next set of experiments is based on the score-thresholding measure for PoStagged word patterns for each corpus. Similar to untagged word patterns we are only reporting those threshold scores for the GENIA corpus that attain high precision scores (see Table 5 in Appendix C for complete results in terms of precision, recall and F-score for each corpus). 70 Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IG IGR MI NMI LL CHI Meta tf-idf >0.1 >0.2 >0.3 Threshold Scores Figure 7: Score-thresholding results for PoS-tagged word patterns using GENIA corpus Similar to Figure 5 here in Figure 7 too we are able to achieve 100% precision score but recall is very low. In the final set of experiments of surface type patterns without prepositions, we carried out a similar set of experiments using verb-centred word patterns. The numbers of verb-centred word patterns extracted from each corpus are: GENIA 8328, WEB 28645, BNC 1604809 and GENIA EVENT 4010. Figure 8 shows the rankthresholding results for verb-centred word patterns using the GENIA corpus, precision scores are represented along the Y-axis (see Table 3 in Appendix C for Precision complete results in terms of precision, recall and F-score for each corpus). 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IG IGR MI NMI LL CHI Meta tf-idf Top 100 Top 200 Top 300 Figure 8: Rank-thresholding results for verb-centred word patterns using GENIA corpus 71 The overall results achieved using the rank-thresholding measure in verb-centred word patterns indicate that it is similar to PoS-tagged word patterns in the way that it is able to achieve higher precision scores than compared to untagged word patterns (Figure 4). Moreover, similar to other surface pattern types here too IG, IGR and LL attained quite similar results in all three corpora. In the next set of experiments, we used score-thresholding measure for verb-centred word patterns for each corpus using only those threshold scores that provide us higher precision scores for the GENIA corpus (see Table 6 in Appendix C for complete Precision results in terms of precision, recall and F-score for each corpus). 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IG IGR MI NMI LL CHI Meta >0.1 >0.2 >0.3 tf-idf Threshold Scores Figure 9: Score-thresholding results for verb-centred word patterns using GENIA corpus Figure 9 shows the results of the score-thresholding measure in verb-centred word patterns using the GENIA corpus and they indicate that overall we are able to achieve higher precision scores than compared to other surface pattern types for the GENIA corpus. In the next phase of experiments, we also considered prepositions present between two NEs along with the content words during the pattern learning process and again obtain the same surface pattern types (i.e. untagged word patterns, PoS-tagged word patterns and verb-centred word patterns) along with prepositions. Prepositions are used to express relations of place, direction, time or possessions. We used the same set of corpora and ranking methods as used in previous phase of experiments. 72 Similar to the first phase of experiments, we carried out our experiments on all three corpora for each surface pattern type with prepositions. The numbers of untagged word patterns along with prepositions extracted from each corpus are: GENIA 10093, WEB 34122, GENIA+WEB 41990, BNC 991004 and GENIA EVENT 4854. Figure 10 shows the rank-thresholding results for untagged word patterns along with prepositions using the GENIA corpus (see Table 7 in Appendix C for complete results Precision in terms of precision, recall and F-score for each corpus). 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IG IGR MI NMI LL CHI Meta tf-idf Top 100 Top 200 Top 300 Figure 10: Rank-thresholding results for untagged word patterns along with prepositions using GENIA corpus The results in Figure 10 show that addition of prepositions in untagged word patterns has been very useful and has increased overall precision scores compared with untagged word patterns without prepositions for GENIA corpus (Figure 4). After rank-thresholding, the next set of experiments is based on the score-thresholding measure for untagged word patterns along with prepositions for each corpus (e.g. GENIA, WEB and GENIA+WEB). Here too we are considering only those threshold scores which enable us to attain high precision scores for the GENIA corpus (see Table 10 in Appendix C for complete results in terms of precision, recall and F-score for each corpus). 73 Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IG IGR MI NMI LL CHI Meta >0.1 >0.2 >0.3 tf-idf Threshold Scores Figure 11: Score-thresholding results for untagged word patterns along with prepositions using GENIA corpus We carried out a similar set of experiments using PoS-tagged word patterns along with prepositions. The numbers of PoS-tagged word patterns along with prepositions extracted from each corpus are: GENIA 9237, WEB 33871, GENIA+WEB 41245, BNC 840057 and GENIA EVENT 4446. Figure 12 shows the rank-thresholding results for PoS-tagged word patterns along with prepositions using the GENIA corpus, precision scores are along the Y-axis (see Table 8 in Appendix C for complete Precision results in terms of precision, recall and F-score for each corpus). 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IG IGR MI NMI LL CHI Meta tf-idf Top 100 Top 200 Top 300 Figure 12: Rank-thresholding results for PoS-tagged word patterns along with prepositions using GENIA corpus After rank-thresholding, in the next set of experiments we used score-thresholding measure for PoS-tagged word patterns along with prepositions for each corpus (see Table 11 in Appendix C for complete results in terms of precision, recall and F-score for each corpus). Figure 13 shows the results of the score-thresholding measure for 74 PoS-tagged word patterns along with prepositions using the GENIA corpus and they indicate that additions of prepositions in PoS-tagged word patterns has been very helpful and has increased overall precision scores compared with PoS-tagged word Precision patterns without prepositions (Figure 7). 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IG IGR MI NMI LL CHI Meta >0.1 >0.2 >0.3 tf-idf Threshold Scores Figure 13: Score-thresholding results for PoS-tagged word patterns along with prepositions using GENIA corpus We carried out a similar set of experiments using verb-centred word patterns along with prepositions for each corpus. The numbers of verb-centred word patterns along with prepositions extracted from each corpus are: GENIA 6645, WEB 23931, GENIA+WEB 29353, BNC 598948 and GENIA EVENT 3271. Figures 14 shows the rank-thresholding results for verb-centred patterns along with prepositions using GENIA, precision scores are along the Y-axis (see Table 9 in Appendix C for Precision complete results in terms of precision, recall and F-scores for each corpus). 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IG IGR MI NMI LL CHI Meta tf-idf Top 100 Top 200 Top 300 Figure 14: Rank-thresholding results for verb-centred word patterns along with prepositions using GENIA corpus 75 After rank-thresholding, we used score-thresholding measure for verb-centred surface patterns along with prepositions for each corpus (see Table 12 in Appendix C for complete results in terms of precision, recall and F-score for each corpus). Figure 15 shows the results of the score-thresholding measure for verb-centred patterns with Precision prepositions for the GENIA corpus, precision scores are represented along the Y-axis. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IG IGR MI NMI LL CHI Meta >0.1 >0.2 >0.3 tf-idf Threshold Scores Figure 15: Score-thresholding results for verb-centred word patterns along with prepositions using GENIA corpus 3.1.6.1 Ranking methods In Section 3.1.6, we carried out our experiments on three different surface pattern types (untagged, PoS-tagged and verb-centred) without prepositions and with prepositions. We used different pattern ranking methods (see Section 3.1.4) and in all experiments we found that IG, IGR and LL achieved quite similar results while CHI, Meta and NMI are the best performing ranking methods while MI is the worst in terms of precision scores. The tf-idf ranking method performed better than MI on all occasions but it is not really applicable to our work as our corpus consists of those documents that describe relevant domain information only as compared to the corpus used by Sudo et al. (2003). Even though CHI and Meta ranking methods attained higher precision scores but recall scores are very low. We used two evaluation measures: rank-thresholding and score-thresholding, we found that score-thresholding is a better performing measure than rank-thresholding as we are able to achieve 100% precision score with it. Moreover, when we compared different surface pattern types without prepositions to different surface pattern types with prepositions, we found that generally surface pattern types with preposition performed better as the addition of 76 prepositions is useful for extracted semantic relations. We explored three surface pattern types (untagged, PoS-tagged and verb-centred) and found that verb-centred and PoS-tagged pattern types are better than untagged word patterns. Figure 16 shows the precision score of the best performing ranking method (CHI-score) for each corpus in verb-centred patterns in the score-thresholding measure while Figure 17 Precision shows the same results for verb-centred patterns along with prepositions. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 GENIA WEB GENIA+WEB >0.08 >0.09 >0.1 >0.2 >0.3 >0.4 Threshold Scores Figure 16: Precision scores of best performing ranking method for verb-centred patterns in score-thresholding 1 0.9 0.8 Precision 0.7 0.6 GENIA 0.5 WEB 0.4 GENIA+WEB 0.3 0.2 0.1 0 >0.08 >0.09 >0.1 >0.2 >0.3 >0.4 Threshold Scores Figure 17: Precision scores of best performing ranking method for verb-centred patterns with prepositions in score-thresholding 77 Overall in all these sets of experiments, IG, IGR and LL ranking methods perform quite similarly to each other and in general, there is no statistical significant difference between them. While literature on the topic suggests that IGR performs better than IG (Quinlan, 1986; Manning and Schütze, 1999), we found that in general there is no statistical significant difference between IG and IGR, IGR and LL in all three patterns types. Moreover, in all these experiments, obviously due to the aforementioned problem, MI performs quite poorly; the normalised version of MI helps to alleviate this problem. Moreover, there exists a statistically significant difference (p < 0.01) between NMI and the other ranking methods in all three pattern types. The metaranking method did not improve on the best individual ranking method as expected. Moreover, we found that there is a statistically significant difference (p < 0.05) between the meta-ranking method and all the other ranking methods for all three pattern types. We also found that the score-thresholding method is better than the rank-thresholding method as we were able to achieve 100% precision scores. 3.1.6.2 Types of patterns PoS-tagged word patterns and verb-centred patterns perform better than untagged word patterns. Verb-centred patterns work well, because verbs are known to express semantic relations between named entities using syntactic arguments to the verb; PoStagged word patterns add important semantic information into the pattern and possibly disambiguate words appearing in the pattern. In order to find out that whether the differences between the three patterns types are statistically significant, we carried out a paired t-test again. We found that there is no statistically significant difference between PoS-tagged word patterns and verb-centred patterns. Apart from IG, IGR and LL there is a statistically significant difference between all the ranking methods of untagged word patterns and PoS-tagged word patterns, untagged word patterns and verb-centred patterns respectively. 3.1.6.3 Precision vs. F-measure optimisation In terms of F-score verb-centred patterns achieved a higher F-score as compared to other pattern types while the addition of prepositions in each pattern type also results 78 in a higher F-score (see Appendix C for more details). Moreover CHI and NMI are the best performing ranking methods, Figure 18 and Figure 19 show precision, recall and F-score for verb-centred patterns with prepositions for the GENIA corpus achieved using these ranking methods. 1 0.9 0.8 0.7 0.6 Precision 0.5 Recall 0.4 F-score 0.3 0.2 0.1 0 >0.06 >0.07 >0.08 >0.09 >0.1 >0.2 >0.3 Threshold Scores Figure 18: Precision, recall and F-score for verb-centred patterns with prepositions in score-thresholding measure using CHI 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Precision Recall F-score >0.06 >0.07 >0.08 >0.09 >0.1 >0.2 >0.3 Threshold Scores Figure 19: Precision, recall and F-score for verb-centred patterns with prepositions in score-thresholding measure using NMI Figure 18 clearly shows that even though CHI achieved high precision scores, recall and F-score are quite low while Figure 19 shows that NMI achieved much better recall and F-score than CHI but at the cost of low precision scores. 79 The score-thresholding measure achieves higher precision than the rank-thresholding measure. High precision is quite important in applications such as MCQ generation. In score-thresholding, it is possible to optimise for high precision (up to 100%), though the F-score is generally quite low. MCQ applications rely on the production of good questions rather than the production of all possible questions, so high precision plays a vital role in such applications. 3.2 Unsupervised Dependency-based Patterns 3.2.1 Automatic Parsing of Text Syntactic analysis of text, also known as parsing, is a process of determining the grammatical structure of its sentence constituents. In syntactic analysis, a sentence is recursively decomposed into smaller units called constituents or phrases. These constituents are then categorised into noun phrases or verb phrases according to their internal structures. Syntactic analysis is generally represented in the form of a parse tree. Syntax plays an important role in making language useful for communication. Syntax in linguistics attempts to describe the language in terms of certain rules. In relation to automatic parsing, many theoretical approaches are presented so far in the area of syntax. Dependency trees are regarded as a suitable basis for semantic pattern acquisition as they abstract away from the surface structure to represent relations between elements (entities) of a sentence. Semantic patterns represent semantic relations between elements of sentences. One of the advantages of using dependency trees is that they provide a useful structure for the sentences by annotating edges with dependency functions e.g. subject, object etc. (Fundel et al., 2007). In a dependency tree a pattern is defined as a path in the dependency tree passing through zero or more intermediate nodes within a dependency tree (Sudo et al., 2001). Stevenson and Greenwood (2009) provided an insight of the usefulness of dependency patterns in their work (see Section 2.6.3.6). In their work, they revealed that dependency parsers have the 80 advantage of generating analyses which abstract away from the surface realisation of text to a greater extent than phrase structure grammars tend to, resulting in semantic information being more accessible in the representation of the text which can be useful for IE. Several approaches in IE have relied on dependency trees in order to extract patterns for the automatic acquisition of IE systems (Yangarber et al., 2000; Sudo et al., 2001; Sudo et al., 2003; Stevenson and Greenwood, 2005 and Greenwood et al., 2005) (see Sections 2.6.3). Apart from IE, Lin and Pantel (2001) used dependency trees in order to infer rules for question answering while Szpektor et al. (2004) had made use of dependency trees for paraphrase identification. Moreover, dependency parsers are used most recently in the systems which identify protein interactions in biomedical texts (Katrenko and Adriaans, 2006; Erkan et al., 2007 and Saetre et al., 2007). All the abovementioned approaches have used different pattern models based on the particular part of the dependency analysis. The motive behind all of these models is to extract the necessary information from text without being overly complex. All of the pattern models have made use of the semantic patterns based on the dependency trees for the identification of items of interest in text. These models vary in terms of their complexity, expressivity and performance in an extraction scenario. 3.2.2 Our Approach In our dependency-based approach, we employed two dependency tree pattern models: SVO pattern model (SVO patterns) and an adapted version of the linked chain pattern model. We used SVO pattern model (Yangarber et al., 2000; see Section 2.6.3.1 for more details) as a baseline in our experiments. In the SVO model, we extracted all subject-verb-object tuples from the dependency parse of a sentence and discarded the remainder of the dependency parse. Our adapted linked chain pattern model approach (Afzal et al., 2011) is based on the linked chain pattern model presented by Greenwood et al. (2005) (see Section 2.6.3.4). Linked chain pattern model combines the pairs of chain in a dependency tree 81 which share common verb root but no direct descendants. We selected the linked chain dependency pattern model as it is the best performing pattern model and its performance is consistently better than the collective performance of both SVO and chain dependency pattern models (Stevenson and Greenwood, 2009). In our approach, we have treated every Named Entity (NE) as a chain in a dependency tree if it is less than 5 dependencies away from the verb root and the word linking the NEs to the verb root are from the category of content words (Verb, Noun, Adverb and Adjective) along with prepositions. We consider only those chains in the dependency tree of a sentence which contain NEs, which is much more efficient than the subtree model of Sudo et al. (2003) (see Section 2.6.3.3), where all subtrees containing verbs are taken into account. This allows us to extract more meaningful patterns from the dependency tree of a sentence. We extract all NE chains which follow aforementioned rule from a sentence and combine them together. The extracted patterns are then stored in a database along with their frequencies. 3.2.3 Extraction of Candidate Patterns As with the learning of surface-based patterns, our general approach to learn dependency-based patterns consists of the same two main stages: (i) the construction of potential patterns from an unannotated domain corpus and (ii) their relevance ranking. 3.2.3.1 Pre-processing steps The first step in constructing candidate patterns is to perform NE recognition in an unannotated domain corpus. We will explain the whole process of candidate patterns extraction from the dependency trees with the help of an example shown below: Fibrinogen activates NF-kappa B in mononuclear phagocytes. 82 We used the GENIA 32 tagger for NER and the following example shows the NER from a biomedical text: <protein> Fibrinogen </protein> activates <protein> NF-kappa B </protein> in <cell_type> mononuclear phagocytes </cell_type>. Once the NEs are recognised in the domain corpus by the GENIA tagger, we replace all the NEs with their semantic class respectively, so the aforementioned sentence is transformed into the following sentence. PROTEIN activates PROTEIN in CELL. The transformed sentences are then parsed by using the Machinese Syntax 33 parser (Tapanainen and Järvinen, 1997). The Machinese Syntax parser uses a functional dependency grammar for parsing. The Machinese Syntax parser first labels each word with its all possible function types and then applies a collection of handwritten rules to introduce links between specific types in a given context and remove all the other function types. The Machinese Syntax parser was evaluated in terms of correct identification of attached heads on three different genres in the Bank of English (Järvinen, 1994) data. Table 11 shows the results in terms of precision and recall. Precision Recall Broadcast 93.4% 88.0% Literature 96.0% 88.6% Newspaper 95.3% 87.9% Table 11: Percentages of heads correctly attached Stevenson and Greenwood (2007) used three different parsers including the Machinese Syntax parser in order to compare different IE models. They carried out their experiments on two different corpora: MUC-6 corpus and a biomedical corpus (see Section 2.6.3.6). 32 33 http://www-tsujii.is.s.u tokyo.ac.jp/GENIA/tagger/ http://www.connexor.com/software/syntax/ 83 Figure 20 shows the dependency tree produced by the parser for the aforementioned adapted sentence example. Figure 20: Dependency tree of ‘PROTEIN activates PROTEIN in CELL’ The analyses produced by the Machinese Syntax parser are encoded to make the most of information they contain and ensure consistent structures from which patterns could be extracted. Figure 21 shows the encoded output of a biomedical text. <s id="S1"> <W ID="2" LEMMA="protein" POS="N" FUNC="SUBJ" DEP="3">PROTEIN</W> <W ID="3"LEMMA="activate" POS="V" FUNC="+FMAINV"DEP="1">activates</W> <W ID="4" LEMMA="protein" POS="N" FUNC="OBJ" DEP="3">PROTEIN</W> <W ID="5" LEMMA="in" POS="PREP" FUNC="ADVL" DEP="3">in</W> <W ID="6" LEMMA="cell" POS="N" FUNC="P" DEP="5">CELL</W> <W ID="7" LEMMA="." POS="" FUNC="" DEP="none">.</W> </s> Figure 21: Encoded biomedical text 3.2.3.2 Dependency-based patterns After the encoding, the patterns are extracted from the dependency trees using the methodology described in Section 3.2.2. For example the following SVO pattern was extracted from the Figure 21. [V/activate] (subj[PROTEIN] + obj[PROTEIN]) The following adapted linked chain patterns were extracted from the same example (Figure 21): 84 [V/activate] (subj[PROTEIN] + obj[PROTEIN]) [V/activate] (obj[PROTEIN] + prep[in] + p[CELL_TYPE]) For dependency tree patterns representation, we employed a similar sort of formalism to that used by Sudo et al. (2003). Each node in the dependency tree is represented in the format a[B] e.g. subj[PROTEIN] where a is the dependency relation between this node and its parent (subj) and B is the semantic class of the named entity. The relationship between nodes is represented as X (A+B+C) which indicates that nodes A, B and C are direct descendants of X. The patterns along with their frequencies are stored in a database. Similar to surface-based patterns, we also filtered out the patterns containing only stop-words in dependency-based patterns too. In SVO patterns, we extracted only those SVO patterns where both subject and object are named entities. Table 12 shows some examples of the most frequent SVO patterns along with their frequencies extracted from the GENIA corpus. Patterns Frequency [V/contain] (subj[DNA] + obj[DNA]) 34 [V/activate] (subj[PROTEIN] + obj[PROTEIN]) 32 [V/contain] (subj[PROTEIN] + obj[PROTEIN]) 19 [V/induce] (subj[PROTEIN] + obj[PROTEIN]) 18 [V/encode] (subj[DNA] + obj[PROTEIN]) 17 [V/express] (subj[CELL_TYPE] + obj[PROTEIN]) 16 [V/inhibit] (subj[PROTEIN] + obj[PROTEIN]) 14 [V/form] (subj[PROTEIN] + obj[PROTEIN]) 6 [V/regulate] (subj[PROTEIN] + obj[PROTEIN]) 6 [V/stimulate] (subj[PROTEIN] + obj[PROTEIN]) 6 Table 12: SVO patterns along with their frequencies The total numbers of SVO patterns extracted from the GENIA corpus is very small and one of the main reasons for this is that the SVO pattern model does not perform well in the biomedical domain. This fact was also highlighted by Stevenson and Greenwood (2009) and they argued that the reason behind this is that in the 85 biomedical domain named entities are described in ways that the SVO pattern model is unable to represent as it is restricted to verbs and their direct arguments only. In their work they compared various pattern models using two domains: MUC-6 and biomedical data (see Section 2.6.3.6) Table 13 shows some examples of the most frequent adapted linked chain patterns along with their frequencies extracted from the GENIA corpus. Patterns Frequency [V/contain] (subj[DNA] + obj[DNA]) 34 [V/activate] (subj[PROTEIN] + obj[PROTEIN]) 32 [V/contain] (subj[PROTEIN] + obj[PROTEIN]) 19 [V/induce] (subj[PROTEIN] + app[PROTEIN]) 19 [V/activate] (a[DNA] + obj[PROTEIN]) 18 [V/induce] (subj[PROTEIN] + obj[PROTEIN]) 18 [V/interact] (subj[PROTEIN] + prep[in] + p[PROTEIN]) 17 [V/induce] (subj[PROTEIN] + obj[phosphorylation] + prep[of] 17 + p[PROTEIN]) [V/encode] (subj[DNA] + obj[PROTEIN]) 17 [V/induce] (subj[PROTEIN] + subj[PROTEIN]) 17 Table 13: Adapted linked-chain patterns along with their frequencies In our experiments we preferred to use an adapted linked chain pattern model as it is possible to encode more of the information present in a sentence than compared to SVO pattern model (Section 2.6.3.1) or chain pattern model (Section 2.6.3.2) and this fact was also highlighted by Stevenson and Greenwood (2009). Moreover, SVO pattern model performed very poorly in the biomedical domain as compared to linked chain pattern model (Stevenson and Greenwood, 2009). 86 3.2.4 Pattern Ranking In order to rank extracted candidate patterns, we employed the same information theoretic concepts: Information Gain (IG), Information Gain Ratio (IGR), Mutual Information (MI), Normalised Mutual Information (NMI) and statistical tests of association: Log-likelihood (LL) and Chi-Square (CHI), along with meta-ranking and tf-idf ranking methods which we used in the surface-based approach (see Section 3.1.4 for further details). 3.2.5 Evaluation We used the same experimental data as used in the surface-based patterns experiments (see Section 3.1.5 for further details). The numbers of adapted linked chain dependency patterns extracted from each corpus are: GENIA 5066, WEB 13653, GENIA+WEB 17694, BNC 419274 and GENIA EVENT 3031. The quality of extracted patterns is evaluated by employing the same approach as described in Section 3.1.5.2. 3.2.6 Results We conducted our experiments on all 3 corpora (GENIA, WEB and GENIA+WEB). Similar to the surface-based approach, here we will discuss the results for the GENIA corpus only while the complete results for all three corpora in terms of precision, recall and F-scores are given in Appendix C. Figure 22 shows the rank-thresholding results for adapted linked chain dependency patterns using the GENIA corpus. Here precision scores are represented along the Y-axis (for complete results see Table 13 in Appendix C). 87 Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IG IGR MI NMI LL CHI Meta tf-idf Top 100 Top 200 Top 300 Figure 22: Rank-thresholding results for adapted linked chain patterns using GENIA corpus Figure 22 shows that similar to the surface-based approach CHI and NMI are the best performing ranking methods while MI is the worst. Moreover, IG, IGR and LL achieved quite similar results. In the next step we used score-thresholding measure for each corpus similar to the surface-based approach. Here too, we are considering only those threshold scores that give us high precision scores (see Table 14 in Appendix C for complete results for each corpus). Figure 23 shows the results of score-thresholding measures for adapted Precision linked chain dependency patterns using the GENIA corpus. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IG IGR MI NMI LL CHI Meta >0.1 >0.2 >0.3 tf-idf Threshold Scores Figure 23: Rank-thresholding results for adapted linked chain patterns using GENIA corpus 88 3.2.6.1 Ranking methods We carried out our experiments using both ranking measures: rank-thresholding and score-thresholding. In both set of experiments, similar to the surface-based approach (Section 3.1.6) CHI is the best performing ranking method but recall scores are very low. MI is the worst performing ranking method while IG, IGR and LL attained quite similar results. Moreover, we found that there is a no statistical significant difference (p < 0.05) between IG and LL, IGR and LL. Similar to the surface-based approach tfidf achieved quite reasonable results but it is not the best performing ranking method. Figure 24 shows the precision scores of the best performing ranking method (CHI) in the score-thresholding method for dependency patterns. Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 GENIA WEB GENIA+WEB 0.2 0.1 0 >0.08 >0.09 >0.1 >0.2 >0.3 >0.4 Threshold Scores Figure 24: Precision scores of best performing ranking method for adapted linked chain dependency patterns in score-thresholding 3.2.6.2 Score vs. rank thresholding We also found that the score-thresholding method produces better results than the rank-thresholding as we are able to achieve higher precision with the former measure. 89 3.2.6.3 Precision vs. F-measure optimisation As mentioned earlier, CHI is the best performing ranking method in terms of precision scores while recall scores are very low. Using NMI ranking method we are able to achieve quite reasonable results in terms of both precision and recall scores. Figure 25 and Figure 26 show precision, recall and F-score for the GENIA corpus using these ranking methods (CHI and NMI). 1 0.9 0.8 0.7 0.6 Precision 0.5 Recall 0.4 F-score 0.3 0.2 0.1 0 >0.06 >0.07 >0.08 >0.09 >0.1 >0.2 >0.3 Threshold Scores Figure 25: Precision, recall and F-score for adapted linked chain dependency patterns in score-thresholding measure using CHI 1 0.9 0.8 0.7 0.6 Precision 0.5 Recall 0.4 F-score 0.3 0.2 0.1 0 >0.06 >0.07 >0.08 >0.09 >0.1 >0.2 >0.3 Threshold Scores Figure 26: Precision, recall and F-score for adapted linked chain dependency patterns in score-thresholding measure using NMI 90 Similar to the surface-based approach (Section 3.1.6.4); in the dependency-based approach the score-thresholding measure achieves higher precision than the rankthresholding. Applications such as MCQ generation, as mentioned earlier, rely on high precision so the score-thresholding method gives us the opportunity to attain higher precision but low recall. 3.3 Comparison between Surface-based and Dependencybased Approaches In section 3.1, we have discussed different surface type patterns (e.g. untagged word patterns, PoS-tagged word patterns and verb-centred patterns) with and without prepositions and as later the experimental results revealed that the verb-centred pattern type along with prepositions performed better than compared to other pattern types and moreover inclusion of prepositions provide useful insight into extracted semantic relations. We employed different ranking methods and found that CHI and NMI are the best performing ranking methods. CHI is the best performing ranking method in terms of precision scores but recall scores are very low (Figure 18) while using NMI we are able to attain much better recall scores (Figure 19). Moreover, the score-thresholding measure performs better than the rank-thresholding. In Section 3.2, we explored the dependency-based pattern approach and there too we found that overall CHI (Figure 25) and NMI (Figure 26) are the best performing ranking methods while the score-thresholding ranking measure outperforms the rankthresholding. In this section, we compare the precision scores obtained by using the best performing ranking methods (NMI and CHI) for the dependency-based patterns with the surfacebased verb-centred patterns along with prepositions for the GENIA corpus. Figure 27 shows the comparison of precision scores obtained using NMI ranking method for GENIA corpus between the dependency-based patterns and the surface-based verbcentred patterns along with prepositions. 91 Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Dependency-based Surface-based 0.2 0.1 0 >0.08 >0.09 >0.1 >0.2 >0.3 >0.4 Threshold Scores Figure 27: Comparison of precision scores using NMI for GENIA corpus between dependency-based and verb-centred surface-based patterns Figure 27 shows that the NMI ranking method in dependency-based patterns is able to achieve higher precision scores compare with the NMI ranking method in surfacebased verb-centred patterns while Figure 28 shows the same comparison but using CHI ranking method. Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Dependency-based Surface-based 0.2 0.1 0 >0.08 >0.09 >0.1 >0.2 >0.3 >0.4 Threshold Scores Figure 28: Comparison of precision scores using CHI for GENIA corpus between dependency-based and verb-centred surface-based patterns 92 Figure 28 also shows that precision scores attained by the dependency-based approach are higher than the scores attained by the surface-based approach. Overall, the results achieved from Figure 27 and 28 revealed that the dependencybased patterns outperform the best performing surface-based pattern type (verbcentred along with prepositions) in terms of precision scores. Moreover, the dependency-based approach provided more coverage compared to the surface-based approach. The dependency-based approach enabled us to extract semantic relations that the surface-based approach was unable to extract as it abstract away from different surface realisations of semantic relations. The surface-based approach was able to extract much more effectively those semantic relations that involved PROTEIN and DNA named entities but it was unable to extract a few semantic relations that involved the following named entities (CELL_LINE, CELL_TYPE and RNA) while the dependency-based approach was able to extract these effectively. For example: [V/express] (subj[CELL_LINE] + obj[RNA]) [V/activate] (p[CELL_LINE] + p[CELL_LINE]) [V/show] (subj[CELL_TYPE] + obj[expression] + prep[of] + P[RNA]) [V/enhance] (a[RNA] + obj[transcription] + prep[in] + p[CELL_LINE]) [V/inhibit] (a[RNA] + obj[transcription] + prep[in] + p[CELL_LINE]) [V/mediate] (obj[transcription] + prep[of] + p[DNA] + prep[in] + p[CELL_LINE]) Our detailed analysis has revealed that the dependency-based approach is much more effective in extracting semantic relations than the surface-based approach. 3.4 Summary In this chapter, we have presented two unsupervised approaches (surface-based and dependency-based) for Relation Extraction from the biomedical domain. In the 93 surface-based approach, we experimented with three different surface-based approaches and showed that PoS-based and verb-centred patterns achieve higher precision compared to untagged word patterns while in the dependency-based approach we employed an adapted version of a linked chain patterns model to extract the patterns from dependency trees. We explored different ranking methods and found that in the surface-based approach and dependency-based approach the CHI ranking method obtained higher precision than the other ranking methods while NMI is the second best ranking method. In the dependency-based approach we found that we are able to achieve good results if a biomedical corpus is first adapted and then dependency patterns are extracted from it. Moreover, we found that there is no statistical significant difference between IG and IGR, LL and CHI ranking methods in both approaches. We employed two different techniques: the rank-thresholding measure and the score-thresholding measure and found that the score-thresholding measure performs better than the rank-thresholding measure. Moreover, corpus homogeneity and similarity scores revealed that the use of the Web as a corpus is still unable to ensure the same level of topic relevance as achieved in manually compiled corpora. At the end of this chapter, we compared the dependency-based approach with the best performing surface-based approach and found that the dependency-based approach achieves better results than to the best performing surface-based approach. 94 Chapter 4: Questions and Distractors Generation In this chapter, we will look at the way extracted patterns (i.e. semantic relations) are transformed into questions automatically. First, we will discuss the approach employed to transform extracted surface-based patterns into questions and then the approach used to transform extracted dependency-based patterns into questions. At the end of this chapter, we will elaborate on the process of automatically generating distractors for each question using a distributional similarity measure. 4.1 Question Generation Automatic question generation is an important and emerging area of research in NLP. The automatic question generation has the potential to be employed in various areas such as intelligent tutoring systems, dialogue systems (Walker et al., 2001) and educational technologies (Graesser et al., 2005). In automatic question generation it is not only important to ask questions which are grammatically correct but also that the generated questions are asking about important concepts described in a given text (Vanderwende, 2008). Moreover, it is also important to automatically generate questions that stimulate learning process among the learners. Recent workshops in Question Generation Task and Evaluation 34 are trying to define a shared task for question generation. In 2010, Question Generation Shared Task and Evaluation Challenge (QGSTEC 35 , 2010) focused on evaluating the generation of questions from paragraphs and the generation of questions from sentences. It is well-known that generating/asking good questions is a complicated task (Graeseer and Person, 1994). Vanderwende (2007, 2008) emphasised the need of generating important questions from a given text. Ruminator (Ureel et al., 2005) is a computer system which generates questions from simplified input sentences but this 34 35 http://www.questiongeneration.org/ http://www.questiongeneration.org/QGSTEC2010 95 system relies heavily on simplified input sentences and it does produce quite a large number of obvious or easy questions. Due to this the quality of the generated question is not particularly good and moreover the generated questions are not informative enough. Another question generation system presented by Schwartz et al. (2004) generates questions in order to help the learning process. This system depends on the summarisation as a pre-processing step for the identification of important questions in a given text. The authors noted that question selections created by the system can be difficult to process. Gates (2008) presented an approach that could automatically generate fact-based reading comprehension questions by using a look-back strategy i.e. re-reading the text to find the answer of a given question. The system presented in this paper makes use of several existing NLP resources i.e. BBN’s IdentiFinder (Bikel et al., 1999) for recognising named entities and specific Prop-Bank (Palmer et al., 2005) semantic arguments (e.g. ARG0, ARG1) using ASSERT (Pradhan et al., 2005). The system uses CBC4Kids corpus (news texts for children) and produces a reading passage along with 5 randomly selected questions and clickable answers in the text. The system measures the accuracy of reading comprehension questions in terms of grammaticality, semantic correctness and practicality of the questions produced from the text. The system was able to generate 81% of acceptable questions from reading comprehensions. The drawback of this system is that most of the questions are quite obvious and too easy to answer. Chen et al. (2009) presented an approach to generate self-questioning instructions automatically from any given informational text, specially focusing on children’s text (children’s in grades 1-3). Previous work (Mostow and Chen, 2009) automatically generated self-questioning instructions from narrative text by first generating questions from the text and then augmenting the questions into strategy questions. Narrative text focuses on characters, their behaviours and their mental states (e.g. happy, sad, think, regret) while informational text places emphasis on descriptions and rationalisations of a certain objective phenomena. Due to the different nature of narrative text and informational text the same approach cannot be applied to both of them. The informational text does not contain many mental states so the system has to make use of discourse markers which indicate causal relationships (conditional and 96 temporal contexts such as if, after), modality (i.e. possibility and necessity) and inference rules to generate questions from informational text. The system evaluated the generated questions in terms of their grammatical correctness and how the generated questions made sense in the context of the text. From 444 total sentences in test corpus, the system generated 180 questions in total, 15 questions about conditional contexts (86.7% acceptable), 88 questions about temporal information (65.9% acceptable) and 77 questions about modality (87.0% acceptable). Kalady et al. (2010) presented an approach to automatically generated questions based on syntactic and keyword modelling. Their approach mainly relied on parse tree manipulation, named entity recognition and Up-keys (significant phrases in a document) to automatically generate factoid and definitional questions from input documents. The factoid questions are generated from a single sentence and are very simple (e.g. yes/no questions and wh-questions from the subject, object, adverbials and prepositional phrases in the sentence). The process of generating definitional questions is quite different as compared to factoid questions as they have descriptive answers and they used the concept of Up-keys that are keywords relating to the input document (Das and Elikkottil, 2010). The authors of this paper only evaluated the factoid-based questions by preparing a gold-standard of questions from a set of documents and comparing the automatically generated questions with them. They reported the results in terms of precision, recall and F-score and their system achieved a precision score of 0.46, recall 0.68 and F-score of 0.55. The main drawback of this approach is its inability to handle lengthy and complex sentences, as well as the fact that the automatically generated questions are very simple and easy to answer. It still remains a great challenge in the field of NLP to decide which part of the text is important in a given text as identification of key concepts present in a text is a critical sub task during automatic question generation (Nielsen, 2008). Moreover, it is also important for the automatically generated questions to be syntactically and semantically well-formed. 97 4.2 Our Approach Our research enables us to generate questions regarding the important concepts present in a domain. This is done by relying on the unsupervised Relation Extraction approach; extracted semantic relations allow us to identify key information in a sentence. In Chapter 3, we extracted important semantic relations present in a domain in the form of patterns and in this chapter we will describe our approach to automatically transform those extracted semantic relations (patterns) into questions. The automatically generated questions by our approach are more effective as it automatically generates questions from important concepts present in the given domain by relying on the semantic relations. Our approach for the automatic generation of questions depends upon accurate output of the named entity tagger and the parser. 4.2.1 Surface-based Patterns In order to automatically generate questions from surface-based patterns, we first assume that the user has supplied a set of documents on which students will be tested. We will refer to this set of documents as “evaluation corpus” (e.g. in this research, we used a small subset of GENIA EVENT Annotation corpus as an evaluation corpus). In Chapter 3, we have extracted a set of relevance-ranked semantic patterns from the GENIA corpus. As we found that NMI and CHI ranking methods are the best performing ranking methods, we select semantic patterns attaining higher precision/ higher F-score at certain score thresholds using the score-thresholding measure. As in our surface-based approach semantic patterns always start and end with a named entity (see Section 3.1), so we extracted surface-based semantic patterns from the evaluation corpus and try to match these patterns with the semantic patterns learned from the GENIA corpus and when a match is found we extract the whole sentence from the evaluation corpus and then automatically transform the extracted pattern into a question by using certain set of rules (Table 14). This whole automatic question generation process can be illustrated by the following example: 98 Pattern: DNA contain_v DNA Step 1: Identify instantiations of a pattern in the evaluation corpus, this involves finding the template (in the above example, the verb ‘contain’) and the slot filler (two specifics DNA’s in the above example). We then have the aforementioned pattern being matched in the evaluation corpus and the relevant sentence is extracted form it. Thus, the gamma 3 ECS is an inducible promoter containing cis elements that critically mediate CD40L and IL-4-triggered transcriptional activation of the human C gamma 3 gene. Step 2: The part of the extracted sentence that contains template together with slot fillers is tagged by <QP> and </QP> tags as shown below: Thus, the <DNA> gamma 3 ECS </DNA> is an <QP> <DNA> inducible promoter </DNA> containing <DNA> cis elements </DNA> </QP> that critically mediate <protein> CD40L </protein> and IL-4-triggered transcriptional activation of the <DNA> human C gamma 3 gene </DNA>. Step 3: In this step, we extract semantic tags and actual names from the extracted sentence by employing Machinese parser (Tapanainen and Järvinen, 1997). After parsing, the extracted semantic pattern is transformed into the following question: Which DNA contains cis elements? As mentioned earlier, our surface-based patterns consisted of two named entities, one at the start and the other at the end of a pattern along with content words and prepositions, so during the automatic questions generation process from various forms of extracted patterns, we develop a certain set of rules (Table 14) based on semantic classes (Named Entities) and part-of-speech (PoS) information present in a pattern. We employ verb-centred patterns along with prepositions for question generation as the presence of a verb between two NEs does generally represent a meaningful semantic relation between them. During evaluation of different types of patterns in Chapter 3, we also found that verb-centred patterns along with prepositions achieve 99 good results in terms of precision, recall and F-score as compared to the untagged word patterns and the PoS-based word patterns. During the automatic generation of questions, we also employed a list of irregular verbs in order to produce past participle form of irregular verbs. Table 14 contains few of the examples of patterns and their respective automatically generated questions. Here SC represents the Semantic Class (e.g. Named Entities). All these rules are domain-independent and only rely on the presence of semantic classes and PoS information between these semantic classes. Patterns Questions Examples SC1 verb SC2 Which DNA contains cis elements? DNA contain_v DNA Which DNA is contained by inducible promoter? SC1 verb preposition SC2 Which cell_type is cultured with IL-4? CELL_TYPE culture_v with_i PROTEIN SC1 verb adjective SC2 Which cell_type expresses several low molecular weight CELL_TYPE express_v several_j PROTEIN transmembrane adaptor proteins? SC1 verb verb SC2 Which cell_type exhibits enhance IL-2? CELL_TYPE exhibit_v enhance_v PROTEIN SC1 adverb verb SC2 Which DNA is efficiently activated by Oct2? PROTEIN efficiently_a activate_v DNA SC1 verb preposition SC2 Which protein binds to ribosomal protein gene PROTEIN bind_v to_t DNA promoters? SC1 verb noun preposition SC2 Which cell_line confirms importance of NF-kappa B? CELL_LINE confirm_v importance_n of_i PROTEIN SC1 verb preposition adjective SC2 Which cell_type derives from adherent PBMC? CELL_TYPE derive_v from_i adherent_j CELL_TYPE SC1 verb preposition noun preposition SC2 Which cell_type results in activation of TNF-alpha? CELL_TYPE result_v in_i activation_n of_i PROTEIN SC1 adverb verb noun preposition SC2 Which cell_line specifically induces transcription from CELL_LINE specifically_a induce_v interleukin-2 enhancer? transcription_n from_i DNA Table 14: Examples of extracted patterns along with automatically generated questions 100 The quality of automatically generated questions in terms of their readability, relevance and acceptance will be evaluated in chapter 5. 4.2.2 Dependency-based Patterns In a similar way to surface-based patterns approach, we match a learned relevanceranked dependency-based pattern (GENIA corpus) with a dependency-based pattern of evaluation corpus and the relative sentence is then extracted from the evaluation corpus. The extracted sentence is then automatically transformed into question. The automatic question generation process can be explained by the following example: Consider the following pattern expressing a semantic relation between two types of proteins: [V/encode] (subj[DNA] + obj[PROTEIN]) This pattern is matched with the following sentence, which contains its instantiation: This structural similarity suggests that the pAT 133 gene encodes a transcription factor with a specific biological function. Our dependency-based patterns always include a main verb, so in order to automatically generate questions we traverse the whole dependency tree of the extracted sentence and extract all of the words which rely on the main verb present in the dependency parse of a sentence. So from aforementioned sentence, we extracted part from the sentence based on the presence of the main verb from the dependency pattern. The part of the sentence is then transformed into the question by selecting the subtree of the parse bounded by the two named entities present in the dependency pattern. Figure 29 shows the dependency parse of the aforementioned sentence. 101 Figure 29: Automatic question generation from dependency tree From the dependency parse in Figure 29 the following question is automatically generated by traversing the whole dependency tree of the sentence and extracting all of the words that depend on the main verb present in the dependency parse of the sentence: Which DNA encodes a transcription factor with a specific biological function? Similar to surface-based questions, the quality of automatically generated dependency-based questions will be evaluated in chapter 5. In both surface-based and dependency-based approaches, we are able to automatically generate only one type of questions (Which questions) regarding named entities present in a semantic relation. Our approach is not capable of automatically generating different types of questions (e.g. Why, How and What questions), and in order to do that one has to look at various NLG techniques. This would be beyond the scope of this thesis. 102 4.3 Distractors Generation Distractors play a vital role in a multiple-choice question as good quality distractors ensure a credible development of the learners’ knowledge. The automatic generation of plausible distractors is a very important task in the automatic generation of MCQs. During the process of automatic generation of distractors, the purpose is to find words which are semantically similar to the correct answer but incorrect in the given context. Goodrich (1977) analysed the potency and discrimination power of manually generated distractors. Previous approaches used different methods in order to automatically generate distractors. Mitkov et al. (2006) used several WordNet-based semantic similarity measures such as the Lesk algorithm (Lesk, 1986), the Jiang and Conrath measure (Jiang and Conrath, 1997), the Lin measure (Lin, 1997) and the Leacock-Chodorow measure (Leacock and Chodorow, 1998) to automatically generate distractors. Most of the previous approaches (e.g. Brown et al., 2005; Sumita et al., 2005 and Hoshino and Nakagawa, 2007) have focused on second language learning acquisition (i.e. grammar and vocabulary). In these approaches distractors are generally generated by employing WordNet, a machine-readable thesaurus or inhouse thesauri to retrieve similar words (synonyms, antonyms, hypernyms, hyponyms etc.). Pino et al. (2008) used WordNet to measure semantic similarity while Papasalouros et al. (2008) used domain ontologies built manually by domain experts to automatically generate distractors. Smith et al. (2009) used distributional information from the corpus. Mitkov et al. (2009) argued in their work that semantic similarity measures appear to be a more logical way of automatically generating distractors. They carried their experiments using various semantic similarity measures and found that there is no statistically significant difference between them. Mitkov et al. (2009) used both WordNet and corpora for the automatic generation of distractors. Pino and Eskenazi (2009) presented an automatic approach to generate morphological distractors during cloze questions for English vocabulary learning. In morphological distractors, the distractor is a morphological variant of the correct answer. For example if the correct answer is “interested” then the distractor can be “interesting”. In morphological distractors several variant types were generated such as adding –ing 103 or –ed to a verb, -s to a noun, -er or –est to an adjective. Aldabe and Maritxalar (2010) presented a corpus-based approach for the automatic generation of distractors in the Basque language. Their approach made use of semantic similarity measures and ontologies in the process of automatically generating distractors. They used Latent Semantic Analysis (LSA) to compute the context-words similarity. In order to generate distractors, our approach relies on distributional similarity measures. Distributional similarity is based on the distributional hypothesis which states that words occurring in similar contexts tend to have similar meanings (Harris, 1954; Firth, 1957 and Harshman, 1970). In their work, Mitkov et al. (2006) suggested the usefulness of distributional similarity measures in order to automatically generate plausible distractors. Previous researches have mentioned different levels of context e.g. context of a word in the document in which it occurs, an n-gram, a bag of words on either side or the words with which it has some grammatical dependency. Distributional similarity is a useful measure and is used in many NLP applications such as language modelling, word classification (Turney and Litman, 2003), query expansion in IR (Cao, et al., 2008), automatic thesaurus generation (e.g. Grefenstette, 1994; Hatzivassiloglou, 1996; Lin, 1998 and Caraballo, 1999), word sense disambiguation (Yuret and Yatbaz, 2010), fact extraction (Paşca et al., 2006), semantic role labelling (Erk, 2007) and textual advertising (Chang et al., 2009). We prefer to use distributional similarity measures in order to automatically generate distractors compared to other taxonomic similarity measures (such as WordNet) as they require having a detailed manually compiled ontology or a resource containing high quality definitions of all possible terms. Another drawback of these taxonomic similarity measures is their limited coverage as they require all candidate named entities and terms found in the instructional material to be recorded in the ontology which itself is a time-consuming and labour-intensive task. Once created, updating ontology is again an expansive and time-consuming task. Moreover, in these manually build lexical resources matching the measure to the resource is a research problem itself as highlighted by Weeds (2003). Distributional similarity allows us to alleviate the problem of data sparseness by estimating the probabilities of unseen co-occurrences of words from the probabilities 104 of seen co-occurrences of similar words. Moreover, the distributional similarity measure allows us to automatically generate semantically close distractors that are more plausible and better in distinguishing confident test takers from uncertain ones. In distributional similarity similar named entities are generally computed by comparing co-occurrence vectors between all named entities (Sarmento et al., 2007). The advantage of using distributional similarity is that it is corpus-driven compared to manually created lexical resources (Grefenstette, 1994). In order to estimate word cooccurrence probabilities various distributional similarity measures have been proposed (e.g., the L1 Norm, the Euclidean Distance, the Cosine Metric (Salton and McGill, 1983), Jaccard’s Coefficient (Frakes and Baeza-Yates, 1992), the Dice Coefficient (Frakes and Baeza-Yates,1992), the Kullback-Leibler Divergence (Cover and Thomas, 1991) and the Jenson-Shannon Divergence (Rao, 1982). Dagan, 2000; Weeds, 2003; Mohammad and Hirst (2005) have presented a detailed review of various distributional similarity measures. The best distributional similarity measure will be the one which returns the most plausible neighbours in the context of a particular application and thus leads to the best performance in that application. A few/several distributional similarity measures such as Euclidean distance, the cosine and the L1 distance treated the distributions as vectors and made use of geometrically motivated functions to measure distributional similarity. Lee (2001) presented a detailed comparison among various distributional similarity measures. Distributional similarity has also been used in the area of IE. Lin and Pantel (2001) used it to show that patterns which occur with similar pairs tend to have similar meanings. Turney et al. (2003) further showed that pairs of words that co-occur in similar patterns tend to have similar semantic relations. The distributional hypothesis relies on availability of a large corpus, and is vulnerable to the inevitable data sparseness: reliable estimates of semantic similarity cannot be obtained for infrequent words in the corpus. The availability of a large corpus enables us to examine the context in which words appear and then calculate the similarity between various context distributions. 105 4.3.1 Our Approach In order to produce distractors from corpus, we carried out linguistic processing using GENIA tagger. GENIA tagger provides us with tokenised text along with the part-ofspeech (PoS) information. In order to handle the data sparseness issue, we build a pool of various biomedical corpora including GENIA, GENIA EVENT, BioInfer 36 , YPD (Hodges et al., 1999), Yapex 37 , MIPS 38 , WEB 39 corpus and BioMed 40 corpus in order to generate distractors from these corpora. After linguistic processing, we build a frequency matrix which involves the scanning of sequential semantic classes (Named Entities) along with a notional word (Noun, Verb, Adverb and Adjective) in the corpora and record their frequencies in a database. In this way, we are able to construct distributional models of all candidate named entities found in the text. Once accurate and informative contextual representation of each semantic class has been extracted along with their frequencies, semantic classes are compared using the distributional hypothesis that similar words appear in similar context. The distractors to a given correct answer are then automatically generated by measuring it similarity to entire candidate named entities. At the end, we select the top 4 similar candidate named entities as the distractors. Table 15 shows some examples of correct answers and distractors automatically generated by our approach. Our aim is to automatically generate plausible distractors, so if the correct answer is a protein then our approach automatically generates all protein distractors that are involved in similar processes or belong to the same biological category. 36 http://mars.cs.utu.fi/BioInfer/ http://www.sics.se/humle/projects/prothalt/#data 38 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC146421/ 39 http://www.ncbi.nlm.nih.gov/ 40 http://www.biomedcentral.com/info/about/datamining/ 37 106 Distractors Correct Answer K562 cells M1 cells Yin-Yang 1 Alpha-tubulin NGF STAT1 JAK3 NF-kappa B transcription STAT3 factor CD40 IL-2 IL-4 T lymphocytes TCR monocytes IFN-gamma IL-2 NF-kappa B IL-4 LMP1 HIV-1 Tat T lymphocytes NF-kappa B- Fas ligand mediated gene ETS beta- Gammac basal Human alpha- transgenic transcription promoter promoter globin thymocytes factors promoter Table 15: Examples of automatically generated distractors In our research, we used grammatical relation data to model context. The use of grammatical relation data to model context in not new as Harris (1968) stated that: “The meaning of entities and the meaning of grammatical relation among them, is related to the restriction of combinations of these entities relative to other entities.” We used Jensen-Shannon divergence (Rao, 1983 and Lin, J., 1991) also known as information radius in order to measure the distributional similarity between two context vectors (i.e. named entities). It is a popular distributional similarity measure based on a smoothed version of Kullback-Leibler’s divergence measure (Kullback and Leibler, 1951; Cover and Thomas, 1991; Pereira et al., 1993) and has been frequently employed in word clustering and nearest neighbour techniques (e.g. Dagan et al., 1999; Lapata et al., 2001; Dhilon et al., 2002). The Kullback-Leibler divergence or relative entropy is an asymmetric measure which is employed in order to estimate the similarity between two probability mass functions. Cover and Thomas (1991) defined the relative entropy D ( p || q )  between two distributions p and q as “the inefficiency of assuming that the distribution is q when the true distribution is p”. So the relative entropy is: D ( p || q )   x p ( x ) log p(x) q(x) 107 Relative entropy will be equal to zero if the two distributions are equal. Jensen-Shannon divergence is a symmetric measure and is a popular alternative to the Kullback-Leibler divergence measure. Dagan et al. (1999) defined it as “the average of Kullback-Leibler divergence of each of the two distributions to their average distribution”. dist JS ( p , q )  1 pq pq  D ( p || )  D ( q || ) 2  2 2  Dagan et al. (1997) performs a comparative study based on various distributional similarity measures and found that Jensen-Shannon consistently performs better than other distributional similarity measures. In chapter 5, we will evaluate the quality of automatically generated distractors in terms of their readability, relevance to the correct answer and their levels of acceptability. 4.4 Summary In this chapter, we carried out detailed discussion regarding the approaches used to automatically generate questions from both relation extraction approaches (surfacebased and dependency-based). In the surface-based approach questions were automatically generated from sentences matched by extracted surface-based semantic relations by relying on a certain set of rules while in the dependency-based approach the questions were automatically generated by traversing the dependency tree of extracted sentence matched by the dependency-based semantic relation. At the end of this chapter, we discussed our approach for automatically generating distractors by using distributional similarity measures. In chapter 5, we will evaluate the automatically generated questions and distractors in terms of their readability, relevance and acceptability. 108 Chapter 5: Extrinsic Evaluation In this chapter, we will discuss the importance of extrinsic/user-centred evaluation in any NLP system, evaluation data used during the extrinsic evaluation of both MCQ systems (surface-based and dependency-based) along with the criteria used for the extrinsic evaluation of both systems. We will also elaborate on the results obtained for each MCQ system using the evaluation criteria and compare the evaluation results of both MCQ systems. We involved biomedical experts to extrinsically evaluate both of the systems according to pre-specified evaluation criteria. At the end of this chapter, we will measure the agreement between the two evaluators by employing Kappa statistics. 5.1 Overview The real application users have a vital role to play in the extrinsic or user-centred evaluation process. The involvement of real users in the evaluation process may vary depending upon the nature of application. According to Paroubek et al. (2007), the user-centred evaluation is a paradigm in which the goal is to analyse the utilisation of the NLP application and its various functionalities by the users in their environment. The user-centred evaluation is quite frequently employed by the Information Retrieval (IR), Machine Translation (MT), Natural Language Generation (NLG) and Automatic Summarisation research community (Hirschman and Mani, 2003; Reiter et al., 2005; Paroubek et al., 2007). In intrinsic evaluation, output produced by the system is compared against the gold-standard (the output produced by humans manually before the evaluation). Precision, recall and F-score are the most frequently used evaluation metrics during the intrinsic/automatic evaluation. Intrinsic evaluation is the most popular and commonly used evaluation measure depending upon the availability of gold-standard. In many NLP applications intrinsic evaluation is used to evaluate components of the application as we have done during the evaluation of our IE component of both MCQ systems. Extrinsic evaluation is a sort of global evaluation in which the application as a whole is evaluated, just as we will be doing in this chapter. 109 Evaluation has become an integral part of any NLP system (Hirschman and Mani, 2003). The sole purpose of evaluation is to provide a common ground in order to compare systems and approaches. During the process of system evaluation, it is essential for a system to identify all system elements that can figure as performance factors. Spärck Jones and Galliers (1996) made the following observation regarding the process of evaluation: “Evaluation must be designed to address issues relevant to the specific task domain of the NLP system; therefore, NLP systems operating in different task domains require different evaluation criteria.” All the stakeholders (funding organisations, research community and end users) want to know how useful the system is in real-life application and the performance of the system in comparison to others. In recent years, the NLP community has invested a lot of time and effort into the evaluation of NLP systems through the organisation of conferences (e.g., Language Resources and Evaluation Conference (LREC 41 )) and many evaluation campaigns such as Message Understanding Conferences (MUC 42 ), Document Understanding conferences (DUC 43 ) and Text Retrieval Conferences (TREC 44 ). Text Analysis Conference (TAC) also has many TAC tracks 45 focused on providing a common evaluation procedure that can improve performance of NLP systems on enduser tasks. There are several different ways to evaluate NLP systems (Paroubek et al., 2007). In black-box evaluation (Palmer and Finin, 1990), the evaluation is mainly concerned with the output of the system and not how the system achieves this output. In whitebox/glass-box evaluation (Palmer and Finin, 1990) all the system components are assessed in order to find out how the system attains these results. Black-box evaluation is relatively easier as compared with white-box evaluation in terms of time and resources. 41 http://www.lrec-conf.org/ http://www-nlpir.nist.gov/related_projects/muc/ 43 http://www-nlpir.nist.gov/projects/duc/intro.html 44 http://trec.nist.gov/ 45 http://www.nist.gov/tac/tracks/index.html 42 110 5.2 Our Approach In Chapter 3, we evaluated the IE component of our systems (surface-based and dependency-based) by using the automatic/gold-standard evaluation. In this chapter, we will evaluate both MCQ systems as a whole in a user-centred fashion. The quality of automatically generated MCQs is generally evaluated by human evaluators. The evaluation used in our approach is mainly concerned with the adequate and appropriate generation of MCQs and as well as the amount of human intervention required. In other words, we want to evaluate our system in terms of its robustness and efficiency. 5.2.1 Evaluation Data For the purpose of the evaluation, we randomly selected a small subset from GENIA EVENT corpus. We found in Chapter 3 that in both surface-based and dependencybased approaches NMI and CHI are the best performing ranking methods during the unsupervised relation extraction phase. CHI achieved very high precision scores but recall scores are very low (Figure 18, 25) while in NMI (Figure 19, 26) recall scores are relatively higher than CHI (see Appendix C for further details). Due to this reason, during the extrinsic evaluation phase of automatically generated MCQ systems we employ NMI for both approaches (surface-based and dependency-based) as it was the only ranking method that enabled us to achieve a higher F-score for both approaches and can provide a better evaluation result for both MCQ systems in terms of its usability and effectiveness. Similarly in Chapter 3, we found that the scorethresholding measure performed better than the rank-thresholding measure, so we have chosen the score-thresholding measure here. We selected a score-thresholding (score > 0.01) for NMI for both approaches as it gives a maximum F-score for both approaches. For surface-based it gives us an F-score of 54% while in dependencybased the F-score is 65%. 111 5.2.2 Evaluation Method The extrinsic evaluations of both MCQ systems (surface-based and dependencybased) follow a similar sort of criteria used by Farzindar and Lapalme (2004) for the evaluation of LetSum (an automatic legal text summariser). In LetSum, extrinsic evaluations were based on legal expert judgement. They have defined a series of specific questions for the judgement, which covers the main topics of the document. If a user is able to answer the questions correctly by only reading the summary, it means that the summary contains all of the necessary information from the source judgement. Extrinsic evaluation can measure from what extent a specific NLP application can benefit from employing a certain method or measure. Both MCQ systems (surface-based and dependency-based) automatically generated 80 and 52 MCQs respectively from the evaluation dataset for NMI score > 0.01. In order to evaluate quality of the automatically generated MCQs, we follow the following criteria: Readability of automatically generated questions and distractors is evaluated by asking whether it is clear, rather clear or incomprehensible. Usefulness of semantic relation: Questions are automatically generated by relying on semantic relations, so it is important to evaluate the usefulness of semantic relations present in a question by asking whether it is clear, rather clear or incomprehensible. Relevance: automatically generated questions should be relevant to the extracted sentence from which the question is generated automatically; similarly for automatically generated distractors it is also important for them to be relevant to the automatically generated question and its answer. Both automatically generated questions and distractors are evaluated in terms of relevance by asking whether it is very relevant, rather relevant or not relevant. 112 Acceptability: in order to evaluate the acceptability of automatically generated questions and distractors the evaluators are asked to evaluate them from a scale of 0 to 5 (where 0 means unacceptable and 5 means totally acceptable). Overall MCQ usability: at the end of this evaluation the evaluators are asked to evaluate the overall usability of automatically generated MCQs by selecting one option from directly usable, needs minor revision, needs major revision or unusable. Figure 26 shows the screenshot of the interface used during the extrinsic evaluation of both automatically generated MCQs system (Appendix B shows few examples of automatically generated MCQs). The biomedical experts were asked to complete this interface during the extrinsic evaluation of each MCQ. In the extrinsic evaluation, two biomedical experts (both post-doc) were asked to evaluate both MCQs systems (surface-based and dependency-based) according to the aforementioned criteria. Both evaluators were vastly experienced, one evaluator’s 46 main area of research focuses on isolation, characterising and growing stem cells from Keloid and Dupuytren’s disease and is currently working at Plastics and Reconstructive Surgery Research while the other biomedical expert 47 is a bio-curator with a PhD in molecular biology and is currently working for the Hugo Gene Nomenclature Committee (HGNC). Both evaluators were asked to give a scoring value for the readability of questions and distractors from 1 (incomprehensible) to 3 (clear), usefulness of semantic relations from 1 (incomprehensible) to 3 (clear), question and distractors relevance from 1 (not relevant) to 3 (very relevant), question and distractors acceptability from 0 (unacceptable) to 5 (acceptable) and overall MCQ usability from 1 (unusable) to 4 (directly usable). 46 47 http://www.plasticsurgeryresearch.org/people/PostDocs.html http://www.genenames.org/about/team 113 Figure 30: Screenshot of extrinsic evaluation interface 114 5.2.3 Results Table 16 shows the results obtained for surface-based and dependency-based MCQ systems where QR, DR USR, QRelv, DRelv, QA, DA and MCQ Usability represents Question Readability, Distractors Readability, Question Relevance, Distractors Relevance, Question Acceptability, Distractors Acceptability and Overall MCQ Usability respectively. QR DR USR QRelv DRelv QA DA MCQ (1-3) (1-3) (1-3) (1-3) (0-5) Usability (1-3) (0-5) (1-4) Surface-based MCQs System Evaluator 1 2.15 2.96 2.14 2.04 2.24 2.53 3.04 2.61 Evaluator 2 1.74 2.29 1.88 1.66 2.10 1.95 3.28 2.11 1.95 2.63 2.01 1.85 2.17 2.24 3.16 2.36 Average Dependency-based MCQs System Evaluator 1 2.42 2.98 2.38 2.37 2.31 3.25 3.73 3.37 Evaluator 2 2.25 2.15 2.46 2.23 2.06 3.27 3.15 2.79 2.34 2.57 2.42 2.30 2.19 3.26 3.44 3.08 Average Table 16: Evaluation results of surface-based and dependency-based MCQ systems 5.2.4 Comparison In this section, we performed a comparison between the results of surface-based and dependency-based MCQs systems. For this purpose, we take the average scores of all the categories for each MCQ system and compare them. Figure 31 shows the comparison between the two MCQ systems. 115 3 Score 2.5 Surface-based MCQ 2 Dependency-based MCQ 1.5 1 QR DR USR QRelv DRelv Score Evaluation Criteria 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Surface-based MCQ Dependency-based MCQ QA DA Evaluation Criteria 4 Score 3.5 3 Surface-based MCQ 2.5 Dependency-based MCQ 2 1.5 1 MCQ Usability Evaluation Criteria Figure 31: Comparison between surface-based and dependency-based MCQ systems The results from Figure 31 show that MCQs generated using the dependency-based approach achieve better results during extrinsic evaluation in terms of question readability, usefulness of semantic relation, question and distractors relevance, 116 question and distractors acceptability and overall usability of MCQ. These results are better compared with the extrinsic evaluation results of surface-based MCQs system respectively. In terms of overall MCQ usability, the extrinsic evaluation results show that in surface-based MCQ system 35% of MCQ items were considered directly usable, 30% needed minor revisions and 14% needed major revisions while 21% MCQ items were deemed unusable. In case of dependency-based MCQ system, we found that 65% of MCQ items were considered directly usable, 23% needed minor revisions and 6% needed major revisions while 6% of MCQ items were unusable. 5.2.5 Discussion We used Kappa statistics (Cohen, 1960) in order to measure the agreement between the two evaluators. Kappa statistics are a quite useful and popular quantitative measure that is used to measure the agreement between evaluators. The Kappa coefficient between evaluators is defined as: K  P A  P 1  PE E where PA is the times evaluators agree and PE is the proportion of times that we would expect the evaluators to agree by chance. K = 1 when there is a complete agreement among the evaluators while K = 0 when there is no agreement. The interpretation of the Kappa score is very important and an example of a commonly used scale is presented in Table 17 (Cohen, 1960). Kappa Score Agreement <0.20 Poor 0.21 – 0.40 Fair 0.41 – 0.60 Moderate 0.61 – 0.80 Good 0.81 – 1.00 Excellent Table 17: Interpretation of Kappa score 117 In our extrinsic evaluation, both of the evaluators evaluated both MCQ systems (surface-based and dependency-based) according to the criteria mentioned in the Section 5.1.2. We measured the agreement between the evaluators by using Kappa score which is shown in Table 17. Evaluation Criteria Kappa Score Kappa Score (Surface- (Dependency- based MCQ) based MCQ) Question Readability 0.29 0.31 Distractors Readability 0.08 -0.13 Usefulness of Semantic Relation 0.21 0.42 Question Relevance 0.27 0.22 Distractors Relevance 0.29 0.31 Question Acceptability 0.27 0.26 Distractors Acceptability 0.12 0.10 Overall MCQ usability 0.25 0.23 Table 18: Kappa score The average Kappa score is 0.27 which is fair according to Table 17 but not very high due to various different sub-categories present in the extrinsic evaluation. We used weighted Kappa (Cohen, 1968) to measure the agreement across major subcategories in which there is a meaningful difference. For example, in question readability we had three sub-categories: ‘Clear’, ‘Rather Clear’ and ‘Incomprehensible’. In this case we may not care whether one evaluator chooses question readability as ’Clear’ while another evaluator chooses ‘Rather Clear’ in regards to the same question. We might care however if one evaluator chooses question readability as ‘Clear’ while another evaluator chooses question readability for the same question meaning it is recorded as ‘Incomprehensible’. In weighted Kappa, we assigned a score of 1 when both of the evaluators agree while a score of 0.5 is assigned when one evaluator chooses the question readability of a question as 118 ‘Clear’ while the other evaluator chooses it as ‘Rather Clear’. We used a similar sort of criteria during distractors readability, usefulness of semantic relation, question relevance and distractors relevance. In questions and distractors acceptability, we assigned an agreement score of 1 when both evaluators agree completely while a score of 0.5 was assigned when both of the evaluators choose questions and distractors acceptability between ‘0’ and ‘2’. A score of 0.5 was also assigned when both of the evaluators choose questions and distractors acceptability between ‘3’ and ‘5’. In overall MCQ usability, we assigned a score of 1 when both of the evaluators agreed and a score of 0.5 was assigned when one of the evaluator assigned an MCQ as ‘Directly Usable’ while the other evaluators marked the same MCQ as ‘Needs Minor Revision’. An agreement score of 0.5 was assigned when an MCQ was assigned by one of the evaluator as ‘Needs Major Revision’ while the other evaluator marked the same MCQ as ‘Unusable’. Table 19 shows the results obtained using weighted Kappa. Evaluation Criteria Kappa Score Kappa Score (Surface- (Dependency- based MCQ) based MCQ) Question Readability 0.44 0.44 Distractors Readability 0.48 0.37 Usefulness of Semantic Relation 0.37 0.51 Question Relevance 0.43 0.42 Distractors Relevance 0.48 0.54 Question Acceptability 0.46 0.45 Distractors Acceptability 0.39 0.39 Overall MCQ usability 0.43 0.41 Table 19: Weighted Kappa score The results in Table 19 show that the use of weighted Kappa has increased the agreement between the two evaluators from fair to moderate. The agreement between the two evaluators is not very high. Because of this we are not looking at average 119 scores between the two evaluators but instead we analyse the scores assigned by each evaluator separately. One of the main reasons for not having high agreement score between the two evaluators is that these MCQs are generated from a part of the GENIA EVENT corpus which is very different to an instructional text or teaching material. As mentioned earlier, the GENIA EVENT corpus consists of MEDLINE abstracts so due to that some automatically generated MCQs are ambiguous or lacks context. For example in an MCQ, one evaluator classified the question readability as ‘Clear’ and the same MCQ is classified as ‘Rather Clear’ by the other evaluator due to the lack of context. This can be explained from the following example: Sentence: Conversely inhibition of NF-kappaB confers a tenfold increase in glucocorticoid mediated apoptosis establishing that NF-kappaB also functions as an antiapoptotic factor. The following question was automatically generated from the aforementioned sentence: Which protein also functions as an antiapoptotic factor? According to the feedback of one evaluator this question is ambiguous and needs more context as there are hundreds of apoptotic factors and so there is a possibility of more than one right answer for this question. Similarly NF-Kappa B protein refers to a family of several proteins rather than one protein only so context is also important in automatically generating good quality MCQs. Moreover, sometimes the GENIA named entity tagger’s inability to recognise the boundaries of a named entity also resulted in MCQ where the answer of a particular question is partially given in the question. This can be elaborated from the following example: Sentence: The B cell-specific nuclear factor OTF-2 positively regulates transcription of the human class II transplantation gene DRA. The following question was automatically generated from the aforementioned sentence: 120 Which protein OTF-2 positively regulates transcription of the human class II transplantation gene DRA? According to the evaluator’s feedback the answer of the question is partially given in the question and the actual question should be: Which protein positively regulates transcription of the human class II transplantation gene DRA? But due to the GENIA tagger’s inability to recognise some named entity boundaries our system was unable to automatically generate the correct question. In order to test the significance of the difference between two sets of (surface-based and dependency-based) MCQ systems we used the Chi-Square test, which being a non-parametric statistical test, is suitable as we cannot assume a normal distribution of evaluator scores. In carrying out the test, we compared two sets of scores assigned by one evaluator: the scores assigned to MCT items generated with the surface-based method and those assigned to MCT items generated with the dependency-based method. Table 20 shows the p-values of Chi-Square test obtained from using the evaluation scores provided by the two evaluators. Evaluation Criteria p-values of Chi-Square Test Evaluator 1 Evaluator 2 Question Readability 0.1912 0.0011 Distractors Readability 0.5496 0.4249 Usefulness of Semantic Relation 0.2737 0.0002 Question Relevance 0.0855 0.0004 Distractors Relevance 0.1244 0.7022 Question Acceptability 0.1449 0.0028 Distractors Acceptability 0.0715 0.4123 Overall MCQ Usability 0.0026 0.0010 Table 20: p-values of Chi-Square 121 In Table 20, where there is a statistical significant difference (at the level of p < 0.05), between surface-based and dependency-based MCQ systems, the number is shown in bold. Both evaluators agreed during the extrinsic evaluation that the dependencybased MCQ system is better than the surface-based MCQ system in terms of overall MCQ usability. This has been proved by the p-values of Chi-Square (Table 20). Indeed there is a statistical difference between surface-based and dependency-based MCQ systems in terms of overall MCQ usability. The MCQs generated by the dependency-based system are more usable than the MCQs generated by the surfacebased systems. Our extrinsic evaluation methodology enables us to evaluate automatically generated MCQs in terms of question and distractor readability, usefulness of semantic relation, question and distractor relevance, question and distractor acceptability and overall usability of an MCQ. In 2010, First Question Generation Shared Task Evaluation Challenge (QGSTEC 48 ) also used a similar sort of evaluation criteria where they evaluated the automatically generated questions in terms of relevance, question type, syntactic correctness and fluency, ambiguity and variety. Mitkov et al. (2006, see Section 2.1) carried out extrinsic evaluation of their automatically generated MCQ system on a much broader scale by using item response theory (Gronlund, 1982) where they evaluated their MCT items in terms of their difficulty and discrimination. We were unable to carry out such extrinsic evaluation of our MCQ systems due to lack of resources and in the future we would like to explore this evaluation approach for our systems. 5.3 Summary We already measured the performance of the Information Extraction component of the system using automatic, gold-standard evaluation in terms of precision, recall and F-score in the chapter 3. In this chapter, we used extrinsic evaluation to measure the 48 http://www.questiongeneration.org/QGSTEC2010 122 performance of whole MCQ systems based on surface-based and dependency-based semantic patterns. We first elaborated on the importance of evaluation in NLP systems and then evaluation criteria used in this evaluation. Two biomedical experts evaluated both systems on the basis of this pre-defined evaluation criteria and we found that the dependency-based MCQ system performed better than the surfacebased MCQ system. Moreover, we used Kappa statistics to measure the agreement between the two evaluators and found that there is a moderate agreement between the two evaluators. We found that there is a statistical significant difference between the overall MCQ usability of dependency-based and surface-based MCQ systems and that MCQs generated by the dependency-based system are more usable than the surfacebased MCQ systems. 123 Chapter 6: Conclusions This chapter provides a summary of the main contributions of the thesis; presents a review of the whole thesis and outlines directions for future extensions of this work. 6.1 Thesis Contributions This thesis has presented research in the area of unsupervised relation extraction for eLearning applications. We mainly focused on the automatic generation of multiplechoice questions (MCQs). The main aim of this thesis was to use IE methodologies to improve the quality of automatically generated MCQs and to overcome the problems faced by the previous approaches. Most of the previous approaches for automatic generation of MCQs relied on the syntactic structures of sentences to generate questions while different approaches were focused on different methods to automatically generate distractors (Section 2.1). The main drawback of these approaches was that they were unable to automatically generate questions from complex sentences; moreover one of the other problems faced by these approaches was the selection of appropriate sentences for automatic question generation. In contrast, our approach attempts to capture semantic rather than syntactic relations between key terms and named entities in a text. In this way, our approach makes use of semantic relations in order to select the best candidate sentences for question generation. Our approach consisted of three main phases: in the first phase we used IE methodologies to extract semantic relations and in the second phase we automatically generated questions using these semantic relations. In the third phase distractors were automatically generated using distributional similarity measures. This aim was accomplished through adopted unsupervised relation extraction approaches (surfacebased and dependency-based) to extract the important semantic relations from the text. In the surface-based approach, we investigated several surface-based pattern 124 types, while in the dependency-based approach we studied extracted semantic relations based on the dependency tree of a sentence. We conducted experiments with various information-theoretic and statistical measures to rank candidate semantic patterns by domain relevance as well as meta-ranking (a method that combined multiple pattern-ranking methods). The domain ranking methods were used to select those patterns that capture the most important semantic relations between key notions discussed in domain text. Both surface-based and dependency-based patterns selected in this way were evaluated in terms of precision, recall and F-score. The experimental results revealed that overall in both surfacebased and dependency-based approaches Normalised Mutual Information (NMI) and Chi-Square (CHI) were the best performing ranking methods among other methods. Moreover, we studied two different measures to select patterns: score-thresholding measure and rank-thresholding measure and found that the score-thresholding performed better than the rank-thresholding measure. These extracted semantic relations (surface-based and dependency-based) allowed us to automatically generate better quality questions by focusing on the important concepts present in a given text. In the surface-based approach, questions were automatically generated from semantic relations by using a certain set of rules based on named entities and part-of-speech information present in the surface-based patterns. In the dependency-based approach the questions were automatically generated by traversing the dependency tree of a sentence. As dependency-based patterns always include a main verb, so we traverse the whole dependency tree of the extracted sentence and extract all words which rely on the main verb present in the dependency pattern in order to automatically generate questions. At the next stage, plausible distractors were automatically generated by using a distributional similarity measure. Distributional similarity is known to adequately model the semantic similarity between lexical expressions and it is used quite frequently in many NLP applications (Section 4.3). There exist several distributional similarity measures and previous studies suggest that Information Radius is one of the best performing distributional similarity measure. Distributional similarity measures are corpus-driven and have a broad coverage compared with the thesaurus-based 125 methods that have a limited coverage. Moreover, we preferred to use distributional similarity measures over taxonomic similarity measures (such as those making use of WordNet) as they require having a detailed manually compiled ontology or a resource containing high quality definitions of all possible terms. After individual components of the systems were evaluated using intrinsic evaluation (i.e. against gold-standard data), we carried out an extrinsic/user-centred evaluation of the whole integrated MCQ systems. We presented an extrinsic evaluation approach to evaluate the quality of automatically generated MCQs systems. Both MCQs systems were evaluated in terms of question and distractors readability, usefulness of semantic relation, question and distractors relevance, acceptability and the overall usability of automatically generated MCQ. Two domain experts evaluated both the systems according to the aforementioned evaluation criteria and the results revealed that MCQs generated using the dependency-based approach were more usable than compared to the surface-based approach. In this research, we mainly focused on the biomedical domain but the developed methods for pattern extraction, distractors and question generation are quite portable and can easily be extended to other domains too. 6.2 Thesis Review In this section, we present a brief summary of various chapters of the thesis. Chapter 1 contained the introduction of the research topic and shed light on the importance of e-Learning and the growing needs of effective and efficient e-Learning applications. The chapter also briefly described the importance of multiple choice questions during assessment and the challenges faced during the automatic generation of multiple choice questions. The chapter also elaborated a set of goals which need to be accomplished for the successful completion of this research. Chapter 2 presented an overview of the work done so far in the area of automatic generation of multiple choice questions along with the detailed description of 126 drawbacks and achievements of previous automatic multiple choice questions approaches. In this chapter we also defined the concept of Information Extraction (IE), its applications, its subtasks: Named Entity Recognition (NER) and Relation Extraction (RE), evaluation of IE systems, different strategies to perform IE and various machine learning approaches in IE. The chapter also provided an overview of various supervised, semi-supervised and unsupervised IE systems. The chapter also elaborated the importance and growing use of the Web as a corpus and the challenges faced during its use. Chapter 3 contained the detailed description of the IE phase of this research. In chapter 3, we presented two unsupervised RE approaches (surface-based and dependency-based) that can cover a potentially unrestricted range of semantic relations compared to other RE approaches which can only learn to extract those relations presented in annotated text or seed patterns. In our experiments we employed various information-theoretical and statistical measures to rank extracted semantic patterns and experimental results. This revealed that in both surface-based and dependency-based approaches Normalised Mutual Information and Chi-Square were best performing ranking methods in terms of precision, recall and F-score. In evaluation approaches, we used rank-thresholding and score-thresholding measures and found that the score-thresholding performed better than the rank-thresholding measure. In the surface-based approach, we explored three different pattern types without prepositions and with prepositions. The experimental results divulged that verb-centred surface patterns along with prepositions were the best among the other surface pattern types. We also performed the comparison between the best performing surface-based approach and dependency-based patterns approach and found that the dependency-based approach attained better results than compared to the surface-based approach. Our unsupervised RE approaches were able to achieve high precision scores, which was very important as having high precision scores allowed us to automatically generate good quality MCQs. Chapter 4 contained the detailed description of how semantic patterns (surface-based and dependency-based) are automatically transformed into good quality questions. Our approach enabled us to identify an important part of text in a given text, which was worth asking a question about by using these extracted semantic relations. 127 Plausible distractors were automatically generated by using a distributional similarity measure. The reason behind choosing a distributional similarity measure was that it is corpus-based, alleviated the problem of data sparseness and provided good coverage compared to other taxonomic similarity measures that required a detailed manually compiled ontologies and had limited coverage. For the automatic generation of distractors, we collected various biomedical corpora and built a frequency matrix of semantic classes (named entities) along with a notional word in corpora. This enabled our distributional similarity measure to automatically generate distractors (similar words expressions) appearing in similar contexts. Chapter 5 described an extrinsic evaluation method to evaluate the quality of both MCQs systems (surface-based and dependency-based) in terms of question and distractor readability, usefulness of semantic relation, question and distractor relevance, acceptability and the overall usability of automatically generated MCQ. Two biomedical experts independently evaluated both MCQs systems according to aforementioned evaluation criteria. The results of this evaluation revealed that the quality and usability of MCQs generated by the dependency-based MCQs system were much better than the surface-based MCQs system. 6.3 Future Work During this research and the development of the automatic generation of MCQ systems, a series of potential future leads have emerged. These remain unaddressed in this thesis due to the unavailability of resources and time restrictions. They are discussed in this section. One of the major advantages of our approach to the automatic generation of MCQs is its domain-independence and portability. It makes use of unsupervised semantic relation extraction method so that it can easily adaptable for other domains. In the future, we would like to extend our approach in other domains. A further direction of research is to demonstrate its portability to other specialist domains and to study its dependence on the amount and quality of corpora from which IE patterns are learned. 128 The IE component of our automatically generated MCQs systems is based on the semantic relation extraction assumption that it is between named entities stated in the same sentence when that presence or absence of a relation is independent of the text prior to or succeeding the sentence. It will be interesting to investigate a relation extraction process from multiple sentences rather than a single sentence. Moreover, before the relation extraction process from a given text, it will increase the number of extracted semantic relations and ultimately the quality of automatically generated MCQs, if the given text is first processed by the anaphora and co-reference resolution system which replaces all anaphors with its antecedents and then semantic relations are extracted from the text. In the IE phase, we used Machinese parser during the dependency-based approach. It would be interesting to investigate what kind of impact other parsers such as MINIPAR 49 and Stanford parsers 50 will have in terms of precision, recall and F-score of relation extraction process. The semantic relations can also be useful in other applications such as testing reading comprehension where this IE component can identify important concepts in a given text and show which part of the learning material is vital and worth testing. The automatic question generation phase may benefit from the use of NLG technology (McIntyre and Lapata, 2009; Barzilay and Lapata, 2005; Reiter and Dale, 2000) to improve the quality and grammaticality of automatically generated questions. Another direction of future work is to improve the quality of automatically generated questions further and use them in intelligent tutoring systems, dialogue systems and game-based learning environments. In automatic distractor generation, we used a distributional similarity measure for automatic distractor generation which is a corpus-driven approach. The Web, the biggest available corpus to the research community is quite frequently used in many NLP applications today, so it would be interesting to investigate the use of the Web as a source for automatic distractors generation. Wikipedia 51 is another useful resource that can also be employed in automatic distractors generation. 49 http://webdocs.cs.ualberta.ca/~lindek/minipar.htm http://www-nlp.stanford.edu/software/lex-parser.shtml 51 http://en.wikipedia.org/wiki/Main_Page 50 129 It would be interesting to carry out the extrinsic evaluation of our MCQ systems on a much broader scale using item response theory (Gronlund, 1982). Mitkov et al. (2006) used this theory during the extrinsic evaluation of their MCQ system in which they have evaluated MCT items in terms of their difficulty and discrimination. In the future, our approach for automatic generation of MCQs can be personalised to help to address the potential knowledge gaps of individuals. In this way, our approach can provide significant assistance to teachers and instructors during the entire learning process. 130 Appendix A: Previously Published Work This appendix provides a brief description of the papers included in this thesis that have been previously published in proceedings of peer-reviewed and well-known international conferences. The papers are extended to address the shortcomings identified after the publication of these papers and are then included in this thesis.  Afzal, N. & Pekar, V. (2009). Unsupervised relation extraction for automatic generation of multiple-choice questions. In Proceedings of the Recent Advances in Natural Language Processing (RANLP2009). Borovets, Bulgaria, pp. 1-5. This paper presents unsupervised surface-based relation extraction approach. The findings of this paper are described in the Section 3.1 of the Chapter 3.  Afzal, N., Mitkov, R. & Farzindar, A. (2011). Unsupervised relation extraction using dependency trees for automatic generation of multiple-choice questions. In Proceedings of the C. Butz and P. Lingras (Eds.): Canadian Artificial Intelligence, LNAI 6657. Newfoundland and Labrador, Canada: Springer, Heidelberg, pp. 32-43. This paper presents an unsupervised dependency-based relation extraction approach. The findings of this paper are used in the Section 3.2 of the Chapter 3. 131 Appendix B: Examples of Automatically Generated MCQs This appendix contains few examples of automatically generated MCQs using dependency-based approach along with the sentences from which the MCQ is automatically generated. Sentence: PPARalpha activators inhibit cytokine-induced vascular cell adhesion molecule-1 expression in human endothelial cells. Which protein activators inhibit cytokine-induced vascular cell adhesion molecule-1 expression in human endothelial cells?  Interleukin-5  PPARalpha  cultured human ECs  lymphoid and myeloid cells  proinflammatory mediator Sentence: Taken together these results indicate that STAT1 plays a pivotal role in the differentiation/maturation process of monocytes as an early transcription factor initially activated by adherence and then able to modulate the expression of functional genes such as ICAM-1 and FcgammaRI. Which protein plays a pivotal role in the differentiation/maturation process of monocytes as an early transcription factor?  JAK3  NF-kappa B  STAT1  transcription factor  STAT3 132 Sentence: We show that TLR2 associates with the high-affinity LPS binding protein membrane CD14 to serve as an LPS receptor complex and that LPS treatment enhances the oligomerization of TLR2. Which protein associates with the high-affinity LPS binding protein membrane CD14?  Phosphatidylinositol 3-kinase  T-cell-specific transcription factor  TLR2  eukaryotic transcription factor  HLA-DM Sentence: We have found that ISG expression in the monocytic U937 cell line differs from most cell lines previously examined. Which protein expression in the monocytic U937 cell line differs from most cell lines?  ISG  SOCS-1  beta-like globin cluster  early growth response-1 gene  Rel/NF-kappa B Sentence: We show here that c-Rel binds to kappa B sites as homodimers as well as heterodimers with p50. Which protein binds to kappa B sites as homodimers as well as heterodimers?  B cells  NF-kappa B  NF-kappa B  c-Rel  p65 133 Sentence: We also present evidence that IL-6 kappa B binding factor II functions as a repressor specific for IL-6 kappa B-related kappa B motifs in lymphoid cells. Which protein functions as a repressor specific for IL-6 kappa B-related kappa B motifs in lymphoid cells?  IL-6 kappa B binding factor II  Translocated hormone/receptor complexes  positive and negative regulatory factors  recombinant caspase 3  p1-79 probes Sentence: The long terminal repeat (LTR) region of HIV proviral DNA contains binding sites for nuclear factor kappa B (NF-kappa B) and this transcriptional activator appears to regulate HIV activation. Which DNA region of HIV proviral DNA contains binding sites for nuclear factor kappa B (NF-kappa B)?  Epstein-Barr viral DNA  chronically infected T cell line  long terminal repeat  transcription factor family  IL-1alpha gene Sentence: We report here that the HIV-1-encoded Nef protein inhibits the induction of NF-kappa B DNA-binding activity by T- cell mitogens. Which protein inhibits the induction of NF-kappa B DNA-binding activity by Tcell mitogens?  HIV-1-encoded Nef protein  immediate precursors 134  prognostic factor  reticulocytes  metastasis-suppressor gene Sentence: We have found that the p49 (100) DNA binding subunit together with p65 can act in concert with Tat-I to stimulate the expression of HIV-CAT plasmid. Which protein together with p65 can act in concert with Tat-I?  HLA DQA1*0201  human PAX-5 gene  p49 ( 100 ) DNA binding subunit  raf  immune system regulatory and effector cells 135 Appendix C: Result Tables GENIA P R WEB GENIA + WEB F P R F P R F Top 100 Ranked Patterns IG 0.530 0.009 0.018 0.150 0.003 0.005 0.200 0.003 0.007 IGR 0.560 0.010 0.019 0.150 0.003 0.005 0.200 0.003 0.007 MI 0.330 0.006 0.011 0.030 0.001 0.001 0.080 0.001 0.003 NMI 0.680 0.012 0.023 0.390 0.007 0.013 0.550 0.010 0.019 LL 0.560 0.010 0.019 0.150 0.003 0.005 0.200 0.003 0.007 CHI 0.740 0.013 0.025 0.570 0.010 0.019 0.640 0.011 0.022 Meta 0.740 0.013 0.025 0.480 0.008 0.016 0.540 0.009 0.018 tf-idf 0.660 0.011 0.023 0.380 0.007 0.013 0.530 0.009 0.018 Top 200 Ranked Patterns IG 0.560 0.019 0.038 0.210 0.007 0.014 0.200 0.007 0.013 IGR 0.565 0.020 0.038 0.210 0.007 0.014 0.205 0.007 0.014 MI 0.305 0.011 0.020 0.030 0.001 0.002 0.105 0.004 0.007 NMI 0.530 0.018 0.036 0.380 0.013 0.025 0.455 0.016 0.031 LL 0.565 0.020 0.038 0.210 0.007 0.014 0.205 0.007 0.014 CHI 0.615 0.021 0.041 0.465 0.016 0.031 0.540 0.019 0.036 Meta 0.605 0.021 0.041 0.315 0.011 0.021 0.430 0.015 0.029 tf-idf 0.525 0.018 0.035 0.375 0.013 0.025 0.390 0.014 0.026 Top 300 Ranked Patterns IG 0.543 0.028 0.054 0.173 0.009 0.017 0.213 0.011 0.021 IGR 0.540 0.028 0.053 0.173 0.009 0.017 0.217 0.011 0.021 MI 0.343 0.018 0.034 0.037 0.002 0.004 0.120 0.006 0.012 NMI 0.540 0.028 0.053 0.320 0.017 0.032 0.400 0.021 0.040 LL 0.540 0.028 0.053 0.173 0.009 0.017 0.217 0.011 0.021 CHI 0.577 0.030 0.057 0.387 0.020 0.038 0.483 0.025 0.048 Meta 0.543 0.028 0.054 0.317 0.016 0.031 0.377 0.020 0.037 tf-idf 0.527 0.027 0.052 0.313 0.016 0.031 0.347 0.018 0.034 Table 1: Rank-thresholding results of untagged word patterns 136 GENIA P R WEB GENIA + WEB F P R F P R F Top 100 Ranked Patterns IG 0.780 0.014 0.027 0.320 0.006 0.011 0.300 0.005 0.010 IGR 0.790 0.014 0.027 0.320 0.006 0.011 0.300 0.005 0.010 MI 0.430 0.008 0.015 0.030 0.001 0.001 0.080 0.001 0.003 NMI 0.810 0.014 0.028 0.520 0.009 0.018 0.660 0.012 0.023 LL 0.790 0.014 0.027 0.320 0.006 0.011 0.300 0.005 0.010 CHI 0.900 0.016 0.031 0.700 0.012 0.024 0.800 0.014 0.028 Meta 0.860 0.015 0.030 0.390 0.007 0.014 0.460 0.008 0.016 tf-idf 0.800 0.014 0.028 0.420 0.007 0.015 0.520 0.009 0.018 Top 200 Ranked Patterns IG 0.755 0.027 0.051 0.380 0.013 0.026 0.415 0.015 0.028 IGR 0.755 0.027 0.051 0.380 0.013 0.026 0.415 0.015 0.028 MI 0.420 0.015 0.029 0.050 0.002 0.003 0.125 0.004 0.009 NMI 0.720 0.025 0.049 0.480 0.017 0.033 0.545 0.019 0.037 LL 0.755 0.027 0.051 0.380 0.013 0.026 0.415 0.015 0.028 CHI 0.755 0.027 0.051 0.565 0.020 0.038 0.570 0.020 0.039 Meta 0.765 0.027 0.052 0.400 0.014 0.027 0.480 0.017 0.033 tf-idf 0.715 0.025 0.049 0.440 0.016 0.030 0.490 0.017 0.033 Top 300 Ranked Patterns IG 0.720 0.038 0.072 0.307 0.016 0.031 0.353 0.019 0.035 IGR 0.730 0.039 0.073 0.303 0.016 0.030 0.353 0.019 0.035 MI 0.460 0.024 0.046 0.043 0.002 0.004 0.140 0.007 0.014 NMI 0.707 0.037 0.071 0.410 0.022 0.041 0.503 0.027 0.051 LL 0.730 0.039 0.073 0.303 0.016 0.030 0.353 0.019 0.035 CHI 0.740 0.039 0.074 0.423 0.022 0.043 0.500 0.026 0.050 Meta 0.727 0.038 0.073 0.407 0.021 0.041 0.480 0.025 0.048 tf-idf 0.677 0.036 0.068 0.373 0.020 0.037 0.430 0.023 0.043 Table 2: Rank-thresholding results of PoS-tagged word patterns 137 GENIA P R WEB GENIA + WEB F P R F P R F Top 100 Ranked Patterns IG 0.840 0.021 0.041 0.380 0.009 0.018 0.410 0.010 0.020 IGR 0.840 0.021 0.041 0.380 0.009 0.018 0.410 0.010 0.020 MI 0.380 0.009 0.018 0.060 0.001 0.003 0.100 0.002 0.005 NMI 0.790 0.020 0.038 0.380 0.009 0.018 0.480 0.012 0.023 LL 0.840 0.021 0.041 0.380 0.009 0.018 0.410 0.010 0.020 CHI 0.820 0.020 0.040 0.540 0.013 0.026 0.620 0.015 0.030 Meta 0.830 0.021 0.040 0.370 0.009 0.018 0.380 0.009 0.018 tf-idf 0.780 0.019 0.038 0.390 0.010 0.019 0.450 0.011 0.022 Top 200 Ranked Patterns IG 0.765 0.038 0.073 0.335 0.017 0.032 0.410 0.010 0.020 IGR 0.765 0.038 0.073 0.340 0.017 0.032 0.410 0.010 0.020 MI 0.360 0.018 0.034 0.040 0.002 0.004 0.160 0.008 0.015 NMI 0.710 0.035 0.067 0.330 0.016 0.031 0.395 0.020 0.038 LL 0.765 0.038 0.073 0.340 0.017 0.032 0.410 0.010 0.020 CHI 0.735 0.037 0.070 0.365 0.018 0.035 0.465 0.023 0.044 Meta 0.750 0.037 0.071 0.310 0.015 0.029 0.395 0.020 0.038 tf-idf 0.690 0.034 0.066 0.320 0.016 0.030 0.435 0.022 0.041 Top 300 Ranked Patterns IG 0.770 0.058 0.107 0.263 0.020 0.037 0.357 0.027 0.050 IGR 0.760 0.057 0.106 0.267 0.020 0.037 0.353 0.026 0.049 MI 0.413 0.031 0.058 0.040 0.003 0.006 0.157 0.012 0.022 NMI 0.603 0.045 0.084 0.247 0.018 0.034 0.330 0.025 0.046 LL 0.757 0.057 0.105 0.260 0.019 0.036 0.353 0.026 0.049 CHI 0.623 0.047 0.087 0.297 0.022 0.041 0.367 0.027 0.051 Meta 0.667 0.050 0.093 0.277 0.021 0.039 0.327 0.024 0.045 tf-idf 0.597 0.045 0.083 0.283 0.021 0.039 0.337 0.025 0.047 Table 3: Rank-thresholding results of verb-centred word patterns 138 GENIA P WEB R GENIA + WEB F P R F P R F Threshold score > 0.01 IG 0.354 0.461 0.401 0.067 0.064 0.065 0.106 0.143 0.122 IGR 0.348 0.358 0.353 0.060 0.065 0.062 0.110 0.125 0.117 MI 0.354 0.518 0.420 0.025 0.126 0.041 0.082 0.523 0.141 NMI 0.357 0.441 0.395 0.027 0.121 0.044 0.083 0.457 0.140 LL 0.349 0.353 0.351 0.060 0.065 0.062 0.110 0.125 0.117 CHI 0.348 0.230 0.277 0.158 0.047 0.072 0.228 0.064 0.099 Meta 0.353 0.392 0.372 0.026 0.097 0.042 0.083 0.383 0.136 tf-idf 0.332 0.265 0.295 0.059 0.085 0.070 0.100 0.332 0.154 Threshold score > 0.02 IG 0.355 0.280 0.313 0.078 0.043 0.055 0.121 0.079 0.095 IGR 0.356 0.213 0.266 0.080 0.043 0.056 0.123 0.076 0.094 MI 0.355 0.429 0.388 0.026 0.115 0.043 0.082 0.450 0.139 NMI 0.354 0.384 0.368 0.028 0.108 0.045 0.084 0.404 0.139 LL 0.356 0.213 0.266 0.080 0.043 0.056 0.123 0.076 0.094 CHI 0.342 0.163 0.221 0.282 0.033 0.058 0.326 0.039 0.070 Meta 0.347 0.294 0.318 0.029 0.083 0.043 0.086 0.309 0.135 tf-idf 0.326 0.238 0.275 0.063 0.066 0.064 0.114 0.311 0.167 Threshold score > 0.03 IG 0.354 0.197 0.253 0.089 0.035 0.051 0.131 0.056 0.078 IGR 0.482 0.064 0.113 0.085 0.033 0.047 0.131 0.056 0.078 MI 0.353 0.392 0.372 0.027 0.105 0.043 0.083 0.413 0.138 NMI 0.348 0.327 0.337 0.028 0.097 0.044 0.084 0.361 0.136 LL 0.482 0.064 0.113 0.085 0.033 0.047 0.130 0.055 0.078 CHI 0.339 0.159 0.216 0.314 0.025 0.047 0.386 0.029 0.054 Meta 0.348 0.258 0.296 0.033 0.073 0.045 0.091 0.253 0.134 tf-idf 0.304 0.201 0.242 0.064 0.053 0.058 0.132 0.251 0.173 Threshold score > 0.04 IG 0.479 0.058 0.104 0.095 0.025 0.039 0.148 0.044 0.068 IGR 0.511 0.050 0.092 0.099 0.026 0.042 0.149 0.044 0.068 MI 0.350 0.368 0.359 0.026 0.099 0.042 0.083 0.397 0.137 NMI 0.344 0.305 0.323 0.029 0.091 0.044 0.084 0.340 0.135 LL 0.511 0.050 0.092 0.099 0.026 0.042 0.149 0.044 0.068 139 CHI 0.571 0.029 0.055 0.413 0.019 0.037 0.518 0.023 0.044 Meta 0.346 0.219 0.268 0.043 0.063 0.051 0.104 0.198 0.137 tf-idf 0.324 0.161 0.215 0.075 0.042 0.054 0.160 0.223 0.187 Threshold score > 0.05 IG 0.515 0.044 0.082 0.102 0.021 0.035 0.153 0.037 0.060 IGR 0.511 0.043 0.079 0.102 0.021 0.035 0.153 0.037 0.060 MI 0.349 0.341 0.345 0.026 0.093 0.041 0.083 0.368 0.135 NMI 0.346 0.290 0.316 0.028 0.086 0.042 0.085 0.330 0.135 LL 0.510 0.043 0.079 0.102 0.021 0.035 0.153 0.037 0.060 CHI 0.585 0.024 0.047 0.462 0.017 0.032 0.543 0.019 0.036 Meta 0.340 0.210 0.260 0.042 0.059 0.049 0.105 0.191 0.136 tf-idf 0.350 0.148 0.208 0.092 0.035 0.051 0.212 0.196 0.204 Threshold score > 0.06 IG 0.537 0.034 0.063 0.111 0.018 0.031 0.168 0.032 0.053 IGR 0.551 0.032 0.061 0.111 0.018 0.031 0.167 0.031 0.052 MI 0.344 0.319 0.331 0.027 0.089 0.041 0.083 0.350 0.134 NMI 0.342 0.265 0.299 0.029 0.083 0.043 0.086 0.310 0.134 LL 0.551 0.032 0.061 0.111 0.018 0.031 0.167 0.031 0.052 CHI 0.576 0.023 0.044 0.516 0.014 0.027 0.589 0.015 0.030 Meta 0.344 0.171 0.229 0.126 0.047 0.069 0.172 0.080 0.109 tf-idf 0.352 0.115 0.173 0.119 0.029 0.047 0.340 0.130 0.188 Threshold score > 0.07 IG 0.544 0.029 0.055 0.113 0.016 0.028 0.168 0.027 0.047 IGR 0.537 0.028 0.052 0.113 0.015 0.027 0.169 0.027 0.047 MI 0.344 0.315 0.329 0.026 0.086 0.040 0.082 0.343 0.133 NMI 0.341 0.261 0.295 0.029 0.081 0.042 0.086 0.303 0.134 LL 0.536 0.027 0.052 0.113 0.015 0.027 0.169 0.027 0.047 CHI 0.733 0.015 0.030 0.558 0.012 0.024 0.638 0.013 0.025 Meta 0.341 0.169 0.226 0.134 0.046 0.068 0.173 0.073 0.103 tf-idf 0.360 0.075 0.124 0.144 0.024 0.041 0.434 0.116 0.182 Threshold score > 0.08 IG 0.538 0.026 0.049 0.129 0.013 0.024 0.170 0.025 0.043 IGR 0.537 0.023 0.044 0.114 0.013 0.023 0.169 0.025 0.043 MI 0.340 0.299 0.318 0.026 0.084 0.040 0.082 0.330 0.132 NMI 0.339 0.254 0.291 0.029 0.079 0.043 0.087 0.296 0.134 140 LL 0.537 0.023 0.044 0.114 0.013 0.023 0.170 0.025 0.043 CHI 0.750 0.014 0.028 0.569 0.011 0.021 0.640 0.011 0.022 Meta 0.526 0.037 0.069 0.153 0.042 0.066 0.190 0.065 0.096 tf-idf 0.398 0.062 0.107 0.201 0.017 0.031 0.451 0.095 0.157 Threshold score > 0.09 IG 0.562 0.022 0.042 0.148 0.012 0.022 0.188 0.022 0.039 IGR 0.560 0.022 0.042 0.148 0.012 0.022 0.188 0.022 0.039 MI 0.338 0.293 0.314 0.026 0.081 0.039 0.082 0.325 0.132 NMI 0.342 0.240 0.282 0.029 0.076 0.042 0.088 0.286 0.135 LL 0.560 0.022 0.042 0.144 0.012 0.022 0.188 0.022 0.039 CHI 0.740 0.013 0.025 0.579 0.010 0.019 0.687 0.010 0.020 Meta 0.529 0.035 0.066 0.159 0.040 0.064 0.196 0.061 0.093 tf-idf 0.548 0.033 0.063 0.259 0.014 0.026 0.467 0.072 0.125 Threshold score > 0.1 IG 0.563 0.021 0.040 0.159 0.011 0.020 0.200 0.018 0.033 IGR 0.564 0.021 0.040 0.192 0.011 0.020 0.200 0.018 0.033 MI 0.341 0.281 0.308 0.026 0.079 0.039 0.083 0.322 0.132 NMI 0.341 0.237 0.280 0.029 0.074 0.041 0.087 0.282 0.134 LL 0.562 0.020 0.040 0.188 0.010 0.019 0.200 0.018 0.033 CHI 0.806 0.010 0.020 0.614 0.009 0.017 0.711 0.009 0.018 Meta 0.524 0.034 0.064 0.153 0.037 0.060 0.255 0.046 0.078 tf-idf 0.589 0.019 0.038 0.294 0.010 0.018 0.491 0.053 0.096 Threshold score > 0.2 IG 0.667 0.007 0.014 0.164 0.005 0.009 0.203 0.007 0.014 IGR 0.667 0.007 0.014 0.164 0.005 0.009 0.203 0.007 0.014 MI 0.337 0.232 0.275 0.025 0.066 0.036 0.085 0.269 0.129 NMI 0.338 0.159 0.216 0.041 0.054 0.047 0.103 0.178 0.131 LL 0.667 0.007 0.014 0.164 0.005 0.009 0.203 0.007 0.014 CHI 1.000 0.004 0.009 0.697 0.004 0.008 0.688 0.004 0.008 Meta 0.800 0.009 0.018 0.300 0.011 0.022 0.382 0.016 0.031 tf-idf 0.709 0.013 0.025 0.345 0.007 0.013 0.541 0.045 0.084 Threshold score > 0.3 IG 0.568 0.004 0.009 0.165 0.003 0.006 0.203 0.004 0.008 IGR 0.568 0.004 0.009 0.165 0.003 0.006 0.203 0.004 0.008 MI 0.337 0.213 0.261 0.026 0.059 0.036 0.086 0.244 0.127 141 NMI 0.550 0.025 0.048 0.138 0.041 0.064 0.177 0.065 0.095 LL 0.568 0.004 0.009 0.165 0.003 0.006 0.203 0.004 0.008 CHI 1.000 0.002 0.004 0.727 0.003 0.006 0.875 0.002 0.005 Meta 1.000 0.004 0.008 0.683 0.005 0.010 0.600 0.006 0.011 tf-idf 0.714 0.008 0.015 0.375 0.004 0.008 0.614 0.022 0.042 Threshold score > 0.4 IG 0.750 0.002 0.004 0.480 0.002 0.004 0.500 0.003 0.005 IGR 0.733 0.002 0.004 0.480 0.002 0.004 0.500 0.003 0.005 MI 0.329 0.190 0.241 0.027 0.052 0.036 0.088 0.213 0.124 NMI 0.544 0.019 0.038 0.157 0.034 0.056 0.198 0.053 0.084 LL 0.733 0.002 0.004 0.480 0.002 0.004 0.500 0.003 0.005 CHI 1.000 0.002 0.003 0.917 0.002 0.004 1.000 0.002 0.003 Meta 1.000 0.002 0.005 0.696 0.003 0.006 0.882 0.003 0.005 tf-idf 0.741 0.003 0.007 0.464 0.002 0.004 0.667 0.012 0.023 Table 4: Score-thresholding results of untagged word patterns GENIA P WEB R GENIA + WEB F P R F P R F Threshold score > 0.01 IG 0.444 0.328 0.377 0.084 0.072 0.078 0.145 0.119 0.131 IGR 0.439 0.377 0.406 0.084 0.072 0.078 0.142 0.126 0.133 MI 0.436 0.684 0.533 0.032 0.173 0.054 0.103 0.700 0.179 NMI 0.439 0.593 0.504 0.035 0.164 0.058 0.106 0.620 0.182 LL 0.439 0.378 0.407 0.084 0.072 0.078 0.142 0.123 0.131 CHI 0.439 0.266 0.331 0.221 0.056 0.090 0.349 0.065 0.110 Meta 0.436 0.398 0.416 0.036 0.123 0.056 0.108 0.467 0.176 tf-idf 0.414 0.311 0.355 0.042 0.140 0.064 0.110 0.439 0.176 Threshold score > 0.02 IG 0.457 0.239 0.314 0.166 0.043 0.068 0.203 0.066 0.100 IGR 0.457 0.242 0.316 0.156 0.044 0.069 0.203 0.066 0.100 MI 0.438 0.582 0.500 0.034 0.157 0.056 0.106 0.610 0.180 NMI 0.436 0.513 0.471 0.037 0.146 0.059 0.109 0.551 0.182 LL 0.457 0.242 0.316 0.156 0.044 0.069 0.203 0.066 0.099 CHI 0.442 0.213 0.287 0.364 0.035 0.064 0.467 0.045 0.083 142 Meta 0.438 0.349 0.389 0.041 0.104 0.059 0.115 0.372 0.175 tf-idf 0.427 0.278 0.337 0.047 0.115 0.067 0.112 0.374 0.173 Threshold score > 0.03 IG 0.729 0.056 0.104 0.209 0.032 0.056 0.258 0.046 0.078 IGR 0.731 0.057 0.107 0.209 0.032 0.056 0.252 0.044 0.074 MI 0.443 0.526 0.481 0.035 0.141 0.056 0.107 0.557 0.180 NMI 0.435 0.433 0.434 0.039 0.130 0.060 0.112 0.470 0.181 LL 0.730 0.057 0.106 0.209 0.032 0.056 0.253 0.044 0.075 CHI 0.737 0.038 0.072 0.412 0.026 0.049 0.497 0.031 0.058 Meta 0.440 0.301 0.357 0.046 0.093 0.061 0.121 0.316 0.175 tf-idf 0.418 0.183 0.255 0.053 0.103 0.070 0.123 0.349 0.182 Threshold score > 0.04 IG 0.749 0.047 0.088 0.213 0.026 0.046 0.276 0.035 0.062 IGR 0.749 0.047 0.088 0.215 0.026 0.047 0.276 0.035 0.062 MI 0.434 0.498 0.463 0.035 0.137 0.056 0.108 0.540 0.180 NMI 0.433 0.395 0.413 0.039 0.121 0.059 0.112 0.442 0.179 LL 0.749 0.047 0.088 0.215 0.026 0.047 0.276 0.035 0.062 CHI 0.751 0.033 0.063 0.559 0.020 0.039 0.517 0.024 0.047 Meta 0.439 0.267 0.332 0.058 0.081 0.068 0.139 0.253 0.179 tf-idf 0.465 0.165 0.244 0.062 0.077 0.069 0.124 0.328 0.180 Threshold score > 0.05 IG 0.816 0.033 0.063 0.247 0.021 0.040 0.318 0.029 0.053 IGR 0.817 0.031 0.060 0.246 0.021 0.039 0.318 0.029 0.053 MI 0.434 0.444 0.439 0.036 0.126 0.056 0.109 0.487 0.177 NMI 0.436 0.374 0.402 0.039 0.116 0.058 0.112 0.430 0.178 LL 0.817 0.031 0.060 0.246 0.021 0.039 0.318 0.029 0.053 CHI 0.743 0.031 0.060 0.655 0.017 0.033 0.709 0.019 0.037 Meta 0.447 0.219 0.294 0.056 0.075 0.064 0.138 0.245 0.177 tf-idf 0.499 0.150 0.231 0.068 0.064 0.066 0.124 0.317 0.179 Threshold score > 0.06 IG 0.809 0.028 0.053 0.319 0.018 0.034 0.310 0.025 0.046 IGR 0.827 0.026 0.051 0.319 0.018 0.034 0.310 0.025 0.046 MI 0.430 0.422 0.426 0.037 0.120 0.056 0.110 0.458 0.177 NMI 0.432 0.337 0.379 0.041 0.109 0.060 0.114 0.394 0.177 LL 0.827 0.026 0.051 0.319 0.018 0.034 0.310 0.025 0.046 143 CHI 0.878 0.018 0.035 0.683 0.015 0.029 0.750 0.016 0.031 Meta 0.444 0.216 0.290 0.177 0.057 0.087 0.236 0.094 0.135 tf-idf 0.500 0.094 0.158 0.108 0.042 0.060 0.151 0.268 0.194 Threshold score > 0.07 IG 0.852 0.024 0.047 0.307 0.016 0.030 0.361 0.021 0.041 IGR 0.876 0.020 0.039 0.308 0.016 0.030 0.360 0.021 0.040 MI 0.431 0.410 0.420 0.036 0.115 0.055 0.109 0.444 0.176 NMI 0.431 0.334 0.376 0.040 0.105 0.058 0.114 0.387 0.176 LL 0.876 0.020 0.039 0.308 0.016 0.030 0.360 0.021 0.040 CHI 0.873 0.017 0.033 0.719 0.012 0.024 0.796 0.014 0.028 Meta 0.441 0.212 0.286 0.178 0.055 0.084 0.239 0.090 0.131 tf-idf 0.542 0.071 0.125 0.119 0.033 0.052 0.200 0.246 0.221 Threshold score > 0.08 IG 0.718 0.018 0.035 0.296 0.014 0.027 0.354 0.020 0.037 IGR 0.720 0.019 0.037 0.297 0.014 0.027 0.351 0.020 0.037 MI 0.430 0.388 0.408 0.036 0.111 0.054 0.110 0.432 0.175 NMI 0.431 0.329 0.373 0.040 0.102 0.058 0.114 0.375 0.175 LL 0.720 0.019 0.037 0.297 0.014 0.027 0.354 0.020 0.037 CHI 0.921 0.012 0.024 0.742 0.012 0.023 0.835 0.013 0.025 Meta 0.724 0.041 0.078 0.201 0.050 0.079 0.267 0.081 0.124 tf-idf 0.601 0.069 0.124 0.141 0.024 0.041 0.257 0.182 0.213 Threshold score > 0.09 IG 0.698 0.016 0.030 0.378 0.012 0.023 0.415 0.015 0.028 IGR 0.715 0.017 0.034 0.386 0.013 0.024 0.415 0.015 0.028 MI 0.427 0.379 0.402 0.036 0.106 0.053 0.109 0.419 0.173 NMI 0.433 0.307 0.360 0.041 0.099 0.058 0.115 0.362 0.175 LL 0.715 0.017 0.034 0.386 0.013 0.024 0.412 0.014 0.028 CHI 0.955 0.011 0.022 0.773 0.010 0.020 0.829 0.010 0.020 Meta 0.725 0.038 0.072 0.205 0.048 0.078 0.268 0.077 0.119 tf-idf 0.680 0.050 0.093 0.184 0.020 0.035 0.313 0.155 0.208 Threshold score > 0.1 IG 0.692 0.015 0.029 0.359 0.011 0.021 0.402 0.014 0.027 IGR 0.697 0.015 0.029 0.359 0.011 0.021 0.402 0.014 0.027 MI 0.427 0.372 0.398 0.035 0.104 0.053 0.109 0.408 0.173 NMI 0.431 0.304 0.357 0.040 0.096 0.056 0.115 0.358 0.174 144 LL 0.697 0.015 0.029 0.359 0.011 0.021 0.402 0.014 0.027 CHI 0.951 0.010 0.020 0.774 0.008 0.017 0.839 0.009 0.018 Meta 0.739 0.034 0.066 0.272 0.040 0.070 0.349 0.055 0.095 tf-idf 0.700 0.036 0.068 0.276 0.015 0.028 0.376 0.119 0.180 Threshold score > 0.2 IG 0.853 0.005 0.010 0.320 0.006 0.011 0.348 0.007 0.014 IGR 0.853 0.007 0.014 0.320 0.006 0.011 0.348 0.007 0.014 MI 0.427 0.297 0.350 0.035 0.082 0.049 0.111 0.339 0.167 NMI 0.439 0.209 0.283 0.054 0.072 0.062 0.136 0.239 0.173 LL 0.853 0.005 0.010 0.320 0.006 0.011 0.348 0.007 0.014 CHI 1.000 0.004 0.007 0.806 0.004 0.009 0.839 0.005 0.009 Meta 0.894 0.010 0.021 0.400 0.014 0.027 0.495 0.019 0.037 tf-idf 0.775 0.025 0.048 0.300 0.010 0.019 0.535 0.086 0.148 Threshold score > 0.3 IG 0.810 0.003 0.006 0.250 0.004 0.008 0.293 0.005 0.010 IGR 0.810 0.003 0.006 0.250 0.004 0.008 0.293 0.005 0.010 MI 0.424 0.275 0.334 0.035 0.074 0.048 0.112 0.305 0.164 NMI 0.729 0.034 0.064 0.173 0.050 0.078 0.236 0.085 0.125 LL 0.810 0.003 0.006 0.250 0.004 0.008 0.293 0.005 0.010 CHI 1.000 0.002 0.004 0.850 0.003 0.006 0.938 0.003 0.005 Meta 1.000 0.004 0.008 0.833 0.006 0.012 0.837 0.006 0.013 tf-idf 0.854 0.014 0.028 0.385 0.007 0.015 0.695 0.052 0.096 Threshold score > 0.4 IG 1.000 0.002 0.004 0.203 0.003 0.006 0.239 0.004 0.008 IGR 1.000 0.002 0.004 0.203 0.003 0.006 0.239 0.004 0.008 MI 0.422 0.245 0.310 0.036 0.068 0.047 0.114 0.277 0.161 NMI 0.722 0.026 0.050 0.199 0.042 0.070 0.266 0.071 0.112 LL 1.000 0.002 0.004 0.203 0.003 0.006 0.239 0.004 0.008 CHI 1.000 0.002 0.003 0.909 0.002 0.004 0.917 0.002 0.004 Meta 1.000 0.002 0.004 0.826 0.003 0.007 0.895 0.003 0.006 tf-idf 0.906 0.005 0.010 0.453 0.004 0.008 0.830 0.016 0.032 Table 5: Score-thresholding results of PoS-tagged word patterns 145 GENIA P WEB R GENIA + WEB F P R F P R F Threshold score > 0.01 IG 0.447 0.361 0.400 0.071 0.071 0.071 0.147 0.200 0.169 IGR 0.444 0.455 0.449 0.069 0.072 0.070 0.152 0.179 0.164 MI 0.451 0.689 0.545 0.030 0.156 0.050 0.108 0.722 0.188 NMI 0.448 0.589 0.509 0.031 0.147 0.052 0.110 0.650 0.189 LL 0.444 0.455 0.449 0.068 0.073 0.071 0.152 0.179 0.164 CHI 0.444 0.291 0.352 0.099 0.055 0.071 0.203 0.119 0.150 Meta 0.448 0.511 0.478 0.032 0.111 0.050 0.113 0.494 0.184 tf-idf 0.415 0.504 0.455 0.036 0.108 0.054 0.115 0.458 0.183 Threshold score > 0.02 IG 0.460 0.233 0.309 0.125 0.050 0.072 0.177 0.086 0.116 IGR 0.450 0.307 0.365 0.124 0.050 0.071 0.180 0.084 0.115 MI 0.446 0.572 0.501 0.031 0.140 0.051 0.110 0.632 0.187 NMI 0.448 0.507 0.476 0.033 0.131 0.052 0.113 0.578 0.189 LL 0.450 0.307 0.365 0.125 0.050 0.072 0.187 0.083 0.115 CHI 0.445 0.213 0.288 0.164 0.039 0.063 0.277 0.054 0.091 Meta 0.448 0.357 0.397 0.036 0.094 0.052 0.117 0.393 0.180 tf-idf 0.435 0.504 0.467 0.042 0.108 0.061 0.118 0.423 0.184 Threshold score > 0.03 IG 0.776 0.054 0.100 0.176 0.038 0.062 0.268 0.051 0.086 IGR 0.457 0.228 0.305 0.183 0.037 0.062 0.270 0.046 0.079 MI 0.446 0.513 0.477 0.031 0.126 0.050 0.111 0.579 0.186 NMI 0.443 0.430 0.436 0.033 0.116 0.052 0.115 0.506 0.187 LL 0.457 0.228 0.305 0.179 0.038 0.062 0.270 0.046 0.079 CHI 0.440 0.209 0.283 0.211 0.030 0.053 0.331 0.042 0.075 Meta 0.448 0.315 0.370 0.040 0.085 0.054 0.121 0.334 0.178 tf-idf 0.435 0.383 0.407 0.045 0.076 0.056 0.126 0.352 0.186 Threshold score > 0.04 IG 0.770 0.047 0.088 0.230 0.028 0.049 0.313 0.036 0.064 IGR 0.775 0.051 0.095 0.232 0.028 0.050 0.312 0.035 0.063 MI 0.444 0.493 0.467 0.031 0.122 0.050 0.112 0.553 0.186 NMI 0.444 0.390 0.416 0.034 0.108 0.051 0.115 0.470 0.184 LL 0.775 0.051 0.095 0.224 0.028 0.050 0.312 0.035 0.063 146 CHI 0.760 0.034 0.065 0.289 0.025 0.045 0.356 0.028 0.052 Meta 0.445 0.293 0.353 0.048 0.074 0.058 0.136 0.275 0.182 tf-idf 0.594 0.226 0.327 0.050 0.067 0.057 0.132 0.274 0.178 Threshold score > 0.05 IG 0.770 0.041 0.078 0.245 0.023 0.042 0.350 0.029 0.053 IGR 0.768 0.045 0.084 0.245 0.023 0.042 0.350 0.029 0.053 MI 0.441 0.441 0.441 0.031 0.111 0.048 0.112 0.509 0.184 NMI 0.445 0.368 0.403 0.033 0.102 0.050 0.115 0.455 0.184 LL 0.768 0.045 0.084 0.245 0.023 0.042 0.350 0.029 0.053 CHI 0.764 0.030 0.058 0.360 0.019 0.036 0.488 0.021 0.040 Meta 0.445 0.265 0.332 0.047 0.070 0.056 0.136 0.266 0.180 tf-idf 0.627 0.187 0.288 0.055 0.055 0.055 0.141 0.240 0.178 Threshold score > 0.06 IG 0.762 0.038 0.073 0.263 0.020 0.038 0.347 0.024 0.044 IGR 0.766 0.039 0.074 0.263 0.020 0.038 0.347 0.024 0.044 MI 0.439 0.421 0.429 0.031 0.105 0.048 0.113 0.481 0.182 NMI 0.441 0.334 0.380 0.034 0.097 0.050 0.116 0.423 0.182 LL 0.766 0.039 0.074 0.263 0.020 0.038 0.347 0.024 0.044 CHI 0.758 0.028 0.054 0.404 0.017 0.032 0.565 0.018 0.036 Meta 0.446 0.216 0.291 0.090 0.053 0.066 0.183 0.131 0.153 tf-idf 0.658 0.130 0.217 0.068 0.048 0.057 0.192 0.162 0.176 Threshold score > 0.07 IG 0.864 0.025 0.049 0.336 0.018 0.034 0.409 0.021 0.040 IGR 0.856 0.027 0.052 0.340 0.017 0.033 0.409 0.021 0.040 MI 0.439 0.406 0.422 0.031 0.102 0.048 0.112 0.468 0.181 NMI 0.441 0.332 0.379 0.034 0.094 0.050 0.116 0.408 0.180 LL 0.856 0.027 0.052 0.340 0.017 0.033 0.409 0.021 0.040 CHI 0.889 0.016 0.031 0.455 0.015 0.029 0.610 0.016 0.031 Meta 0.445 0.214 0.289 0.095 0.051 0.067 0.183 0.125 0.149 tf-idf 0.686 0.076 0.137 0.079 0.036 0.050 0.202 0.136 0.163 Threshold score > 0.08 IG 0.840 0.021 0.041 0.348 0.016 0.030 0.410 0.019 0.036 IGR 0.850 0.017 0.032 0.350 0.016 0.031 0.410 0.019 0.036 MI 0.443 0.386 0.412 0.031 0.098 0.047 0.113 0.453 0.180 NMI 0.440 0.327 0.375 0.034 0.091 0.049 0.116 0.394 0.179 147 LL 0.850 0.023 0.044 0.352 0.016 0.031 0.410 0.019 0.036 CHI 0.881 0.015 0.029 0.524 0.013 0.026 0.630 0.014 0.028 Meta 0.762 0.040 0.076 0.110 0.047 0.065 0.198 0.108 0.140 tf-idf 0.725 0.026 0.050 0.101 0.024 0.039 0.233 0.107 0.146 Threshold score > 0.09 IG 0.859 0.020 0.039 0.345 0.015 0.028 0.412 0.017 0.033 IGR 0.844 0.020 0.039 0.345 0.015 0.028 0.415 0.018 0.034 MI 0.439 0.378 0.406 0.031 0.094 0.046 0.111 0.435 0.177 NMI 0.442 0.305 0.361 0.035 0.089 0.050 0.117 0.383 0.179 LL 0.844 0.020 0.039 0.345 0.015 0.028 0.415 0.018 0.034 CHI 0.875 0.014 0.027 0.578 0.012 0.023 0.689 0.013 0.025 Meta 0.758 0.039 0.074 0.111 0.045 0.064 0.201 0.105 0.138 tf-idf 0.753 0.014 0.028 0.134 0.020 0.035 0.240 0.090 0.131 Threshold score > 0.1 IG 0.889 0.014 0.027 0.353 0.013 0.025 0.404 0.016 0.031 IGR 0.891 0.014 0.028 0.345 0.013 0.025 0.411 0.017 0.032 MI 0.439 0.371 0.402 0.031 0.091 0.046 0.112 0.429 0.177 NMI 0.440 0.302 0.358 0.034 0.085 0.048 0.116 0.372 0.176 LL 0.891 0.014 0.028 0.345 0.013 0.025 0.411 0.017 0.032 CHI 0.947 0.009 0.018 0.608 0.011 0.022 0.758 0.012 0.023 Meta 0.754 0.036 0.069 0.109 0.042 0.061 0.199 0.100 0.133 tf-idf 0.783 0.009 0.018 0.215 0.017 0.031 0.268 0.073 0.115 Threshold score > 0.2 IG 0.852 0.006 0.011 0.341 0.007 0.014 0.389 0.009 0.018 IGR 0.852 0.006 0.011 0.341 0.007 0.014 0.389 0.009 0.018 MI 0.434 0.318 0.367 0.029 0.074 0.042 0.111 0.358 0.170 NMI 0.440 0.208 0.283 0.044 0.064 0.052 0.134 0.260 0.176 LL 0.852 0.006 0.011 0.341 0.007 0.014 0.389 0.009 0.018 CHI 1.000 0.003 0.006 0.800 0.006 0.012 0.852 0.006 0.011 Meta 0.867 0.010 0.019 0.310 0.015 0.029 0.396 0.019 0.037 tf-idf 0.826 0.005 0.009 0.393 0.012 0.023 0.411 0.044 0.079 Threshold score > 0.3 IG 0.789 0.004 0.007 0.280 0.005 0.010 0.326 0.007 0.014 IGR 0.789 0.004 0.007 0.280 0.005 0.010 0.326 0.007 0.014 MI 0.432 0.275 0.336 0.032 0.068 0.043 0.114 0.311 0.166 148 NMI 0.754 0.031 0.060 0.091 0.046 0.061 0.178 0.117 0.141 LL 0.789 0.004 0.007 0.280 0.005 0.010 0.326 0.007 0.014 CHI 1.000 0.002 0.004 0.789 0.004 0.007 0.833 0.004 0.007 Meta 1.000 0.004 0.008 0.727 0.006 0.012 0.405 0.008 0.016 tf-idf 0.846 0.003 0.005 0.500 0.008 0.016 0.572 0.032 0.060 Threshold score > 0.4 IG 0.714 0.002 0.005 0.254 0.004 0.009 0.316 0.006 0.012 IGR 0.714 0.002 0.004 0.254 0.004 0.009 0.316 0.006 0.012 MI 0.431 0.247 0.314 0.033 0.062 0.043 0.116 0.283 0.164 NMI 0.735 0.025 0.048 0.100 0.037 0.054 0.195 0.095 0.128 LL 0.714 0.002 0.005 0.254 0.004 0.009 0.316 0.006 0.012 CHI 1.000 0.001 0.003 0.867 0.003 0.006 1.000 0.002 0.005 Meta 1.000 0.002 0.005 0.810 0.004 0.008 0.842 0.004 0.008 tf-idf 0.875 0.002 0.003 0.677 0.005 0.010 0.764 0.010 0.021 Table 6: Score-thresholding results of verb-centred word patterns GENIA P R WEB GENIA + WEB F P R F P R F Top 100 Ranked Patterns IG 0.720 0.015 0.029 0.030 0.001 0.001 0.030 0.001 0.001 IGR 0.740 0.015 0.030 0.030 0.001 0.001 0.030 0.001 0.001 MI 0.400 0.008 0.016 0.010 0.000 0.000 0.180 0.004 0.007 NMI 0.770 0.016 0.031 0.190 0.004 0.008 0.250 0.005 0.010 LL 0.740 0.015 0.030 0.030 0.001 0.001 0.030 0.001 0.001 CHI 0.770 0.016 0.031 0.220 0.005 0.009 0.200 0.004 0.008 Meta 0.770 0.016 0.031 0.120 0.002 0.005 0.100 0.002 0.004 tf-idf 0.750 0.015 0.030 0.130 0.003 0.005 0.170 0.004 0.007 Top 200 Ranked Patterns IG 0.670 0.028 0.053 0.025 0.001 0.002 0.050 0.002 0.004 IGR 0.680 0.028 0.054 0.025 0.001 0.002 0.050 0.002 0.004 MI 0.380 0.016 0.030 0.020 0.000 0.001 0.135 0.006 0.011 NMI 0.575 0.024 0.046 0.130 0.005 0.010 0.195 0.008 0.015 LL 0.680 0.028 0.054 0.025 0.001 0.002 0.050 0.002 0.004 CHI 0.630 0.026 0.050 0.150 0.006 0.012 0.200 0.008 0.016 Meta 0.705 0.029 0.056 0.085 0.004 0.007 0.090 0.004 0.007 149 tf-idf 0.670 0.028 0.053 0.100 0.004 0.008 0.140 0.006 0.011 Top 300 Ranked Patterns IG 0.607 0.037 0.071 0.047 0.003 0.005 0.067 0.004 0.008 IGR 0.627 0.039 0.073 0.047 0.003 0.005 0.070 0.004 0.008 MI 0.443 0.027 0.052 0.010 0.001 0.001 0.140 0.009 0.016 NMI 0.500 0.031 0.058 0.147 0.009 0.017 0.180 0.011 0.021 LL 0.657 0.041 0.076 0.047 0.003 0.005 0.070 0.004 0.008 CHI 0.537 0.033 0.062 0.133 0.008 0.016 0.173 0.011 0.020 Meta 0.640 0.040 0.075 0.083 0.005 0.010 0.107 0.007 0.012 tf-idf 0.613 0.038 0.071 0.097 0.006 0.011 0.110 0.007 0.013 Table 7: Rank-thresholding results of untagged word patterns along with prepositions GENIA P R WEB GENIA + WEB F P R F P R F Top 100 Ranked Patterns IG 0.610 0.014 0.027 0.100 0.002 0.004 0.110 0.002 0.005 IGR 0.610 0.014 0.027 0.110 0.002 0.005 0.110 0.002 0.005 MI 0.440 0.010 0.019 0.010 0.000 0.000 0.050 0.001 0.002 NMI 0.690 0.016 0.030 0.320 0.007 0.014 0.350 0.008 0.015 LL 0.610 0.014 0.027 0.110 0.002 0.005 0.110 0.002 0.005 CHI 0.740 0.017 0.033 0.380 0.009 0.017 0.430 0.010 0.019 Meta 0.760 0.017 0.033 0.240 0.005 0.011 0.240 0.005 0.011 tf-idf 0.670 0.015 0.029 0.260 0.006 0.011 0.290 0.007 0.013 Top 200 Ranked Patterns IG 0.655 0.029 0.056 0.135 0.006 0.012 0.175 0.008 0.015 IGR 0.650 0.029 0.056 0.130 0.006 0.011 0.170 0.008 0.015 MI 0.465 0.021 0.040 0.020 0.001 0.002 0.060 0.003 0.005 NMI 0.635 0.029 0.055 0.205 0.009 0.018 0.265 0.012 0.023 LL 0.650 0.029 0.056 0.130 0.006 0.011 0.170 0.008 0.015 CHI 0.655 0.029 0.056 0.220 0.010 0.019 0.235 0.011 0.020 Meta 0.650 0.029 0.056 0.185 0.008 0.016 0.200 0.009 0.017 tf-idf 0.650 0.029 0.056 0.185 0.008 0.016 0.215 0.010 0.019 0.084 0.103 0.007 0.013 0.137 0.009 0.017 Top 300 Ranked Patterns IG 0.667 0.045 150 IGR 0.643 0.043 0.081 0.103 0.007 0.013 0.137 0.009 0.017 MI 0.470 0.032 0.059 0.023 0.002 0.003 0.063 0.004 0.008 NMI 0.550 0.037 0.070 0.173 0.012 0.022 0.233 0.016 0.029 LL 0.643 0.043 0.081 0.103 0.007 0.013 0.133 0.009 0.017 CHI 0.563 0.038 0.071 0.177 0.012 0.022 0.227 0.015 0.029 Meta 0.607 0.041 0.077 0.157 0.011 0.020 0.200 0.013 0.025 tf-idf 0.603 0.041 0.076 0.143 0.010 0.018 0.177 0.012 0.022 Table 8: Rank-thresholding results of PoS-tagged word patterns along with prepositions GENIA P R WEB GENIA + WEB F P R F P R F Top 100 Ranked Patterns IG 0.620 0.019 0.037 0.180 0.006 0.011 0.140 0.004 0.008 IGR 0.620 0.019 0.037 0.180 0.006 0.011 0.140 0.004 0.008 MI 0.430 0.013 0.026 0.030 0.001 0.002 0.030 0.001 0.002 NMI 0.690 0.021 0.041 0.300 0.009 0.018 0.360 0.011 0.021 LL 0.620 0.019 0.037 0.180 0.006 0.011 0.140 0.004 0.008 CHI 0.700 0.021 0.042 0.350 0.011 0.021 0.330 0.010 0.020 Meta 0.760 0.023 0.045 0.230 0.007 0.014 0.200 0.006 0.012 tf-idf 0.660 0.020 0.039 0.240 0.007 0.014 0.260 0.008 0.015 Top 200 Ranked Patterns IG 0.690 0.042 0.080 0.130 0.008 0.015 0.150 0.001 0.002 IGR 0.670 0.041 0.077 0.140 0.009 0.016 0.155 0.001 0.002 MI 0.445 0.027 0.051 0.015 0.001 0.002 0.050 0.003 0.006 NMI 0.545 0.033 0.063 0.215 0.013 0.025 0.285 0.017 0.033 LL 0.670 0.041 0.077 0.140 0.009 0.016 0.155 0.001 0.002 CHI 0.560 0.034 0.065 0.225 0.014 0.026 0.255 0.016 0.029 Meta 0.610 0.037 0.070 0.200 0.012 0.023 0.225 0.014 0.026 tf-idf 0.605 0.037 0.070 0.210 0.013 0.024 0.220 0.013 0.025 Top 300 Ranked Patterns IG 0.607 0.056 0.102 0.093 0.009 0.016 0.103 0.009 0.017 IGR 0.600 0.055 0.101 0.093 0.009 0.016 0.107 0.010 0.018 MI 0.470 0.043 0.079 0.020 0.002 0.003 0.050 0.005 0.008 NMI 0.503 0.064 0.113 0.173 0.016 0.029 0.233 0.021 0.039 151 LL 0.600 0.055 0.101 0.093 0.009 0.016 0.110 0.010 0.018 CHI 0.533 0.049 0.090 0.190 0.017 0.032 0.230 0.021 0.039 Meta 0.543 0.050 0.091 0.167 0.015 0.028 0.213 0.020 0.036 tf-idf 0.533 0.049 0.090 0.170 0.016 0.029 0.187 0.017 0.031 Table 9: Rank-thresholding results of verb-centred word patterns along with prepositions GENIA P WEB R GENIA + WEB F P R F P R F Threshold score > 0.01 IG 0.449 0.693 0.545 0.018 0.043 0.025 0.075 0.230 0.113 IGR 0.448 0.759 0.563 0.018 0.044 0.026 0.075 0.227 0.112 MI 0.444 0.776 0.564 0.012 0.068 0.020 0.068 0.484 0.119 NMI 0.449 0.728 0.555 0.012 0.064 0.020 0.068 0.450 0.119 LL 0.448 0.759 0.563 0.018 0.043 0.025 0.075 0.226 0.112 CHI 0.466 0.488 0.477 0.020 0.033 0.025 0.076 0.162 0.103 Meta 0.450 0.728 0.556 0.012 0.058 0.020 0.067 0.420 0.115 tf-idf 0.435 0.684 0.532 0.009 0.051 0.015 0.061 0.410 0.107 Threshold score > 0.02 IG 0.449 0.693 0.545 0.027 0.032 0.029 0.078 0.100 0.088 IGR 0.450 0.716 0.552 0.027 0.032 0.029 0.079 0.102 0.089 MI 0.452 0.700 0.549 0.012 0.061 0.020 0.068 0.445 0.118 NMI 0.454 0.657 0.537 0.012 0.058 0.020 0.068 0.408 0.117 LL 0.450 0.716 0.552 0.027 0.032 0.030 0.079 0.102 0.089 CHI 0.470 0.405 0.435 0.045 0.024 0.031 0.089 0.048 0.062 Meta 0.449 0.692 0.545 0.012 0.048 0.019 0.066 0.319 0.109 tf-idf 0.433 0.650 0.520 0.011 0.048 0.018 0.063 0.377 0.108 Threshold score > 0.03 IG 0.449 0.693 0.545 0.031 0.025 0.028 0.081 0.071 0.076 IGR 0.457 0.577 0.510 0.030 0.022 0.026 0.081 0.071 0.076 MI 0.453 0.653 0.535 0.012 0.058 0.019 0.068 0.421 0.117 NMI 0.463 0.522 0.491 0.012 0.054 0.020 0.068 0.369 0.114 LL 0.457 0.577 0.510 0.031 0.025 0.028 0.081 0.071 0.076 CHI 0.468 0.345 0.398 0.062 0.020 0.030 0.110 0.036 0.054 Meta 0.459 0.536 0.495 0.013 0.043 0.019 0.069 0.249 0.107 152 tf-idf 0.414 0.487 0.448 0.012 0.045 0.019 0.066 0.329 0.110 Threshold score > 0.04 IG 0.449 0.693 0.545 0.030 0.020 0.024 0.078 0.057 0.066 IGR 0.462 0.520 0.490 0.031 0.020 0.024 0.077 0.056 0.065 MI 0.460 0.537 0.496 0.012 0.056 0.019 0.068 0.400 0.116 NMI 0.464 0.498 0.481 0.013 0.052 0.020 0.067 0.349 0.113 LL 0.462 0.522 0.490 0.031 0.020 0.024 0.078 0.057 0.066 CHI 0.467 0.295 0.362 0.077 0.016 0.026 0.118 0.033 0.052 Meta 0.461 0.500 0.480 0.014 0.039 0.021 0.069 0.235 0.106 tf-idf 0.430 0.402 0.416 0.012 0.041 0.019 0.068 0.291 0.111 Threshold score > 0.05 IG 0.466 0.372 0.414 0.031 0.018 0.023 0.079 0.049 0.061 IGR 0.463 0.490 0.476 0.031 0.017 0.022 0.079 0.049 0.061 MI 0.465 0.512 0.487 0.012 0.053 0.019 0.068 0.387 0.115 NMI 0.466 0.482 0.474 0.012 0.049 0.020 0.067 0.335 0.111 LL 0.463 0.491 0.477 0.030 0.017 0.021 0.079 0.049 0.061 CHI 0.468 0.294 0.361 0.099 0.015 0.026 0.128 0.017 0.030 Meta 0.467 0.440 0.453 0.020 0.032 0.024 0.076 0.157 0.103 tf-idf 0.442 0.314 0.367 0.013 0.034 0.018 0.074 0.223 0.111 Threshold score > 0.06 IG 0.466 0.372 0.414 0.032 0.014 0.019 0.082 0.039 0.053 IGR 0.468 0.428 0.447 0.035 0.016 0.022 0.079 0.040 0.053 MI 0.464 0.496 0.480 0.012 0.051 0.019 0.067 0.365 0.114 NMI 0.464 0.463 0.463 0.012 0.046 0.019 0.065 0.308 0.108 LL 0.468 0.428 0.447 0.035 0.016 0.022 0.079 0.040 0.053 CHI 0.467 0.269 0.341 0.111 0.012 0.022 0.139 0.014 0.025 Meta 0.468 0.381 0.420 0.020 0.030 0.024 0.077 0.149 0.101 tf-idf 0.445 0.256 0.325 0.014 0.027 0.019 0.076 0.178 0.106 Threshold score > 0.07 IG 0.466 0.372 0.414 0.030 0.012 0.017 0.082 0.037 0.051 IGR 0.466 0.372 0.414 0.030 0.012 0.017 0.081 0.036 0.050 MI 0.461 0.486 0.473 0.012 0.050 0.019 0.067 0.356 0.113 NMI 0.468 0.414 0.439 0.012 0.044 0.019 0.064 0.297 0.106 LL 0.466 0.372 0.414 0.030 0.012 0.017 0.081 0.036 0.050 CHI 0.464 0.263 0.336 0.121 0.011 0.020 0.141 0.012 0.022 153 Meta 0.464 0.352 0.400 0.041 0.024 0.030 0.087 0.056 0.068 tf-idf 0.453 0.230 0.305 0.016 0.024 0.019 0.084 0.123 0.100 Threshold score > 0.08 IG 0.466 0.372 0.414 0.030 0.011 0.016 0.087 0.034 0.049 IGR 0.464 0.318 0.377 0.030 0.011 0.016 0.087 0.034 0.049 MI 0.462 0.472 0.467 0.012 0.049 0.019 0.067 0.342 0.112 NMI 0.470 0.401 0.433 0.012 0.044 0.019 0.065 0.285 0.106 LL 0.464 0.318 0.377 0.030 0.011 0.016 0.087 0.034 0.049 CHI 0.464 0.262 0.335 0.121 0.008 0.016 0.141 0.011 0.021 Meta 0.467 0.332 0.388 0.043 0.023 0.030 0.088 0.054 0.067 tf-idf 0.461 0.201 0.280 0.018 0.020 0.019 0.085 0.098 0.091 Threshold score > 0.09 IG 0.466 0.372 0.414 0.033 0.011 0.016 0.085 0.030 0.044 IGR 0.466 0.283 0.352 0.033 0.011 0.016 0.086 0.031 0.045 MI 0.462 0.461 0.462 0.012 0.047 0.019 0.067 0.340 0.112 NMI 0.467 0.363 0.408 0.013 0.042 0.020 0.066 0.268 0.106 LL 0.466 0.283 0.352 0.033 0.011 0.016 0.086 0.031 0.045 CHI 0.470 0.172 0.252 0.143 0.007 0.014 0.187 0.009 0.018 Meta 0.467 0.301 0.366 0.048 0.020 0.029 0.074 0.045 0.056 tf-idf 0.466 0.191 0.271 0.020 0.018 0.019 0.090 0.052 0.066 Threshold score > 0.1 IG 0.469 0.189 0.269 0.031 0.009 0.014 0.085 0.027 0.041 IGR 0.464 0.280 0.349 0.031 0.009 0.014 0.085 0.027 0.041 MI 0.464 0.454 0.459 0.012 0.046 0.019 0.066 0.328 0.110 NMI 0.465 0.341 0.393 0.012 0.041 0.019 0.066 0.267 0.105 LL 0.465 0.280 0.349 0.031 0.009 0.014 0.085 0.027 0.041 CHI 0.470 0.172 0.252 0.140 0.007 0.013 0.198 0.008 0.015 Meta 0.466 0.275 0.346 0.049 0.020 0.028 0.092 0.043 0.059 tf-idf 0.460 0.184 0.262 0.021 0.013 0.016 0.093 0.048 0.063 Threshold score > 0.2 IG 0.670 0.028 0.054 0.032 0.004 0.007 0.100 0.017 0.029 IGR 0.670 0.028 0.054 0.032 0.004 0.007 0.100 0.017 0.029 MI 0.464 0.314 0.374 0.011 0.038 0.017 0.064 0.277 0.104 NMI 0.462 0.260 0.333 0.022 0.028 0.024 0.078 0.130 0.097 LL 0.670 0.028 0.054 0.032 0.004 0.007 0.100 0.017 0.029 154 CHI 0.773 0.018 0.034 0.276 0.003 0.007 0.278 0.003 0.006 Meta 0.735 0.023 0.046 0.083 0.007 0.014 0.115 0.010 0.018 tf-idf 0.539 0.074 0.129 0.021 0.011 0.014 0.111 0.036 0.055 Threshold score > 0.3 IG 0.738 0.012 0.024 0.052 0.003 0.005 0.049 0.002 0.005 IGR 0.738 0.012 0.024 0.052 0.003 0.005 0.049 0.002 0.005 MI 0.460 0.280 0.348 0.012 0.034 0.018 0.065 0.227 0.101 NMI 0.463 0.166 0.244 0.067 0.020 0.031 0.100 0.035 0.052 LL 0.738 0.012 0.024 0.052 0.003 0.005 0.049 0.002 0.005 CHI 0.778 0.012 0.023 0.314 0.002 0.004 0.417 0.002 0.004 Meta 0.730 0.017 0.033 0.071 0.004 0.007 0.086 0.004 0.007 tf-idf 0.586 0.046 0.084 0.025 0.009 0.013 0.144 0.019 0.034 Threshold score > 0.4 IG 0.702 0.008 0.016 0.025 0.001 0.002 0.033 0.001 0.002 IGR 0.696 0.008 0.016 0.025 0.001 0.002 0.033 0.001 0.002 MI 0.457 0.253 0.325 0.013 0.029 0.018 0.068 0.198 0.102 NMI 0.753 0.014 0.028 0.083 0.016 0.026 0.111 0.025 0.040 LL 0.696 0.008 0.016 0.025 0.001 0.002 0.033 0.001 0.002 CHI 0.850 0.004 0.007 0.250 0.001 0.002 0.333 0.001 0.002 Meta 0.892 0.007 0.013 0.120 0.002 0.004 0.128 0.002 0.004 tf-idf 0.655 0.019 0.037 0.082 0.007 0.012 0.156 0.009 0.017 Table 10: Score-thresholding results of untagged word patterns along with prepositions GENIA P WEB R GENIA + WEB F P R F P R F Threshold score > 0.01 IG 0.439 0.615 0.512 0.017 0.047 0.025 0.045 0.107 0.063 IGR 0.440 0.583 0.501 0.017 0.047 0.025 0.044 0.099 0.061 MI 0.444 0.696 0.542 0.012 0.071 0.021 0.038 0.265 0.066 NMI 0.444 0.648 0.527 0.013 0.067 0.022 0.038 0.246 0.067 LL 0.440 0.583 0.501 0.017 0.047 0.025 0.044 0.099 0.061 CHI 0.447 0.342 0.387 0.023 0.037 0.028 0.068 0.055 0.061 Meta 0.440 0.610 0.511 0.013 0.056 0.021 0.039 0.212 0.066 tf-idf 0.388 0.559 0.458 0.014 0.059 0.023 0.046 0.223 0.077 155 Threshold score > 0.02 IG 0.447 0.379 0.410 0.024 0.033 0.027 0.046 0.067 0.054 IGR 0.443 0.449 0.446 0.024 0.032 0.028 0.046 0.066 0.054 MI 0.442 0.623 0.517 0.012 0.063 0.021 0.038 0.242 0.066 NMI 0.443 0.538 0.486 0.014 0.061 0.023 0.040 0.213 0.068 LL 0.443 0.449 0.446 0.024 0.032 0.028 0.046 0.066 0.054 CHI 0.450 0.229 0.303 0.075 0.023 0.035 0.119 0.031 0.050 Meta 0.437 0.442 0.439 0.015 0.046 0.023 0.041 0.155 0.064 tf-idf 0.391 0.538 0.453 0.017 0.049 0.025 0.049 0.190 0.078 Threshold score > 0.03 IG 0.447 0.379 0.410 0.027 0.022 0.024 0.045 0.046 0.045 IGR 0.450 0.339 0.386 0.026 0.022 0.024 0.045 0.046 0.045 MI 0.439 0.560 0.492 0.013 0.061 0.022 0.039 0.220 0.067 NMI 0.441 0.446 0.444 0.015 0.056 0.024 0.042 0.188 0.068 LL 0.450 0.339 0.387 0.026 0.022 0.024 0.045 0.046 0.045 CHI 0.452 0.200 0.277 0.130 0.018 0.032 0.166 0.025 0.043 Meta 0.448 0.362 0.401 0.018 0.041 0.025 0.045 0.132 0.067 tf-idf 0.399 0.456 0.425 0.019 0.041 0.026 0.053 0.153 0.078 Threshold score > 0.04 IG 0.458 0.217 0.295 0.029 0.020 0.023 0.047 0.038 0.042 IGR 0.455 0.248 0.321 0.028 0.020 0.023 0.047 0.038 0.042 MI 0.441 0.531 0.482 0.014 0.057 0.022 0.040 0.207 0.067 NMI 0.438 0.429 0.433 0.016 0.051 0.024 0.043 0.175 0.069 LL 0.455 0.248 0.321 0.028 0.020 0.023 0.047 0.038 0.042 CHI 0.449 0.197 0.273 0.155 0.016 0.029 0.199 0.020 0.037 Meta 0.449 0.327 0.378 0.021 0.038 0.027 0.053 0.116 0.072 tf-idf 0.399 0.456 0.425 0.022 0.037 0.027 0.056 0.117 0.075 Threshold score > 0.05 IG 0.458 0.217 0.295 0.034 0.017 0.022 0.051 0.029 0.037 IGR 0.455 0.214 0.291 0.032 0.017 0.022 0.049 0.031 0.038 MI 0.441 0.496 0.467 0.014 0.054 0.022 0.041 0.197 0.068 NMI 0.440 0.410 0.425 0.015 0.048 0.023 0.043 0.167 0.068 LL 0.455 0.214 0.291 0.032 0.017 0.022 0.049 0.031 0.038 CHI 0.445 0.193 0.270 0.160 0.013 0.023 0.218 0.018 0.032 Meta 0.444 0.285 0.347 0.024 0.036 0.029 0.056 0.106 0.073 156 tf-idf 0.425 0.402 0.413 0.024 0.029 0.026 0.060 0.092 0.073 Threshold score > 0.06 IG 0.673 0.041 0.078 0.034 0.015 0.020 0.053 0.027 0.036 IGR 0.674 0.041 0.078 0.034 0.015 0.020 0.052 0.027 0.036 MI 0.438 0.438 0.438 0.014 0.051 0.022 0.041 0.182 0.067 NMI 0.443 0.396 0.418 0.016 0.047 0.024 0.044 0.159 0.068 LL 0.663 0.042 0.078 0.034 0.015 0.020 0.053 0.027 0.036 CHI 0.732 0.020 0.039 0.185 0.012 0.022 0.227 0.016 0.029 Meta 0.449 0.236 0.309 0.024 0.035 0.028 0.093 0.048 0.063 tf-idf 0.457 0.316 0.374 0.025 0.025 0.025 0.078 0.067 0.072 Threshold score > 0.07 IG 0.673 0.041 0.078 0.042 0.013 0.020 0.051 0.023 0.032 IGR 0.661 0.038 0.071 0.043 0.013 0.020 0.051 0.023 0.032 MI 0.434 0.423 0.428 0.015 0.050 0.022 0.042 0.176 0.067 NMI 0.446 0.348 0.391 0.017 0.045 0.024 0.044 0.149 0.068 LL 0.661 0.038 0.071 0.043 0.013 0.020 0.051 0.023 0.032 CHI 0.725 0.020 0.038 0.311 0.009 0.018 0.236 0.014 0.026 Meta 0.452 0.208 0.285 0.066 0.025 0.036 0.096 0.045 0.061 tf-idf 0.463 0.218 0.297 0.045 0.020 0.027 0.080 0.049 0.061 Threshold score > 0.08 IG 0.643 0.033 0.063 0.072 0.013 0.022 0.064 0.020 0.031 IGR 0.643 0.033 0.063 0.072 0.013 0.022 0.064 0.020 0.031 MI 0.433 0.419 0.426 0.015 0.048 0.022 0.042 0.170 0.067 NMI 0.443 0.336 0.382 0.018 0.044 0.025 0.045 0.139 0.068 LL 0.643 0.033 0.063 0.072 0.013 0.022 0.064 0.020 0.031 CHI 0.752 0.018 0.035 0.339 0.009 0.017 0.232 0.013 0.024 Meta 0.450 0.205 0.282 0.066 0.023 0.034 0.106 0.040 0.058 tf-idf 0.509 0.124 0.199 0.060 0.013 0.022 0.101 0.039 0.056 Threshold score > 0.09 IG 0.653 0.029 0.055 0.066 0.011 0.019 0.091 0.017 0.028 IGR 0.640 0.030 0.058 0.066 0.011 0.019 0.092 0.017 0.029 MI 0.440 0.413 0.426 0.015 0.047 0.023 0.042 0.166 0.067 NMI 0.442 0.334 0.381 0.018 0.042 0.026 0.046 0.132 0.068 LL 0.640 0.030 0.058 0.066 0.011 0.019 0.092 0.017 0.029 CHI 0.743 0.017 0.033 0.376 0.009 0.017 0.461 0.009 0.018 157 Meta 0.448 0.203 0.279 0.069 0.021 0.032 0.108 0.038 0.056 tf-idf 0.534 0.070 0.124 0.071 0.011 0.018 0.114 0.028 0.045 Threshold score > 0.1 IG 0.653 0.029 0.055 0.067 0.011 0.019 0.092 0.017 0.028 IGR 0.649 0.028 0.054 0.067 0.011 0.019 0.092 0.017 0.028 MI 0.439 0.407 0.423 0.014 0.045 0.022 0.042 0.165 0.067 NMI 0.444 0.312 0.367 0.018 0.041 0.025 0.046 0.131 0.068 LL 0.649 0.028 0.054 0.067 0.011 0.019 0.092 0.017 0.028 CHI 0.737 0.016 0.031 0.427 0.008 0.015 0.488 0.009 0.017 Meta 0.649 0.031 0.059 0.080 0.020 0.033 0.123 0.036 0.056 tf-idf 0.559 0.045 0.083 0.094 0.008 0.015 0.141 0.020 0.035 Threshold score > 0.2 IG 0.970 0.007 0.014 0.149 0.006 0.011 0.171 0.008 0.015 IGR 0.970 0.007 0.014 0.149 0.006 0.011 0.171 0.008 0.015 MI 0.441 0.306 0.361 0.015 0.035 0.021 0.044 0.123 0.065 NMI 0.446 0.194 0.271 0.024 0.031 0.027 0.059 0.101 0.075 LL 0.970 0.007 0.014 0.149 0.006 0.011 0.171 0.008 0.015 CHI 0.952 0.004 0.009 0.629 0.005 0.010 0.688 0.005 0.010 Meta 0.870 0.009 0.018 0.179 0.010 0.019 0.207 0.012 0.023 tf-idf 0.577 0.025 0.048 0.109 0.005 0.009 0.202 0.014 0.026 Threshold score > 0.3 IG 1.000 0.003 0.007 0.136 0.004 0.008 0.144 0.006 0.012 IGR 1.000 0.003 0.007 0.136 0.004 0.008 0.144 0.006 0.012 MI 0.436 0.264 0.329 0.016 0.032 0.021 0.044 0.113 0.064 NMI 0.703 0.018 0.034 0.076 0.019 0.030 0.120 0.036 0.055 LL 1.000 0.003 0.007 0.136 0.004 0.008 0.144 0.006 0.012 CHI 1.000 0.002 0.005 0.652 0.003 0.007 0.696 0.004 0.007 Meta 0.955 0.005 0.009 0.193 0.006 0.012 0.204 0.008 0.015 tf-idf 0.637 0.016 0.032 0.168 0.004 0.007 0.280 0.010 0.020 Threshold score > 0.4 IG 1.000 0.002 0.005 0.131 0.004 0.007 0.146 0.005 0.010 IGR 1.000 0.003 0.007 0.132 0.004 0.007 0.146 0.005 0.010 MI 0.437 0.213 0.287 0.017 0.029 0.022 0.048 0.102 0.065 NMI 0.714 0.015 0.029 0.086 0.017 0.028 0.132 0.030 0.049 LL 1.000 0.002 0.005 0.132 0.004 0.007 0.146 0.005 0.010 158 CHI 1.000 0.002 0.004 0.733 0.002 0.005 0.813 0.003 0.006 Meta 1.000 0.002 0.005 0.652 0.003 0.007 0.680 0.004 0.008 tf-idf 0.719 0.010 0.020 0.211 0.003 0.005 0.468 0.007 0.013 Table 11: Score-thresholding results of PoS-tagged word patterns along with prepositions GENIA P WEB R GENIA + WEB F P R F P R F Threshold score > 0.01 IG 0.447 0.675 0.538 0.018 0.050 0.027 0.050 0.102 0.067 IGR 0.447 0.622 0.520 0.019 0.050 0.027 0.050 0.109 0.069 MI 0.452 0.700 0.550 0.013 0.070 0.021 0.039 0.274 0.069 NMI 0.450 0.659 0.535 0.013 0.067 0.022 0.040 0.254 0.070 LL 0.447 0.622 0.520 0.019 0.050 0.027 0.050 0.109 0.068 CHI 0.452 0.345 0.391 0.026 0.039 0.031 0.081 0.056 0.066 Meta 0.448 0.612 0.517 0.014 0.059 0.022 0.042 0.225 0.071 tf-idf 0.409 0.609 0.489 0.016 0.068 0.026 0.045 0.244 0.076 Threshold score > 0.02 IG 0.453 0.412 0.432 0.030 0.035 0.032 0.054 0.071 0.062 IGR 0.447 0.444 0.446 0.030 0.035 0.032 0.054 0.072 0.062 MI 0.449 0.634 0.525 0.013 0.063 0.021 0.040 0.250 0.069 NMI 0.450 0.532 0.487 0.015 0.061 0.024 0.042 0.221 0.071 LL 0.447 0.444 0.446 0.030 0.035 0.032 0.054 0.072 0.062 CHI 0.452 0.238 0.312 0.086 0.024 0.038 0.114 0.035 0.054 Meta 0.448 0.437 0.442 0.017 0.048 0.025 0.044 0.168 0.070 tf-idf 0.411 0.554 0.472 0.018 0.059 0.028 0.049 0.195 0.078 Threshold score > 0.03 IG 0.453 0.412 0.432 0.035 0.023 0.028 0.054 0.049 0.051 IGR 0.454 0.356 0.399 0.034 0.024 0.028 0.054 0.050 0.052 MI 0.447 0.558 0.496 0.014 0.061 0.022 0.041 0.228 0.070 NMI 0.448 0.443 0.445 0.015 0.056 0.024 0.043 0.194 0.071 LL 0.454 0.356 0.399 0.035 0.024 0.028 0.053 0.050 0.051 CHI 0.451 0.206 0.283 0.156 0.020 0.036 0.194 0.026 0.046 Meta 0.451 0.408 0.428 0.020 0.044 0.028 0.049 0.141 0.073 tf-idf 0.426 0.424 0.425 0.020 0.051 0.029 0.054 0.153 0.080 159 Threshold score > 0.04 IG 0.457 0.252 0.325 0.040 0.021 0.028 0.059 0.039 0.047 IGR 0.458 0.253 0.326 0.037 0.021 0.027 0.058 0.039 0.047 MI 0.448 0.527 0.484 0.014 0.058 0.023 0.042 0.216 0.070 NMI 0.446 0.428 0.437 0.016 0.052 0.024 0.045 0.183 0.072 LL 0.458 0.253 0.326 0.037 0.021 0.027 0.058 0.039 0.047 CHI 0.447 0.203 0.279 0.182 0.017 0.032 0.228 0.021 0.039 Meta 0.451 0.323 0.376 0.024 0.039 0.030 0.058 0.121 0.079 tf-idf 0.411 0.297 0.345 0.026 0.042 0.032 0.066 0.121 0.085 Threshold score > 0.05 IG 0.699 0.039 0.074 0.048 0.018 0.027 0.063 0.034 0.044 IGR 0.459 0.221 0.299 0.046 0.018 0.026 0.063 0.034 0.044 MI 0.449 0.490 0.468 0.014 0.055 0.023 0.043 0.205 0.071 NMI 0.444 0.405 0.424 0.016 0.049 0.024 0.045 0.176 0.071 LL 0.459 0.221 0.299 0.048 0.018 0.027 0.063 0.034 0.044 CHI 0.444 0.200 0.276 0.198 0.015 0.027 0.242 0.018 0.034 Meta 0.447 0.293 0.354 0.026 0.039 0.031 0.060 0.118 0.079 tf-idf 0.440 0.183 0.258 0.030 0.030 0.030 0.073 0.097 0.084 Threshold score > 0.06 IG 0.699 0.039 0.074 0.050 0.016 0.025 0.066 0.026 0.038 IGR 0.698 0.039 0.074 0.050 0.016 0.025 0.066 0.026 0.038 MI 0.445 0.436 0.440 0.014 0.052 0.023 0.043 0.189 0.070 NMI 0.449 0.396 0.421 0.016 0.047 0.024 0.046 0.167 0.072 LL 0.698 0.039 0.074 0.050 0.016 0.025 0.066 0.026 0.038 CHI 0.741 0.019 0.038 0.230 0.013 0.025 0.248 0.017 0.031 Meta 0.452 0.244 0.317 0.026 0.037 0.030 0.101 0.051 0.068 tf-idf 0.464 0.129 0.202 0.048 0.024 0.032 0.116 0.074 0.090 Threshold score > 0.07 IG 0.689 0.034 0.065 0.056 0.015 0.023 0.068 0.024 0.036 IGR 0.697 0.035 0.067 0.056 0.015 0.023 0.068 0.024 0.036 MI 0.443 0.423 0.433 0.015 0.050 0.023 0.043 0.183 0.070 NMI 0.451 0.354 0.397 0.017 0.046 0.025 0.047 0.159 0.072 LL 0.697 0.035 0.067 0.056 0.015 0.023 0.068 0.024 0.036 CHI 0.741 0.019 0.038 0.351 0.010 0.020 0.254 0.015 0.028 Meta 0.452 0.214 0.290 0.076 0.027 0.040 0.105 0.048 0.066 160 tf-idf 0.483 0.090 0.151 0.068 0.018 0.028 0.124 0.057 0.078 Threshold score > 0.08 IG 0.689 0.034 0.065 0.077 0.014 0.024 0.071 0.021 0.032 IGR 0.675 0.032 0.061 0.077 0.014 0.024 0.072 0.021 0.033 MI 0.443 0.420 0.431 0.015 0.049 0.022 0.044 0.179 0.070 NMI 0.449 0.340 0.387 0.018 0.045 0.025 0.047 0.151 0.072 LL 0.675 0.032 0.061 0.077 0.014 0.024 0.071 0.021 0.032 CHI 0.763 0.018 0.035 0.366 0.009 0.018 0.262 0.013 0.026 Meta 0.453 0.208 0.285 0.076 0.025 0.038 0.107 0.043 0.062 tf-idf 0.529 0.056 0.102 0.085 0.011 0.020 0.178 0.036 0.060 Threshold score > 0.09 IG 0.664 0.029 0.056 0.068 0.012 0.020 0.071 0.019 0.030 IGR 0.664 0.029 0.056 0.068 0.012 0.020 0.071 0.019 0.030 MI 0.444 0.413 0.428 0.015 0.048 0.023 0.044 0.174 0.070 NMI 0.448 0.338 0.385 0.019 0.044 0.026 0.048 0.143 0.072 LL 0.664 0.029 0.056 0.068 0.012 0.020 0.071 0.019 0.030 CHI 0.760 0.017 0.034 0.405 0.009 0.018 0.243 0.010 0.019 Meta 0.451 0.206 0.283 0.095 0.023 0.037 0.118 0.040 0.060 tf-idf 0.569 0.035 0.066 0.097 0.007 0.014 0.195 0.025 0.044 Threshold score > 0.1 IG 0.662 0.026 0.051 0.067 0.011 0.019 0.091 0.017 0.029 IGR 0.667 0.027 0.052 0.067 0.011 0.019 0.091 0.017 0.029 MI 0.444 0.403 0.423 0.015 0.046 0.022 0.044 0.170 0.070 NMI 0.447 0.309 0.365 0.018 0.043 0.026 0.048 0.142 0.072 LL 0.667 0.027 0.052 0.067 0.011 0.019 0.091 0.017 0.029 CHI 0.757 0.016 0.032 0.443 0.008 0.016 0.526 0.009 0.018 Meta 0.774 0.025 0.049 0.099 0.023 0.037 0.121 0.038 0.058 tf-idf 0.608 0.027 0.051 0.139 0.006 0.011 0.246 0.017 0.032 Threshold score > 0.2 IG 0.944 0.005 0.010 0.202 0.006 0.012 0.153 0.008 0.015 IGR 0.944 0.005 0.010 0.202 0.006 0.012 0.153 0.008 0.015 MI 0.443 0.303 0.359 0.015 0.038 0.022 0.047 0.135 0.069 NMI 0.445 0.201 0.277 0.026 0.034 0.029 0.063 0.108 0.080 LL 0.944 0.005 0.010 0.202 0.006 0.012 0.153 0.008 0.015 CHI 0.909 0.003 0.006 0.680 0.005 0.010 0.692 0.006 0.011 161 Meta 0.871 0.008 0.016 0.214 0.011 0.021 0.227 0.013 0.024 tf-idf 0.654 0.016 0.032 0.216 0.003 0.007 0.283 0.012 0.023 Threshold score > 0.3 IG 1.000 0.002 0.005 0.170 0.005 0.010 0.170 0.006 0.011 IGR 1.000 0.002 0.005 0.170 0.005 0.010 0.170 0.006 0.011 MI 0.438 0.274 0.337 0.017 0.035 0.023 0.047 0.124 0.068 NMI 0.725 0.018 0.035 0.091 0.021 0.034 0.138 0.038 0.060 LL 1.000 0.002 0.005 0.170 0.005 0.010 0.170 0.006 0.011 CHI 1.000 0.002 0.004 0.750 0.004 0.007 0.706 0.004 0.007 Meta 0.917 0.003 0.007 0.242 0.007 0.013 0.233 0.008 0.016 tf-idf 0.690 0.009 0.018 0.296 0.002 0.005 0.339 0.006 0.013 Threshold score > 0.4 IG 1.000 0.002 0.003 0.163 0.004 0.008 0.159 0.005 0.010 IGR 1.000 0.002 0.003 0.163 0.004 0.008 0.159 0.005 0.010 MI 0.441 0.226 0.299 0.019 0.032 0.024 0.051 0.112 0.070 NMI 0.735 0.015 0.030 0.109 0.020 0.033 0.150 0.032 0.053 LL 1.000 0.002 0.003 0.163 0.004 0.008 0.159 0.005 0.010 CHI 1.000 0.001 0.002 0.833 0.003 0.006 0.846 0.003 0.007 Meta 1.000 0.002 0.004 0.750 0.004 0.007 0.706 0.004 0.007 tf-idf 0.704 0.006 0.012 0.417 0.002 0.003 0.417 0.003 0.006 Table 12: Score-thresholding results of verb-centred word patterns along with preposition GENIA P R WEB GENIA + WEB F P R F P R F Top 100 Ranked Patterns IG 0.770 0.025 0.049 0.210 0.007 0.013 0.210 0.007 0.013 IGR 0.770 0.025 0.049 0.210 0.007 0.013 0.210 0.007 0.013 MI 0.560 0.018 0.036 0.020 0.001 0.001 0.190 0.006 0.012 NMI 0.940 0.031 0.060 0.410 0.014 0.026 0.510 0.017 0.033 LL 0.770 0.025 0.049 0.210 0.007 0.013 0.210 0.007 0.013 CHI 0.960 0.032 0.061 0.420 0.014 0.027 0.510 0.017 0.033 Meta 0.900 0.030 0.057 0.350 0.012 0.022 0.430 0.014 0.027 tf-idf 0.920 0.030 0.059 0.390 0.013 0.025 0.460 0.015 0.029 Top 200 Ranked Patterns 162 IG 0.800 0.053 0.099 0.135 0.009 0.017 0.240 0.016 0.030 IGR 0.800 0.053 0.099 0.135 0.009 0.017 0.235 0.016 0.029 MI 0.560 0.037 0.069 0.045 0.003 0.006 0.160 0.011 0.020 NMI 0.815 0.054 0.101 0.330 0.022 0.041 0.445 0.029 0.055 LL 0.800 0.053 0.099 0.135 0.009 0.017 0.235 0.016 0.029 CHI 0.815 0.054 0.101 0.395 0.026 0.049 0.445 0.030 0.056 Meta 0.830 0.055 0.103 0.365 0.024 0.045 0.425 0.028 0.053 tf-idf 0.810 0.053 0.100 0.360 0.024 0.045 0.430 0.028 0.053 Top 300 Ranked Patterns IG 0.780 0.077 0.140 0.167 0.016 0.030 0.210 0.021 0.038 IGR 0.787 0.078 0.142 0.167 0.016 0.030 0.210 0.021 0.038 MI 0.540 0.053 0.097 0.037 0.004 0.007 0.163 0.016 0.029 NMI 0.707 0.070 0.127 0.277 0.027 0.050 0.353 0.035 0.064 LL 0.790 0.078 0.142 0.167 0.016 0.030 0.210 0.021 0.038 CHI 0.710 0.070 0.128 0.380 0.038 0.068 0.440 0.044 0.079 Meta 0.740 0.073 0.133 0.310 0.031 0.056 0.377 0.037 0.068 tf-idf 0.707 0.070 0.128 0.337 0.033 0.061 0.387 0.038 0.070 Table 13: Rank-thresholding results of adapted linked chain dependency patterns GENIA P WEB R GENIA + WEB F P R F P R F Threshold score > 0.01 IG 0.748 0.107 0.187 0.150 0.073 0.098 0.223 0.145 0.176 IGR 0.748 0.107 0.187 0.153 0.076 0.101 0.225 0.144 0.176 MI 0.567 0.816 0.669 0.048 0.190 0.077 0.161 0.822 0.269 NMI 0.566 0.767 0.651 0.049 0.179 0.077 0.163 0.771 0.268 LL 0.748 0.107 0.187 0.151 0.077 0.102 0.225 0.144 0.176 CHI 0.577 0.529 0.552 0.191 0.059 0.090 0.263 0.099 0.144 Meta 0.571 0.643 0.605 0.051 0.144 0.076 0.161 0.596 0.253 tf-idf 0.553 0.575 0.564 0.054 0.157 0.080 0.176 0.527 0.264 Threshold score > 0.02 IG 0.796 0.051 0.097 0.199 0.054 0.085 0.263 0.094 0.138 IGR 0.796 0.051 0.097 0.199 0.054 0.085 0.264 0.093 0.137 MI 0.566 0.744 0.643 0.048 0.174 0.076 0.162 0.758 0.267 163 NMI 0.570 0.706 0.631 0.051 0.162 0.078 0.165 0.687 0.266 LL 0.796 0.051 0.097 0.199 0.054 0.085 0.264 0.093 0.137 CHI 0.591 0.243 0.344 0.327 0.042 0.074 0.330 0.064 0.107 Meta 0.569 0.547 0.558 0.053 0.120 0.073 0.163 0.496 0.245 tf-idf 0.557 0.532 0.544 0.057 0.131 0.079 0.184 0.457 0.263 Threshold score > 0.03 IG 0.785 0.035 0.067 0.220 0.047 0.078 0.263 0.074 0.116 IGR 0.785 0.035 0.067 0.219 0.047 0.077 0.263 0.074 0.116 MI 0.566 0.711 0.631 0.048 0.165 0.074 0.164 0.734 0.268 NMI 0.568 0.663 0.612 0.050 0.148 0.074 0.165 0.646 0.263 LL 0.785 0.035 0.067 0.220 0.047 0.078 0.263 0.074 0.116 CHI 0.613 0.146 0.236 0.414 0.033 0.061 0.426 0.040 0.073 Meta 0.577 0.355 0.439 0.056 0.105 0.073 0.164 0.403 0.233 tf-idf 0.567 0.491 0.526 0.058 0.088 0.070 0.225 0.337 0.270 Threshold score > 0.04 IG 0.784 0.025 0.049 0.203 0.033 0.056 0.259 0.061 0.098 IGR 0.786 0.025 0.049 0.203 0.033 0.056 0.260 0.060 0.098 MI 0.566 0.681 0.618 0.048 0.150 0.073 0.163 0.662 0.261 NMI 0.569 0.620 0.593 0.050 0.140 0.074 0.164 0.608 0.258 LL 0.786 0.025 0.049 0.203 0.033 0.056 0.260 0.060 0.098 CHI 0.604 0.139 0.226 0.429 0.024 0.045 0.443 0.033 0.062 Meta 0.586 0.237 0.337 0.106 0.079 0.090 0.200 0.189 0.194 tf-idf 0.575 0.412 0.480 0.115 0.080 0.094 0.246 0.254 0.250 Threshold score > 0.05 IG 0.727 0.018 0.036 0.209 0.029 0.051 0.268 0.050 0.085 IGR 0.727 0.018 0.036 0.209 0.029 0.051 0.268 0.050 0.085 MI 0.566 0.658 0.608 0.048 0.144 0.072 0.163 0.641 0.260 NMI 0.567 0.598 0.582 0.050 0.130 0.072 0.161 0.550 0.249 LL 0.727 0.018 0.036 0.209 0.029 0.051 0.268 0.050 0.085 CHI 0.595 0.130 0.214 0.418 0.019 0.037 0.456 0.027 0.052 Meta 0.604 0.145 0.234 0.103 0.072 0.085 0.195 0.176 0.185 tf-idf 0.581 0.363 0.446 0.121 0.069 0.088 0.297 0.166 0.213 Threshold score > 0.06 IG 0.685 0.012 0.024 0.198 0.026 0.046 0.273 0.045 0.078 IGR 0.685 0.012 0.024 0.198 0.026 0.046 0.273 0.045 0.078 164 MI 0.564 0.646 0.602 0.047 0.139 0.071 0.163 0.627 0.259 NMI 0.565 0.563 0.564 0.051 0.122 0.072 0.162 0.522 0.247 LL 0.685 0.012 0.024 0.198 0.026 0.046 0.273 0.045 0.078 CHI 0.865 0.051 0.096 0.446 0.016 0.032 0.507 0.023 0.044 Meta 0.600 0.137 0.223 0.139 0.062 0.086 0.222 0.125 0.160 tf-idf 0.631 0.287 0.395 0.152 0.062 0.088 0.378 0.131 0.195 Threshold score > 0.07 IG 0.630 0.010 0.019 0.191 0.022 0.040 0.265 0.042 0.073 IGR 0.630 0.010 0.019 0.191 0.022 0.040 0.264 0.042 0.072 MI 0.565 0.620 0.591 0.047 0.133 0.069 0.162 0.601 0.256 NMI 0.563 0.537 0.550 0.050 0.118 0.070 0.162 0.508 0.245 LL 0.630 0.010 0.019 0.191 0.022 0.040 0.264 0.042 0.072 CHI 0.871 0.049 0.092 0.430 0.013 0.026 0.516 0.021 0.040 Meta 0.594 0.130 0.213 0.142 0.058 0.082 0.222 0.114 0.151 tf-idf 0.713 0.203 0.316 0.170 0.053 0.081 0.417 0.084 0.140 Threshold score > 0.08 IG 0.758 0.008 0.016 0.181 0.020 0.036 0.255 0.039 0.067 IGR 0.758 0.008 0.016 0.181 0.020 0.036 0.255 0.039 0.067 MI 0.565 0.607 0.585 0.047 0.131 0.070 0.162 0.588 0.254 NMI 0.565 0.526 0.545 0.050 0.113 0.069 0.162 0.486 0.243 LL 0.758 0.008 0.016 0.181 0.020 0.036 0.255 0.039 0.067 CHI 0.906 0.038 0.073 0.422 0.012 0.022 0.528 0.019 0.036 Meta 0.795 0.060 0.112 0.187 0.049 0.077 0.157 0.088 0.113 tf-idf 0.823 0.124 0.216 0.206 0.046 0.075 0.439 0.050 0.090 Threshold score > 0.09 IG 0.733 0.007 0.014 0.167 0.016 0.030 0.259 0.036 0.064 IGR 0.733 0.007 0.014 0.168 0.016 0.030 0.259 0.036 0.064 MI 0.563 0.593 0.578 0.047 0.129 0.069 0.160 0.572 0.250 NMI 0.572 0.507 0.538 0.051 0.109 0.070 0.162 0.463 0.240 LL 0.733 0.007 0.014 0.167 0.016 0.030 0.259 0.036 0.064 CHI 0.900 0.036 0.069 0.667 0.008 0.016 0.515 0.016 0.032 Meta 0.860 0.048 0.092 0.217 0.046 0.076 0.259 0.080 0.122 tf-idf 0.854 0.058 0.109 0.311 0.032 0.057 0.443 0.028 0.053 0.012 0.174 0.015 0.027 0.252 0.034 0.060 Threshold score > 0.1 IG 0.704 0.006 165 IGR 0.704 0.006 0.012 0.174 0.015 0.027 0.252 0.034 0.060 MI 0.564 0.588 0.576 0.048 0.122 0.069 0.159 0.554 0.248 NMI 0.569 0.483 0.523 0.050 0.106 0.068 0.162 0.460 0.240 LL 0.704 0.006 0.012 0.174 0.015 0.027 0.252 0.034 0.060 CHI 0.898 0.035 0.067 0.714 0.007 0.013 0.500 0.015 0.029 Meta 0.856 0.047 0.089 0.207 0.042 0.070 0.251 0.075 0.115 tf-idf 0.866 0.045 0.085 0.371 0.021 0.039 0.476 0.022 0.043 Threshold score > 0.2 IG 0.571 0.003 0.005 0.159 0.004 0.007 0.210 0.007 0.013 IGR 0.571 0.003 0.005 0.159 0.004 0.007 0.210 0.007 0.013 MI 0.566 0.473 0.515 0.044 0.090 0.059 0.157 0.456 0.234 NMI 0.600 0.133 0.218 0.105 0.064 0.079 0.200 0.155 0.175 LL 0.571 0.003 0.005 0.159 0.004 0.007 0.210 0.007 0.013 CHI 1.000 0.015 0.029 0.800 0.003 0.005 0.917 0.004 0.007 Meta 1.000 0.013 0.025 0.337 0.011 0.020 0.434 0.021 0.040 tf-idf 0.879 0.019 0.037 0.443 0.013 0.025 0.737 0.009 0.018 Threshold score > 0.3 IG 1.000 0.001 0.001 0.195 0.003 0.005 0.211 0.005 0.010 IGR 1.000 0.001 0.001 0.195 0.003 0.005 0.211 0.005 0.010 MI 0.562 0.320 0.408 0.040 0.074 0.052 0.154 0.380 0.220 NMI 0.812 0.055 0.104 0.141 0.047 0.070 0.230 0.090 0.130 LL 1.000 0.001 0.001 0.195 0.003 0.005 0.211 0.005 0.010 CHI 1.000 0.009 0.018 1.000 0.001 0.002 1.000 0.002 0.004 Meta 1.000 0.004 0.008 0.302 0.005 0.010 0.364 0.008 0.015 tf-idf 0.895 0.011 0.022 0.656 0.010 0.020 0.842 0.005 0.010 Threshold score > 0.4 IG 1.000 0.001 0.001 0.154 0.002 0.004 0.236 0.004 0.008 IGR 1.000 0.001 0.001 0.154 0.002 0.004 0.236 0.004 0.008 MI 0.569 0.209 0.306 0.040 0.064 0.049 0.154 0.329 0.209 NMI 0.939 0.031 0.059 0.203 0.036 0.061 0.281 0.057 0.095 LL 1.000 0.001 0.001 0.154 0.002 0.004 0.236 0.004 0.008 CHI 1.000 0.005 0.010 1.000 0.001 0.001 1.000 0.001 0.002 Meta 1.000 0.001 0.003 0.154 0.002 0.004 0.286 0.005 0.010 tf-idf 0.941 0.005 0.010 0.800 0.004 0.008 0.917 0.004 0.007 Table 14: Score-thresholding results of adapted linked chain dependency patterns 166 Bibliography Abell, M., Bauder, D. & Simmons, T. (2004). Universally designed online assessment: Implications for the future. Information Technology and Disability, 10(1). Afzal, N., Mitkov, R. & Farzindar, A. (2011). Unsupervised relation extraction using dependency trees for automatic generation of multiple-choice questions. In Proceedings of the C. Butz and P. Lingras (Eds.): Canadian Artificial Intelligence, LNAI 6657. Newfoundland and Labrador, Canada: Springer, Heidelberg, pp. 32-43. Afzal, N. & Pekar, V. (2009). Unsupervised relation extraction for automatic generation of multiple-choice questions. In Proceedings of the Recent Advances in Natural Language Processing (RANLP-2009). Borovets, Bulgaria, pp. 1-5. Agichtein, E. & Ganti, V. (2004). Mining reference tables for automatic text segmentation. In Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-04). Seattle, USA: ACM Press, pp. 2029. Agichtein, E. & Gravano, L. (2000). Snowball: Extracting relations from large plaintext collections. In Proceedings of the 5th ACM conference on Digital libraries. ACM, pp. 85-94. Aldabe, I. & Maritxalar, M. (2010). Automatic distractors generation for domain specific texts. In Proceedings of the 7th International Conference on Advances in Natural Language Processing (IceTAL-2010). pp. 27-38. Alfonseca, E. & Manandhar, S. (2002). An unsupervised method for general named entity recognition and automated concept discovery. In Proceedings of the 1st International Conference on General WordNet. pp. 1-9. 167 Ananiadou, S. & McNaught, J. eds. (2006). Text Mining for Biology and Biomedicine, Artech House. Ando, R.K. & Zhang, T. (2005). A high-performance semi-supervised learning method for text chunking. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL-05). Association for Computational Linguistics, pp. 1-9. Andrade, M.A. & Valencia, A. (1998). Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics, 14(7), pp. 600-607. Asahara, M. & Matsumoto, Y. (2003). Japanese named entity extraction with redundant morphological analysis. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL- 03). pp. 8-15. Bach, N. & Badaskar, S. (2007). A survey on relation extraction. Language Technologies Institute, Carnegie Mellon University. Ball, S., Barber, C., Buckel, L., Cooke, S., Gluc, E., Mole, J. & Sutherland, A. (2003). Inclusive learning and teaching: ILT for disabled learners. Becta Ferl and JISC TechDis. Barzilay, R. & Lapata, M. (2005). Collective content selection for concept-to-text generation. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-05). pp. 331-338. Basili, R., Pazienza, M. & Vindigni, M. (2000). Corpus-driven learning of event recognition rules. In Proceedings of the ECAI 2001 Workshop on Machine Learning for Information Extraction. August, pp. 20-25. Becker, W.E. & Watts, M. (2001). Teaching methods in U.S. undergraduate economics courses. Journal of Economic Education, pp.269-280. 168 Becker, W.E. & Johnston, C. (1999). The relationship between multiple choice and essay response questions in assessing economics understanding. Economic Record, 75(4), pp. 348-357. Benton, M., Tremaine, M. & Scher, J. (2004). Computer aids for designing effective multiple choice questions. In Proceedings of Americas Conference on Information Systems (AMCIS). Bikel, D.M., Schwartz, R. & Weischedel, R.M. (1999). An algorithm that learns what’s in a name. Machine Learning, 34(1), pp. 211-231. Bikel, D.M., Miller, S., Schwartz, R. & Weischedel, R. (1998). Nymble: A highperformance learning name-finder. In Proceedings of the 5th Conference on Applied Natural Language Processing. Association for Computational Linguistics, p. 8. Blaschke, C., Andrade, M., Ouzounis, C. & Valencia, A. (1999). Automatic extraction of biological information from scientific text: protein-protein interactions. In Proceedings of the International Conference on Intelligent Systems for Molecular Biology (ISMB). pp. 60-67. Borkar, V., Deshmukh, K. & Sarawagi, S. (2001). Automatic segmentation of text into structured records. In Proceedings of ACM SIGMOD International Conference of Management of Data. Santa Barabara, USA, pp. 175-186. Borthwick, A., Sterling, J., Agichtein, E. & Grishman, R. (1998). NYU: Description of the MENE named entity system as used in MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7). Virginia, USA. Boytcheva, S., Nikolova, I., Paskaleva, E., Angelova, G., Tcharaktchiev, D. & Dimitrova, N. (2009). Extraction and exploration of correlations in patient status data. In Proceedings of RANLP 2009 Workshop: Biomedical Information Extraction. Borovets, Bulgaria, pp. 1-7. 169 Brin, S. (1998). Extracting patterns and relations from the World Wide Web. In Proceedings of the International Workshop on World Wide Web and Databases. pp. 172–183. Brown, J.C., Frishkoff, G. & Eskenazi, M. (2005). Automatic question generation for vocabulary assessment. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-05). October, Vancouver, pp. 819-826. Bunescu, R. & Mooney, R. (2007). Learning to extract relations from the web using minimal supervision. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL-07). Prague, Czech Republic. Bunescu, R. & Mooney, R. (2006). Subsequence kernels for relation extraction. In Y. Weiss, B. Schölkopf, & J. Platt, eds. Advances in Neural Information Processing Systems 18. MIT Press, p. 171. Bunescu, R., Ge, R., Kate, R., Marcotte, M., Mooney, R., Ramani, A. & Wong, Y. (2005). Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine, 33(2), pp. 139-155. Cai, Y., Dong, X., Halevy, A., Liu, J. & Madhavan, J. (2005). Personal information management with SEMEX. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD-05). ACM Press, pp. 921-923. Cao, H., Jiang, D., Pei, J., He, Q., Liao, Z., Chen, E. & Li, H. (2008). Context-aware query suggestion by mining click-through and session data. In Proceedings of KDD08. pp. 875-883. Caraballo, S.A. (1999). Automatic construction of a hypernym-labeled noun hierarchy from text. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. pp. 120-126. 170 Carreras, X., Màrquez, L. & Padró, L. (2003). A simple named entity extractor using AdaBoost. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003. Edmonton, Canada, pp. 152-155. Carter, J., Ala-Mutka, K., Fuller, U., Dick, M., English, J., Fone, W. & Sheard, J. (2003). How shall we assess this? ACM SIGCSE Bulletin, 35(4), pp. 107-123. Català, N. (2003). Acquiring Information Extraction patterns from unannotated corpora. PhD thesis, Technical University of Catalonia. Català, N., Castell, N. & Martin, M. (2000). ESSENCE: A portable methodology for acquiring Information Extraction patterns. In Proceedings of 14th European Conference on Artificial Intelligence (ECAI-2000). pp. 411-415. Chakrabarti, S., Mirchandani, J. & Nandi, A. (2005). SPIN: Searching Personal Information Networks. In Proceedings of Annual ACM Conference on Research and Development in Information Retrieval. p. 674. Chang, W., Pantel, P., Popescu, A.M. & Gabrilovich, E. (2009). Towards intentdriven bidterm suggestion. In Proceedings of the 18th International Conference on World Wide Web (WWW-09). pp. 1093-1094. Chen, C.Y., Liou, H.C. & Chang, J.S. (2006). FAST: An automatic generation system for grammar tests. In Proceedings of the COLING/ACL on Interactive Presentation Sessions. Association for Computational Linguistics, pp. 1–4. Chen, L., Liu, H. & Friedman, C. (2005). Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics, 21(2), pp. 248-256. Chen, W., Aist, G. & Mostow, J. (2009). Generating questions automatically from informational text. In Proceedings of the 2nd Workshop on Question Generation. Brighton, UK, pp. 17-24. 171 Chieu, H.L. & Ng, H.T. (2003). Named entity recognition with a maximum entropy approach. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003. Edmonton, Canada, pp. 160-163. Chinchor, N. (1998). MUC-7 named entity task definition, version 3.5. In Proceedings of the Seventh Message Understanding Conference (MUC-7). Cohen, A.M. & Hersh, W.R. (2005). A survey of current work in biomedical text mining. Briefings in bioinformatics, 6(1), pp. 57-71. Cohen, J. (1968). Weighted Kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), pp. 213-220. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), pp. 37-46. Collins, M. & Singer, Y. (1999). Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. pp. 100-110. Corney, D., Buxton, B., Langdon, W. & Jones, D. (2004). BioRAT: Extracting biological information from full-length papers. Bioinformatics (Oxford, England), 20(17), pp. 3206-3213. Cover, T. & Thomas, J. (1991). Elements of Information Theory, New York, USA: Wiley. Cucchiarelli, A. & Velardi, P. (2001). Unsupervised named entity recognition using syntactic and semantic contextual evidence. Computational Linguistics, 27(1), pp. 123-131. Culotta, A. & Sorensen, J. (2004). Dependency tree kernels for relation extraction. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (ACL-04). Barcelona, Spain: Association for Computational Linguistics, pp. 423-429. 172 Cunningham, H., Maynard, D., Bontcheva, K. & Tablan, V. (2002). GATE: A framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 168-175. Cutrell, E. & Dumais, S.T. (2006). Exploring personal information. Communications of the ACM, 49(4), pp. 50-51. Dagan, I. (2000). Contextual word similarity. In R. Dale, H. Moisl, & H. Somers, eds. Handbook of Natural Language Processing. Marcel Dekker Inc, pp. 459-476. Dagan, I., Lee, L. & Pereira, F. (1999). Similarity-based models of word cooccurrence probabilities. Machine Learning, 34(1-3), pp. 43-69. Dagan, I., Lee, L. & Pereira, F. (1997). Similarity-based methods for word sense disambiguation. In Proceedings of the 35th Annual Meeting on Association for Computational Linguistics. Madrid, Spain, pp. 56-63. Das, R. & Elikkottil, A. (2010). Auto-summarizer to aid a Q/A system. International Journal of Computer Applications, 1(1), pp. 113-117. Dhillon, I.S., Mallela, S. & Kumar, R. (2002). Enhanced word clustering for hierarchical text classification. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-02). pp. 191-200. Downey, D., Etzioni, O. & Soderland, S. (2005). A probabilistic model of redundancy in information extraction. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI-05). pp. 1034-1041. Dufresne, R.J., Leonard, W.J. & Gerace, W.J. (2002). Making sense of students’ answers to multiple-choice questions. The Physics Teacher, 40(3), pp. 174-180. 173 Eichler, K., Hemsen, H. & Neumann, G. (2008). Unsupervised relation extraction from web documents. In Proceedings of the 6th International Language Resources and Evaluation (LREC-08). Marrakech, Morocco: European Language Resources Association (ELRA). Erk, K. (2007). A simple, similarity-based model for selectional preferences. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL-07). Prague, Czech Republic, pp. 216-223. Erkan, G., Ozgur, A. & Radev, D.R. (2007). Semi-supervised classification for extracting protein interaction sentences using dependency parsing. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Prague, Czech Republic, pp. 228-237. Etzioni, O., Banko, M., Soderland, S. & Weld, D.S. (2008). Open Information Extraction from the Web. Communications of the ACM, 51(12), pp. 68-74. Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A., Shaked, T., Soderland, S., Weld, D. & Yates, A. (2005). Unsupervised named-entity extraction from the Web: An experimental study. Artificial Intelligence, 165(1), pp. 91-134. Evans, R. (2003). A framework for named entity recognition in the open domain. In Proceedings of the Recent Advances in Natural Language Processing (RANLP-2003). Borovets, Bulgaria, pp. 137-144. Farzindar, A. & Lapalme, G. (2004). LetSum, an automatic legal text summarizing system. In T. F. Gordon, ed. Legal Knowledge and Information Systems, JURIX 2004. Berlin, Germany: IOS Press, pp. 11-18. Feldman, R. (2006). Information Extraction - theory and practice. In Proceedings of 23rd International Conference on Machine Learning (ICML). Pittsburgh, Pennsylvania. 174 Finkel, J.R., Grenager, T. & Manning, C. (2005). Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, pp. 363-370. Firth, J.R. (1957). A synopsis of linguistic theory 1930-1955. F. Palmer, ed. Studies in Linguistic Analysis, Special volume, pp. 1-32. Florian, R. Han, B., Luo, X., Kambhatla, N. & Zitouni, I. (2007). IBM ACE-07 system description. In Proceedings of NIST 2007 Automatic Content Extraction Evaluation. Frakes, B.W. & Baeze-Yates, R. (1992). Information Retrieval, Data Structures and Algorithms, Prentice Hall. Fundel, K., Küffner, R. & Zimmer, R. (2007). RelEx- relation extraction using dependency parse trees. Bioinformatics, 23(3), pp. 365-371. Fundel, K. & Zimmer, R. (2006). Gene and protein nomenclature in public databases. BMC Bioinformatics, 7, p.372. Gates, D. (2008). Generating look-back strategy questions from expository texts. In Proceedings of 1st Workshop on the Question Generation Shared Task and Evaluation Challenge. Arlington, VA, pp. 1-3. Goodrich, C.H. (1977). Distractor efficiency in foreign language testing. TESOL Quarterly, 11(1), pp. 69-78. Graesser, A.C., Chipman, P., Haynes, B.C. & Olney, A. (2005). AutoTutor: An intelligent tutoring system with mixed-initiative dialogue. IEEE Transactions on Education, 48(4), pp. 612-618. Graesser, A.C. & Person, N.K. (1994). Question asking during tutoring. American Educational Research Journal, 31(1), pp. 104-137. 175 Grover, C., Lascarides, A. & Lapata, M. (2005). A comparison of parsing technologies for the biomedical domain. Natural Language Engineering, 11(1), pp. 27-65. Grefenstette, G. (1994). Explorations in automatic thesaurus discovery. Kluwer, Bostin: The Springer International Series in Engineering and Computer Science, Vol. 278. Greenwood, M.A., Stevenson, M., Guo, Y., Harkema, H. & Roberts, A. (2005). Automatically acquiring a linguistically motivated genic interaction extraction system. In Proceedings of the 4th Learning Language in Logic Workshop (LLL-05). Bonn, Germany, pp. 1-7. Grishman, R., Huttunen, S. & Yangarber, R. (2002). Information Extraction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 35(4), pp. 236-246. Grishman, R. & Sundheim, B. (1996). Message Understanding Conference-6: A brief history. In Proceedings of the 16th International Conference on Computational Linguistics. Copenhagen, Denmark, pp. 466-471. Gronlund, N. (1982). Constructing Achievement Tests. New York, USA: Prentice Hall. Ha, L.A. (2007). Advances in automatic terminology processing: Methodology and application in focus. PhD thesis. University of Wolverhampton. Haladyna, T., Downing, S. & Rodriguez, M. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), pp. 309-333. 176 Harabagiu, S. & Maiorano, S. (2000). Acquisition of linguistic patterns for knowledge-based information extraction. In Proceedings of LREC-2000. Athens, Greece. Harkema, H., Setzer, A., Gaizauskas, R., Hepple, M., Power, R. & Rogers, J. (2005). Mining and modelling temporal clinical data. In Proceedings of the 4th UK e-Science All Hands Meeting. Nottingham, UK. Harris, Z. (1968). Mathematical Structures of Language. Interscience Publishers. Harris, Z. (1954). Distributional structure. J. Katz, ed. Word Journal of the International Linguistic Association, 10(23), pp. 146-162. Harshman, R.A. (1970). Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis. UCLA Working Papers in Phonetics, 16(1), pp. 1-84. Hasegawa, T., Sekine, S. & Grishman, R. (2004). Discovering relations among named entities from large corpora. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Barcelona, Spain: Association for Computational Linguistics, pp. 415-422. Hatzivassiloglou, V. (1996). Do we need linguistics when we have statistics? A comparative analysis of the contributions of linguistics cues to a statistical word grouping system. In J. Klavans & P. Resnik, eds. The balancing act: Combining symbolic and statistical approaches to language. MIT Press, pp. 67-94. Hearst, M.A. (1992). Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguistics. Association for Computational Linguistics, pp. 539-545. Heng, J. (2008). Improving Information Extraction and translation using component interactions. PhD thesis. New York University. 177 Heng, J. & Grishman, R., (2006). Data selection in semi-supervised learning for name tagging. In Proceedings of COLING/ACL 06 Workshop on Information Extraction Beyond Document. Association for Computational Linguistics, pp. 48-55. Higgins, D., Burstein, J., Marcu, D., & Gentile, C. (2004). Evaluating multiple aspects of coherence in student essays. In Proceedings of the Human Language Technology Conference North American Chapter of the Association for Computational Linguistics, pp. 185-192. Hirschman, L. & Mani, I. (2003). Evaluation. In R. Mitkov, ed. The Oxford Handbook of Computational Linguistics. Oxford University Press, pp. 414-429. Hodges, P.E., McKee, A.H., Davis, B.P., Payne, W.E. & Garrels, J.I. (1999). The Yeast Proteome Database (YPD): A model for the organization and presentation of genome-wide functional data. Nucleic Acids Research, 27(1), pp. 69-73. Hoshino, A. & Nakagawa, H. (2007). Assisting cloze test making with a web application. In Proceedings of Society for Information Technology Teacher Education International Conference 2007. AACE, pp. 2807-2814. Hoshino, A. & Nakagawa, H. (2005). A real-time multiple-choice question generation for language testing: a preliminary study. In Proceedings of the 43rd ACL’05, Second Workshop on Building Educational Applications Using NLP. pp. 17-20. Huang, M., Zhu, X., Payan, G.D., Qu, K. & Li, M. (2004). Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics (Oxford, England), 20(18), pp. 3604-3612. Huffman, S. (1996). Learning information extraction patterns from examples. Connectionist Statistical and Symbolic Approaches to Learning for Natural Language Processing, pp. 246-260. Isaacs, G. (1994). Multiple choice testing: A guide to the writing of multiple choice tests and to their analysis. Campbell town: HERDSA. 178 Järvinen, T. (1994). Annotating 200 million words: the Bank of English project. In Proceedings of the 15th International Conference on Computational Linguistics. Kyoto, pp. 565-568. Jiang, J.J. & Conrath, D.W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference on Research in Computational Linguistics. Taiwan, pp. 19-33. Jurafsky, D. & Martin, J.H. (2008). Information Extraction. In Speech and Language Processing. Prentice Hall, pp. 725-764. Kaiser, K. & Miksch, S. (2005). Information Extraction a survey. Technology, (May), p.32. Kalady, S., Elikkottil, A. & Das, R. (2010). Natural language question generation using syntax and keywords. In Boyer, Kristy Elizabeth and Piwek, Paul eds. (2010). Proceedings of QG2010: The 3rd Workshop on Question Generation. Pittsburgh, Pennsylvania, pp. 1-11. Karamanis, N., Ha, L.A. & Mitkov, R. (2006). Generating multiple-choice test items from medical text: A pilot study. In Proceedings of the 4th International Natural Language Generation Conference, (July), pp. 111-113. Katrenko, S. & Adriaans, P. (2006). Learning relations from biomedical corpora using dependency tree levels. In Proceedings of the 1st International Workshop on Knowledge Discovery and Emergent Complexity in Bioinformatics. pp. 61-80. Keller, F. & Lapata, M. (2003). Using the web to obtain frequencies for unseen bigrams. Computational Linguistics, 29(3), pp. 459-484. Kilgarriff, A. & Grefenstette, G. (2003). Introduction to the special issue on the Web as corpus. Computational Linguistics, 29(3), pp. 333-347. 179 Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), pp. 1-37. Kilgarriff, A. & Rose, T. (1998). Measures for corpus similarity and homogeneity. In Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing (EMNLP). Granada, Spain, pp. 46-52. Kilgarriff, A. (1997). Using word frequency lists to measure corpus homogeneity and similarity between corpora. In Proceedings of 5th ACL-SIGDAT Workshop on very Large Corpora. Beijing and Hong Kong, pp. 231–245. Kim, J.-D., Ohta, T. & Tsujii, J. (2008). Corpus annotation for mining biomedical events from literature. BMC Bioinformatics, 9(1), p.10. Kim, J.T. & Moldovan, I.D. (1995). Acquisition of linguistic patterns for knowledgebased information extraction. IEEE Transactions on Knowledge and Data Engineering, 7(5), pp. 713-724. Klein, D. & Manning, C.D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, pp. 423-430. Kleinberg, J.M. (2002). Bursty and hierarchical structure in streams. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-02). ACM Press, pp. 91-101. Kulick, S., Bies, A., Liberman, M., Mandel, M., McDonald, R., Palmer, M., Schein, A. & Ungar, L. (2004). Integrated annotation for biomedical information extraction. In J. Pustejovsky & L. Hirschman, eds. Proceedings of the HLT-NAACL 2004 Workshop: BioLINK 2004, Linking Biological Literature, Ontologies and Databases. Boston, Massachusetts, USA: Association for Computational Linguistics, pp. 61-68. Kullback, S. & Leibler, R.A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), pp. 79-86. 180 Kunichika, H. Urushima, M. Hirashima, T. & Takeuchi, A. (2002). A computational method of complexity of questions on contents of English sentences and its evaluation. In Proceedings of the International Conference on Computers in Education (ICCE-02). Auckland, NZ, pp. 97-101. Lapata, M., Keller, F. & McDonald, S. (2001). Evaluating smoothing algorithms against plausibility judgements. In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics (ACL-2001). Toulouse, France, pp. 346353. Lavelli, A., Califf, M., Ciravegna, F., Freitag, D., Giuliano, C., Kushmerik, N. & Romano, L. (2004). IE evaluation: Criticisms and recommendations. In Proceedings of the AAAI Workshop on Adaptive Text Extraction and Mining. Leacock, C. & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37(no. 4), pp. 389-405. Leacock, C. & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. In C. Fellbaum, ed. WordNet: An Electronic Lexical Database. MIT Press, pp. 265-283. Lee, L. (2001). On the effectiveness of the skew divergence for statistical language analysis. In Proceedings of the Artificial Intelligence and Statistics. pp. 65-72. Lehnert, W., Cardie, C., Fisher, D., McCarthy, J., Riloff, E. & Soderland, S. (1992). University of Massachusetts: Description of the CIRCUS System as Used for MUC-4. In Proceedings of the Fourth Message Understanding Conference (MUC-4). Morgan Kaufmann, pp. 282-288. Lehnert, W. (1990). Symbolic/subsymbolic sentence analysis: Exploiting the best of two worlds. In Barnden, J. and Pollack, J., editors 1990, Advances in Connectionist and Neural Computation Theory, 1, pp. 135–164. 181 Lesk, M. (1986). Automatic sense disambiguation: How to tell a pine cone from an ice cream cone. In Proceedings of the SIGDOC Conference. Toronto, Canada: Association for Computing Machinery, pp. 24-26. Lin, D. & Pantel, P. (2001). DIRT- Discovery of Inference Rules from Text. In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining. San Francisco, USA, pp. 323–328. Lin, D. (1998). Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, pp. 768-774. Lin, D. (1997). Using syntactic dependency as local context to resolve word sense ambiguity. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics. Madrid, Spain, pp. 64-71. Lin, D. (1991). MINIPAR: A minimalist parser. In Maryland Linguistics Colloquium, University of Maryland. Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), pp. 145-151. Linn, R.L. & Miller, M.D. (2005). Measurement and Assessment in Teaching 9th ed., Prentice Hall. Lister, R. (2001). Objectives and objective assessment in CS1. ACM SIGCSE Bulletin, 33(1), pp. 292-296. Lister, R. (2000). On blooming first year programming, and its blooming assessment. In Proceedings of the Australasian Conference on Computing Education. ACM, pp. 158-162. 182 Liu, B., Hu, M. & Cheng, J. (2005). Opinion observer: Analyzing and comparing opinions on the web. In Proceedings of the 14th International Conference on World Wide Web. ACM, pp. 342-351. McDonald, R. (2005). Extracting relations from unstructured text. Technical report, Department of Computer and Information Science, University of Pennsylvania. Manning, C.D., Raghavan, P. & Schütze, H. (2008). Evaluation in information retrieval. In Introduction to Information Retrieval. Cambridge, USA: Cambridge University Press, pp. 139-161. Manning, C. & Schütze, H. (1999). Foundations of Statistical Natural Language Processing, Cambridge, USA: The MIT Press. Martin, E.P., Bremer, E., Guerin, G., DeSesa, M-C. & Jouve, O. (2004). Analysis of protein-protein interactions through biomedical literature: text mining of abstract vs. text mining of full text articles. Knowledge Exploration in Life Science Informatics, 3303, pp. 96-108. Mayfield, J., McNamee, P. & Piatko, C. (2003). Named entity recognition using hundreds of thousands of features. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003. Edmonton, Canada, pp. 184-187. Maynard, D., Tablan, V., Ursu, C., Cunningham, H. & Wilks, Y. (2001). Named entity recognition from diverse text types. In Proceedings of Recent Advances in Natural Language Processing Conference. CiteSeer, pp. 257–274. McCallum, A. & Jensen, D. (2003). A note on the unification of information extraction and data mining using conditional-probability, relational models. Journal of Machine Learning Research, pp. 79-86. McCallum, A. & Li, W. (2003). Early results for named entity recognition with conditional random fields. In Proceedings of the 7th Conference on Natural Language Learning (CoNLL-2003), pp. 188-191. 183 McCallum, A., Freitag, D. & Pereira, F. (2000). Maximum entropy Markov models for information extraction and segmentation. In Proceedings of the 17th International Conference on Machine Learning. Palo Alto, CA, pp. 591-598. McFarlane, A. (2002). Educating the inheritors of information age. Inaugural lecture, University of Bristol, 18th November. McFarlane, A. (2001). Perspectives on the relationships between ICT and assessment. Journal of Computer Assisted Learning, 17(3), pp. 227-234. McIntyre, N. & Lapata, M. (2009). Learning to tell tales: A data-driven approach to story generation. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Singapore, pp. 217-225. McKeachie, W.J. (2002). McKeachie’s Teaching Tips: Strategies, Research and Theory for College and University Teachers 11th ed., Boston: Houghton Mifflin. Mehdi, Y.-M., Farzindar, A. & Lapalme, G. (2010). Supervised machine learning for summarizing legal documents. In Proceedings of the Canadian Conference on Artificial Intelligence. Ottawa, Canada: Springer Berlin / Heidelberg, pp. 51-62. Meulder, F.D. & Daelemans, W. (2003). Memory-based named entity recognition using unannotated data. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003. Edmonton, Canada, pp. 208-211. Mikheev, A. (1999). A knowledge-free method for capitalized word disambiguation. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 159-166. Mitkov, R., Ha, L.A., Varga, A. & Rello, L. (2009). Semantic similarity of distractors in multiple-choice tests: Extrinsic evaluation. In Proceedings of EACL 2009 184 Workshop on GEometrical Models of Natural Language Semantics (GEMS-2009). Athens, Greece: Association for Computational Linguistics, pp. 49-56. Mitkov, R., Ha, L.A. & Karamanis, N. (2006). A computer-aided environment for generating multiple-choice test items. Natural Language Engineering, 12(2), pp. 177194. Mitkov, R. & Ha, L.A. (2003). Computer-aided generation of multiple-choice tests. In Proceedings of the HLT-NAACL 03 Workshop on Building Educational Applications using Natural Language Processing, Edmonton, Canada, pp. 17-22. Mohammad, S. & Hirst, G. (2005). Distributional measures as proxies for semantic relatedness. Kluwer Academic Publishers, Netherlands, pp. 1-48. Mooney, R.J. & Bunescu, R. (2005). Mining knowledge from text using information extraction. ACM SIGKDD Explorations Newsletter, 7(1), pp. 3-10. Mostow, J. & Chen, W. (2009). Generating instruction automatically for the reading strategy of self-questioning. In Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED). Brighton, UK, pp. 465-472. Mukherjea, S. & Sahay, S. (2006). Discovering biomedical relations utilizing the World-wide Web. Pacific Symposium on Biocomputing, 175(1), pp. 164-175. Nadeau, D. & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), pp. 3-26. Nadeau, D., Turney, P.D. & Matwin, S. (2006). Unsupervised named-entity recognition: Generating gazetteers and resolving ambiguity. In Proceedings of the 19th Canadian Conference on Artificial Intelligence. Quebec, Canada, pp. 266-277. Nielsen, R. (2008). Question generation: Proposed challenge tasks and their evaluation. In Proceedings of the 1st Workshop on the Question Generation Shared Task and Evaluation Challenge. Arlington, VA. 185 Ono, T., Hishigaki, H., Tanigami, A. & Takagi, T. (2001). Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics (Oxford, England), 17(2), pp. 155-61. Palmer, D.D. & Day, D.S. (1997). A statistical profile of the named entity task. In Proceedings of the 5th Conference on Applied Natural Language Processing. Association for Computational Linguistics, pp. 190-193. Palmer, M., Gildea, D. & Kingsbury, P. (2005). The Proposition Bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1), pp. 71-106. Palmer, M. & Finin, T. (1990). Workshop on the evaluation of natural language processing systems. Computational Linguistics, 16(3), pp. 175-181. Papasalouros, A., Kotis, K. & Kanaris, K. (2008). Automatic generation of multiplechoice questions from domain ontologies. In Proceeding of the IADIS e-Learning. Amsterdam: IADIS, pp. 427-434. Paroubek, P., Chaudiron, S. & Hirschman, L. (2007). Principles of evaluation in Natural Language Processing. Traitement Automatique des Langues, 48(1), pp. 7–31. Paşca, M., Lin, D., Bigham, J., Lifchits, A. & Jain, A. (2006). Names and similarities on the web: Fact extraction in the fast lane. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. pp. 809-816. Paşca, M., Lin, D., Bigham, J., Lifchits, A. & Jain, A. (2006). Organizing and searching the World Wide Web of facts - Step one: The que-million fact extraction challenge. In Proceedings of the National Conference on Artificial Intelligence. pp. 1400-1405. 186 Pekar, V., Krkoska, M. & Staab, S. (2004). Feature weighting for co-occurrencebased classification of words. In Proceedings of the 20th International Conference on Computational Linguistics (COLING-04). Geneva, Switzerland, pp. 799-805. Pereira, F., Tishby, N. & Lee, L. (1993). Distributional clustering of English words. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL-93). Columbus, Ohio, pp. 183-190. Petasis, G., Vichot, F., Wolinski, F., Paliouras, G., Karkaletsis, V. & Spyropoulos, D. (2001). Using machine learning to maintain rule-based named-entity recognition and classification systems. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, pp. 426-433. Pino, J. & Eskenazi, M. (2009). Semi-automatic generation of cloze question distractors effect of students’ L1. In Proceedings of the SLaTE Workshop on Speech and Language Technology in Education. pp. 1-4. Pino, J., Heilman, M.J. & Eskenazi, M. (2008). A selection strategy to improve cloze question quality. In Proceedings of the Workshop on Intelligent Tutoring Systems for Ill-Defined Domains, 9th International Conference on Intelligent Tutoring Systems. Plake, C., Schiemann, T., Pankalla, M., Hakenberg, J. & Leser, U. (2006). AliBaba: PubMed as a graph. Bioinformatics, 22(19), pp. 2444-2445. Pollock, M., Whittington, C. & Doughty, G. (2000). Evaluating the costs and benefits of changing to CAA. In Proceedings of the 4th Annual CAA Conference. Loughborough. Ponomareva, N., Gomez, J.M. & Pekar, V. (2009). AIR: a Semi-Automatic System for Archiving Institutional Repositories. In Proceedings of 14th International Conference on Applications of Natural Language to Information Systems (NLDB-09). Saarbrucken, Germany: Springer Berlin / Heidelberg, pp. 169-181. 187 Popescu, A.-M. & Etzioni, O. (2005). Extracting product features and opinions from reviews. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-05). Association for Computational Linguistics, pp. 339-346. Pradhan, S., Hacioglu, K., Krugler, V., Ward, W., Martin, J.H. and Jurafsky,D. (2005). Support vector learning for semantic argument classification. Machine Learning, 60(1-3), pp. 11-39. Pulman, S.G. & Sukkarieh, J.Z. (2005). Automatic short answer marking. In Proceedings of 2nd Workshop on Building Educational Applications using NLP, Association for Computational Linguistics, Ann Arbor, Michigan, pp. 9-16. Pustejovsky, J., Castano, J., Zhang, J., Kotecki, M. & Cohran, B. (2002). Robust relational parsing over biomedical literature: Extracting inhibit relations. In Proceedings of the 7th Annual Pacific Symposium on Biocomputing. pp. 362-373. Pustejovsky, J., Castano, J., Cohran, B., Kotecki, M. & Morrell, M. (2001). Automatic extraction of acronym-meaning pairs from MEDLINE databases. Studies in Health Technology and Informatics, 84(Pt 1), pp. 371-375. Quinlan, J.R. (1986). Induction of decision trees. J. W. Shavlik & T. G. Dietterich, eds. Machine Learning, 1(1), pp. 81-106. Rao, C.R. (1982). Diversity: Its measurement, decomposition, apportionment and analysis. Sankhyà: The Indian Journal of Statistics Series A, 44, Series (1), pp.1–22. Ravichandran, D. & Hovy, E. (2002). Learning surface text patterns for a Question Answering system. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL-02). Philadelphia, PA. Reiter, E., Sripada, S., Hunter, J., Yu, J. & Davy, I. (2005). Choosing words in computer-generated weather forecasts. Artificial Intelligence, 167(1-2), pp. 137-169. 188 Reiter, E. & Dale, R. (2000). Building Natural Language Generation Systems. Cambridge University Press. Riloff, E. & Jones, R. (1999). Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the 16th National Conference on Artificial Intelligence (AAAI-99). pp. 474-479. Riloff, E. (1996). Automatically generating extraction patterns from untagged text. In Proceedings of the 13th National Conference on Artificial Intelligence. AAAI Press, pp. 1044-1049. Riloff, E. (1993). Automatically constructing a dictionary for information extraction tasks. In Proceedings of the 11th National Conference on Artificial Intelligence (AAAI-93). AAAI Press/The MIT Press, pp. 811-816. Sætre, R., Sagae, K. & Tsujii, J. (2007). Syntactic features for protein-protein interaction extraction. In Proceedings of the 2nd International Symposium on Languages in Biology and Medicine (LBM-2007). Singapore, pp. 6.1-6.14. Salton, G. & McGill, M.J. (1983). Introduction to Modern Information Retrieval, McGraw-Hill. Sarawagi, S. (2008). Information extraction. Foundations and Tends, 1(3), pp. 261377. Sarmento, L., Jijkuon, V., de Rijke, M. & Oliveira, E. (2007). “More like these”: Growing entity classes from seeds. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM). Lisbon, Portugal, pp. 959-962. Savova, G.K. Kipper-Schuler, K.C., Buntrock, J.D. & Chute, C.G. (2008). UIMAbased clinical information extraction system. In Proceedings of LREC 2008 Workshop: Towards Enhanced Interoperability for Large HLT Systems UIMA for NLP. 189 Scheuermann, F. & Guimarães Pereira, A. (2008). Towards a Research Agenda on Computer-based Assessment Challenges and needs for European Educational Measurement. Schwartz, L., Aikawa, T. & Pahud, M. (2004). Dynamic language learning tools. In Proceedings of the InSTIL/ICALL Symposium. Sekine, S. (2006). On-demand information extraction. In Proceedings of the COLING/ACL on Main Conference Poster Sessions. Association for Computational Linguistics, pp. 731-738. Sekine, S. & Nobata, C. (2004). Definition, dictionary and tagger for extended named entities. In Proceedings of the 4th International Conference on Language Resources and Evaluation. Lisbon, Portugal. Sekine, S. (1998). NYU: Description of the Japanese NE System used for MET-2. In Proceedings of the Seventh Message Understanding Conference (MUC-7). Virginia, USA. Shinyama, Y. & Sekine, S. (2006). Preemptive information extraction using unrestricted relation discovery. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. pp. 304-311. Shinyama, Y. & Sekine, S. (2004). Named entity discovery using comparable news articles. In Proceedings of the 20th International Conference on Computational Linguistics (COLING-04). Association for Computational Linguistics, pp. 848-853. Siegfried, J.J. & Kennedy, P.E. (1995). Does pedagogy vary with class size in introductory economics? American Economic Review, American Economic Association, 85(2), pp. 347-351. Skalban, Y. (2009). Improving the output of a multiple-choice test generator: Analysis and proposals. University of Wolverhampton. 190 Smith, S., Kilgarriff, A., Sommers, S., Wen-liang, G. & Guang-zhong, W. (2009). Automatic cloze generation for English proficiency testing. In Proceedings of the LTTC Conference. Taipei, Taiwan. Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1), pp. 233-272. Soderland, S., Fisher, D. & Aseltine, J. (1995). CRYSTAL: Inducing a conceptual dictionary. In Proceedings of 14th International Joint Conference on Artificial Intelligence (IJCAI-95). pp. 1314-1319. Soderland, S. & Lehnert, W. (1994). Corpus-driven knowledge acquisition for discourse analysis. In Proceedings of AAAI’1994. pp. 827-832. Sparck, J.K. & Galliers, J. (1996). Evaluating natural language processing systems: An analysis and review. Lecture Notes in Artificial Intelligence, 1083, Berlin, Germany: Springer Verlag. Stevenson, M. & Greenwood, M. (2009). Dependency pattern models for information extraction. Research on Language and Computation, 7(1), pp. 13-39. Stevenson, M. & Greenwood, M. (2006). Comparing information extraction pattern models. In Proceedings of the Information Extraction Beyond the Document Workshop (COLING/ACL 2006). Sydney, Australia, pp. 12-19. Stevenson, M. & Greenwood, M. (2005). A semantic approach to IE pattern induction. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL-05). Ann Arbor, Michigan, pp. 379-386. Stevenson, M. & Ciravegna, F. (2003). Information extraction as a semantic web technology: Requirements and promises. In Proceedings of the 14th European Conference on Machine Learning (ECML 2003) Workshop: Adaptive Text Extraction and Mining (ATEM-03). Cavtat-Dubrovnik, Croatia. 191 Stiggins, R.J. (2001). The unfulfilled promise of classroom assessment. Educational Measurement Issues and Practice, 20(3), pp. 5-15. Sudo, K., Sekine, S. & Grishman, R. (2003). An improved extraction pattern representation model for automatic IE pattern acquisition. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics. Sapporo, Japan: Association for Computational Linguistics, pp. 224-231. Sudo, K., Sekine, S. & Grishman, R. (2001). Automatic pattern acquisition for Japanese information extraction. In Proceedings of the 1st International Conference on Human Language Technology Research. Association for Computational Linguistics, pp. 1–7. Sumita, E., Sugaya, F. & Yamamoto, S. (2005). Measuring non-native speakers’ proficiency of English by using a test with automatically-generated fill-in-the-blank questions. In Proceedings of the 2nd Workshop on Building Educational Applications using NLP, June, pp. 61-68. Szpektor, I., Tanev, H., Dagan, I. & Coppola, B. (2004). Scaling web-based acquisition of entailment relations. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP). Barcelona, Spain, pp. 41–48. Tapanainen, P. & Järvinen, T. (1997). A non-projective dependency parser. In Proceedings of the 5th Conference on Applied Natural Language Processing. Washington, DC: Association for Computational Linguistics, pp. 64-74. Tateno, J., Sano, H., Aizawa, H., Nakamura, T. & Morita, Y. (2005). Producing English educational materials from the BNC and releasing them on the Web. IEIC Technical Report (Institute of Electronics, Information and Communication Engineers), 105(437), pp. 7-12. 192 Tjong Kim Sang, E. & Meulder, F.D. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the CoNLL2003. Edmonton, Canada, pp. 142-147. Tjong Kim Sang, E. (2002). Introduction to the CoNLL-2002 shared task: Languageindependent named entity recognition. In Proceedings of the CoNLL-2002. Taipei, Taiwan, pp. 155-158. Tsuruoka, Y., Tateishi, Y., Kim, J-D., Ohta, T., McNaught, J., Ananiadou, S. & Tsujii, J. (2005). Developing a robust part-of-speech tagger for biomedical text. Advances in Informatics – 10th Panhellenic Conference on Informatics, LNCS, 3746, pp. 382-392. Tsuruoka, Y. & Tsujii, J. (2005). Bidirectional inference with the easiest-first strategy for tagging sequence data. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-05). Association for Computational Linguistics, pp. 467-474. Turney, P.D. & Littman, M. (2003). Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems TOIS, 21(4), pp. 315-346. Turney, P.D. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Learning (ECML-2001). Freiburg, Germany, pp. 491-502. Ureel, L., Forbus, K., Riesbeck, C. & Birnbaum, L. (2005). Question generation for learning by reading. In Proceedings of the AAAI Workshop on Textual Question Answering. Pittsburgh, Pennsylvania. Vanderwende, L. (2008). The importance of being important: Question generation. In Proceedings of the Workshop on the Question Generation Shared Task and Evaluation Challenge. Arlington, VA. 193 Vanderwende, L. (2007). Answering and questioning for machine reading. In Proceedings of the 2007 AAAI Spring Symposium on Machine Reading. Stanford, CA. Walker, M.A., Rambow, O. & Rogati, M. (2001). SPoT: A trainable sentence planner. In Proceedings of the 2nd Annual Meeting of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 1-8. Wang, X., Mohanty, N. & McCallum, A. (2005). Group and topic discovery from relations and text. In Proceedings of the 3rd International Workshop on Link Discovery (LinkKDD-05). ACM Press, pp. 28-35. Weeds, J. (2003). Measures and applications of lexical distributional similarity. PhD thesis, University of Sussex. Weller, M. (2002). Assessment Issues on a web-based course. Assessment & Evaluation in Higher Education, 27(2), pp. 109-116. Wilbur, J. & Smith, L. (2007). Biocreative 2. gene mention task. In Proceedings of the 2nd Biocreative Challenge Evaluation Workshop. pp. 7–16. Wong, L. (2001). PIES, a Protein Interaction Extraction System. Pacific Symposium on Biocomputing, 531, pp. 520-531. Yangarber, R. (2003). Counter-training in discovery of semantic patterns. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (ACL-03). pp. 343-350. Yangarber, R. (2000). Scenario customization of information extraction. PhD thesis, New York University. Yangarber, R. & Grishman, R. (2000). Machine learning of extraction patterns from unannotated corpora: position statement. In Proceedings of 14th European Conference 194 on Artificial Intelligence (ECAI 2000), Workshop on Machine Learning for Information Extraction. Berlin, Germany, pp. 76-83. Yangarber, R., Grishman, R. & Tapananien, P. (2000). Unsupervised discovery of scenario-level patterns for information extraction. In Proceedings of the 6th Conference on Applied Natural Language Processing. Association for Computational Linguistics, pp. 282–289. Yeh, A., Morgan, A., Colosimo, M. & Hirschman, L. (2005). BioCreAtIvE Task 1A: Gene mention finding evaluation. BMC Bioinformatics, 6(Suppl 1), S2. Yuret, D. & Yatbaz, M.A. (2010). The noisy channel model for unsupervised word sense disambiguation. Computational Linguistics, 36(1), pp. 111-127. Zelenko, D., Aone, C. & Richardella, A. (2003). Kernel methods for relation extraction. Journal of Machine Learning Research, 3(6), pp. 1083-1106. 195 View publication stats