This paper describes a methodology for the construction of WordNets based on machine translation ... more This paper describes a methodology for the construction of WordNets based on machine translation of an English sense tagged corpus. For the construction of such a corpus we use two freely available resources: the Semcor Corpus and the Princeton WordNet Gloss Corpus. This methodology is being used for the construction of Spanish and Catalan WordNet 3.0. In our first experiments we used a simple word alignment algorithm obtaining precision results comparable to those obtained by methods based on bilingual dictionaries. After that, we used a freely available statistical word alignment algorithm (Berkeley Aligner) obtaining better results. This methodology can be suitable for those languages with an available statistical machine translation system and can be used for constructing WordNets from the scratch and for enlarging existing WordNets.
En este documento, se describen los criterios que determinan el conjunto de multipalabras ('multi... more En este documento, se describen los criterios que determinan el conjunto de multipalabras ('multiwords') recogidas en el recurso electrónico WordNet 3.0 (http://adimen.si.ehu.es/cgi-bin/ wei/public/wei.consult.perl). Puesto que se trata de una propuesta aplicada y restringida a un recurso en concreto, se aleja de la idea de ser una propuesta teórica que pretenda resolver la distinción entre cadenas léxicas composicionales y cadenas léxicas que son multipalabras. Los criterios denidos son neutros, en el sentido que son aplicables en el reconocimiento de multipalabras tanto del catalán como del castellano. Dado que estas dos lenguas son muy cercanas y comparten muchos rasgos lingüísticos y sintácticos, los criterios aquí presentados son aplicables a las dos lenguas. Dado que el reconocimiento de multipalabras es una cuestión pendiente de resolver en Lin-güística, los criterios establecidos se entienden como un conjunto de ltros que permiten discernir las cadenas léxicas que actúan como multipalabra del resto del conjunto de candidatos a ser mul-tipalabra. De esta manera, cada ltro aplicado sobre los candidatos de multipalabra marca si la cadena es una multipalabra y, al mismo tiempo, reduce este conjunto de candidatos. Después de de la aplicación de los ltrosidealmente' no debería quedar ninguna cadena léxica para tratar, pero es probable que los ltros no permitan explicar todos los candidatos * La investigación de la que dimana este trabajo ha sido nanciada [en parte] por el Ministerio de Ciencia e Innovación mediante el proyecto Representación del Conocimiento Semántico (SKR), con referencia TIN2009-14715-C0403.
Spelling and grammar checking has become a daily activity for almost all text processor users. Us... more Spelling and grammar checking has become a daily activity for almost all text processor users. Usually these tools offer limited information about the misspelling or the grammar error and in certain cases suggest one or more possible alternatives. Sometimes users make the same mistakes one day after the other because they don't know the real reason of the mistake. In these cases extended information about the error could be very handy: detailed grammatical information and exercises to improve the user's writing skills. In this paper we present AVI.cat, a grammar checker and virtual assessor for the improvement of writing skills in Catalan. The tool is based on LanguageTool grammar checker and fully integrates into OpenOffice/LibreOffice text processor. Along with the grammar checker with extended information collected from the web, the tool also offers an automatic evaluator and assessor, that can perform an automatic assessment from a collection of texts written by the user. The assessor can also give information about the use progress and suggests exercises for further improvement.
In this paper we present a methodology for WordNet construction based on the exploitation of para... more In this paper we present a methodology for WordNet construction based on the exploitation of parallel corpora with semantic annotation of the English source text. We are using this methodology for the enlargement of the Spanish and Catalan versions of WordNet 3.0, but the methodology can also be used for other languages. As big parallel corpora with semantic annotation are not usually available, we explore two strategies to overcome this problem: to use monolingual sense tagged corpora and machine translation, on the one hand; and to use parallel corpora and automatic sense tagging on the source text, on the other. With these resources, the problem of acquiring a WordNet from parallel corpora can be seen as a word alignment task. Fortunately, this task is well known, and some aligning algorithms are freely available.
Resum En aquest article presentem l'estat de la qüestió en l'ús de la Viquipèdia per a tasques re... more Resum En aquest article presentem l'estat de la qüestió en l'ús de la Viquipèdia per a tasques relacionades amb el processament del llenguatge natural i tres aplicacions que hem creat per a l'enriquiment d'un recurs lingüístic de gran abast: el WordNet versió 3.0 per al català i castellà. Els investigadors en aquesta àrea fa anys que cerquen vies perquè les aplicacions integrin informació sobre coneixement del món, d'una manera més o menys estructurada, ja que aquest tipus de coneixement ha demostrat ser molt important per a resoldre de manera satisfactòria moltes tasques de processament del llenguatge. La Viquipèdia pot respondre perfectament a aquesta demanda d'informació amb l'avantatge del seu accés lliure i la seva actualització constant.
Abstract This paper presents a state of the art on the use of Wikipedia for tasks related to Natural Language Processing and three applications we have developed for enriching a wide-coverage language resource: WordNet version 3.0 for Catalan and Spanish. Researchers in this area have sought for years ways for enriching applications with large quantities of world knowledge in a more or less structured way since it has proved to be crucial for many language processing tasks. Wikipedia can provide such information with some important advantages: free access and constant updating.
Este artículo ofrece una revisión de métodos para la construcción de WordNets siguiendo la estrat... more Este artículo ofrece una revisión de métodos para la construcción de WordNets siguiendo la estrategia de expansión, es decir, mediante la traducción de las variants inglesas del Princeton WordNet. En el proceso de construcción se han utilizado recursos libres disponibles en Internet. El artículo presenta también los resultados de la evaluación de las técnicas en la construcción de los WordNets 3.0 para el castellano y catalán. Estas técnicas se pueden utilizar para la construcción de WordNets para otras lenguas. Palabras clave: WordNet, recursos léxicos, semántica
This paper presents a review of methods for building WordNets follo-wing the expand model, that is, by translating the English variants of the Princeton WordNet. Only free resources available online have been used. The paper also pre-sents the evaluation of the techniques applied in the construction of Spanish and Catalan WordNets 3.0. These techniques can be also used for other languages.
En aquest article presentem un conjunt de programes que faciliten la creació de WordNets a partir... more En aquest article presentem un conjunt de programes que faciliten la creació de WordNets a partir de diccionaris bilingües mitjançant l'estratègia d'expansió. Els programes estan escrits en Python i són per tant multiplataforma. El seu ús, tot i que no disposen d'una interfície gràfica d'usuari, és molt senzill. Aquests programes s'han fet servir amb èxit en el projecte Know2 per a la creació de les versions 3.0 dels WordNets català i espanyol. Els programes estan publicats sota una llicència GNU-GPL i es poden descarregar lliurement de http://lpg.uoc.edu/~wn-toolkit.
This paper presents a set of programs to facilitate the creation of WordNet from bilingual dictionaries following the expand model. The programs are written in Python and are therefore multiplatform. The programs are very easy to use although they don’t have a graphical user interface. These programs have been successfully used in the Know2 Project for the creation of Catalan and Spanish WordNet 3.0. The programs are published under the GNU-GPL licence and can be freely downloaded from http://lpg.uoc.edu/wn-toolkit.
At times it is difficult to automatically identify the most representative terms ina ... more At times it is difficult to automatically identify the most representative terms ina specialized corpus and to validate them as correct due to the similarity of words and terms. In order to identify the most representative terms in a corpus that can be easily adapted to any language or terminology extraction tool, we explore the combination of token slot extraction and ranking metrics to select term candidates with a high likelihood of being terminological units. This paper presents the results we have identified using four statistical measures. We observe high term detection in English corpora (a precision of 76.92% and a recall of 79.09%) and Spanish corpora (a precision of 60% and a recall of 70.48%) using token slot detection together with four ranking metrics: Dice, True Mutual Information, T-score and Log-likelihood. In conclusion, token slot detection extracts terminological patterns in term candidates to reduce lists of candidates, and ranking metrics improve results and reduce the number to be evaluated manually. We will evaluate the algorithm’s performance in other domains and for other user profiles and needs .
This paper presents a set of methodolo-gies and algorithms to create WordNets following the expan... more This paper presents a set of methodolo-gies and algorithms to create WordNets following the expand model. We explore dictionary and BabelNet based strategies, as well as methodologies based on the use of parallel corpora. Evaluation results for six languages are presented: Catalan, Spanish, French, German, Italian and Por-tuguese. Along with the methodologies and evaluation we present an implementation of all the algorithms grouped in a set of programs or toolkit. These programs have been successfully used in the Know2 Project for the creation of Catalan and Spanish WordNet 3.0. The toolkit is published under the GNU-GPL license and can be freely downloaded from http: //lpg.uoc.edu/wn-toolkit.
In this paper we present the evaluation results for the creation of WordNets for five languages (... more In this paper we present the evaluation results for the creation of WordNets for five languages (Spanish, French, German, Italian and Portuguese) using an approach based on parallel corpora. We have used three very large parallel corpora for our experiments: DGT-TM, EMEA and ECB. The English part of each corpus is semantically tagged using Freeling and UKB. After this step, the process of WordNet creation is converted into a word alignment problem, where we want to align WordNet synsets in the English part of the corpus with lemmata on the target language part of the corpus. The word alignment algorithm used in these experiments is a simple most frequent translation algorithm implemented into the WN-Toolkit. The obtained precision values are quite satisfactory, but the overall number of extracted synset-variant pairs is too low, leading into very poor recall values. In the conclusions, the use of more advanced word alignment algorithms, such as Giza++, Fast Align or Berkeley aligner is suggested.
The InLéctor project aims to promote reading in original language, offering an interactive scenar... more The InLéctor project aims to promote reading in original language, offering an interactive scenario which facilitates foreign language teaching and selflearning, as well as the study of literature. In order to achieve this aim, the project develops computational techniques which provide automatically generation of bilingual e-books, incorporating dictionaries, ontologies and audios. Bilingual reading offers the option of moving from the original text to the translated one with a single click. The audio support is based on the human reading of the work in its original language. Moreover, it is planned to incorporate an interactive dictionary service which will show in advance the most difficult words in the text according to the language level of the user. Augmented information will also provide links to encyclopedic entries, images and audio directly related to the fragment of the text being read. The project ensures the creation and distribution of these parallel books in different output formats (html, epub and mobi), which are compatible with the most common tablets and e-books nowadays. The literary works and their translations are published in the public domain. The developed programs are based on free software and will also be published under a free license.
Resumen: En este artículo se presenta la metodología utilizada en la expansión del WordNet del ga... more Resumen: En este artículo se presenta la metodología utilizada en la expansión del WordNet del gallego mediante el WN-Toolkit, así como una evaluación detallada de los resultados obtenidos. El conjunto de herramientas incluido en el WN-Toolkit permite la creación o expansión de wordnets siguiendo la estrategia de expansión. En los experimentos presentados en este artículo se han utilizado estrategias basadas en diccionarios y en corpus paralelos. La evaluación de los resultados se ha realizado de manera tanto automática como manual, permitiendo así la comparación de los valores de precisión obtenidos. La evaluación manual también detalla la fuente de los errores, lo que ha sido de utilidad tanto para mejorar el propio WN-Toolkit, como para corregir los errores del WordNet de referencia para el gallego. Palabras clave: WordNet, adquisición de información léxica, corpus paralelos, recursos plurilingües Abstract: In this paper the methodology and a detailed evaluation of the results of the expansion of the Galician WordNet using the WN-Toolkit are presented. This toolkit allows the creation and expansion of wordnets using the expand model. In our experiments we have used methodologies based on dictionaries and parallel corpora. The evaluation of the results has been performed both in an automatic and in a manual way, allowing a comparison of the precision values obtained with both evaluation procedures. The manual evaluation provides details about the source of the errors. This information has been very useful for the improvement of the toolkit and for the correction of some errors in the reference WordNet for Galician.
In this paper an automatic morphology learning system for complex and agglutinative languages is ... more In this paper an automatic morphology learning system for complex and agglutinative languages is presented. We process complex agglutinative morphology of Indian languages using Adaptor Grammars and linguistic rules of morphology. Adaptor Grammars are a compositional Bayesian framework for grammatical inference, where we define a morphological grammar for agglutinative languages and morphological boundaries are inferred from a corpora of plain text. Once it produces morphological segmentation, regular expressions for orthography rules are applied to achieve final segmentation. We test our algorithm in the case of three complex languages from the Dravidian family and evaluate the results comparing to other state of the art unsupervised morphology learning systems and show significant improvements in the results.
Wordnet is a standard semantic resource for several Natural Language Processing tasks and it is a... more Wordnet is a standard semantic resource for several Natural Language Processing tasks and it is available for an increasing number of languages. The Croatian Wordnet (CroWN) was a relatively small resource with 10.026 synsets and 31.367 synset-variant pairs covering only 45.91% of the so-called Core WordNet. Comparing these figures with the size of the Princeton WordNet for English version 3.0, that has 117,659 synsets and 206,975 synset-variant pairs, it is clear that the CroWN should be expanded. First experiments for the expansion of the CroWN were performed using the WN-Toolkit, a set of Python programs for wordnet creation and expansion using dictionary, Ba-belnet and parallel-corpora based strategies. The WN-Toolkit was previously successfully applied to other languages as Spanish, Catalan and Galician. After this first expansion, CroWN reached 70.63% of the core wordnet. In the second step we used CroDeriv, a derivational database for Croatian and the manual creation of 1,457 synset-variant pairs until reaching 100% of the Core WordNet. After second step was completed, CroWN reached 23,137 synsets and 47,931 synset-lemma pairs.
In this paper we describe a method to morphologically segment highly agglutinating and inflection... more In this paper we describe a method to morphologically segment highly agglutinating and inflectional languages from Dravidian family. We use nested Pitman-Yor process to segment long agglutinated words into their basic components, and use a corpus based morpheme induction algorithm to perform morpheme segmentation. We test our method in two languages, Malay-alam and Kannada and compare the results with Morfessor.
En este artículo presentamos TMX (Translation Memory eXchange), el formato estándar de ... more En este artículo presentamos TMX (Translation Memory eXchange), el formato estándar de intercambiode memorias de traducción. Repasaremos el concepto de memoria de traducción y sus usos, que las convierten en uno de los principales recursos para el traductor. Veremos las estrategias para recuperar de manera rápida los segmentos más similares al que estamos traduciendo y los mecanismos para ordenar los segmentos recuperados según su similitud con el segmento a traducir. Se analizarán los formatos internos de las memorias de traducción en las principales herramientas de traducción asistida y se verá la importancia de disponer de un formato de intercambio que sea estándar, versátil y que permita su evolución para adaptarse a las nuevas necesidades. Presentaremos brevemente las especificaciones del formato TMX y sus diferentes niveles y analizaremos el gradode aceptación de este formato entre las herramientas de traducción asistida. Finalmente presentaremos algunas de las propuestas de futuro para este formato
In this paper we present an extension of the dictionary-based strategy for word-net construction ... more In this paper we present an extension of the dictionary-based strategy for word-net construction implemented in the WN-Toolkit. This strategy allows the extraction of information for polysemous En-glish words if definitions and/or semantic relations are present in the dictionary. The WN-Toolkit is a freely available set of programs for the creation and expansion of wordnets using dictionary-based and parallel-corpus based strategies. In previous versions of the toolkit the dictionary-based strategy was only used for translating monosemous English variants. In the experiments we have used Omegawiki and Wiktionary and we present automatic evaluation results for 24 languages that have wordnets in the Open Multilingual Word-net project. We have used these existing versions of the wordnet to perform an automatic evaluation.
En aquest article presentem un conjunt de programes que faciliten la creació de WordNets a partir... more En aquest article presentem un conjunt de programes que faciliten la creació de WordNets a partir de diccionaris bilingües mitjançant l'estratègia d'expansió. Els programes estan escrits en Python i són per tant multiplataforma. El seu ús, tot i que no disposen d'una interfície gràfica d'usuari, és molt senzill. Aquests programes s'han fet servir amb èxit en el projecte Know2 per a la creació de les versions 3.0 dels WordNets català i espanyol. Els programes estan publicats sota una llicència GNU-GPL i es poden descarregar lliurement de http://lpg.uoc.edu/~wn-toolkit. This paper presents a set of programs to facilitate the creation of WordNet from bilingual dictionaries following the expand model. The programs are written in Python and are therefore multiplatform. The programs are very easy to use although they don’t have a graphical user interface. These programs have been successfully used in the Know2 Project for the creation of Catalan and Spanish WordNet 3.0...
Aquest curs d'actualització està concebut com a una introducció general als conceptes i les e... more Aquest curs d'actualització està concebut com a una introducció general als conceptes i les eines necessàries per a gestionar projectes de traducció: determinació de recursos humans i informàtics, càlcul de volum i cost, formats, control de qualitat, de flux de treball, etc. Este curso de actualización está concebido como una introducción general a los conceptos y las herramientas necesarias para gestionar proyectos de traducción: determinación de recursos humanos e informáticos, cálculo de volumen y coste, formatos, control de calidad, de flujo de trabajo, etc. This refresher course is conceived as a general introduction to the concepts and tools needed to manage translation projects: determination of human and computer resources, calculation of volume and cost, formats, quality control, workflows, etc.
This paper describes a methodology for the construction of WordNets based on machine translation ... more This paper describes a methodology for the construction of WordNets based on machine translation of an English sense tagged corpus. For the construction of such a corpus we use two freely available resources: the Semcor Corpus and the Princeton WordNet Gloss Corpus. This methodology is being used for the construction of Spanish and Catalan WordNet 3.0. In our first experiments we used a simple word alignment algorithm obtaining precision results comparable to those obtained by methods based on bilingual dictionaries. After that, we used a freely available statistical word alignment algorithm (Berkeley Aligner) obtaining better results. This methodology can be suitable for those languages with an available statistical machine translation system and can be used for constructing WordNets from the scratch and for enlarging existing WordNets.
En este documento, se describen los criterios que determinan el conjunto de multipalabras ('multi... more En este documento, se describen los criterios que determinan el conjunto de multipalabras ('multiwords') recogidas en el recurso electrónico WordNet 3.0 (http://adimen.si.ehu.es/cgi-bin/ wei/public/wei.consult.perl). Puesto que se trata de una propuesta aplicada y restringida a un recurso en concreto, se aleja de la idea de ser una propuesta teórica que pretenda resolver la distinción entre cadenas léxicas composicionales y cadenas léxicas que son multipalabras. Los criterios denidos son neutros, en el sentido que son aplicables en el reconocimiento de multipalabras tanto del catalán como del castellano. Dado que estas dos lenguas son muy cercanas y comparten muchos rasgos lingüísticos y sintácticos, los criterios aquí presentados son aplicables a las dos lenguas. Dado que el reconocimiento de multipalabras es una cuestión pendiente de resolver en Lin-güística, los criterios establecidos se entienden como un conjunto de ltros que permiten discernir las cadenas léxicas que actúan como multipalabra del resto del conjunto de candidatos a ser mul-tipalabra. De esta manera, cada ltro aplicado sobre los candidatos de multipalabra marca si la cadena es una multipalabra y, al mismo tiempo, reduce este conjunto de candidatos. Después de de la aplicación de los ltrosidealmente' no debería quedar ninguna cadena léxica para tratar, pero es probable que los ltros no permitan explicar todos los candidatos * La investigación de la que dimana este trabajo ha sido nanciada [en parte] por el Ministerio de Ciencia e Innovación mediante el proyecto Representación del Conocimiento Semántico (SKR), con referencia TIN2009-14715-C0403.
Spelling and grammar checking has become a daily activity for almost all text processor users. Us... more Spelling and grammar checking has become a daily activity for almost all text processor users. Usually these tools offer limited information about the misspelling or the grammar error and in certain cases suggest one or more possible alternatives. Sometimes users make the same mistakes one day after the other because they don't know the real reason of the mistake. In these cases extended information about the error could be very handy: detailed grammatical information and exercises to improve the user's writing skills. In this paper we present AVI.cat, a grammar checker and virtual assessor for the improvement of writing skills in Catalan. The tool is based on LanguageTool grammar checker and fully integrates into OpenOffice/LibreOffice text processor. Along with the grammar checker with extended information collected from the web, the tool also offers an automatic evaluator and assessor, that can perform an automatic assessment from a collection of texts written by the user. The assessor can also give information about the use progress and suggests exercises for further improvement.
In this paper we present a methodology for WordNet construction based on the exploitation of para... more In this paper we present a methodology for WordNet construction based on the exploitation of parallel corpora with semantic annotation of the English source text. We are using this methodology for the enlargement of the Spanish and Catalan versions of WordNet 3.0, but the methodology can also be used for other languages. As big parallel corpora with semantic annotation are not usually available, we explore two strategies to overcome this problem: to use monolingual sense tagged corpora and machine translation, on the one hand; and to use parallel corpora and automatic sense tagging on the source text, on the other. With these resources, the problem of acquiring a WordNet from parallel corpora can be seen as a word alignment task. Fortunately, this task is well known, and some aligning algorithms are freely available.
Resum En aquest article presentem l'estat de la qüestió en l'ús de la Viquipèdia per a tasques re... more Resum En aquest article presentem l'estat de la qüestió en l'ús de la Viquipèdia per a tasques relacionades amb el processament del llenguatge natural i tres aplicacions que hem creat per a l'enriquiment d'un recurs lingüístic de gran abast: el WordNet versió 3.0 per al català i castellà. Els investigadors en aquesta àrea fa anys que cerquen vies perquè les aplicacions integrin informació sobre coneixement del món, d'una manera més o menys estructurada, ja que aquest tipus de coneixement ha demostrat ser molt important per a resoldre de manera satisfactòria moltes tasques de processament del llenguatge. La Viquipèdia pot respondre perfectament a aquesta demanda d'informació amb l'avantatge del seu accés lliure i la seva actualització constant.
Abstract This paper presents a state of the art on the use of Wikipedia for tasks related to Natural Language Processing and three applications we have developed for enriching a wide-coverage language resource: WordNet version 3.0 for Catalan and Spanish. Researchers in this area have sought for years ways for enriching applications with large quantities of world knowledge in a more or less structured way since it has proved to be crucial for many language processing tasks. Wikipedia can provide such information with some important advantages: free access and constant updating.
Este artículo ofrece una revisión de métodos para la construcción de WordNets siguiendo la estrat... more Este artículo ofrece una revisión de métodos para la construcción de WordNets siguiendo la estrategia de expansión, es decir, mediante la traducción de las variants inglesas del Princeton WordNet. En el proceso de construcción se han utilizado recursos libres disponibles en Internet. El artículo presenta también los resultados de la evaluación de las técnicas en la construcción de los WordNets 3.0 para el castellano y catalán. Estas técnicas se pueden utilizar para la construcción de WordNets para otras lenguas. Palabras clave: WordNet, recursos léxicos, semántica
This paper presents a review of methods for building WordNets follo-wing the expand model, that is, by translating the English variants of the Princeton WordNet. Only free resources available online have been used. The paper also pre-sents the evaluation of the techniques applied in the construction of Spanish and Catalan WordNets 3.0. These techniques can be also used for other languages.
En aquest article presentem un conjunt de programes que faciliten la creació de WordNets a partir... more En aquest article presentem un conjunt de programes que faciliten la creació de WordNets a partir de diccionaris bilingües mitjançant l'estratègia d'expansió. Els programes estan escrits en Python i són per tant multiplataforma. El seu ús, tot i que no disposen d'una interfície gràfica d'usuari, és molt senzill. Aquests programes s'han fet servir amb èxit en el projecte Know2 per a la creació de les versions 3.0 dels WordNets català i espanyol. Els programes estan publicats sota una llicència GNU-GPL i es poden descarregar lliurement de http://lpg.uoc.edu/~wn-toolkit.
This paper presents a set of programs to facilitate the creation of WordNet from bilingual dictionaries following the expand model. The programs are written in Python and are therefore multiplatform. The programs are very easy to use although they don’t have a graphical user interface. These programs have been successfully used in the Know2 Project for the creation of Catalan and Spanish WordNet 3.0. The programs are published under the GNU-GPL licence and can be freely downloaded from http://lpg.uoc.edu/wn-toolkit.
At times it is difficult to automatically identify the most representative terms ina ... more At times it is difficult to automatically identify the most representative terms ina specialized corpus and to validate them as correct due to the similarity of words and terms. In order to identify the most representative terms in a corpus that can be easily adapted to any language or terminology extraction tool, we explore the combination of token slot extraction and ranking metrics to select term candidates with a high likelihood of being terminological units. This paper presents the results we have identified using four statistical measures. We observe high term detection in English corpora (a precision of 76.92% and a recall of 79.09%) and Spanish corpora (a precision of 60% and a recall of 70.48%) using token slot detection together with four ranking metrics: Dice, True Mutual Information, T-score and Log-likelihood. In conclusion, token slot detection extracts terminological patterns in term candidates to reduce lists of candidates, and ranking metrics improve results and reduce the number to be evaluated manually. We will evaluate the algorithm’s performance in other domains and for other user profiles and needs .
This paper presents a set of methodolo-gies and algorithms to create WordNets following the expan... more This paper presents a set of methodolo-gies and algorithms to create WordNets following the expand model. We explore dictionary and BabelNet based strategies, as well as methodologies based on the use of parallel corpora. Evaluation results for six languages are presented: Catalan, Spanish, French, German, Italian and Por-tuguese. Along with the methodologies and evaluation we present an implementation of all the algorithms grouped in a set of programs or toolkit. These programs have been successfully used in the Know2 Project for the creation of Catalan and Spanish WordNet 3.0. The toolkit is published under the GNU-GPL license and can be freely downloaded from http: //lpg.uoc.edu/wn-toolkit.
In this paper we present the evaluation results for the creation of WordNets for five languages (... more In this paper we present the evaluation results for the creation of WordNets for five languages (Spanish, French, German, Italian and Portuguese) using an approach based on parallel corpora. We have used three very large parallel corpora for our experiments: DGT-TM, EMEA and ECB. The English part of each corpus is semantically tagged using Freeling and UKB. After this step, the process of WordNet creation is converted into a word alignment problem, where we want to align WordNet synsets in the English part of the corpus with lemmata on the target language part of the corpus. The word alignment algorithm used in these experiments is a simple most frequent translation algorithm implemented into the WN-Toolkit. The obtained precision values are quite satisfactory, but the overall number of extracted synset-variant pairs is too low, leading into very poor recall values. In the conclusions, the use of more advanced word alignment algorithms, such as Giza++, Fast Align or Berkeley aligner is suggested.
The InLéctor project aims to promote reading in original language, offering an interactive scenar... more The InLéctor project aims to promote reading in original language, offering an interactive scenario which facilitates foreign language teaching and selflearning, as well as the study of literature. In order to achieve this aim, the project develops computational techniques which provide automatically generation of bilingual e-books, incorporating dictionaries, ontologies and audios. Bilingual reading offers the option of moving from the original text to the translated one with a single click. The audio support is based on the human reading of the work in its original language. Moreover, it is planned to incorporate an interactive dictionary service which will show in advance the most difficult words in the text according to the language level of the user. Augmented information will also provide links to encyclopedic entries, images and audio directly related to the fragment of the text being read. The project ensures the creation and distribution of these parallel books in different output formats (html, epub and mobi), which are compatible with the most common tablets and e-books nowadays. The literary works and their translations are published in the public domain. The developed programs are based on free software and will also be published under a free license.
Resumen: En este artículo se presenta la metodología utilizada en la expansión del WordNet del ga... more Resumen: En este artículo se presenta la metodología utilizada en la expansión del WordNet del gallego mediante el WN-Toolkit, así como una evaluación detallada de los resultados obtenidos. El conjunto de herramientas incluido en el WN-Toolkit permite la creación o expansión de wordnets siguiendo la estrategia de expansión. En los experimentos presentados en este artículo se han utilizado estrategias basadas en diccionarios y en corpus paralelos. La evaluación de los resultados se ha realizado de manera tanto automática como manual, permitiendo así la comparación de los valores de precisión obtenidos. La evaluación manual también detalla la fuente de los errores, lo que ha sido de utilidad tanto para mejorar el propio WN-Toolkit, como para corregir los errores del WordNet de referencia para el gallego. Palabras clave: WordNet, adquisición de información léxica, corpus paralelos, recursos plurilingües Abstract: In this paper the methodology and a detailed evaluation of the results of the expansion of the Galician WordNet using the WN-Toolkit are presented. This toolkit allows the creation and expansion of wordnets using the expand model. In our experiments we have used methodologies based on dictionaries and parallel corpora. The evaluation of the results has been performed both in an automatic and in a manual way, allowing a comparison of the precision values obtained with both evaluation procedures. The manual evaluation provides details about the source of the errors. This information has been very useful for the improvement of the toolkit and for the correction of some errors in the reference WordNet for Galician.
In this paper an automatic morphology learning system for complex and agglutinative languages is ... more In this paper an automatic morphology learning system for complex and agglutinative languages is presented. We process complex agglutinative morphology of Indian languages using Adaptor Grammars and linguistic rules of morphology. Adaptor Grammars are a compositional Bayesian framework for grammatical inference, where we define a morphological grammar for agglutinative languages and morphological boundaries are inferred from a corpora of plain text. Once it produces morphological segmentation, regular expressions for orthography rules are applied to achieve final segmentation. We test our algorithm in the case of three complex languages from the Dravidian family and evaluate the results comparing to other state of the art unsupervised morphology learning systems and show significant improvements in the results.
Wordnet is a standard semantic resource for several Natural Language Processing tasks and it is a... more Wordnet is a standard semantic resource for several Natural Language Processing tasks and it is available for an increasing number of languages. The Croatian Wordnet (CroWN) was a relatively small resource with 10.026 synsets and 31.367 synset-variant pairs covering only 45.91% of the so-called Core WordNet. Comparing these figures with the size of the Princeton WordNet for English version 3.0, that has 117,659 synsets and 206,975 synset-variant pairs, it is clear that the CroWN should be expanded. First experiments for the expansion of the CroWN were performed using the WN-Toolkit, a set of Python programs for wordnet creation and expansion using dictionary, Ba-belnet and parallel-corpora based strategies. The WN-Toolkit was previously successfully applied to other languages as Spanish, Catalan and Galician. After this first expansion, CroWN reached 70.63% of the core wordnet. In the second step we used CroDeriv, a derivational database for Croatian and the manual creation of 1,457 synset-variant pairs until reaching 100% of the Core WordNet. After second step was completed, CroWN reached 23,137 synsets and 47,931 synset-lemma pairs.
In this paper we describe a method to morphologically segment highly agglutinating and inflection... more In this paper we describe a method to morphologically segment highly agglutinating and inflectional languages from Dravidian family. We use nested Pitman-Yor process to segment long agglutinated words into their basic components, and use a corpus based morpheme induction algorithm to perform morpheme segmentation. We test our method in two languages, Malay-alam and Kannada and compare the results with Morfessor.
En este artículo presentamos TMX (Translation Memory eXchange), el formato estándar de ... more En este artículo presentamos TMX (Translation Memory eXchange), el formato estándar de intercambiode memorias de traducción. Repasaremos el concepto de memoria de traducción y sus usos, que las convierten en uno de los principales recursos para el traductor. Veremos las estrategias para recuperar de manera rápida los segmentos más similares al que estamos traduciendo y los mecanismos para ordenar los segmentos recuperados según su similitud con el segmento a traducir. Se analizarán los formatos internos de las memorias de traducción en las principales herramientas de traducción asistida y se verá la importancia de disponer de un formato de intercambio que sea estándar, versátil y que permita su evolución para adaptarse a las nuevas necesidades. Presentaremos brevemente las especificaciones del formato TMX y sus diferentes niveles y analizaremos el gradode aceptación de este formato entre las herramientas de traducción asistida. Finalmente presentaremos algunas de las propuestas de futuro para este formato
In this paper we present an extension of the dictionary-based strategy for word-net construction ... more In this paper we present an extension of the dictionary-based strategy for word-net construction implemented in the WN-Toolkit. This strategy allows the extraction of information for polysemous En-glish words if definitions and/or semantic relations are present in the dictionary. The WN-Toolkit is a freely available set of programs for the creation and expansion of wordnets using dictionary-based and parallel-corpus based strategies. In previous versions of the toolkit the dictionary-based strategy was only used for translating monosemous English variants. In the experiments we have used Omegawiki and Wiktionary and we present automatic evaluation results for 24 languages that have wordnets in the Open Multilingual Word-net project. We have used these existing versions of the wordnet to perform an automatic evaluation.
En aquest article presentem un conjunt de programes que faciliten la creació de WordNets a partir... more En aquest article presentem un conjunt de programes que faciliten la creació de WordNets a partir de diccionaris bilingües mitjançant l'estratègia d'expansió. Els programes estan escrits en Python i són per tant multiplataforma. El seu ús, tot i que no disposen d'una interfície gràfica d'usuari, és molt senzill. Aquests programes s'han fet servir amb èxit en el projecte Know2 per a la creació de les versions 3.0 dels WordNets català i espanyol. Els programes estan publicats sota una llicència GNU-GPL i es poden descarregar lliurement de http://lpg.uoc.edu/~wn-toolkit. This paper presents a set of programs to facilitate the creation of WordNet from bilingual dictionaries following the expand model. The programs are written in Python and are therefore multiplatform. The programs are very easy to use although they don’t have a graphical user interface. These programs have been successfully used in the Know2 Project for the creation of Catalan and Spanish WordNet 3.0...
Aquest curs d'actualització està concebut com a una introducció general als conceptes i les e... more Aquest curs d'actualització està concebut com a una introducció general als conceptes i les eines necessàries per a gestionar projectes de traducció: determinació de recursos humans i informàtics, càlcul de volum i cost, formats, control de qualitat, de flux de treball, etc. Este curso de actualización está concebido como una introducción general a los conceptos y las herramientas necesarias para gestionar proyectos de traducción: determinación de recursos humanos e informáticos, cálculo de volumen y coste, formatos, control de calidad, de flujo de trabajo, etc. This refresher course is conceived as a general introduction to the concepts and tools needed to manage translation projects: determination of human and computer resources, calculation of volume and cost, formats, quality control, workflows, etc.
Este libro presenta una panorámica general clara y en profundidad de las tecnologías que se aplican hoy en día en el mundo de la traducción: herramientas de traducción asistida, traducción automática, y extracción y gestión de terminología. La obra presenta tanto los principios de funcionamiento de las principales herramientas, como los recursos imprescindibles para todo traductor: las memorias de traducción y las bases de datos terminológicas. Se trata de una obra imprescindible para todos aquellos profesionales interesados en obtener el máximo rendimiento de estas tecnologías en su tarea diaria. El autor tiene una dilatada experiencia en el uso, diseño y docencia de las herramientas de ayuda para la traducción.
Uploads
Papers by Antoni Oliver
Abstract This paper presents a state of the art on the use of Wikipedia for tasks related to Natural Language Processing and three applications we have developed for enriching a wide-coverage language resource: WordNet version 3.0 for Catalan and Spanish. Researchers in this area have sought for years ways for enriching applications with large quantities of world knowledge in a more or less structured way since it has proved to be crucial for many language processing tasks. Wikipedia can provide such information with some important advantages: free access and constant updating.
This paper presents a review of methods for building WordNets follo-wing the expand model, that is, by translating the English variants of the Princeton WordNet. Only free resources available online have been used. The paper also pre-sents the evaluation of the techniques applied in the construction of Spanish and Catalan WordNets 3.0. These techniques can be also used for other languages.
This paper presents a set of programs to facilitate the creation of WordNet from bilingual dictionaries following the expand model. The programs are written in Python and are therefore multiplatform. The programs are very easy to use although they don’t have a graphical user interface. These programs have been successfully used in the Know2 Project for the creation of Catalan and Spanish WordNet 3.0. The programs are published under the GNU-GPL licence and can be freely downloaded from http://lpg.uoc.edu/wn-toolkit.
Abstract This paper presents a state of the art on the use of Wikipedia for tasks related to Natural Language Processing and three applications we have developed for enriching a wide-coverage language resource: WordNet version 3.0 for Catalan and Spanish. Researchers in this area have sought for years ways for enriching applications with large quantities of world knowledge in a more or less structured way since it has proved to be crucial for many language processing tasks. Wikipedia can provide such information with some important advantages: free access and constant updating.
This paper presents a review of methods for building WordNets follo-wing the expand model, that is, by translating the English variants of the Princeton WordNet. Only free resources available online have been used. The paper also pre-sents the evaluation of the techniques applied in the construction of Spanish and Catalan WordNets 3.0. These techniques can be also used for other languages.
This paper presents a set of programs to facilitate the creation of WordNet from bilingual dictionaries following the expand model. The programs are written in Python and are therefore multiplatform. The programs are very easy to use although they don’t have a graphical user interface. These programs have been successfully used in the Know2 Project for the creation of Catalan and Spanish WordNet 3.0. The programs are published under the GNU-GPL licence and can be freely downloaded from http://lpg.uoc.edu/wn-toolkit.
Este libro presenta una panorámica general clara y en profundidad de las tecnologías que se aplican hoy en día en el mundo de la traducción: herramientas de traducción asistida, traducción automática, y extracción y gestión de terminología. La obra presenta tanto los principios de funcionamiento de las principales herramientas, como los recursos imprescindibles para todo traductor: las memorias de traducción y las bases de datos terminológicas. Se trata de una obra imprescindible para todos aquellos profesionales interesados en obtener el máximo rendimiento de estas tecnologías en su tarea diaria. El autor tiene una dilatada experiencia en el uso, diseño y docencia de las herramientas de ayuda para la traducción.