Linguística, Informática e Tradução: Mundos Que Se Cruzam
Linguística, Informática e Tradução: Mundos Que Se Cruzam
Linguística, Informática e Tradução: Mundos Que Se Cruzam
7 (1) / 2015
Issue editors:
Alberto Simes
Anabela Barreiro
Diana Santos
Rui Sousa-Silva
Stella E. O. Tagnin
Reviewers:
Alberto Simes
Alexandre Rademaker
Ana Maria Brito
Anabela Barreiro
Brett Drury
Cludia Freitas
Cristina Mota
Diana Santos
Eckhard Bick
Eugnio Oliveira
Franoise Bacquelaine
Ftima Oliveira
Hugo Gonalo Oliveira
Isabel Galhano
Jorge Teixeira
Jos Joo Dias de Almeida
Joo Veloso
Lus Costa
Lus Miguel Cabral
Lus Trigo
Maria Jos Finatto
Miriam Leite
Mrio J. Silva
Paula Carvalho
Paulo Rocha
Pavel Brazdil
Rui Sousa-Silva
Signe Oksefjell Ebeling
Stella E. O. Tagnin
Slvia Arajo
Thomas Hsgen
Contents
Mundos que se Cruzam
Alberto Simes, Anabela Barreiro, Diana Santos, Rui Sousa-Silva e Stella E.O. Tagnin
Thomas J. C. Hsgen
21
Franoise Bacquelaine
39
57
79
153
vi
CONTENTS
183
207
Anabela Barreiro
223
235
253
283
301
Rui Sousa-Silva
337
359
As WordNets do Portugus
397
425
CONTENTS
vii
439
457
viii
CONTENTS
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 18. (ISSN 1890-9639 / ISBN 978-82-9139812-9)
http://www.journals.uio.no/osla
[2]
[3]
[4]
[5]
Mas a comunicao humana inclui, ainda, outras reas, como o artigo A tool
at hand: gestures and rhythm in listing events Case studies of European and
African Portuguese speakers demonstra. Isabel Galhano Rodrigues descreve um
estudo etnogrfico de gestos e linguagem corporal em interao. Baseando-se
num corpo de quatro interaes de falantes de diferentes culturas, a autora analisa aspetos morfolgicos e padres rtmicos como forma de deteo de regularidades e diferenas (culturais) dos gestos.
Como a Belinda props e estudou, enquanto pioneira com base em corpos, uma
das funes da linguagem exprimir emoes e sentimentos. Se na comunicao
oral e presencial possvel tirar partido da linguagem corporal e da entoao, na
comunicao escrita esse sentimento torna-se mais complicado de detetar, sendo
a rea de deteo de sentimento por computador uma das mais movimentadas
neste momento.
Assim, Brett Drury e Alneu Lopes, em The identification of indicators of sentiment using a multi-view self-training algorithm, escrevem sobre a identificao
de sentimento usando algoritmos de aprendizagem, em que os resultados atingiram nveis de 70% de preciso.
Tambm relacionado com este tpico, em SentiLex-PT: Principais caractersticas e potencialidades, Paula Carvalho e Mrio Silva apresentam um lxico com
informao de sentimento e demonstram a sua aplicao em dois corpos distintos.
No s este tipo de estudo pode tirar partido de ontologias, rea que sempre
esteve entre as preferidas da Belinda. Em As Wordnets do Portugus, Hugo Gonalo Oliveira et al. apresentam vrias das WordNets existentes para o portugus,
discutindo as suas principais diferenas e semelhanas, e discutindo de que forma
estas iniciativas podem criar sinergias na melhoria dos respetivos recursos.
Finalmente, e voltando ao campo das aplicaes e estudo de textos concretos,
o artigo de Pavel Brazdil et al., Affinity mining of documents sets with network
analysis enriched by keywords and summaries, apresenta algumas experincias
de uso de algoritmos de minagem de textos para detetar similaridades entre documentos. Por seu turno, em Reporter fired for plagiarism: a forensic linguistic
analysis of news plagiarism, Rui Sousa-Silva deteta outro tipo de similaridades
mais problemticas, nomeadamente no campo do plgio jornalstico. Recorrendo
a casos ocorridos nos ltimos anos, o autor mostra que o texto noticioso , frequentemente, objeto de plgio, e ilustra de que modo uma abordagem lingustica
de natureza forense a permite detetar.
[6]
d e d i c at r i a s
Dando a palavra aos editores:
Conheci pessoalmente a Belinda em terras estranhas, no LREC 2002 em
Palma de Maiorca, e desde logo fui contagiada pelo seu entusiasmo e vontade
de contribuir para melhorar o panorama da lingustica computacional em
Portugal, mais surpreendente ou comovente por ela ser inglesa e no portuguesa. Camos nos braos uma da outra, como se diz em bom portugus
(mas obviamente no literalmente) e desde a desenvolvemos uma relao
profissional e de amizade de que muito me orgulho. Penso que como lusofalantes no nos devemos envergonhar de reconhecer que uma das pessoas em
Portugal que mais fez pelos estudos de traduo, pelos estudos contrastivos,
e pela lingustica computacional no nosso pas foi uma inglesa do Porto. Esta
homenagem pois o mnimo que me parece natural fazer-lhe. Mais fizera,
no fosse to curto o prazo que tivemos para produzir este livro.
Diana Santos
Carssima Belinda, j nem sei mais quando nos conhecemos. Se no me
falha a memria foi no I CULT em Bertinoro, nos idos de 1997. Que conferncia
excelente! J l percebi que tnhamos abordagens similares no nosso trabalho com os alunos de traduo. A partir da mantivemos contato estreito e em
1998 estiveste em So Paulo para uma conferncia da ABRAPT. Em 2002 tive
o prazer de ter uma contribuio sua para os Cadernos de Traduo, da
UFSC, num nmero especial sobre Lingustica de Corpus. Em 2003 cruzamos
no Corpus Linguistics em Lancaster e voc nos proporcionou um belssimo
tour pela countryside inglesa era seu aniversrio, lembra? Nosso convvio
maior foi em 2004. Voc se recorda do congresso da ABRAPT em Fortaleza,
precedido por aquele maravilhoso fim-de-semana em Guajiru? Inesquecvel,
n? Em compensao, como fizemos voc trabalhar naquele congresso! No
mesmo ano apresentamos um trabalho Ideias que cruzam o oceano
no congresso da EST, em Lisboa, abordando nossos pontos em comum. Foi
quando voc, gentilmente, nos convidou, a mim e ao Franco, para ficar em
sua casa e nos ciceroneou pelo norte de Portugal. Que maravilhosa guia e motorista voc ! Lamento apenas que estejamos em lados opostos do Atlntico.
Como eu gostaria de ter tido a oportunidade de trabalhar mais de perto com
voc, uma pesquisadora admirvel, ecltica, generosa, mas p-no-cho,
sempre procurando desenvolver estudos e ferramentas (leia-se Corpgrafo)
que tenham aplicao prtica na traduo e na terminologia. Curta muito
sua aposentadoria.
Stella E.O. Tagnin
OSLa volume 7(1), 2015
[7]
No momento da sua jubilao, no poderia deixar de prestar homenagem a uma mestre insigne da traduo em Portugal, a professora Belinda
Maia. Fao-a por meio de dois artigos, um sobre a anonimizao de entidades mencionadas num corpus de traduo profissional do domnio jurdicofinanceiro, o outro sobre as tarefas de pr-edio de texto a ser traduzido
automaticamente e avaliao da traduo computorizada, ambas exigindo a
participao de especialistas nas lnguas envolvidas na traduo. A carreira
acadmica da professora Belinda Maia foi dedicada essencialmente ao ensino da traduo em Portugal, pas que escolheu para viver e trabalhar, onde
a sua obra na rea da lingustica contrastiva e traduo deixa uma marca
indelvel. Como sua antiga aluna de doutoramento, tive o privilgio de testemunhar de perto o seu rigor cientfico e a sua genuinidade e usufruir da
generosidade com que compartilha a sua sabedoria com os seus alunos e colegas de profisso. Que esta singela homenagem permita simbolizar a paixo
com que defende o envolvimento de linguistas e tradutores profissionais no
processo de traduo automtica.
Anabela Barreiro
Whos to say whats proper? What if it were agreed that proper
meant wearing a codfish on your head? Would you wear it?
Lewis Caroll, Alice in Wonderland
Passaram-se mais de 20 anos desde que conheci a Professora Belinda Maia,
como estudante do terceiro ano de traduo da Faculdade de Letras da Universidade do Porto. Foi numa dessas primeiras aulas que, para grande perplexidade de muitos, a Belinda aplicou traduo uma lio de vida qual
poucos estavam habituados: There is no black and white in translation studies. Desde esta lio de traduo e de vida a minha admirao pessoal
e profissional pela Belinda no parou de crescer. Pela sua energia. Pela sua
coragem. Pela sua enorme capacidade de trabalho. Pela sua diplomacia. Pela
sua capacidade de quebrar barreiras convencionais. E, sobretudo, pelo seu
rasgo. Naquela tarde de 1994, ainda desconhecia que os dois partilharamos
um longo caminho, primeiro na orientao do estgio de Licenciatura, depois
na orientao do Mestrado, e, mais tarde, na co-orientao do Doutoramento,
tal como desconhecia que aquela Professora me apoiaria em muitos dos desafios e dilemas, intelectuais e profissionais, com os quais me viria a defrontar
no meu percurso. Neste momento importante da sua vida, no poderia deixar
de lhe prestar homenagem. Fao-o em meu nome e em nome de todos os meus
colegas cujas vidas a Belinda tocou e que, por alguma razo, no participam
neste volume. Mas fao-o sobretudo pela amizade que partilhamos.
Rui Sousa-Silva
OSLa volume 7(1), 2015
[8]
agradecimentos
Agradecemos ao CLUP, na pessoa do seu diretor, Joo Veloso, o apoio financeiro
prestado para a edio do presente volume. Agradecemos tambm ao ILOS, da
Universidade de Oslo, o apoio financeiro prestado para a publicao em papel.
Obrigado aos editores da OSLa, Atle Grnn e Dag Haug, a presteza na ajuda e
na facilitao dos prazos, que foram muito apertados, e ao Nuno Carvalho pela
grande ajuda na converso e formatao dos artigos.
O nosso maior agradecimento vai para todos os que participaram nesta iniciativa, como autores, e como parceristas.
c o n ta c t o s
Alberto Simes
Linguateca e CEH, Universidade do Minho
ambs@ilch.uminho.pt
Anabela Barreiro
INESC-ID e Linguateca
anabela.barreiro@inesc-id.pt
Diana Santos
Linguateca e Universidade de Oslo
d.s.m.santos@ilos.uio.no
Rui Sousa-Silva
Centro de Lingustica da Universidade do Porto
r.sousa-silva@lflab.pt
Stella E.O. Tagnin
Universidade de So Paulo
seotagni@usp.br
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 919. (ISSN 1890-9639 / ISBN 978-82-9139812-9)
http://www.journals.uio.no/osla
abstract
The need for a re-mapping of concepts in the field of translation studies is
something that becomes evident today. In this context, Sonia Vandepitte
(2008) proposes a renewed thesaurus that will be analyzed, discussed and
completed in this article understood as a constructive contribution to a new
ontology of translation.
Comentando a proposta de Sonia Vandepitte Remapping Translation Studies: Towards a Translation Studies Ontology. Comunicao proferida a 7 de fevereiro de 2014 no mbito do Seminrio
(Re)Cartografar os Estudos de Traduo no Sculo XXI, organizado pelo Centro de Estudos de Comunicao e Cultura da Universidade Catlica.
[10]
[11]
neste contexto que vejo o meu contributo possvel para a ontologia proposta
por Vandepitte que irei organizar de acordo com os pontos 1 e 2 acima referidos,
designadamente os pontos relativos a [1] acrscimo de termos, e a [2] identificao
de mais nveis nas hierarquias j estabelecidas ou proposta de redues de nveis.
Dentro desta argumentao fundamental podero surgir, casuisticamente, aspetos de modificao de sinnimos (ponto 3) ou mesmo a sugesto de novas relaes
associadas (ponto 4).
No entanto, antes de o fazer gostaria de esclarecer desde j que no pretendo
formular uma organizao alternativa, certamente possvel, mas antes propor algumas alteraes que, no meu entendimento, aumentam a sua operacionalidade
e subsequente adequao ao objetivo, expresso pela autora, de desenvolver uma
alternativa organizao proposta de Holmes (1987) identificando categorias de
classificao de acordo com critrios precisos que permitiro a apresentao de
todo o tipo de estudos no mbito da traduo num novo mapa coerente e consistente (cf. Vandepitte 2008, pg. 573).
[1] a c r s c i m o d e t e r m o s
[12]
[13]
Esta distino afigura-se-me discutvel devido confuso instalada na utilizao dos conceitos estratgia, mtodo, procedimento e tcnica. Segundo Venuti
(1998) as estratgias envolvem decises fundamentais relacionadas com as macroestruturas textuais que o autor classifica como sendo de domesticao (domestication) ou estranhante (foreignization). Por sua vez, Jskelinen (1993) distingue estratgias globais que se reportam a princpios e modos de ao gerais, enquanto
que as estratgias locais se referem a opes mais especficas relacionadas com
a tomada de deciso no contexto da soluo de problemas ao nvel das microestruturas textuais: global strategies refer to general principles and modes of
action and local strategies refer to specific activities in relation to the translators
problem-solving and decision-making. Jskelinen (1993, pg. 16). Newmark
(1988) prope a distino entre mtodo tradutivo e procedimento distinguindoos de forma semelhante de Jaaskelainen para diferenciar estratgias globais de
estratgias locais: while translation methods relate to whole texts, translation
procedures are used for sentences and the smaller units of language. (Newmark
1988, pg. 81). A diferenciao entre opes macro- e micro-estruturais parece-me
importante no contexto dos estudos orientados para a descrio do processo e
que creio que devia ser visvel na apresentao de um thesaurus deste tipo. Assim
tornar-se-ia, a meu ver, necessrio reordenar a lista de Vandepitte no sentido de
a tornar mais consistente e clarificadora, de acordo com a seguinte proposta:
NT: studies of translation strategies
NT: studies of translation strategies
RT: adaptation
RT: adaptation
RT: domestication
RT: domestication
RT: equivalence
RT: foreignization
RT: explicitation
RT: free translation
RT: foreignization
RT: imitation
RT: free translation
RT: literal translation
RT: imitation
UF: word-for-word translation
RT: literal translation
UF: word-for-word translation
UF: metaphrase
UF: metaphrase
RT: sense-for-sense translation
RT: paraphrase
NT: studies of linguistic translation
RT: sense-for-sense translation
procedures
NT: studies of linguistic translation techNT: compensation
niques
RT: shifts of translation
NT: compensation
RT: shifts of translation
(proposta do autor)
[14]
Transference
Naturalization
Cultural equivalent
Functional equivalent
Descriptive equivalent
Componential analysis
Synonymy
Through-translation
[15]
natural efetuada por bilingues em situaes espontneas do quotidiano em comparao com a traduo profissional por tradutores com formao avanada em
situaes altamente estruturadas:
Bilingualism and all forms of translation, whether the natural
translation done in everyday circumstances by bilinguals who have
had no special training for it (Harris 1976, pg. 96) or the professional translation of those with advanced translation degrees working
in todays language industry, are necessarily connected at a very fundamental cognitive level. (Shreve 2012, pg. 1).
O facto de o bilinguismo na traduo ter sido, at este momento, um pouco negligenciado na Europa, ao contrrio do que acontece nos EUA (cf. Antonini (2010)),
pode explicar a ausncia desta entrada na proposta de Vandepitte, mas, naturalmente, no a justifica, at porque se observa tambm na Europa um crescente
interesse pelas redes de intrpretes e tradutores voluntrios que, no curto, prazo
iro modificar de forma significativa o panorama da traduo a nvel global.
proposta de Vandepitte (2008, pgs. 584585).
Types by subject
NT: single-focus translation studies
NT: process-oriented translation studies
(incl. cognitive processes)
NT: studies of translation competence
[]
[2] a l a r ga m e n t o da s h i e r a r q u i a s e s ta b e l e c i da s p o r va n d e p i t t e
[16]
Partilho com Flynn & Gambier (2011, pg. 9293) o pressuposto de que os mtodos quantitativos e qualitativos, isoladamente ou em combinao, so aplicados a
todos os trabalhos sobre os principais fatores do processo tradutivo:
To recapitulate, two main methods of analysis can be used to
study any of the four factors outlined above: quantitative or qualitative or a combination of the two. Listed under quantitative methods
we have noted surveys, (cloze) tests, corpus analyses, key-logging,
eye-tracking, screen-logging and related statistical analyses. Under
qualitative methods we have noted various forms of text and discourse
analysis, narrative and related studies, interviews with individuals or
focus groups, think-aloud protocols, ethnographies, inquiries into to
power, gender and other sets of relations.
Tendo em considerao este entendimento, impe-se uma reorganizao hierrquica dos tipos organizados por mtodos de investigao geral da seguinte forma:
Types by method
Types by general research methods
NT: inductive translation studies
NT: corpus(-based) translation studies
NT: qualitative approaches
NT: quantitative approaches
NT: hermeneutic approaches
NT: deductive translation studies
NT: experimental translation studies
RT: think-aloudprotocol studies
UF: TAP studies
NT: speculative approaches
Types by method
Types by general research methods
NT: quantitative methods
NT: qualitative methods
RT: inductive approaches
RT: deductive approaches
RT: corpus(-based) approaches
NT: hermeneutic approaches
RT: experimental approaches
RT: think-aloud protocol studies
UF: TAP studies
NT: speculative approaches
(proposta do autor)
[17]
A alterao de inductive translation studies, corpus(-bases) studies, deductive translation studies e experimental translation studies para inductive approaches, corpus(-based)approaches, deductive approaches e experimental approaches justifica-se, na minha opinio, por razes de coerncia interna considerando que os itens em causa definem abordagens possveis em que
os mtodos identificados podem vir a encontrar a sua aplicao.
concluso
A construo de um thesaurus com estas caractersticas uma tarefa complexa e
exigente, e por natureza, discutvel ou at polmica, porque as diferentes abordagens, prticas e objetivos nos estudos de traduo nem sempre encontram consenso sobre o que se pode considerar como nuclear e complementar (no no sentido da sua importncia, mas sim da sua abrangncia) mesmo dentro de cada ramo
de investigao. Por essa razo, no proponho aqui uma alternativa de organizao estrutural da rea mas, pelo contrrio, privilegio uma postura de cooperao
construtiva na otimizao de uma soluo/proposta de cariz taxonmico.
No entanto, no quero deixar de mencionar a possibilidade de se pensar numa
estruturao ontolgica alternativa dos estudos de traduo. Flynn & Gambier
(2011), p. ex., consideram 4 fatores fundamentais e interligados na descrio e
explicao da atividade tradutiva, a saber, 1. discursos (no sentido lato incluindo
tradues e toda a interao multilingual relacionada com o texto traduzido) 2.
prticas (para alm da prtica tradutiva, fatores mltiplos que de alguma forma
a influenciam), 3. contextos (em que as tradues so produzidas) e 4. atores
(que inclui todos aqueles que participam, direta ou indiretamente, na atividade
tradutiva) (cf. Flynn & Gambier (2011, pg.8993). Na tentativa de assim agrupar os estudos pelo objetivo comum de procurar descobrir o que ser tradutor
(translatorship) talvez se poderia passar de uma perspetiva multidisciplinar que
no procura necessariamente a integrao do conhecimento, para um campo que
Snell-Hornby et al. (1994) chamaram de uma disciplina interdisciplinar que, por
sua vez, procura congregar todo o saber volta daquilo que poderamos chamar
translatorship. Uma ontologia, partindo deste pressuposto, teria desse modo,
como fundamento estruturante alternativo os fatores que definem per se a traduo enquanto atividade humana comunicativa e intencional (cf. Vandepitte (2008,
pg. 570). Porventura, seria um passo na direo, por muitos sentida como desejvel, de uma reconciliao entre a teoria e a prtica.
agradecimentos
Agradeo colega Anabela Barreiro a reviso atenta deste artigo.
OSLa volume 7(1), 2015
[18]
referncias
Aitchison, Jean, Alan Gilchrist & David Bawden. 2000. Thesaurus construction and
use: A practical manual. Aslib IMI.
Antonini, Rachele. 2010. Natural translator and interpreter. Em Yves Gambier
& Luc van Doorslaer (eds.), Handbook of translation, vol. 2, 102104. John Benjamins.
Baker, Mona (ed.). 1998. Routledge Encyclopedia of Translation Studies. Routledge.
Catford, John Cunnison. 1965. A linguistic theory of translation: An essay in applied
linguistics. Oxford University Press.
Flynn, Peter & Yves Gambier. 2011. Methodology in translation studies. Em Yves
Gambier & Luc van Doorslaer (eds.), Handbook of translation, vol. 2, 8896. John
Benjamins.
Gentzler, Edwin. 2001. Contemporary translation theories. Multilingual Matters 2nd
edn.
Harris, Brian. 1976. The importance of natural translation. Working Papers in Bilingualism 12. 96114.
Harris, Brian & Bianca Sherwood. 1978. Translating as an innate skill. Em David
Gerver & H. Wallace Sinaiko (eds.), Language, interpretation and communication,
155170. Plenum.
Hervey, Sndor & Ian Higgins. 1992. Thinking translation. Routledge.
Holmes, James. 1987. The name and nature of translation studies. Em Gideon
Toury (ed.), Translation across cultures, Bahri Publications.
Jskelinen, Riitta. 1993. Investigating translation strategies. Em John Laffling
& Sonja Tirkkonen-Condit (eds.), Recent Trends in Empirical Translation Research,
99120. University of Joensuu.
Kuhiwczak, Piotr & Karin Littau (eds.). 2007. A companion to translation studies.
Multilingual Matters.
Munday, Jeremy (ed.). 2009. The Routledge Companion to Translation Studies. Routledge.
Newmark, Peter. 1988. Approaches to translation. Prentice Hall.
Popovic, Anton. 1976. Dictionary for the analysis of literary translation. Department
of Comparative Literature, The University of Alberta.
OSLa volume 7(1), 2015
[19]
c o n ta c t o s
Thomas J. C. Hsgen
Faculdade de Letras da Universidade do Porto
thusgen@letras.up.pt
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo : Mundos que se Cruzam, Oslo Studies in Language 7(1), 2015. 2137. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
abstract
The Corpgrafo results from interdisciplinary collaboration between linguists and computer engineers under Belinda Maias direction. This userfriendly tool for building and using tailor-made corpora allows not only for
terminology extraction and management, but also for any research based
on monolingual, comparable or parallel corpora. This paper presents the
Corpgrafos evolution from the first to the fourth version, and two experiences of its use in three languages (English, French and Portuguese). The
first experience is in the field of Bluetooth technology terminology extraction and management. The second deals with four Portuguese structures
containing the universal quantifier cada and expressing progression, dropper, proportion between two sets of events or entities and proportion between a set and a subset of events or entities. These experiences show the
strengths, weaknesses and limits of the Corpgrafo.
[1] i n t r o d u c t i o n
[22]
franoise bacquelaine
tes. Lune concerne llaboration dune BDT trilingue (anglais, franais, portugais) dans le domaine de la technologie de tlcommunication sans fil Bluetooth.
Lautre a rvl la prdominance du quantificateur universel portugais cada sur
le quantificateur universel pluriel todos (os) dans certains corpus de spcialit,
alors que each et chaque sont moins frquents que all ou tous (les) en anglais et
en franais, quels que soient les corpus. Ces deux exemples illustrent bien deux
des principales applications pdagogiques et scientifiques du Corpgrafo dans la
perspective de lutilisateur.
[2] c o r p g r a f o
[23]
diter le texte pour le nettoyer et le diviser de faon semi-automatique en segments, conformment aux besoins de lutilisateur ; (2) constituer un ou plusieurs
corpus partir de slections de fichiers ; (3) raliser des tudes de frquence et
des recherches de collocations (Sarmento & Maia 2003, pg. 27).
En 2004, grce son architecture modulaire, des outils dextraction et de gestion terminologique ont pu tre ajouts aux fonctions du GC pour faciliter le travail du terminographe : le Corpgrafo tait n (Sarmento et al. 2004). La structure
actuelle du Corpgrafo se mettait en place. Le menu principal (figure 1) offre aujourdhui quatre options, dont les deux premires sont hrites du GC : (1) Gestor (Gestionnaire) pour la cration et la gestion de corpus ; (2) Pesquisa (Recherche) pour lanalyse de corpus selon divers types de requtes ; (3) Centro de
Conhecimento (Centre de connaissance) pour la cration et la gestion de bases
de donnes et de relations smantiques ; (4) Centro de comunicao (Centre de
communication) o lutilisateur peut trouver des informations sur le Corpgrafo.
Outre ces quatre options du menu principal donnant accs diverses fonctions,
lutilisateur dispose de quatre boutons (figure 2) lui permettant (1) daccder la
corbeille de fichiers supprims ; (2) dobtenir de laide sur la fonction quil est en
train dutiliser ; (3) denvoyer des commentaires ; (4) dditer son profil.
[24]
franoise bacquelaine
3. Daprs nos calculs, cette frquence est mme plus leve : 4000/110.961 = 0, 36%.
4. Key Word In Context.
OSLa volume 7(1), 2015
[25]
Quant loption Concordncia Janela (figure 7), elle permet de classer les
rsultats par ordre alphabtique daprs les atomes qui prcdent ou qui suivent
lexpression de requte. La version 4 comporte une cinquime fonction permettant dexplorer des corpus parallles, mais nous ne lavons jamais utilise.
Mais ce qui a valu le changement de nom du GC, cest bien la possibilit de
crer et de grer des BDT. Lextraction terminologique se fonde sur lanalyse de ngrammes et un ensemble de restrictions lexicales partir dun dictionnaire lectronique. Lutilisateur peut crer des BDT dont le modle se base sur la norme ISO
12620. Chaque terme vedette ou entre de la BDT correspond une fiche comportant divers champs pouvant tre complts ou non (langue, donnes morphologiques, source, dfinition, exemples en contexte, relations smantiques avec
OSLa volume 7(1), 2015
[26]
franoise bacquelaine
[27]
[28]
franoise bacquelaine
[29]
lments ncessaires leur comprhension, notamment les antcdents des pronoms. Le nettoyage na pas besoin dtre parfait car on peut corriger les extraits
dans la fonction BDT. Il faut toutefois veiller ce que les termes soient orthographis correctement pour tre reconnus par les diffrentes fonctions. Quatre
corpus comparables (anglais, franais, portugais europen et portugais du Brsil) ont t constitus partir des fichiers contenant les documents nettoys.
Ces corpus ont ensuite t exploits grce aux fonctions des options Pesquisa
et Centro de Conhecimento.
Les diffrentes fonctions de recherche permettent de limiter la recherche un
corpus slectionn dans un menu droulant. Elles fonctionnent trs bien quelle
que soit la langue. Nous avons utilis trois des cinq fonctions de recherche : Concordncia Frase, Concordncia Janela et Concordncia KWIC. La premire a permis de trouver des dfinitions et des contextes pour les sigles de moins de quatre
lettres qui ne sont pas reconnus par les fonctions de loutil BDT. En effet, cela en
ralentirait les performances, ce qui est contraire lintrt de la majorit des utilisateurs. Ces trois fonctions ont t trs utiles pour reprer les variantes partir
du co-texte et les termes composs partir de noyaux terminologiques tels que
protocol, layer, link, channel ou logical transport.
Loutil BDT permet dextraire des candidats terminologiques partir dun corpus slectionn. Au cours de la deuxime phase de dveloppement du Corpgrafo,
des filtres ont t mis en place en anglais et en portugais pour diminuer le bruit
caus par la ponctuation, les pronoms, les prpositions, les auxiliaires, les dterminants, etc. Une option permet lusager daccder dun simple clic au contexte
et aux rfrences du fichier dorigine avant de slectionner le candidat. Cette slection entrane automatiquement la cration de la fiche correspondante comportant plusieurs donnes insres automatiquement : la langue et les rfrences du
fichier dorigine du terme. Lextraction terminologique semi-automatique fonctionne mieux en anglais et en portugais quen franais, mais le bruit reste important tant donn le volume des corpus. Toutefois, la liste des termes anglais inclure dans lchantillon a t tablie en concertation avec le professeur Almeida
partir de sa leon filme, du manuel de Schiller (2003) et de sa prsentation PPT en
ligne (2008). En tout, 35 termes EN sur 122, 47 termes PT sur 146 et 5 termes FR sur
205, soit un peu plus de 18% des termes, ont t insrs semi-automatiquement.
Les autres fiches ont t cres au fur et mesure des besoins.
La fiche terminologique du Corpgrafo propose dix champs principaux : Dados Gerais (Donnes gnrales), Pesquisadores (Chercheurs), Autores, Fontes (Sources), Morfologia, Definies, Contextos, Relaes Semnticas,
Termos Relacionados et Equivalentes de Traduo. Nous les avons complts
tous sauf le champ Pesquisadores puisquil nest ncessaire que lorsque plusieurs chercheurs travaillent sur la mme BDT.
[30]
franoise bacquelaine
Le champ Dados Gerais contient le terme vedette et ses principales caractristiques : langue, type (sigle, abrviation, etc.), statut (normalis, admis, etc.),
registre (courant, technique, etc.), frquence demploi et origine (emprunt, nologisme, etc.).
Les champs Autores et Fontes identifient, dune part, les auteurs des fichiers do proviennent les termes vedettes, les dfinitions et/ou les contextes,
et, dautre part, les entits publiques ou prives dont ces auteurs relvent. Ces
informations apparaissent automatiquement si la fonction dextraction terminologique semi-automatique a t applique, mais elle peuvent aussi tre insres
manuellement partir de menus droulants des listes dauteurs et dentits enregistrs lors de la premire tape grce aux fonctions du Gestor.
La conception du champ Morfologia semble bien reposer sur lassomption
que la plupart des termes appartiennent la classe des noms et seuls le genre et le
nombre du terme peuvent tre dfinis par le terminographe. Certains domaines
techniques tels que le tricot ou le crochet 10 prsentent pourtant beaucoup de
verbes qui sont des termes et la terminologie Bluetooth comporte plusieurs adjectifs et de nombreux sigles, qui correspondent, certes, des entits nominales,
mais dont certains combinent les lettres aux chiffres et parfois mme la ponctuation (L2CAP, IEEE 802.15). Les termes complexes sont segments automatiquement
(mais pas les sigles) et la classe grammaticale de chaque lment qui le compose
peut tre slectionne partir dun menu droulant qui propose les options NC
(nom commun), NP (nom propre), AJ (adjectif), VB (verbe), PP (prposition) et AD
(adverbe). Ce systme ferm limite les possibilits de classement et la segmentation du terme est imparfaite et ne peut tre amliore. Par exemple, la contraction
de la prposition et de larticle dfini, en franais et en portugais, et larticle dfini
singulier ou la prposition de lids devant un nom commenant par une voyelle
en franais sont considrs comme un seul mot (et donc un seul atome), ce qui
pose des problmes de classement. Le pronom latin hoc dans rseau ad hoc, les articles et les conjonctions tels que les articles dfinis et la conjonction et dans
interface entre lhte et le contrleur ne peuvent tre classs. Il est vrai que les articles et les conjonctions sont plutt rares dans les terminologies et que cet outil
a t programm pour langlais et le portugais. Lapostrophe a trs peu de chance
dtre employe dans les terminologies anglaises et elle est trs rare en portugais.
Le problme ne se pose que pour les termes complexes de plus en plus souvent
reprsents par des sigles 11 . Une solution pourrait tre de les classer comme un
tout, syntagme nominal, verbal, adjectival ou adverbial et des traits morphologiques dautres classes de mots devraient pouvoir figurer dans ce champ.
10. Notre mmoire de Licence en Philologie germanique sintitule Deutsch-franzsische Terminologie des
Strickens und Hkelns (Bacquelaine 1980) et la plupart des termes slectionns pour lanalyse sont des verbes
particule sparable ou insparable.
11. Sablayrolles parle mme de siglomanie en nologie (2000, pg. 263).
[31]
Les deux champs suivants contiennent la ou les dfinition(s) et le ou les contexte(s) demploi permettant de reprer les collocations et autres phrasologismes
propres au terme et au domaine. La plupart des dfinitions et des contextes ont t
extraits automatiquement des corpus prpars cet effet. Quelques dfinitions
portugaises ont t transcrites du document audio-visuel et quelques anglaises du
livre de Schiller (2003). Dautres ont t reformules partir de plusieurs sources.
Etant donn la complexit des relations smantiques dans ce domaine, nous
navons pas tabli systmatiquement les relations smantiques entre termes dans
la BDT. Nous avons prfr construire trois micro-structures partir de la documentation et des entretiens avec lexpert : les rseaux ad hoc Bluetooth (Bacquelaine 2009, pg. 29), les modes, tats et adresses des appareils compatibles Bluetooth (idem, pg. 31) et le systme principal Bluetooth (idem, pg. 77). Ces trois
micro-structures donnent une ide des relations entre la plupart des noyaux conceptuels dsigns par les termes de la BDT. Par contre, les relations de synonymie
(Termos relacionados) ont t tablies systmatiquement, car elles permettent
de dterminer le nombre de concurrents pour le mme concept. Quelques rares
cas dantonymie ont galement t signals.
Enfin, le dernier champ contient les quivalents de traduction que lon slectionne par langue dans un menu droulant. Cette fonction a t amnage depuis.
En effet, le menu droulant sallongeait au fur et mesure que la BDT senrichissait et il ne comporte plus dsormais que les initiales majuscules ou minuscules
des termes enregistrs dans la BDT. Cet amnagement parsente lavantage de
raccourcir le menu droulant mais aussi linconvnient docculter les termes : le
terminographe ne peut plus choisir le terme dans le menu, il doit savoir ce quil
cherche pour pouvoir le trouver. Il faut donc que la fiche de lquivalent ait t
cre pralablement et notre organisation en trois tapes, langlais, puis le franais, puis le portugais, sest rvle trs pratique. On peut aussi passer dune fiche
lautre grce des hyperliens entre synonymes, antonymes et quivalents de
traduction. Cette fonction ajoute en 2006 la demande des utilisateurs se rvle
trs utile pour vrifier si aucun terme concurrent na t oubli.
Une autre fonction trs utile de la BDT est celle qui permet dobtenir des statistiques sur chaque terme. Elle distingue non seulement le nombre doccurrences
au singulier et au pluriel, sauf en franais, mais aussi le nombre doccurrences par
fichier. Ces donnes permettent de comparer lusage selon les auteurs, les types
de texte ou les registres. Il est aussi possible dobtenir des statistiques gnrales
par langue et par corpus. Ces dernires ne tiennent compte que des termes de
plus de trois lettres extraits et insrs semi-automatiquement, si bien que nous
navons pu utiliser ces rsultats efficacement en raison des nombreux sigles de
trois lettres et des nombreux termes (81,6% du total) insrs manuellement.
On peut aussi associer chaque fiche un ou plusieurs mdias (images, films
ou enregistrements sonores numriss), mais quelques problmes doivent encore
OSLa volume 7(1), 2015
[32]
franoise bacquelaine
tre rsolus. Dune part, aucun champ nest prvu pour indiquer la prsence de
ces fichiers, dautre part, ils nont pas t exports avec les autres donnes de la
BDT.
Dans lensemble, cette exprience terminographique a t trs positive. Si les
performances des fonctions de gestion et de recherche sont remarquables, les
fonctions dextraction semi-automatique de termes et de dfinitions peuvent tre
amliores, notamment en franais, mais elles ont quand mme facilit la tche
terminographique. Certaines contraintes, telles le nombre minimum de quatre
lettres par terme pour obtenir des rsultats statistiques ou les options rduites
de classement morphologique, devraient pouvoir tre leves ou amnages par
lutilisateur, comme cest le cas des relations smantiques. tant donn que tous
les utilisateurs nont pas besoin de tous les champs prvus pour les fiches terminologiques, ceux-ci pourraient tre activs selon les besoins de chacun. Les corpus FR et PE (portugais europen) crs pour cette premire exprience ont t
rutiliss dans la deuxime. Cette possibilit de recyclage reprsente un autre
avantage du Corpgrafo.
[4] p h r a s o l o g i e
[33]
Les attestations des trois quantificateurs, au fminin et au masculin, en franais et en portugais, ont t extraites grce la fonction Concordncia Frase.
Ces rsultats bruts ont t copis-colls sur une feuille de calcul Excel o ont t
ralises les oprations de slection des segments pertinents 12 et de classement
de ceux-ci daprs les noms sur lesquels ils oprent en vue de lanalyse qualitative
et quantitative de ces donnes. Contrairement aux attentes des locuteurs natifs
lusophones qui ces rsultats ont t prsents 13 , il sest avr que, dans labsolu,
cada est plus frquent que chaque, quel que soit le co-texte ou le registre (courant,
juridique ou technico-scientifique).
La fonction Concordncia Janela permet de classer les rsultats par ordre alphabtique daprs le co-texte, cest--dire les mots qui prcdent ou suivent lexpression de requte (la ponctuation na videmment aucun intrt). Cette fonction a ainsi rvl les affinits particulires de chaque quantificateur avec certains
noms. Par exemple, laffinit de cada avec le nom vez est trs forte. Elle a aussi mis
en vidence la particularit de cada qui peut oprer sur un nom quantifi par un
numral cardinal suprieur lunit, ce que ni chaque ni each ne peuvent faire.
Ces dcouvertes ont ainsi dtermin le choix de lobjet dtude de notre doctorat
en cours. En effet, la frquence de cada particulirement leve dans le corpus
Bluetooth sexplique en partie par la frquence de quatre squences semi-figes
dont les traductions 14 en anglais et en franais ne comportent ni each ni chaque
et qui sassimilent la phrasologie au sens large. Les exemples (1) (4) illustrent
ces quatre squences :
(1)
(2)
(3)
12. Les rsultats de cette recherche ont t prsents au Colloque international Traduction, terminologie
et rdaction technique : des ponts entre le franais et le portugais en janvier 2011 et larticle Apports de la
smantique et de la syntaxe la traduction des quantificateurs universels franais et portugais a t accept
en juillet 2011 pour publication dans les Actes, qui se font toujours attendre ce jour.
13. Il sagit du groupe de smantique du CLUP dirig par le professeur Ftima Oliveira, compos notamment
de Ftima Silva, Lus Filipe Cunha, Antnio Leal, Purificao Silvano, Idalina Ferreira et Joaquim Barbosa.
14. Les quivalents de traduction ont t confirms par une tude postrieure de corpus parallles disponibles en ligne.
OSLa volume 7(1), 2015
[34]
franoise bacquelaine
(4)
En (1), cada vez se combine un comparatif (mais, menos, maior, menor, melhor ou
pior 15 ) pour exprimer la progression dans un sens ou dans lautre. La progression sexprime par dautres moyens en anglais (p. ex. more and more, ever more,
ou la lexicalisation du concept en recourant to increase, increasing ou increasingly
pour exprimer laugmentation quantitative ou lintensification qualitative) et en
franais (p. ex. de plus en plus, de moins en moins, de mieux en mieux, de mal en pis
ou diverses lexicalisations du concept telles que se multiplier ou croissant). Nous
avons baptis compte-gouttes la relation exprime en (2), o uma correspond
un numral cardinal restreignant lunit la quantit de piconets o lappareil
Bluetooth peut communiquer de cada vez (at a time et la fois sont les quivalents
les plus frquents). Les exemples (3) et (4) illustrent les deux derniers types de
relations quantifies par cada en portugais. En (3), um em cada trs parlamentares
trabalhistas (one in (every) three Labour MEPs et un dput travailliste sur trois) exprime une proportion entre un ensemble et un sous-ensemble tandis quen (4),
um canal LCH por cada 3 tramas (one LHC channel for every three frames et un canal LHC
pour trois trames) exprime une proportion entre deux ensembles distincts.
Les relations de proportion sont les plus complexes et les prpositions em ou
por peuvent entraner lun ou lautre type de proportion, mais il ne sagit pas
ici dentrer dans les dtails de lexpression de ces quatre relations quantifies
par cada, mais bien de dmontrer la performance et lutilit des fonctions de recherche du Corpgrafo qui dvoilent des aspects insouponns des langues naturelles.
[5] c o n c l u s i o n
[35]
le fait remarquer Gouadec (2003), lutilisateur prfre des rpertoires terminologiques aussi simples que possible et donc en contravention totale avec toutes les
rgles de la terminographie (1618 1620 16 ). Comme nous lavons dit, la possibilit de slectionner les champs des fiches terminologiques selon les besoins de
chaque utilisateur permettrait damliorer la prsentation de la BDT exporte. La
gestion des mdias associs devrait pouvoir tre amliore par le signalement de
leur prsence et leur inclusion lors de lexportation.
Certes, des corpus prt--porter, comparables ou parallles, sont disponibles
et exploitables gratuitement en ligne, mais ils ne contiennent pas toujours ce dont
on a besoin, notamment lorsquil sagit de terminologie ou de phrasologie spcialise. Certes, dautres outils permettent dexplorer limmense corpus du Web (par
exemple, WebCorp 17 voire Google) ou de constituer des corpus partir de motscls associs au domaine (par exemple, BootCat 18 ) 19 , mais le Corpgrafo est sans
doute le seul outil gratuit accessible en ligne 20 ou tlchargeable spcialement
conu pour le traitement automatique du portugais et fonctionnant aussi pour
dautres langues en caractres romains. Il permet en outre dinclure des fichiers
non disponibles sur Internet, tels que ceux fournis par les experts de la FEUP.
Son architecture modulaire lui confre une grande flexibilit et la possibilit de
sadapter aux besoins formuls par les utilisateurs. Cette flexibilit lapparente
un laboratoire o des pistes sont suivies jusquau bout ou abandonnes en chemin
si les rsultats savrent ngatifs. Les fonctions de gestion de fichiers et de corpus
ainsi que les fonctions de recherche sont trs efficaces et ne requirent que peu
defforts de prparation des matires premires. La possibilit de rectifier le texte
des dfinitions et des contextes dans les bases de donnes ainsi que celle de crer
des relations smantiques constituent dautres atouts du Corpgrafo.
Initialement conu comme un outil daide la recherche et la formation en
linguistique portugaise et en terminographie au service de la traduction, il ne peut
se mesurer des fonctions intgres aux outils daide la traduction telles que
Termbase ou Multiterm, qui sont beaucoup plus pratiques pour les traducteurs
professionnels. Il nen reste pas moins que cest un outil didactique performant
pour initier les tudiants aux canons de la terminographie et de la lexicographie.
En outre, il se rvle un alli fidle et utile pour toute recherche sur un ou plusieurs corpus. Les corpus sur mesure constituent ainsi un investissement long
terme permettant dutiliser les mmes corpus ou den crer dautres partir des
mmes fichiers pour raliser toute sorte de recherches fondes sur le langage et
toute sorte de comparaisons entre langues ou registres au sens large (oral ou crit,
16. Il sagit dun document sonore.
17. http://www.webcorp.org.uk/live/.
18. http://bootcat.sslmit.unibo.it/.
19. Ces deux outils concurrents, Webcorp et Bootcat, nous ont t signals par Slvia Arajo que nous tenons
remercier pour sa rvision minutieuse et ses conseils judicieux.
20. http://labclup.letras.up.pt/corpografo/
OSLa volume 7(1), 2015
[36]
franoise bacquelaine
langue courante ou langue de spcialit, types de textes, etc.), quel que soit le
domaine de recherche (Terminologie, Traduction, Linguistique, TAL, Sociologie,
etc.). Depuis que le projet nest plus financ, ce qui est regrettable, le Corpgrafo
na plus fait lobjet que damnagements ponctuels et son bon fonctionnement dpend dsormais de la bonne volont de quelques-uns. Quils en soient remercis.
rfrences
Almeida, Nuno. 2007. A tecnologia Bluetooth.
Bacquelaine, Franoise. 1980. Deutsch-franzsische Terminologie des Strickens
und Hkelns. Universit de Lige, non publi.
Bacquelaine, Franoise. 2006. Leuphmisme, un obstacle la traduction. Revista
da Faculdade de Letras : Lnguas e Literaturas XXIII. 463487.
Bacquelaine, Franoise. 2009. La terminologie Bluetooth en anglais, en franais et en
portugais. tude de nonymie compare. Porto : Faculdade de Letras da Universidade do Porto. MA thesis. Version de septembre 2008 revue aprs soutenance.
Bacquelaine, Franoise. 2015. La terminologie Bluetooth en anglais, en franais et en
portugais tude de nonymie compare. Presses acadmiques francophones.
Branco, Antnio, Amlia Mendes, Slvia Pereira, Paulo Henriques, Thomas Pellegrini, Hugo Meinedo, Isabel Trancoso, Paulo Quaresma, Vera Lcia Strube
de Lima & Fernanda Bacelar. 2012. A lngua portuguesa na era digital The Portuguese Language in the Digital Age. Springer.
Gouadec, Daniel. 2003. Terminologie et traduction. Document audio, communication et discussion : 105- 3230. http://archives.diffusion.ens.fr/
diffusion/audio/2003_10_17_terminologie_02.mp3.
Maia, Belinda & Srgio Matos. 2008. Corpgrafo V4 - Tools for Researchers and
Teachers using Comparable Corpora. In Pierre Zweigenbaum, ric Gaussier &
Pascale Fung (eds.), LREC 2008 Workshop on Comparable Corpora (LREC 2008), 7982.
ELRA.
Maia, Belinda & Lus Sarmento. 2003. GC - Integrated Web Environment for
Corpus Linguistics. Prsentation la Corpus Linguistics 2003 (CL2003). http:
//www.linguateca.pt/documentos/cl2003.pdf.
Maia, Belinda & Lus Sarmento. 2005. The Corpgrafo - an Experiment in Designing a Research and Study Environment for Comparable Corpora Compilation
and Terminology Extraction. In Proceedings of eCoLoRe / MeLLANGE Workshop,
Resources and Tools for e-Learning in Translation and Localisation, 4548.
OSLa volume 7(1), 2015
[37]
Maia, Belinda & Lus Sarmento. 2006. Corpgrafo - Applications. In Third International Workshop on Language Resources for Translation Work Research & Training,
Satellite event of LREC 2006 (LR4Trans-III), 5558.
Sablayrolles, Jean-Franois. 2000. La nologie en franais contemporain. examen du
concept et analyse de productions nologiques rcentes (Lexica. Mots et Dictionnaire 4). Champion.
Santos, Diana. 2005. Relatrio da Linguateca de 15 de Maio de 2004 a 14 de Maio
de 2005. Tech. rep. Linguateca. http://www.linguateca.pt/documentos/
RelatorioLinguatecaMaio2005.pdf.
Sarmento, Lus & Belinda Maia. 2003. Gestor de Corpora Um ambiente Web integrado para Lingustica baseada em Corpora. In Jos Joo Almeida (ed.), Corpora
Paralelos, Aplicaes e Algoritmos Associados (CP3A), 2530.
Sarmento, Lus, Belinda Maia & Diana Santos. 2004. The Corpgrafo - a Web-based
environment for corpora research. In Maria Teresa Lino, Maria Francisca Xavier, Ftima Ferreira, Rute Costa & Raquel Silva (eds.), Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC2004), 449452.
Sarmento, Lus, Belinda Maia, Diana Santos, Ana Pinto & Lus Cabral. 2006. Corpgrafo V3 : From Terminological Aid to Semi-automatic Knowledge Engine. In
Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odjik & Daniel Tapias (eds.), Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC2006), 15021505.
Schiller, Jochen. 2003. Mobile Communications. Harlow (GB) : Pearson Education
Limited 2nd edn.
Schiller, Jochen. 2008. Wireless LANs. Cette version consulte le 15/09/2008 nest
plus disponible ; une version remanie non date est disponible https://
www.iith.ac.in/~tbr/teaching/docs/wireless_lans.pdf.
c o n ta c t s
Franoise Bacquelaine
Faculdade de Letras, Universidade do Porto
franba@letras.up.pt
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 3956. (ISSN 1890-9639 / ISBN 978-82-9139812-9)
http://www.journals.uio.no/osla
abstract
In this paper, we propose an exploratory study about the usefulness of multilingual corpora in areas related to the study of language, translation and,
in particular, of simultaneous interpreting. After a brief overview of corpusbased interpreting studies as well as of some existing electronic interpreting
corpora, we move on to describe the compilation stages of a bidirectional
multimedia corpus (PTEN/ENPT). This is followed by an example of how
the corpus can be explored, which focuses on the issue of anaphoric relations. The aim of this study if twofold: on the one hand, to convey the relevance of this type of resource as a repository of authentic simultaneous
interpreting data; and, on the other hand, to demonstrate that by analysing
it from a linguistic perspective it may be possible to identify sensitive areas
in simultaneous interpreting (e.g. anaphora), which may prove an important contribution for interpreter training.
[1] l i n g u s t i c a d e c o r p u s a p l i c a da i n t e r p r e ta o
O termo interpretao deve ser entendido aqui como a atividade de natureza oral que implica a passagem
de uma mensagem de uma lngua para outra, quer em modo simultneo, quer em modo consecutivo
(Bendazzoli 2010).
[40]
Como vimos, a metodologia da lingustica de corpus tem sido usada, ainda que
de forma artesanal, para estudar a interpretao. Devido natureza intrnseca da
interpretao, os corpora compilados com material interpretado, manuais ou eletrnicos, so necessariamente paralelos, isto , contm pelo menos duas verses
lingusticas de um mesmo texto (original e traduo). Daqui se depreende que
o corpus paralelo seja de grande utilidade para o estudo da traduo, podendo
e devendo esta inferncia ser alargada ao contexto da interpretao (Ginezi
2014).
Desde o apelo lanado por Shlesinger (1998), foram desenvolvidos diversos
projetos dedicados compilao de corpora de interpretao eletrnicos, entre
os quais se destaca o pioneiro European Parliament Interpreting Corpus (EPIC). O EPIC
foi indubitavelmente um forte catalisador de investigao nesta rea, tendo servido de base a inmeros projetos de dissertao no contexto acadmico italiano
(Russo 2010). Os membros da equipa que esteve na origem do EPIC tm-se dedicado ao estudo da direcionalidade e seu impacto no desempenho dos intrpretes, associado s eventuais diferenas decorrentes da interpretao entre lnguas
romnicas, por um lado, e entre uma lngua romnica e uma germnica, por outro lado (Monti et al. 2005). Outro estudo que merece destaque resulta do European Parliament Translation and Interpreting Corpus (EPTIC), projeto derivado do
EPIC, que reequaciona o conceito de simplificao lexical enquanto universal de
traduo/interpretao atravs de medida quantitativas fornecidas pelo corpus
(Bernardini et al. 2013).
Na senda do EPIC (Bendazzoli & Sandrelli 2005), surgiram outros corpora de
interpretao que cobrem uma maior variedade de lnguas, modos e contextos
de interpretao. de destacar o trabalho desenvolvido pelo Hamburger Zentrum
OSLa volume 7(1), 2015
[41]
O corpus de interpretao que constitui a base emprica do estudo sobre a anfora que apresentamos na seco seguinte decorre do corpus Per-Fide (Arajo et al.
2010; Almeida et al. 2014). Para dar continuidade a este corpus composto exclusivamente por textos (escritos) paralelos, pretende-se agora acrescentar uma dimenso oral atravs da compilao de um corpus de interpretao. Este corpus
ser composto pelas transcries das intervenes, em sesso plenria, dos eurodeputados portugueses e ingleses. Convm, aqui, realar que as transcries
por ns realizadas so diferentes das transcries que integram o corpus Europarl
(Koehn 2005). Com efeito, para o corpus de interpretao, partimos do CompteRendu in Extenso (CRE), i.e., o relato integral das sesses plenrias do Parlamento
Europeu, que deu origem ao Europarl, mas procurmos aproxim-lo daquilo que
efetivamente proferido pelos deputados e intrpretes. Apesar de se limitar s
lnguas portuguesa e inglesa, este corpus de interpretao tem a vantagem de especificar a lngua-fonte de cada interveno.
Alm de ser um produto derivado do projeto Per-Fide, este corpus que agora
apresentamos diretamente influenciado pelo EPIC, projeto que lhe serviu de inspirao. De seguida, passamos ento a descrever o Corpus de Interpretao/PerFide, salientando algumas das suas caractersticas e elencando as vrias etapas de
compilao, sem nos adentrarmos nos detalhes mais tcnicos. Porm, antes ainda
dessa descrio, apresentamos um resumo visual do processo, que poder contribuir para uma melhor compreenso dos procedimentos envolvidos na compilao
do corpus na figura 1.
Conforme assinalado na figura 1, a criao do corpus pode ser decomposta em
trs grandes etapas: pr-processamento, processamento e ps-processamento.
Antes de o material a incluir no corpus poder ser processado, necessrio levar a
cabo a transcrio do material, ainda em formato audiovisual (pr-processamento).
OSLa volume 7(1), 2015
[42]
Pr-processamento
Conforme mencionado acima, o Corpus de Interpretao/Per-Fide um corpus de
interpretao composto pelas transcries dos discursos proferidos pelos eurodeputados portugueses e ingleses e das respetivas interpretaes. Os discursos
reunidos correspondem a um perodo de seis meses de intervenes dos deputados portugueses, desde janeiro a junho de 2011. A fim de equilibrar quantitativamente os dados, as intervenes dos homlogos ingleses limitam-se a apenas
trs desses seis meses. Este um corpus bidirecional na medida em que inclui
discursos originais em ambas as lnguas. Podemos, alis, dividi-lo em dois subcorpora para melhor ilustrar o seu carter bidirecional: subcorpus 1) portugus
original ingls interpretado; subcorpus 2) ingls original portugus interpretado. Cria-se, assim, uma estrutura cruzada que, como se pode ver abaixo na
figura 2, demonstra a natureza simultaneamente paralela e comparvel (a dois
nveis) do corpus.
Os discursos (i.e., originais e interpretaes) que integram este corpus apresentam especificidades decorrentes do contexto em que ocorrem. Com efeito, o
sistema de atribuio da palavra no Parlamento Europeu bastante rgido, o que
faz com que cada deputado tenha, em mdia, dois minutos para as suas intervenes. Para garantir a mxima rentabilizao de um tempo de antena mnimo, os
OSLa volume 7(1), 2015
[43]
[44]
Processamento
Como se disse, neste momento, encontramo-nos em fase de reviso das transcries j elaboradas pelos alunos da licenciatura em Lnguas Aplicadas. Aps esta
etapa, passar-se- a um conjunto de procedimentos tcnicos que visam a estruturao e disponibilizao do corpus na rede, em livre acesso. Feitas as transcries, ser necessrio alinhar os originais com as respetivas transcries. Este
OSLa volume 7(1), 2015
[45]
alinhamento implica uma segmentao ao nvel frsico, passvel de automatizao graas aos cdigos XML gerados pelo Partitur-Editor. Esta segmentao servir
de base ao alinhamento dos bi-textos (i.e., original + interpretao) mas tambm
prpria segmentao dos ficheiros vdeo/udio correspondentes que sero depois sincronizados com o texto. Daqui decorre o carter multimdia deste corpus,
semelhana do corpus multimdia Veiga de legendagem (Dios & Guinovart 2012),
desenvolvido no Centro de Lingustica da Universidade de Vigo. Este corpus ir, assim, permitir ao utilizador efetuar pesquisas em bitextos alinhados, com acesso ao
material audiovisual correspondente. Com efeito, estes dois aspetos representam
um salto evolutivo face ao EPIC, embora, no mbito deste ltimo, esteja j prevista a incluso de bitextos alinhados e de material audiovisual como sugesto de
trabalho futuro. importante referir que os bitextos sero alinhados e etiquetados morfossintaticamente, aumentando as potencialidades do corpus para fins de
investigao. Estas duas tarefas sero executadas de forma semiautomtica, combinando software de alinhamento (Simes & Almeida 2007) e etiquetao (Schmid
1994; Brants 2000) cujos resultados sero alvo de uma reviso manual.
Ps-processamento
A ltima etapa prende-se com a disponibilizao do corpus e prev a construo de
uma interface de pesquisa prpria, que a seu tempo poder ser consultada atravs
do stio web do corpus Per-Fide.
Neste momento, o corpus ainda no se encontra disponvel para consulta.
Contudo, foi possvel analisar um subconjunto de discursos, j transcritos, que
faro parte do corpus. Esta anlise revelou que a anfora um fenmeno lingustico importante na interpretao, uma vez que pode afetar os discursos produzidos pelos intrpretes em termos de coeso e coerncia.
[4] a a n f o r a n a i n t e r p r e ta o s i m u l t n e a
[46]
[4.1]
Muito obrigado, Senhor Presidente, caros Colegas. Quero felicitar os relatores pelo excelente trabalho que efectuou e que se traduz num relatrio
que permite dar um bom incio construo do prximo quadro financeiro
plurianual e [que] constitui um desafio para a Comisso e para o Conselho.
Este um relatrio ambicioso e em simultneo um relatrio realista.
Yes, colleagues, I would like to congratulate the rapporteur for his excellent work because he permits us to really do things in the right way. A
good start, financially speaking, within the framework of the MFF and of
course what we have here is a very ambitious report, but a very realistic
one as well.
[47]
Neste discurso interpretado, mantm-se a significao global do discurso original, apesar de a cadeia de referncia mais saliente ((the rapporteur)his (excellent
work)he (permits)) incidir sobre a expresso nominal the rapporteur e j no sobre os termos antecedentes estipulados em (1a). Ou seja, o discurso interpretado
deixa de dar primazia ao produto (i.e., ao relatrio) para abrir com o autor desse
produto (i.e., o relator) a cadeia de referncia que percorre o excerto (1b). Tratase, na realidade, de uma restituio da informao por modulao metonmica
(Chuquet & Paillard 1987, pg. 31) que consiste em privilegiar a relao de causa
(rapporteur) pelo efeito (report) e no o inverso.
A referncia anafrica , sem dvida, uma condio bsica para a construo
de qualquer ato comunicativo. ela que contribui para a organizao textual, na
medida em que assegura a progresso temtica. Importa assinalar, contudo, que o
uso excessivo de elementos anafricos pode constituir um obstculo clareza. De
facto, os deputados portugueses tendem a alongar as frases atravs de mecanismos recorrentes de subordinao (com o uso do pronome relativo que), conforme
ilustrado no exemplo (2a):
(2a)
O longo processo de trabalho que este importante relatrio exigiu, incluindo os muitos compromissos alcanados, tornou-o num documento
bastante amplo e equilibrado dos diversos interesses que a PAC tem de
dar resposta. Este relatrio constitui uma boa orientao para as propostas legislativas, pelo que felicito o seu relator.
[48]
(3b)
[49]
The AU, the African Union, could do far more. We have heard many platitudes from the AU but weve seen little concrete action so far.
(4b)
Desde a sua adeso Unio, em 2007, que quer a Bulgria, quer a Romnia tinham a expectativa legtima dos seus cidados se tornarem cidados
comunitrios de pleno direito e poderem usufruir dos mesmos direitos de
todos os outros cidados comunitrios, onde se inclui a liberdade de circulao no interior do Espao Schengen. , pois, a cidadania europeia que
reforamos ao alargar o Espao Schengen. Sexta e ltima nota, Senhor Presidente: [eles] trabalharam bem. evidente que ambos os pases esto de
parabns pelo esforo que realizaram para cumprir todos os requisitos de
Schengen.
A interpretao referencial desse pronome [eles] que no est materialmente expresso depende exclusivamente da sua relao anafrica com o SN (quer a Bulgria, quer a Romnia) que figura no contexto lingustico esquerda. No discurso
interpretado, assistimos a uma quebra desta progresso temtica (antecedente:
Bulgria e Romniaanafrico: [pro=eles]), pois o intrprete opta por um pronome pessoal de 2 pessoa (youve done a good job) quando se estaria espera do
pronome de 3 pessoa (theyve done):
OSLa volume 7(1), 2015
[50]
Since their accession (to the) European Union, Bulgaria and Romania
have legitimately wanted their citizens to become European citizens, with
full rights, and that of course includes the right to move freely within the
Schengen area. So European citizenship is what we are about strengthening here by enlarging the/ this Schengen area. Sixth point: youve done
a good job. Its clearly that/ its clear that these countries should be congratulated for all the efforts theyve made to comply with the Schengen
requirements, []
So I think that all of the institutions have learned their lesson from the
major financial crisis that weve experienced. We are a united Europe, a
Europe of solidarity and we are trying to converge, to converge our policies.
O marcador de negao restritiva que figura, em (6a), no ltimo elemento da cadeia referencial (s uma Europa unidaestar altura) parece-nos importante
para reforar a ideia de que a ilao apresentada resume, de facto, a principal lio
que se deve reter e aplicar para se conquistar uma Europa melhor. Ao eliminar
OSLa volume 7(1), 2015
[51]
do seu discurso este marcador e ao optar por uma forma verbal no presente simples (are), o intrprete no transmite exatamente o mesmo sentido que subjaz
ao original e leva-nos a crer que a Europa j aprendeu a lio. Aqui, fica claro
que a rutura da cadeia anafrica original tem repercusses, a nvel semntico, no
discurso-alvo.
A alterao de sentido decorrente dessa rutura poderia ter sido atenuada se
a forma verbal are (We are a united Europe orao 1) aparecesse precedida,
por exemplo, de um verbo como need (We need to be a united Europe) que marca
uma necessidade implicitamente expressa no enunciado original: para que a
Europa possa estar altura dos seus ideais, precisamos de a tornar mais coesa.
Imposta pela situao de crise que se faz sentir, esta necessidade de implementao de uma (maior) concertao econmica devia suscitar uma nova forma de
atuao por parte dos agentes europeus. O verbo try usado na forma progressiva
(we are trying to converge, to converge our policies orao 2) denota, sem dvida, o esforo que tem sido feito nesse sentido. No entanto, a juno das duas
oraes resulta, devido aos mltiplos valores que a conjuno and pode adquirir
em contexto, numa ambiguidade semntica entre a leitura de finalidade (somos
uma europa unida e para tal, estamos a tentar atuar de forma mais concertada) e
de causalidade (como somos uma Europa unida, estamos a tentar atuar de forma
mais concertada). Por via desta operao, perde-se o valor de condio expresso
no original (a Europa s estar altura se estiver unida), cuja concretizao se
situa no futuro, ao contrrio das leituras semnticas acima descritas, ambas ancoradas no presente (are/are trying).
Everyone can now see that a default in Greece is coming, except the euro
zone finance ministers who, 13 months after uselessly committing 110 billion euros, now seem set to commit a further sum almost as large.
(7b)
[52]
[53]
agradecimentos
Este trabalho foi realizado com o apoio da Bolsa de Investigao com a referncia
SFRH / BD / 88142 / 2012, financiada pela Fundao para a Cincia e Tecnologia
no mbito do Programa Operacional Potencial Humano inscrito no Quadro de Referncia Estratgico Nacional (Formao Avanada), comparticipado pelo Fundo
Social Europeu e por fundos nacionais do Ministrio da Educao e Cincia.
referncias
Almeida, Jos Joo, Slvia Arajo, Nuno Carvalho, Idalete Dias, Ana Oliveira, Andr Santos & Alberto Simes. 2014. The Per-Fide corpus: a new resource for
corpus-based terminology, contrastive linguistics and translation studies. Em
Tony Berber Sardinha & Telma So Bento Ferreira (eds.), Working with Portuguese Corpora, 177200. Bloomsbury Academic.
Arajo, Slvia, Jos Joo Almeida, Alberto Simes & Idalete Dias. 2010. Apresentao do projecto Per-Fide: Paralelizando o Portugus com seis outras lnguas.
Linguamtica 2(2). 7174.
Bendazzoli, Claudio. 2010. Il corpus DIRSI: creazione e sviluppo di un corpus elettronico
per lo studio della direzionalit in interpretazione simultanea: Alma Mater Studiorum Universit di Bologna. Tese de Doutoramento.
Bendazzoli, Claudio & Annalisa Sandrelli. 2005. An approach to corpus-based interpreting studies: Developing EPIC (European Parliament Interpreting Corpus). Em Heidrun Gerzymisch-Arbogast & Sandra Nauert (eds.), MuTra Challenges of Multidimensional Translation: Conference proceedings, 112.
Bernardini, Silvia, Adriano Ferraresi & Maja Milievi. 2013. From EPIC to EPTIC:
building and using an intermodal corpus of translated and interpreted texts.
Apresentao na 46th Annual Meeting of the Societas Linguistica Europea (SLE 2013).
Brants, Thorsten. 2000. TnT a statistical part-of-speech tagger. Em 6th Applied
NLP Conference, ANLP-2000, 224231.
Bhrig, Kristin, Ortrun Kliche, Birte Pawlak & Bernd Meyer. 2012. The corpus
Interpreting in Hospitals: Possible applications for research and communication training. Em Thomas Schmidt & Kai Wrner (eds.), Multilingual Corpora
and Multilingual Corpus Analysis. Hamburg Studies in Multilingualism (14), 305315.
John Benjamins.
Campos, Maria Henriqueta Costa & Maria Francisca Xavier. 1991. Sintaxe e Semntica do Portugus. Universidade Aberta.
OSLa volume 7(1), 2015
[54]
[55]
Lopes, Ana Cristina Macrio & Conceio Carapinha Rodrigues. 2013. Texto, coeso
e coerncia. Almedina.
Maia, Belinda. 2000. Making corpora: a learning process. Em Silvia Bernardini
& Federico Zanettin (eds.), I corpora nella didattica della traduzione: Corpus Use
and Learning to Translate, 4760. Cooperativa Libraria Universitaria Editrice Bologna.
Maia, Belinda. 2008. Corpgrafo. Presentation at TaLC at TaLC: Teaching and
Linguatecas (Portuguese language) Corpora. http://www.linguateca.pt/
documentos/MaiaWorkshopTaLC2008.pdf.
Marques, Isilda Gaspar. 2009. Anfora associativa - propostas de abordagem em contexto escolar: Faculdade de Letras da Universidade de Coimbra. Tese de Mestrado.
Monti, Cristina, Claudio Bendazzoli, Annalisa Sandrelli & Mariachiara Russo. 2005.
Studying directionality in simultaneous interpreting through an electronic corpus: EPIC (European Parliament Interpreting Corpus). META 50(4). s/pp.
Morais, Maria da Felicidade Arajo. 2011. Marcadores da estruturao textual: elementos para a descrio do papel dos Marcadores Discursivos no processamento cognitivo do texto. Centro de Estudos em Letras. Universidade de Trs-os-Montes e
Alto Douro. Coleo Lingustica 6.
Oliveira, Ftima. 1987. Cadeias anafricas: que referncia? Revista da Faculdade de
Letras : Lnguas e Literaturas, II srie 4. 125136.
Perdicoyanni-Palologou, Hlne. 2001. Le concept danaphore, de cataphore et
de dixis en linguistique franaise. Revue qubcoise de linguistique 29(2). 5577.
Russo, Mariachiara. 2010. Reflecting on interpreting practice: graduation theses
based on the European Parliament Interpreting Corpus (EPIC). Em Lew Zybatow
(ed.), Translationswissenschaft-Stand und Perspektiven, Innsbrucker Ringvorlesungen
zur Translationswissenschaft VI, (vol 12), 3550. Peter Lang.
Russo, Mariachiara, Claudio Bendazzoli & Annalisa Sandrelli. 2006. Looking for
lexical patterns in a trilingual corpus of source and interpreted speeches: extended analysis of EPIC (European Parliament Interpreting Corpus). Forum 4(1).
221254.
Sanders, Ted, Jentine Land & Gerber Mulder. 2007. Linguistic markers of coherence improve text comprehension in functional contexts. Information Design
Journal 15(3). 219235.
OSLa volume 7(1), 2015
[56]
c o n ta c t o s
Slvia Arajo
Instituto de Letras e Cincias Humanas, Universidade do Minho
saraujo@ilch.uminho.pt
Ana Correia
Instituto de Letras e Cincias Humanas, Universidade do Minho
ana.moutinho@ilch.uminho.pt
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 5777. (ISSN 1890-9639 / ISBN 978-82-9139812-9)
http://www.journals.uio.no/osla
resumo
This paper studies the field of admirao in their two meanings in Portuguese, namely: veneration/respect and surprise, using the framework suggested in Belinda Maias PhD thesis (Maia 1994/1996). After presenting briefly her findings and methodology, we investigate (i) the distribution of the
vague words of the field of admirao by the two meanings, and discuss the
heuristics used in its rule-based distinction; (ii) the distribution of admirao
by genre, tense and person; (iii) its presence in negative sentences; and (iv)
its antonym(s).
Este artigo estuda a admirao (nas suas duas vertentes, espanto/surpresa e venerao/respeito) em portugus, usando o enquadramento terico proposto por
Belinda Maia (Maia 1994/1996) na sua tese de doutoramento, e pretende ser assim
[58]
Belinda Maia, na sua tese de 1994 (note-se que usamos como verso legtima, como
o desejo da autora, a verso revista de 1996), dedicou-se ao estudo lingustico
da emoo nas suas duas lnguas, usando para isso um corpo comparvel (coligido, digitalizado e analisado por ela) de textos literrios, de 778.500 palavras em
ingls e 819.500 em portugus, produzindo cerca de 25 mil exemplos de emoo
(somando as duas lnguas). Aps rever a literatura extensa sobre as emoes na
lngua, decide-se pela abordagem lingustica, ou seja, no partindo de postulados
psicolgicos ou filosficos, mas sim da forma como as duas lnguas funcionam,
para a sua categorizao. inspirada sobretudo por Ortony et al. (1988), que usa
como ponto de partida:
I propose to adopt Ortony et als (1988) classification of emotion groups,
and add to it when necessary. (Maia 1994/1996, seco 4.5)
Uma das razes aduzidas que a teoria deles no est colada ao ingls, visto
que faz uma diferena entre a situao da emoo e a palavra da emoo, e da
oferece mais possibilidades para uma anlise bilingue.
OSLa volume 7(1), 2015
[59]
Muito resumidamente, Belinda Maia usa os seguintes instrumentos de trabalho ao cartografar as emoes em portugus e em ingls:
(i) assume que numa emoo existe sempre o sentidor (senser) e o fenmeno
(que inspira a emoo);
(ii) em relao ao sentidor, considera interessante distinguir entre a expresso
de emoes prprias ou doutros;
(iii) em relao ao fenmeno ou estmulo, ela prope onze tipos de estmulos
diferentes, sendo o primeiro no especificado, os quatro seguintes associados ao sentidor, os prximos quatro associados ao outro no sentido do
dilogo, e finalmente os ltimos dois referindo-se quer a um objeto (no
humano, portanto) quer a uma proposio.
Belinda Maia estudou 17 emoes, concetualmente divididas em quatro grupos, nomeadamente anger (3), appreciation (3), disappointment (2), dislike (4), distress (1), fear (2), gratitude (3), hope (2), joy (1), liking (4), pride (3), relief (2), reproach (3), resentment (1), satisfaction (2), self-reproach (3), sorry for (1). A descrio
do grupo, acima marcado simplesmente pelo nmero, dada a seguir: 1) reao
a acontecimentos (em que interessante que no existem, segundo ela, alegria
pela felicidade ou infelicidade dos outros1 nas duas lnguas estudadas); 2) reaes
a acontecimentos projetados (em que mais uma vez o grupo logicamente possvel
de medos confirmados no tem expresso lexical em nenhuma das lnguas); 3)
reaes a agentes; 4) reaes a objetos, sobre as quais Belinda Maia comenta que
amor e dio, arquetpicas emoes para um leigo, no so consideradas emoes
bsicas por vrios tericos2 . Note-se que interessante que no so considerados
como emoo os campos na nossa opinio possveis de ingratido e de coragem, enquanto a saudade um subtipo de distress e a vergonha est includa no
grupo de self-reproach/remorse.
Alm disso, ela tambm refere e estuda - indicando que esto fora do esquema
de Ortony et al. (1988) - as seguintes possveis emoes: surpresa, desejo, e emoo genrica (compreendendo palavras como sentir, emocionar, emoo, sentimento).
Para dar uma pequena ideia do tipo de dados fornecidos na tese de Belinda
Maia, apresentamos aqui a sua anlise referente a surpresa, com as tabelas publicadas em relao distribuio dos lexemas. Em ingls: surprise (48,5%), amaze
[1]
[2]
O que no significa que no seja possvel sentir essas emoes sem palavras a elas dedicadas. De facto,
Belinda Maia demonstra a sua alegria precisamente pelo facto de a lngua portuguesa no ter um equivalente lexical da palavra gloat, que alis era muito rara no seu corpo ingls tambm. Mas happy for
apenas uma variante de happy, o que no exige portanto um campo parte.
Embora em lngua inglesa exista uma ateno gramatical sobre a distino entre animado e no animado,
tal no acontece em portugus, donde poderia fazer algum sentido juntar os 3 e 4 num mesmo grupo
dizemos ns, que somos falantes de portugus.
OSLa volume 7(1), 2015
[60]
English
66,7
33,3
22,3
Portuguese
62,5
37,5
22,1
Outra questo mencionada, e anotada por Belinda Maia, foi a de deliberate (deliberadamente) quando
um dado fenmeno tem como inteno provocar a emoo.
1
2
3
4
5
6
7
8
9
10
11
Total
[61]
S focus
3
English
P focus totals
3
5
37
7
5
11
8
135
25
6
242
5
44
10
8
17
15
178
67
16
363
7
3
3
6
7
43
42
10
121
%
0,8
S focus
19
1
4
12
5
9
7
6
165
49
11
288
1,4
12,1
2,8
2,2
4,7
4,1
49
18,5
4,4
Portuguese
P focus totals
7
26
3
4
4
7
19
4
9
27
36
8
15
3
9
53
218
56
105
5
16
173
461
%
5,6
0,9
0,9
4,1
2
7,8
3,3
2
47,3
22,8
3,5
Total EN
4
16
133
1
88
1,1
4,4
36,6
0,3
24,2
Total PT
8
6
15
107
%
1,7
1,3
3,3
23,2
110
42
23,9
9,1
[62]
Fenmeno
P-adj-att
P-adj-pr
P-pp-att
P-pp-pr
P-adv
P-n
P-v
P-v-se
Total EN
28
26
%
7,7
7,1
Total PT
8
5
%
1,7
1,1
21
16
30
5,9
4,4
8,3
3
69
59
29
0,7
15
12,8
6,3
Tal pode apreciar-se mais facilmente se esquecermos a diferena entre adjetivo e particpio passado, com
a mesma vagueza nas duas lnguas (Santos 1998), e somarmos as linhas S-adj e S-pp: 153 casos em ingls
(En) e 136 em portugus (Pt).
[63]
bm, que muito se possa aproveitar para repetir e esmiuar melhor o que o
objetivo confesso do presente artigo, no lado do portugus. Mas pensamos que
se torna bvio para qualquer leitor que existiro muitos outros tesouros na sua
tese, e, sublinhamos, razes para a ler, tanto pela clareza da discusso como por
opinies muito interessantes sobre a prpria gramtica portuguesa.
[3] a a d m i r a o n o s c o r p o s d o a c / d c
A primeira coisa que gostaramos de fazer seria uma comparao com os dados relativos ao portugus compilados na tese de Belinda Maia, para ver se mais dados
permitem mais conhecimento, ou se a amostra dela era suficientemente representativa j.
Em princpio, podemos estudar quer textos literrios apenas, quer toda a lngua a que temos acesso, veja-se Santos (2014a) para uma descrio breve do enquadramento.
Alguns problemas, contudo, se nos deparam. Em primeiro lugar, consideramos que os dados acima referidos se referem apenas a surpresa (visto que foram
todos submetidos a rigoroso escrutnio por Belinda Maia) mas, de momento, alguma parte do que est marcado como surpresa pode referir-se realmente a venerao ou respeito, visto que a distino entre os dois ainda no foi cabalmente
operacionalizada e revista, veja-se a seco [4].
A noo de venerao poder, portanto, no caso de ter um perfil sinttico
muito diferente e ser suficientemente frequente, confundir os nmeros relativos
distribuio sinttica da surpresa, que apresentamos na tabela 5. medida que
essa desambiguao for includa no AC/DC, poderemos obter valores mais confiveis.
A outra dificuldade, que mais difcil de contornar, a de que centenas de milhes de palavras no permitem a anlise to detalhada em termos, por exemplo,
da orientao para o sentidor e para o fenmeno, e por isso a comparao ter de
ser feita por categorias menos finas, em particular, amalgamando as tabelas 3 e 4
apenas por categoria gramatical, na coluna Maia.
V
N
ADJ
PP
ADV
total
Maia
130
179
27
122
3
461
%
28,2
38,8
5,8
26,5
0,65
Literrio
4490
3845
4462
1033
1160
14990
%
30,0
25,6
29,8
6,9
7,7
Todos
60925
55047
79349
6725
10723
212769
%
28,6
25,9
37,3
3,2
5,0
[64]
[65]
[66]
Note-se que temos a possibilidade de olhar para o tempo verbal dos verbos de
emoo, ou para o tempo verbal das frases em que uma palavra de emoo ocorre,
e que esses dois tipos de nmeros sero ou podero ser completamente diferentes.
Neste caso escolhemos o mais fcil de investigar, que se refere s formas verbais,
mas que pode proporcionar algum vis, visto que no abarca o campo semntico
global.
Na figura 2, comparamos as emoes verbais todas com a admirao (em logaritmo). O mais interessante, naturalmente, so os casos em que a distribuio
no seja semelhante. Da inspeo da figura, os casos mais gritantes so, aparentemente, mais casos de perfeito e menos casos de imperfeito do que a mdia das
emoes, e significativamente mais casos de passiva no infinitivo.
[67]
MP 1a
44867
10740
71186
21
17
26
MP 3a
121171
23713
185539
3
6
31
CONDIV 1a
18880
19514
69480
27
27
60
CONDIV 3a
336832
104791
610251
367
63
642
[7]
Para facilitar a reproduo dos nossos resultados, apresentamos a forma de extrair os dados utilizada:
[pos="V.*"& pessnum="1S.*"& sema=".*emomin:(surpresa|admirar).*"]
Os dados das tabelas que se seguem referem-se verso do Museu da Pessoa de Setembro de 2014. Ao
revermos o artigo em Janeiro de 2015, demo-nos conta de que o panorama quantitativo (automtico) tinha mudado radicalmente, por termos adicionado talvez temporariamente o conceito de respeito (e
como tal todas as palavras a ele associadas) noo de admirao. Mas como essa questo no relevante
para a distino entre os dois sentidos de admirao (que focamos no presente artigo) e, para termos valores fiveis, teramos ainda de efetuar nova desambiguao entre respeito-venerao e respeito-medo,
decidimos no incorporar os novos valores no artigo.
O observador atento poder reparar que na primeira pessoa existem mais casos no total do que a soma
de S com P... o que se deve aos casos marcados pelo PALAVRAS como 1/3S, e que foram arbitrariamente
considerados nesta tabela como primeira pessoa. Igualmente, os casos em excesso de terceira pessoa
referem-se ao uso do infinitivo impessoal, que no marcado nem com S nem com P.
OSLa volume 7(1), 2015
[68]
A palavra desrespeito denota falta de respeito, mas no como atitude ou emoo, e sim como ao. No
se pode dizer ele tinha desrespeito por ela, ou senti uma grande desrespeito, mas apenas Isso/essa ao foi um
grande desrespeito.
negado
total
%
medo
1646
164.046
1.0
[69]
coragem
2033
194.959
1.04
admirao
633
71.674
.88
tabela 7: A proporo das emoes (verbais), negadas ou no, no conjunto de todos os corpos
xicalmente bem separados, e que denotaremos para facilidade de compreenso
no presente artigo a partir de agora sempre por surpresa e venerao.
Comeamos por tentar proceder a uma distino entre as duas admiraes
no caso da base admirar. Embora fosse extremamente interessante conhecer a
histria desta palavra ou famlia de palavras, no a consideraremos aqui.
Usado reflexivamente, admirar-se significa quase sempre surpresa (embora
seja possvel uma pessoa admirar-se ao espelho, em relao sua aparncia). Transitivamente, admirar significa uma atitude (de admirao) em relao a uma pessoa, obra, ou ao, quando o sujeito humano. Quando o sujeito uma ao ou
situao e o objeto humano, estamos em face de surpresa novamente, cf. O comportamento dele admirou-a. Na passiva9 , os auxiliares estar e ficar esto associados
a espanto, enquanto ser indica a atitude mental. Noutros casos preciso mais do
que a estrutura sinttica para distinguir entre os dois sentidos de admirar, como
o caso das oraes participiais (sem auxiliar expresso) (exs. 1-2), ou a prpria nominalizao, admirao (exs. 3-5), embora em alguns casos, como em ter/inspirar
admirao, o verbo suporte permita facilmente a desambiguao. Repare-se tambm que as preposies por e de, respetivamente associadas ao verbo e ao nome,
so ambguas, como os exemplos mais uma vez ilustram.
(1) D. Ana Perptua ficou fascinada pelo esprito fulgurante do poeta, admirado
por todos, e admitido na intimidade da famlia na quinta de Arroios, em
Colares.
(2) Langdon parou, admirado por ela conhecer a obscura publicao sobre os
movimentos dos planetas e seu efeito sobre as mars.
(3) Grande, porm, seno dolorosa, foi a admirao de Salazar, quando, anos depois, lendo o primeiro tomo da edio das obras de Cames, (...)
(4) O relativamente obscuro general Suharto tem gozado desde ento da admirao de Washington.
[9]
Estamos naturalmente muito conscientes de que passiva no mais uma vez uma designao consensual, mas remetemos o leitor para Santos (2014c) para uma descrio das vrias escolhas e alternativas
possveis, e qual a escolhida no presente enquadramento.
OSLa volume 7(1), 2015
[70]
emomin+emomax
186385+21053
39571
+medo
4938
+tristeza
362
Lemas dif.
175 (98)
14
Juntando diferenas entre as variantes tal como estupefacto e estupefato ou atnito e atnito e retirando
muitos casos bvios de erro no texto original (tais como imprevisibildade ou subito) ou erro do analisador
sinttico (por exemplo, espasmo como derivado de pasmo, surpreendedora como substantivo em vez da
forma feminina do adjetivo surpreendedor).
[71]
Inicial
17
148
Final
46
118
Dvidas
2 maravilhar
2 maravilhar
[72]
http://www.linguateca.pt/Reve/
[73]
(15) O meu amigo dividiu a dor com o pblico; e, se enterrou a mulher sem aparato, no deixou de lhe mandar esculpir na Itlia um magnfico mausolu,
que esta cidade admirou exposto, na Rua do Ouvidor, durante perto de um
ms.
Com efeito, se em 13 s se pode compreender o prazer de ver ou mirar, em 15
pode-se facilmente tambm interpretar como admirao.
[5] o o p o s t o d e a d m i r a r
Outra observao interessante refere-se ao oposto de admirar no sentido de venerar: ser possvel escolher entre desprezar e invejar? Em invejar mantm-se a
noo de que o objeto bom, mas o sentimento que inspira mau, enquanto em
desprezar simplesmente a atitude (oposta) que mencionada.
Para tentar responder a esta questo de uma forma emprica, baseando-nos
na hiptese de que a antonmia tambm uma propriedade textual, como defendido por Justeson & Katz (1991, 1992), medimos a co-ocorrncia destes conceitos
(lemas), em todos os corpos, e obtivemos 72 casos de co-ocorrncia de desprezar
e admirar, e 148 de invejar e admirar. Embora a balana penda mais para a inveja,
portanto, observmos que s vezes esta mencionada como um tipo de admirao
e no como o seu oposto, cf. exemplos 16 e 17.
(16) (...) ela via com santa inveja e admirao as sobre-humanas foras que imaginava no frade (...)
(17) Entusistico f de Mrio Soares, Jos Aparecido de Oliveira revela que o que
mais admira, inveja mesmo, no nosso Presidente no cultura, no a comunicabilidade, no a inteligncia, no a glria.
[6] o b s e r va e s f i n a i s
http://alfarrabio.di.uminho.pt/vercial/
OSLa volume 7(1), 2015
[74]
agradecimentos
Agradecemos Cludia Freitas a anotao dos casos referidos no artigo, e ao Eugnio Oliveira e Maria Jos Finatto os comentrios pertinentes.
OSLa volume 7(1), 2015
[75]
apndice
Lista, por ordem decrescente de frequncia, dos lemas considerados com o sentido
de admirao/respeito/venerao:
f, admirvel, admirador, reverncia, venerar, deslumbrar, venerao, deslumbramento, admirar, venervel, admirao, reverente, endeusar, admiradora, reverentemente, admiravelmente.
[76]
referncias
Baayen, R. Harald & Antoinette Renouf. 1996. Chronicling the Times: Productive
Lexical Innovations in an English Newspaper. Language 72(1). 6996.
Freitas, Cludia, Eduardo Motta, Ruy Luiz Milidi & Juliana Csar. 2014. Sparkling
Vampire... lol! Annotating Opinions in a Book Review Corpus. Em Sandra Alusio & Stella E. O. Tagnin (eds.), New Language Technologies and Linguistic Research:
A Two-Way Road, 128146. Cambridge Scholars Publishing.
Freitas, Cludia & Diana Santos. 2015. Blogs, Amaznia e a Floresta Sint(c)tica:
um corpus de um novo gnero? Em Simone Sarmento, Tony Berber Sardinha, Livia Pretto Mottin & Ana Maria T. Ibaos (eds.), Pesquisas e perspetivas em
lingstica de corpus, 123150. Mercado de Letras.
Justeson, John S. & Slava M. Katz. 1991. Co-occurrences of Antonymous Adjectives
and Their Contexts. Computational Linguistics 17(1). 119.
Justeson, John S. & Slava M. Katz. 1992. Redefining Antonymy: The Textual Structure of a Semantic Relation. Literary and Linguistic Computing 7(3). 176184.
Maia, Belinda. 1994/1996. A Contribution to the Study of the Language of Emotion in
English and Portuguese: FLUP. Tese de Doutoramento. Verso revista: 1996.
Maia, Belinda & Diana Santos. 2012. Who is afraid of ... what? - In English and in
Portuguese. Em Signe Oksefjell Ebeling, Jarle Ebeling & Hilde Hasselgrd (eds.),
Aspects of corpus linguistics: compilation, annotation, analysis 12, s/pp.
Ortony, Andrew, Gerald L. Clore & Allan Collins. 1988. The Cognitive Structure of
Emotions. Cambridge University Press.
Santos, Diana. 1998. A relevncia da vagueza para a traduo, ilustrada com exemplos de ingls para portugus / The relevance of vagueness for translation:
Examples from English to Portuguese. TradTerm 5. 4170, 7178.
Santos, Diana. 2014a. First steps of Gramateca: a corpus-based grammar initiative
for Portuguese, driven by Linguateca. Apresentao na Universidade de Oslo.
http://www.linguateca.pt/Diana/download/GramatecaOslo.pdf.
Santos, Diana. 2014b. Gramateca: corpus-based grammar of Portuguese. Em Jorge
Baptista, Nuno Mamede, Sara Candeias, Ivandr Paraboni, Thiago A.S. Pardo &
Maria das Graas Volpe Nunes (eds.), International Conference on Computational
Processing of Portuguese (PROPOR2014), 214219. Springer.
Santos, Diana. 2014c. Podemos contar com as contas? Em Sandra Alusio & Stella
Tagnin (eds.), New language technologies and linguistic research: a two-way road,
194213. Cambridge Scholars Publishing.
OSLa volume 7(1), 2015
[77]
Santos, Diana. 2015. Comparando corpos orais (transcritos) e escritos no mbito da Gramateca. Em Proceedings from the conference Parler les langues romanes/Parlare le lingue romanze/Hablar las lenguas romances/Falando lnguas romnicas (The ninth GSCP International Conference), University Press Universit di Napoli LOrientale.
Santos, Diana, Rui Pedro Ribeiro Marques, Cludia Freitas, Cristina Mota & Alberto
Simes. 2015. Comparando anotaes na Gramateca, Atas do ELC2014 (Ttulo
preliminar). Em preparao.
Santos, Diana & Cristina Mota. 2010. Experiments in human-computer cooperation for the semantic annotation of Portuguese corpora. Em Nicoletta Calzolari,
Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis,
Mike Rosner & Daniel Tapias (eds.), Proceedings of the International Conference on
Language Resources and Evaluation (LREC 2010), 14371444. ELRA.
Santos, Diana & Cristina Mota. 2015. Emotions in natural language: a broadcoverage perspective. Em apreciao.
Silva, Augusto Soares da. 2008. O corpus CONDIV e o estudo da convergncia e
divergncia entre variedades do portugus. Em Lus Costa, Diana Santos & Nuno
Cardoso (eds.), Perspectivas sobre a Linguateca / Actas do encontro Linguateca : 10
anos, 2528. Linguateca.
Zampieri, Marcos & Martin Becker. 2013. Colonia: Corpus of historical portuguese. Em Marcos Zampieri & Sascha Diwersy (eds.), Non-standard data sources
in corpus-based research, vol. 5 ZSM Studien, 7784. Shaker.
c o n ta c t o s
Diana Santos
Linguateca e Universidade de Oslo
d.s.m.santos@ilos.uio.no
Cristina Mota
Linguateca
cmota@ist.utl.pt
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 7999. (ISSN 1890-9639 / ISBN 978-82-9139812-9)
http://www.journals.uio.no/osla
resumo
This paper first advocates an onomasiological, concept-based and socio-cognitive approach to lexical borrowing, expanding the current loanword research from lexical items towards concepts. Second, it presents a corpusbased and concept-based sociolectometrical study on differences in the use
of loanwords in European Portuguese and Brazilian Portuguese and their
impact on diachronic lexical variation between the two national varieties.
In the first part, the main topics and contributions of the Cognitive Sociolinguistic perspective on borrowability, and concept-based sociolectometrical
methods of measuring variation in the success of loanwords are highlighted. In the second part, English and French loanwords in the field of football and clothing terminologies are analyzed through possible receptor Portuguese equivalents and advanced corpus-based sociolectometrical measures, such as featural measures (calculating the proportion of terms possessing a special feature) and uniformity measures (calculating onomasiological homogeneity and convergence/divergence between language varieties).
These measures are based on onomasiological profiles, i.e. sets of alternative synonymous terms, together with their frequencies. As a development
of our previous research on lexical convergence and divergence between
European and Brazilian Portuguese (Soares da Silva 2010), the data include
thousands of observations of the usage of alternative terms to refer to 43
football and clothing concepts. Corpus material was extracted from sports
newspapers and fashion magazines from the 1950s, 1970s and 1990s/2000s,
Internet chats related to football, and labels and price tags pictured from
clothes shop windows. Football and clothing concepts confirm the hypothesis that the influence of foreign languages is stronger in the Brazilian variety
than in the European variety. The use of loanwords has contributed towards
onomasiological heterogeneity within and across the two national varieties
in the last 60 years.
[80]
[81]
A Sociolingustica Cognitiva (Kristiansen & Dirven 2008; Croft 2009; Soares da Silva
2009; Geeraerts 2010; Geeraerts et al. 2010; Soares da Silva 2014b) uma extenso emergente da Lingustica Cognitiva (Geeraerts & Cuyckens 2007) como modelo orientado para o significado e centrado no uso, que pretende investigar a
inter-relao entre as dimenses sociais e as dimenses conceptuais da variao
intralingustica atravs de avanados mtodos empricos quantitativos e multivariacionais. Representa a convergncia de interesses de investigao da Sociolingustica e da Lingustica Cognitiva e contribui quer para integrar na agenda
da Lingustica Cognitiva os aspetos sociais da linguagem quer para incorporar na
agenda da Sociolingustica os aspetos conceptuais da variao intralingustica. A
contribuio da Sociolingustica Cognitiva para o estudo das lnguas pluricntricas est patente em Soares da Silva (2014b).
Destacam-se trs contributos especficos da Sociolingustica Cognitiva para a
investigao sociolingustica, que evidenciam a importncia da semntica nos estudos variacionistas: (i) a anlise da variao do significado, isto , os vrios modos
de interao entre o significado e as outras fontes de variao lingustica, nomeadamente forma e contexto; (ii) o tratamento do problema metodolgico da equivalncia semntica, pr-requisito para o estudo scio-variacionista do lxico e da
gramtica; e (iii) o estudo do significado da variao ou representao cognitiva
da variao intralingustica, nas suas componentes de perceo, categorizao e
avaliao atitudinal da diversidade lingustica.
OSLa volume 7(1), 2015
[82]
[83]
[3] m t o d o o n o m a s i o l g i c o e s o c i o l e t o m t r i c o
Utilizamos neste estudo uma perspetiva onomasiolgica de investigao do emprstimo lexical, no sentido de que tomamos o conceito expresso pelo estrangeirismo
como ponto de partida. A anlise incide na variao onomasiolgica entre palavras semanticamente equivalentes (sinnimos denotacionais), de que o estrangeirismo faz parte. Lembremos a distino clssica entre semasiologia e onomasiologia, estabelecida na tradio europeia da semntica lexical (Baldinger 1964).
Enquanto a perspetiva semasiolgica toma a palavra como ponto de partida para
analisar os seus vrios sentidos ou referentes, a perspetiva onomasiolgica parte
do conceito para analisar as diferentes palavras ou outras expresses que o designam. A semasiologia ocupa-se de fenmenos como a polissemia (Soares da Silva
2006), ao passo que a onomasiologia estuda fenmenos como a sinonmia e mecanismos lexicogenticos como a formao de palavras ou o emprstimo.
A variao onomasiolgica pode envolver diferenas conceptuais e/ou diferenas sociais. Assim, as escolhas lexicais podem ser determinadas ora por fatores conceptuais ora por fatores dialetais, socioletais ou idioletais numa palavra,
letais. Por exemplo, a opo entre guarda-redes e goleiro ou equipa e time uma
escolha de formas que exprimem o mesmo conceito mas pertencem a diferentes
variedades nacionais; e a opo entre morrer e falecer uma escolha de formas
que exprimem o mesmo conceito mas so estilisticamente diferenciadas. Podemos designar esta variao entre sinnimos denotacionais como variao onomasiolgica formal, em oposio variao onomasiolgica conceptual, como a que se d,
por exemplo, entre guarda-redes e jogador (sendo o primeiro termo hipnimo do
segundo), a qual envolve diferenas conceptuais (Geeraerts et al. 1994). A variao onomasiolgica formal deve-se, pois, no a uma classificao conceptual diferente da mesma entidade, mas ao uso de diferentes palavras referindo o mesmo
conceito e associadas a diferentes regies, grupos sociais ou registos, isto , os
sinnimos denotacionais. Esta variao onomasiolgica sociolinguisticamente
relevante, justamente porque os sinnimos denotacionais revelam a prpria existncia e competio entre variedades letais.
Convm notar que a distino entre variao onomasiolgica formal e conceptual no dicotmica, bem como no fcil estabelecer uma relao de equivalncia semntica entre diferentes expresses. Na verdade, podem existir diferenas conceptuais subtis. Em relao a itens lexicais concretos, a equivalncia
semntica mais fcil de estabelecer, na medida em que podemos controlar os
referentes e assim verificar se o referente o mesmo ou no. Neste estudo, os
sinnimos denotacionais de peas de vesturio foram determinados com base em
fotos das respetivas peas; no caso dos termos de futebol, as imagens e/ou o contexto permitiram determinar objetivamente a relao de sinonmia denotacional.
As dificuldades aumentam quando passamos de itens lexicais concretos para itens
abstratos e para construes gramaticais. Todavia, o que importa determinar no
OSLa volume 7(1), 2015
[84]
abs
109
24
0
0
1841
204
795
P50
rel rel*W
3,7
0,0
0,8
0,8
0,0
0,0
0,0
0,0
61,9
31,0
6,9
0,0
26,7
0,0
31,8
abs
0
528
111
66
0
26
631
B50
rel rel*W
0,0
0,0
38,8
38,8
8,1
4,1
4,8
1,9
0,0
0,0
1,9
0,0
46,3
0,0
44,8
[85]
W
0
1
0,5
0,4
0,5
0
0
[86]
P50
8,8
71,6
0,0
19,2
0,1
0,3
(P50)2
77,8
5128,8
0,0
369,2
0,0
0,1
55,8
B50
36,6
0,9
48,9
6,8
5,2
1,5
(B50)2
1340,7
0,9
2393,5
45,8
27,4
2,4
38,1
Os dados lingusticos para o presente estudo foram recolhidos dos campos lexicais
do futebol e da moda/vesturio, devido popularidade dos respetivos conceitos
e ao facto de serem suscetveis influncia de lnguas estrangeiras. Os materiais do corpus foram extrados de trs fontes: (i) jornais de desporto e revistas de
moda dos primeiros anos das dcadas de 50, 70 e 90-2000; (ii) linguagem da Internet de chats associados a clubes de futebol; e (iii) etiquetas de roupas de lojas de
vesturio. Os materiais de (i) permitem responder questo diacrnica de saber
se a influncia das lnguas estrangeiras maior no PB ou no PE e se aumentou ou
OSLa volume 7(1), 2015
[87]
[88]
7,1%
9,8%
10,2%
13,2%
15%
12,8%
13,9%
17,9%
18,5%
15,9%
18%
15,9%
<
<
<
<
<
<
<
<
<
<
<
<
AIng (B50)
AIng (B70)
AIng (B00)
AIng (B50)
AIng (B70)
AIng (B00)
Aestr (B50)
Aestr (B70)
Aestr (B00)
Aestr (B50)
Aestr (B70)
Aestr (B00)
18%
17,1%
16,2%
18,5%
20,4%
20,3%
23,5%
22,8%
23,3%
23,4%
25%
25%
17,6%
15,9%
10,2%
16,7%
20,6%
16,1%
3,3%
5,8%
16,9%
7,7%
16%
19,5%
22,4%
22,1%
28,2%
27,3%
38,1%
37,9%
=
<
=
<
=
<
=
<
AF r (B50)
AF r (B70)
AF r (B00)
AF r (B50)
AF r (B70)
AF r (B00)
AIng (B50)
AIng (B70)
AIng (B00)
AIng (B50)
AIng (B70)
AIng (B00)
Aestr (B50)
Aestr (B70)
Aestr (B00)
Aestr (B50)
Aestr (B70)
Aestr (B00)
[89]
18,5%
18,1%
7,9%
22,2%
17,6%
16,9%
4,2%
7,6%
16,8%
6,7%
15%
27,1%
23,8%
26,7%
24,9%
33%
34,4%
44,4%
[90]
6%
7,9%
8,9%
7,5%
7,5%
7,8%
=
<
<
=
<
<
AIng.adapt (B50)
AIng.adapt (B70)
AIng.adapt (B00)
AIng.adapt (B50)
AIng.adapt (B70)
AIng.adapt (B00)
2,8%
16,9%
16%
3,8%
16,5%
15,8%
10,2%
16,2%
14,3%
12,8%
20,3%
12,3%
<
<
<
=
<
[91]
14,3%
26,9%
26,9%
12,3%
19,7%
19,7%
16,9%
16,8%
18%
19,5%
27,1%
17,9%
10,2%
7,9%
9,7%
16,1%
16,9%
14,1%
=
>
>
=
>
<
=
>
18%
10,1%
10,1%
17,9%
22,6%
22,6%
9,7%
10,4%
10,4%
14,1%
11,3%
11,3%
[92]
[93]
[94]
agradecimentos
Este estudo foi financiado pela Fundao para a Cincia e a Tecnologia, como parte
do projeto estratgico PEst-OE/FIL/UI0683/2011 do Centro de Estudos Filosficos
e Humansticos da Universidade Catlica Portuguesa. Agradeo aos revisores Rui
Sousa Silva e Lus Trigo os seus esclarecedores e estimulantes comentrios e sugestes.
anexo
Perfis de futebol
RBITRO: apitador, rbitro, director da partida, juiz, juiz de campo, ref(eree), referi,
refre
RBITRO AUXILIAR: rbitro auxiliar, rbitro assistente, auxiliar, 2/3/4 rbitro, bandeirinha, fiscal de linha, juiz de linha, liner
AVANADO: atacante, avanado, avante, dianteiro, forward, ponta-de-lana
BALIZA: arco, baliza, cidadela, goal, gol(o), malhas, marco, meta, rede, redes, vala
BOLA: balo, bola, couro(inho), esfera, esfrico, pelota
CANTO: canto, chute de canto, corner, crner, escanteio, esquinado, pontap de canto,
tiro de canto
DEFESA: (full-)back, beque, bequeira, defensor, defesa, lateral, lbero, zagueiro
EQUIPA: conjunto, formao, eleven, equipa/e, esquadra, esquadro, grupo, match,
onze, onzena, plantel, quadro, team, time, turma
EXTREMO: ala, extremo, ponta, ponteiro
OSLa volume 7(1), 2015
[95]
FALTA: carga, falta, foul, golpe, infra(c)o, obstru(c)o, transgresso, violao (das
regras)
FINTA: corte, drible(ing), engano, feint, finta, firula, ginga, lesa, manobra enganadora,
simulao
FORA DE JOGO: adiantamento, banheira, deslocao, fora-de-jogo, impedimento, offside, posio irregular
GOLO: bola, goal, gol, golo, ponto, tento
GRANDE PENALIDADE: castigo mximo, castigo-mor, falta mxima, grande penalidade,
penalidade, penalidade mxima, penlti (pnalti, pnalti), penalty
GUARDA-REDES: arqueiro, goal-keeper, goleiro, golquper, guarda-meta, guarda-rede,
guarda-redes, guarda-vala, guarda-valas, guardio, keeper, porteiro, quper, vigia
JOGADA: jogada, lance
JOGO: batalha, choque, combate, competio, confronto, desafio, disputa, duelo, embate,
encontro, jogo, justa, luta, match, partida, peleja, prlio, prova, pugna
MDIO: alfe, central, centro-campista, centro-mdio, half, interior, mdio, meia, meiocampista, meio-campo, midfield, trinco, volante
PONTAP LIVRE: chute (in)direto, falta, free(-kick), livre (directo, indirecto), pontap
livre, tiro dire(c)to, tiro livre (direto, indireto)
PONTAP: chute, chuto, kick(-off), panzio, pelotada, pontap, quique, shoot, tiro
TREINADOR: mister, professor, tcnico, treinador
Perfis de vesturio
BLUSA F: blouse, blusa, blusinha, bustier, camisa, camisa-body, camiso, camiseiro
(inho), camiseta/e, (blusa) chmisier, (blusa) chemisi
BLUSO M/F: blazer, blizer, bliser, bluso, bluson, camura, camurcine, camisa
esporte, casaco de pele (de ganga, etc.), colete, parka
CALAS M/F: cala, calas, pantalona
CALAS CURTAS M/F: bermuda(s), calas-capri, cala(s) corsrio, cala(s) curta(s), calas 3/4, cales, cool pants , corsrios, hot pants, knikers, pantacourt,
pedal pusher, short(s), short cuts, short shorts, shortinho, slack(s)
CALAS JUSTAS F: fuseau(x), fus, legging(s)
CAMISA M: bluso, camisa, camisa de gravata, camisa de manga curta, camisa desportiva, camisa esporte(iva), camisa jeans, camisa social, camiseta, camisete, camisette
CAMISOLA M/F: blusa, bluso, blusinha, body, cachemir, camisa, camisa-de-meia, camiseta, camisinha, camisola, camisolinha, canoutier, canouti, malha, malhinha,
moleton, pull, pullover, pulver, suter, sweat, sweat shirt, sweater
OSLa volume 7(1), 2015
[96]
referncias
Baldinger, Kurt. 1964. Smasiologie et onomasiologie. Revue de Linguistique Romane
28. 249272.
Croft, William. 2001. Radical Construction Grammar. Oxford University Press.
OSLa volume 7(1), 2015
[97]
[98]
[99]
Soares da Silva, Augusto. 2014a. Measuring and comparing the use and success of
loanwords in Portugal and Brazil: A corpus-based and concept-based sociolectometrical approach. Em Eline Zenner & Gitte Kristiansen (eds.), New Perspectives on Lexical Borrowing: Onomasiological, methodological and phraseological innovations, 101141. De Gruyter.
Soares da Silva, Augusto. 2014b. Pluricentricity: Language Variation and Sociocognitive Dimensions. De Gruyter.
Soares da Silva, Augusto. 2014c. The pluricentricity of Portuguese: A sociolectometrical approach to divergence between European and Brazilian Portuguese.
Em Augusto Soares da Silva (ed.), Pluricentricity: Language Variation and Sociocognitive Dimensions, 143188. De Gruyter.
Speelman, Dirk, Stefan Grondelaers & Dirk Geeraerts. 2003. Profile-based linguistic uniformity as a generic method for comparing language varieties. Computers
and the Humanities 37. 317337.
Taylor, John R. 1995. Linguistic Categorization. Prototypes in Linguistic Theory. Oxford
University Press 2nd edn.
Zenner, Eline. 2013. Cognitive Contact Linguistics. The macro, meso and micro influence
of English on Dutch: University of Leuven. Tese de Doutoramento.
Zenner, Eline & Gitte Kristiansen (eds.). 2014. New Perspectives on Lexical Borrowing:
Onomasiological, methodological and phraseological innovations. De Gruyter.
Zenner, Eline, Dirk Speelman & Dirk Geeraerts. 2012. Cognitive Sociolinguistics
meets loanword research: Measuring variation in the success of anglicisms in
Dutch. Cognitive Linguistics 23(4). 749792.
c o n ta c t o s
Augusto Soares da Silva
Universidade Catlica Portuguesa
assilva@braga.ucp.pt
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 101124. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
resumo
Este artigo apresenta o processo de anonimizao automtica de entidades
mencionadas num novo corpo paralelo pesquisvel do domnio jurdico-financeiro para o par de lnguas portugus-ingls. O corpo resulta de memrias de traduo utilizadas em traduo profissional. Contm cerca de
40.000 pares de frases alinhadas, ou seja, frases que so tradues umas das
outras. A anotao das entidades mencionadas foi feita com regras especiais
da Gramtica de Restries otimizadas para o domnio jurdico-financeiro,
que permitiram alcanar uma abrangncia balanceada em termos de preciso de quase 90% para as entidades mencionadas candidatas (pessoa, organizao, endereo e identificadores pessoais) e uma abrangncia consideravelmente superior com modificaes heursticas e otimizadas para a produo. O corpo destina-se a estudos de traduo e lingustica computacional (traduo automtica estatstica) e ser publicamente pesquisvel,
permitindo ao seu utilizador procurar uma palavra ou expresso e devolvendo os resultados da pesquisa em contexto na lngua da busca e na sua
traduo.
[1] i n t r o d u c t i o n
High quality parallel corpora are useful for many natural language processing
(NLP) applications and represent an important resource for language and translation learning. However, parallel corpora available for research are scarse, and
when available, they may not be of good quality. Many parallel corpora contain mistakes resulting from lexical variation or inappropriate use of the lexicon
and terminology, which carries over into semantic errors and unsuitable translations Barreiro (2009). Despite quantity and quality limitations, researchers use
parallel corpora for cross-language retrieval, mining terms for human and machine translation (MT), among other applications. For languages like Portuguese,
the few parallel corpora available may be specific to a certain subject matter or
domain, but normally do not exist for technical texts. Given the lack of parallel data available to train NLP systems, the corpus described in this paper represents an effort in making trustworthy technical data available for research pur-
[102]
Corpora resources represent the driving force behind NLP systems and the source
of data to train SMT systems. Several resources and corpora tools allow studying
human translation and performing contrastive studies between Portuguese and
English (cf. Santos (1996, chp. 8), Maia (2008), and Tagnin et al. (2009), among
others). Tools for searchable corpora allow, for example, to search a word or
expression in Portuguese and see how that word or expression was translated into
English in different contexts. Searches can be simple text searches or advanced
context searches exploiting categories like part of speech, syntactic function or
semantics, and will often allow quantitative analysis, providing frequency lists,
and so on.
There are several parallel corpora available for Portuguese as one of the languages involved in the corpus translation pair, among them: the EuroParl1 , JR[1]
OSLa volume 7(1), 2015
http://www.statmt.org/europarl
[103]
http://langtech.jrc.it/JRC-Acquis.html
http://linguateca.di.uminho.pt/nat
http://nilc.icmc.usp.br/dispara/CorTrad/
http://www.linguateca.pt/COMPARA/
http://www.linguee.com/
OSLa volume 7(1), 2015
[104]
EN-UK
h) Those, who have management or supervisory duties in five companies, excepting law firms, firms of
official auditors and official auditors, subject in the
latter case to the provisions of article 76 of DecreeLaw no. 487/99, of the 16th of November;
i) Official auditors, who are in any of the other circumstances of incompatibility provided in the corresponding legislation;
knowable, which is why they have to be anonymised against each other (hence
the internal confusion measure). By contrast, a text corpus does not come with
clear referents and needs NER just to identify the data records themselves. So
in principle, the anonymisation background is the entire population, making the
task less challenging in this regard. In addition, without a database structure, an
internal confusion measure such as the k-value is not practically applicable. All
in all, textual anonymisation is quite different from the anonymisation of data
fields, with its own added problems, such as vagueness, importance of context,
lack of consistency, among others. In the following sections ([3] to [6]), we will
discuss how these issues can be addressed with NLP tools.
[3] c o r p u s
Taking into consideration that the quality of SMT is ultimately dependent on the
adequacy of the parallel corpora used for the task, and that good quality translations for a specialised domain are difficult or impossible to obtain when training
MT systems on another, or more general domain, we have prepared such a specialized parallel corpus for the legal-financial domain. Apart from SMT researchers,
we are also targeting human translators in need of contextualized and idiomatic
translation examples. The corpus is based on translation memories used in the
Metatrad7 agencys professional translation activities, and comprises 40,000 sentences in Portuguese and English, corresponding to about 1 million tokens each.
[7]
OSLa volume 7(1), 2015
http://www.metatrad.com
[105]
[4] t h e p a l av r a s n e r f r a m e wo r k
The PALAVRAS parser Bick (2000) is a rule-based parser using the Constraint Grammar paradigm, specifically the open source CG3 compiler8 . PALAVRAS uses contextual disambiguation and mapping rules on morphologically multi-tagged input, where each token receives one or more readings lines (a so-called cohort).
The core version of the system covers part-of-speech (POS), inflection, syntactic
function and dependency links or constituent structure. However, various special grammar modules have been added over time for specific research projects or
applications, such as semantic roles, semantic prototypes, valency, anaphora and
NER Bick (2014). The parser has been applied to a host of Portuguese language
corpora (among others, all Linguateca9 corpora), and research versions have addressed transcribed speech, historical text and various non-standard written domains.
PALAVRAS NER participated twice in Linguatecas joint NER tasks, and performed at the top of the field. The first version (avalia SREC, Bick (2003)), taking
a more static approach, tried to fix multi-word names (MWEs) before running the
systems grammars - either by simple lexicon-lookup or by pattern-recognition
in the preprocessor - and the only allowed post-grammar token alteration was
fusion of adjacent name chains. This technique was replaced by a more dynamic,
grammar based NE chunking approach in the second version Bick (2006), used
for the HAREM shared task Santos et al. (2006). In this system, which we are
using here, preprocessor-generated name candidate MWEs are fed to the morphological analyzer not as a whole, but in individual token parts. Thus, parts of
unknown name candidates will be individually tagged for word class, inflection
and, most importantly, semantic prototype class, which is used as a prime trigger for NE classification and used by the NE type mapping rules (cf. [5.3]). In
addition, each part is tagged as either @prop1 (leftmost part) or @prop2 (middle
and rightmost parts), and both tag types can be added or removed by contextual
rules. At the same time, the NE category set was expanded from 6 super-categories
to 41 fine-grained categories with a functional rather than lexematic definition.
For our anonymisation task, we internally maintained the fine-grained set, but
selected the individual human category @hum as the anonymisation category
<NAME_PERSON> and lumped the membership group category with administrative/institutional organisations and companies into @org (anonymisation category <NAME_ORGANIZATION>).
[8]
[9]
http://visl.sdu.dk/constraint_grammar.html
http://www.linguateca.pt/ (2000-2014)
OSLa volume 7(1), 2015
[106]
[107]
[108]
Preprocessed
corpus
EN-UK
ENG-UK
So far as trainees&92 opinion regarding the
possibility that CETs will be seen by the general public as &93;second rate&94; courses
ENG-UK So far as trainees opinion regarding the possibility that CETs will be seen by
the general public as second rate courses
table 2: Preprocessing.
PT-PT
Parser output
Post-processed
corpus
PROP
F S
anonymisation have to be inserted as <NAME_....> place-holders and the respective token removed. For certain unclassified name tokens, the postprocessor performs its own heuristic anonymisation (cf. [5.3.2]), treating all-uppercase names
as organisations and compound names as person names.
Note that the extract illustrated in Table 3, apart from two ACNE, contains a
third NE, Lisboa, which has also been classified as civitas <civ>. Geographical
locations were considered public domain in our current scheme, but could easily
be anonymised, given the full NER mark-up, or, in this case, fused into the address
ACNE.
OSLa volume 7(1), 2015
[109]
% Uppercase words
14.45%
16.61%
29.08%
29.51%
37.61%
[110]
The most difficult are cases where the initial-uppercase clue (for namehood)
is lost because the whole word is in uppercase, e.g. EVITA COSTA (V: evitar,
N: costa). Still, many extreme cases (e.g. (vi)) can be ruled out heuristically,
even in otherwise uppercased context. For instance, rules can rule out multiple
derivation or forbid certain affixes specifically.
Strings of this and similar type, such as AS (article or A.S.?), CET and PAI
need to be contextually disambiguated. Thus, rules exploit the fact that an alluppercase word in parenthesis is more likely a name abbreviation than, say, a
function word. On the other hand, a plural article or a plural ending in -s help
discard a company name in favour of a noun abbreviation (e.g. os CET, os SPVs).
OSLa volume 7(1), 2015
[111]
This is definitely true for the proper noun part of person names, while categories like HAREMs OFFICIAL,
or titles without proper nouns (e.g. Sr. Dr. Juz) have no great need for anonymisation. The only
exception for the proper noun person names are cases where a name is used to denote works of art (e.g.
listen to Mozart) and possibly names in publications - where we follow HAREM conventions in using
a different category, PUBLICATION
OSLa volume 7(1), 2015
[112]
Organization Names
Internally, our grammar distinguishes between different types of organisation
according to the PALAVRAS and HAREM schemes 1-7.
(i) organisation (@org) - the umbrella category, e.g. international, NGO;
(ii) company (@company): e.g. Embraer, A.S., Ltda.;
(iii) administrative units12 (@admin): government, parliament, assembly;
(iv) institution (@inst): institute, laboratory, museum, university;
[11]
[12]
The rule has been simplified, real rules often have multiple exceptions to cover special cases. Here, the
brand case is constrained to <foreign>-marked proper nouns, there is a town name context exception
for So, and the PROP chaining also allows the preposition de.
This is a HAREM category and was also used for countries and towns, if they functioned as agents or cognizers. The distinction is not upheld by PALAVRAS, but only mapped later using semantic role inference,
where desired. Furthermore, PALAVRAS tags place-bound administrative units as institutions, alongside
shops, hotels etc.
[113]
Tail tokens also occur with person names, but they are rare (e.g. Neto, Neta, Filho), unless one also
counts prepositional phrases like da Silva, dos Santos, etc.
OSLa volume 7(1), 2015
[114]
Addresses
Though the existing PALAVRAS NER module already treated addresses as a separate NER category, it did not perform well on the bilingual legal-financial domain corpus at hand, in part simply because international address formats (e.g.
English, Dutch, etc.) appeared next to the known Portuguese ones (e.g. 10a
Belmont Street, NW1 8HH, Londres), but also because of the large orthographical variation in the corpus, possibly caused by OCR or keyboard (typewriter?)
limitations. Thus, there were around 20 different variants of n, to name just one
example, including n., n, n.e, no.s, no., n9., n*, n", n,, etc. plus uppercase
variants, with similar variation in ordinals before words like piso and andar, or
as affixes (e.g. 89-3), as well as use of ordinal abbreviations in other words (e.g.
2o dt, Esq). In order to identify address NE, we again defined head nouns and
tail words, as illustrated in rules (xvi) and (xvii).
(xvi) LIST N-ADDRESS = <Lpath> "Av" "Av." "Av.a" "Av. [A-Z].*"r ...
"rua" "R." "Ra" "Via" ...
(xvii) LIST N-ADDRESS-POST = "Avenue" "Bd" "Boulevard" "Rd" "Road"
"St" "Street" "Sq" "Square" ....
The latter was necessary, because English addresses place the closed-class
part of street names last (e.g. Hampton Road), while Portuguese (and other Romance languages) have closed-class material first (e.g. Via Appia). A third possibility is seen in German and Dutch addresses where the closed-class items are not
separate words, making the use of regular expressions necessary (Bergstrasse,
Meulengracht). In addition, Portuguese/Continental and English addresses place
street number differently, so they mark either right or left boundaries of street
addresses. @prop2 rules were used to let addresses span right over further uppercase material, added numerical material and subaddress words (e.g. casa,
lote, piso, esq., r/c), allowing also interfering commas, letters, hyphens, slashes,
the preposition de, articles and the n token in all its variants. Though identified
as such, person names inside addresses were not allowed to prevent address string
[14]
OSLa volume 7(1), 2015
Provided, of course, the parser has correctly disambiguated a as not being a preposition.
[115]
from growing right, i.e. from the head Avenida to the last part Esq. or Piso
across the person names in bold face in the examples in (xviii) and (xix). This
means that it is the larger address NE that gets marked rather than the smaller
person NE inside it (Jlio Dinis and Ferno de Magalhes, in the examples).
(xviii) Avenida *Jlio Dinis*, n. 2 3o Esq.
(xix) Avenida *Ferno de Magalhes*, n 1862.-14 Piso
A special topic concerns town names with postal area codes, which were treated as addresses when appearing on their own, but otherwise fused into adjacent
address strings. Internationally, postal codes vary a lot, and number-only codes
in particular need a recognized place name or address as context. Conversely,
once identified, postal codes can help identify lexically unknown place names. In
some cases, address heads or tail words are identified in connection with proper
nouns, but without a number extension, subaddress or postal code. These are first
tagged ambiguously as @address @site, and later treated by the disambiguation
grammar with full context, lumping these cases together with other site words
such as estao, estdio, mina, and shopping, among others. Corpus-wise, we
decided that street names, etc. used on their own are not precise enough to need
anonymisation.
[116]
hum
org
address
nameid
all
Cases
263
871
38
54
1229
Recall
87.83
93.69
81.58
60.71
88.32
Precision
87.50
86.53
91.18
87.18
86.88
[117]
F1 -score
87.66
89.97
86.11
71.58
86.68
Recall
90.71
88.32
Precision
89.22
86.88
F1 -score
89.96
87.59
88.36
86.26
86.91
84.84
87.63
85.54
untyped, chunked
typed, chunked
86.02
84.19
84.61
82.81
85.31
83.49
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/muc_sw/muc_sw_manual.
html
OSLa volume 7(1), 2015
[118]
http://www.alias-i.com/lingpipe/
http://www.cnts.ua.ac.be/conll2003/ner/
[119]
(iii) treating single-token PROP as ACNEs, if the parser marked them as <foreign>.
Again, these cases covered a mixture of person/organisation types.
(iv) treating camel case as ACNE (of <org> type).
(v) treating all numerical expressions as ACNEs. These were mostly of <nameid>
type, but <address> in cases where uppercase letters were followed by digits.
The above strategies capture 88.8% of all false negatives. Of the remaining
13 cases, one was partially recognized already (person name within organisation name), and would thus get anonymised anyway; the rest consisted of ordinary words used as names (e.g. Tranquilidade) or ambiguous with names at
sentences start (e.g. Marques), names with case errors (e.g. o opbbr uma
sociedade, o Oi) or mistyped/untyped PROP, the latter sometimes as part of
what the parser regarded as a longer PROP chain. Given this distribution of cases,
almost total anonymisation recall could be achieved by treating all PROP-tagged
strings as ACNEs. Table 7 below shows how the individual strategies affect recall,
and - for the non-numerical types - precision.
The price in precision loss for applying the above strategies is, of course, fairly
high. The safest strategy is all-uppercase PROP, where recall gain out-weighs precision loss 5:1 and where recall for the main affected category, <org>, climbed
to over 96%. Treating all complex PROP as ACNEs is much less safe, and would
sink precision into the 50% bracket. However, only applying this strategy to complex PROP not otherwise categorized, still matches most false positives of this
type18 , while leading to a more tolerable precision loss, only a little above the
corresponding recall gain. It is beneficial especially for person names (8% recall
gain), bringing them on par with <org> coverage. Camel case and the <foreign>
tag are much more expensive in precision terms, and risk including typos and,
for the latter, a good portion of ordinary English words (> 40%). General numerical anonymisation, finally, captures virtually all id and address information and
is unproblematic to use - irrespective of precision loss - because textual cohesion suffers much less from digit replacement than it does when upper case noun
chains and proper nouns are replaced with dummies.
We conclude from the above that apart from numerical anonymisation, two
fallback strategies are cost-efficient enough to be used - treating remaining alluppercase as <org> and unclassified compound proper nouns as <hum>. All in
all, this achieves a recall for ACNEs of 98.24%, arguably good enough for purely
[18]
The target group of compound names, person names, are mostly cases where all elements of the MWEs
are individually proper nouns, while compound names with uppercase noun elements often belong to
other classes. It is exactly this trait that makes it likely that the parser already has found a classification
for them, based on its knowledge about semantic noun classes.
OSLa volume 7(1), 2015
no recall heuristics
all-upper PROP
compound PROP
numerical expressions
uppercase + numerical
<org>
<hum>
<address>
<nameid>
<foreign> PROP
<camelcase> PROP
other PROP
other
30
41
20
4
False
negative
0.50%
0.17%
0.84%
2.35%
3.43%
1.76%
0.25%
R gain
0.47%
5.51%
P loss
3.77%
0.50%
6
2
10
3
Cumulative
recall,
untyped
90.29%
92.64%
96.23%
97.91%
98.24%
99.13%
98.10%
100.00%
100.00%
98.83%
98.91%
99.75%
table 7: Effect of recall heuristics.
<org> 96.16%
<hum> 95.82%
<nameid> 100%
<address> 92.16%
Typed recall
effect for
main category
2.97%
7.99%
39.29%
10.58%
Typed
recall
gain
[121]
[122]
PT-PT
EN-UK
(A) As Partes so duas sociedades consti tudas sob o domnio integral da <_ORGANISATION>, sociedade adjudicatria da Fase
A do denominado Concurso das Elicas,
conforme Contrato celebrado com a (agora
designada)
<_ORGANISATION_
ADMIN>
(<NAME3_ORGANISATION_
ADMIN>)
em
<NAME4_DATE>, nos termos do qual, e dos respectivos anexos, a <NAME5_ORGANISATION>
e a <NAME6_ORGANISATION> assumiram os
direitos e obrigaes relacionados co m as actividades de promoo dos Parques Elicos e do
Projecto Industrial previs tos no mesmo Contrato
com a <NAME7_ORGANISATION_ ADMIN>,
respectivamente;
(A) The Parties are two companies incorporated under the exclusive control of
<NAME1_ORGANISATION>, a company, which
has been awarded the contract for Phase A of the
Wind power Tender, in accordance with a Contract with the <NAME2_ORGANISATION_ ADMIN>
(<NAME3_ORGANISATION_ ADMIN>), as it is
now designated, signed on the <NAME4_DATE>.
According to the terms of the said Contract with
the <NAME7_ORGANISATION_ ADMIN> and the
annexes thereof, <NAME5_ORGANISATION> and
<NAME6_ORGANISATION> respectively assumed
the rights and obligations in relation to the promotion of the Wind Parks and Industrial Project
envisaged in the said Contract;
[7] c o n c l u s i o n s a n d f u t u r e wo r k
acknowledgments
We would like to thank Metatrad for making it possible to create the corpus described here, and for allowing us to make it publicly available for searching. We
also would like to thank Hugo Gonalo Oliveira and Miriam Leite for relevant comments that helped improve this paper. Anabelas work was funded by FCT through
grant SFRH/BPD/91446/2012).
OSLa volume 7(1), 2015
[123]
references
Barreiro, Anabela. 2009. Make it Simple with Paraphrases: Automated Paraphrasing for
Authoring Aids and Machine Translation: Universidade do Porto PhD dissertation.
Bick, Eckhard. 2000. The Parsing System Palavras: Automatic Grammatical Analysis
of Portuguese in a Constraint Grammar Framework: Aarhus University PhD dissertation.
Bick, Eckhard. 2003. Multi-Level NER for Portuguese in a CG Framework. In Jorge
Baptista, Isabel Trancoso, Maria das Graas Volpe Nunes & Nuno J. Mamede
(eds.), Computational Processing of the Portuguese Language: 6th International Workshop, PROPOR 2003. Faro, Portugal, June 2003 (PROPOR 2003), 118125. Springer.
Bick, Eckhard. 2006. Functional Aspects in Portuguese NER. In Renata Vieira,
Paulo Quaresma, Maria da Graa Volpes Nunes, Nuno J. Mamede, Cludia
Oliveira & Maria Carmelita Dias (eds.), Computational processing of the portuguese
language, proceedings of propor 2006, 8089. Springer.
Bick, Eckhard. 2014. Palavras, a constraint grammar-based parsing system for
portuguese. In Tony Berber Sardinha & Thelma de Lurdes So Bento Ferreira
(eds.), Working with portuguese corpora, 279302. Bloomsbury Academic.
Maia, Belinda. 2008. Corpgrafo V4 - Tools for Educating Translators. In Elia Yuste
Rodrigo (ed.), Topics in Language Resources for Translation and Localisation, 5770.
John Benjamins Pub. Co.
Medlock, Ben. 2006. An Introduction to NLP-based Textual Anonymisation. In
Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Joseph
Mariani, Jan Odjik & Daniel Tapias (eds.), Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), 10511056.
Sang, Erik F. Tjong Kim & Fien De Meulder. 2003. Introduction to the CoNLL-2003
shared task: Language-independent named entity recognition. In Proceedings
of CoNLL 2003, .
Santos, Diana, Nuno Seco, Nuno Cardoso & Rui Vilela. 2006. HAREM: An Advanced NER Evaluation Contest for Portuguese. In Nicoletta Calzolari, Khalid
Choukri, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odjik & Daniel
Tapias (eds.), Proceedings of the 5th International Conference on Language Resources
and Evaluation (LREC 2006), 19861991.
Santos, Diana Maria de Sousa Marques Pinto dos. 1996. Tense and aspect in English
and Portuguese: a contrastive semantical study: Instituto Superior Tcnico, Universidade Tcnica de Lisboa PhD dissertation.
OSLa volume 7(1), 2015
[124]
c o n ta c t s
Eckhard Bick
University of Southern Denmark
eckhard.bick@mail.dk
Anabela Barreiro
INESC-ID
anabela.barreiro@inesc-id.pt
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 125137. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
abstract
The Portuguese were the first Europeans establishing contact with Japan in
the 16th century. They wrote about what they had seen there, which was the
start of a long history of documenting Japan in Portugal. Even though the
relationship between the two countries had its ups and downs throughout
the times, fascination for Japan among the Portuguese seems to continue.
The goal of the study reported in this article was to identify which aspects
of Japan most drawn the attention of Portuguese media in the 90s. Concordances and frequencies from the CETEMPblico corpus, containing texts
published in the Portuguese daily newspaper PBLICO in the 90s, and a combination of automatic and manual processes, were used for that purpose.
[1] i n t r o d u o
[126]
[127]
Os trabalhos de anlise de contedo de textos em portugus que consegui encontrar so feitos sobre amostras bem mais pequenas. Veja-se, por exemplo, Ferro
(2011), Lobo (2010) e Magalhes (2004) onde se estudaram respetivamente 161,
159 e 73 peas jornalsticas, e de Almeida Menezes (2011), onde se estudaram 10
entrevistas. Apenas em Magalhes (2004) e de Almeida Menezes (2011) indicado terem-se utilizado ferramentas computacionais para suportar a anlise do
contedo dos textos. Em ambos os casos foram usadas ferramentas do WordSmith
Tools (Scott 1996). Em Lobo (2010) referido explicitamente no se terem usado
programas de computador para suportar a anlise textual e em Ferro (2011) no
feita qualquer referncia ao seu uso.
Para outras lnguas existem variados exemplos do uso de ferramentas computacionais para suportar a anlise de quantidades muito maiores de texto. Vejase, por exemplo, Kutter & Kantner (2012), onde se trabalhou sobre um corpo de
meio milho de textos em ingls, holands, francs e alemo para analisar como
os meios de comunicao social de diferentes pases europeus cobrem guerras
e intervenes militares, e Baker et al. (2008) onde se usaram textos em ingls
contendo 140 milhes de palavras para estudar como a imprensa britnica relata
assuntos relacionados com refugiados, pessoas que pediram asilo, imigrantes e
migrantes.
Para o trabalho descrito neste artigo, dada a quantidade de texto e a abrangncia do objeto de estudo, usou-se uma mistura de processos computacionais e
manuais. Relativamente aos processos computacionais, utilizaram-se fundamentalmente concordncias e distribuies. Tanto umas como outras podem ser obtidas atravs do servio de interrogao a corpos AC/DC (Costa et al. 2009). Este
servio permite fazer pesquisas num conjunto de corpos com diferentes caractersticas, sendo o CETEMPblico um desses corpos.
Concordncias no so mais do que exemplos extrados de um corpo de textos que correspondem a uma determinada expresso de pesquisa. Por exemplo,
interrogando-se o corpo CETEMPblico no AC/DC com a expresso de pesquisa
[sem="93b" & word="Soares"] (esta expresso indica que se pretendem todas
as ocorrncias da palavra Soares no segundo semestre de 1993), obtm-se 2.176
concordncias, incluindo os seguintes exemplos.
(1)
[128]
(3)
par=ext111822-pol-93b-1: Soares respondeu que no se mostrava nada impressionado com a mensagem, pois acabava precisamente de chegar de
ptimos momentos de convvio com um imperador o do Japo .
[129]
[130]
[131]
[132]
[133]
[134]
[135]
[136]
agradecimentos
Estou agradecido Diana por me ter desafiado a escrever este artigo e principalmente por me ter dado a oportunidade de comear a trabalhar para a Linguateca
em 2002. Para alm de ter aprendido imenso, conheci muitas pessoas interessantes, entre elas a homenageada neste livro, a Belinda, com a qual tambm tive o
prazer de trabalhar.
referncias
de Almeida Menezes, Danielle. 2011. Discurso sobre literaturas de lngua inglesa:
uma anlise baseada em ferramentas da lingustica de Corpus. Trabalhos em
Lingustica Aplicada 50(1). 97118.
Baker, Paul, Costas Gabrielatos, Majid Khosravinik, Michal Krzyzanowski, Tony
McEnery & Ruth Wodak. 2008. A useful methodological synergy? Combining
critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press. Discourse and Society 19(3). 273305.
Bick, Eckhard. 2000. The Parsing System Palavras: Automatic Grammatical Analysis
of Portuguese in a Constraint Grammar Framework: Aarhus University. Tese de
Doutoramento.
Costa, Lus, Diana Santos & Paulo Alexandre Rocha. 2009. Estudando o portugus
tal como usado: o servio AC/DC. Em The 7th Brazilian Symposium in Information
and Human Language Technology (STIL 2009), 150153.
OSLa volume 7(1), 2015
[137]
Ferro, Hugo. 2011. A construo meditica da sade mental e da doena mental: o caso
do Pblico e do Correio da Manh entre 1990 e 2010: Faculdade de Letras da Universidade de Coimbra. Tese de Mestrado.
Fris, Lus. 1976-1984. Historia de Japam. Biblioteca Nacional de Lisboa. 5 volumes.
Edio anotada por Jos Wicki.
Janeira, Armando Martins. 1981. Figuras de Silncio - A Tradio Cultural Portuguesa
no Japo de Hoje. Junta de Investigaes Cientficas do Ultramar.
Janeira, Armando Martins. 1988. O Impacto Portugus sobre a Civilizao Japonesa.
Publicaes Dom Quixote 2nd edn.
Kutter, Amelia & Cathleen Kantner. 2012.
Corpus-Based Content Analysis: A Method for Investigating News Coverage on War and Intervention. International Relations Online Working Paper. Stuttgart University.
http://www.uni-stuttgart.de/soz/ib/forschung/IRWorkingPapers/
IROWP_Series_2012_1_Kutter_Kantner_Corpus-Based_Content_
Analysis.pdf.
Lobo, Mafalda. 2010. Cobertura meditica de frica na imprensa europeia,
no contexto da II Cimeira UE-frica.
http://www.bocc.uff.br/pag/
silva-mafalda-cobertura-mediatica-de-africa-na-imprensa-europeia.
pdf.
Magalhes, Clia. 2004. Interdiscursividade e conflito entre discursos sobre raa
em reportagens brasileiras. Linguagem em (Dis)Curso 4. 3560.
de Moraes, Wenceslau. 1993. Antologia. Vega. Seleco de textos de Armando
Martins Janeira.
Rocha, Paulo & Diana Santos. 2000. CETEMPblico: Um corpus de grandes dimenses de linguagem jornalstica portuguesa. Em Maria das Graas Volpe Nunes
(ed.), Actas do V Encontro para o processamento computacional da lngua portuguesa
escrita e falada (PROPOR), 131140.
Rodrigues, Joo. 1604. Arte da Lingoa de Iapam. Collegio de Iapo da Companhia de
Iesu.
Scott, Mike. 1996. Wordsmith tools. Oxford University Press.
c o n ta c t o s
Lus Fernando Costa
Yamaguchi University e Linguateca
luis.f.kosta@gmail.com
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 139152. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
pesquisa em educao:
perspectivas (qualitativas?) na
explorao de grandes corpora
MIRIAM LEITE E CLUDIA FREITAS
abstract
Research methods in Education usually rely on qualitative analysis, focusing on samples of individuals or small groups. On the other hand, it is well
known that education deals with large scale issues as well: in Brazil, the
planning of public policies must take into account the more than 50 million
students enrolled in Secondary Education. However, the quantitative approach is viewed with suspicion in Education, leading to very little development
of large scale studies. Since these studies can be based on written texts, the
dialogue between Education and corpus based approaches becomes highly
valuable. In this paper, we briefly present the results of two studies based
on corpora specifically designed to foster educational research: (i) a corpus
of blogs created and maintained by public schools; (ii) a corpus of teaching
materials used in public schools. When discussing the results of these researches, we draw attention to the crucial role played by corpus tools, and to
the risks and potentials of the corpus based approach in Education.
No preciso ser especialista em Educao para saber que se lida, nesse campo,
com questes que se colocam em larga escala: segundo o Censo Escolar da Educao Bsica, em 2013, registraram-se 50,04 milhes de matrculas nas redes pblica
e privada do pas. Por outro lado, tampouco necessria maior expertise para se
ponderar que o microcosmo da Educao tambm precisa ser considerado, tanto
pela pesquisa acadmica, quanto pelas polticas pblicas. A abstrao dos mais
de 50 milhes de matrculas se traduz em vida vivida, quando cada uma delas
ganha nome e sobrenome e impe a singularidade da sua localizao geogrficocultural, histria familiar, deficincia fsica ou mental etc. Interessam, portanto,
para a pesquisa em Educao, os estudos qualitativos que focalizam tais contingncias, mas tambm investigaes e reflexes que operem com dados massivos,
que, por certo, so do mesmo modo pertinentes a esse campo.
Entretanto, polmicas em torno das abordagens quantitativas, que marcaram
a pesquisa acadmica, sobretudo, nas dcadas de 1980 e 1990, parecem ainda repercutir na Educao, observando-se pouco desenvolvimento de estudos em larga
[140]
[2]
[3]
OSLa volume 7(1), 2015
[141]
tativas (Ldke & Andr 2008) anuncia-se, j na contracapa: A pesquisa em educao encontra-se atualmente em fase de grande evoluo, ampliando seu foco de
interesse e mtodos para alm dos estudos tradicionais do tipo survey ou experimental, que constituram suas mais fortes inclinaes durante as ltimas trs ou
quatro dcadas.
Entretanto, Gatti (2004) cita estudos que apontam que a pesquisa em Educao
era bastante limitada at ento e que, nesse universo restrito, apenas 29% operavam com dados quantitativos. Mas o que se observa que, com ou sem respaldo
emprico, difundiu-se, no campo educacional, robusto preconceito contrrio aos
estudos quantitativos, o que leva a autora a constatar quadro semelhante, passada quase uma dcada da publicao deste ltimo artigo citado: tudo o que vem
a partir de abordagens qualitativas bom; o que vem de abordagens quantitativas mau) (Gatti 2012, pg. 30).
Dificulta-se, assim, a construo de uma crtica mais consistente que permita
uma identificao menos apaixonada dos limites e potencialidades da pesquisa
com dados massivos. Desse modo, percebe-se a ausncia de pesquisadores da Educao quando se desenvolvem tais estudos, que so, com frequncia, realizados
por profissionais de outras reas, como especialistas em informtica, economistas, administradores de empresas.
Contudo, muitas j foram as vozes da academia que se mobilizaram para matizar tal entendimento e argumentar contrariamente ao reducionismo da antagonizao apriorstica qualitativo/quantitativo. Brando (2002), por exemplo, em
texto que j conta com mais de dez anos de publicao, argumenta que:
A incomensurabilidade das prticas sociais no significa, no entanto,
que no se possa e deva tentar aproximaes quantitativas dos fenmenos. Portanto, os antagonismos quantitativo/qualitativo, assim
como micro/macrossocial so improcedentes; informaes e dados
objetivos, assim como depoimentos e entrevistas em profundidade
podem ser produzidos em perspectiva positivista; sem uma conceituao prvia e uma reconstruo a posteriori, nenhum material de
pesquisa escapa superficialidade do mau jornalismo. (Brando 2002,
pg. 2829).
Ou seja, a associao apriorstica entre o trabalho acadmico com base em dados empricos de larga escala e abordagens homogeneizadoras e simplistas dos
contextos sociais focalizados pela pesquisa em Educao no se sustenta. O reconhecimento da irrepetibilidade do acontecimento social contingente pode nos
levar ao estudo do singular, mas tambm pode se beneficiar do olhar para um
quantitativo ampliado de casos singulares.
Santos (2014) faz outra ponderao que julgamos de ainda maior interesse
para esta discusso: a dicotomia entre qualitativo e quantitativo uma falsa
OSLa volume 7(1), 2015
[142]
http://www.rioeduca.net
[143]
[144]
[145]
Pesquisa O grmio e outros espaos-tempos de dilogo poltico na escola: possibilidades contemporneas, contemplada com financiamento pelo Edital Apoio Melhoria do Ensino em Escolas da Rede Pblica Sediadas no
Estado do Rio de Janeiro 2014.
OSLa volume 7(1), 2015
[146]
Tambm no contexto da pesquisa Diferena e desigualdade na educao escolar do jovem adolescente: desconstrues, (Romo 2014) desenvolveram estudo sobre as repeties e deslocamentos em torno dos sentidos do feminino presentes nas apostilas
distribudas pela SME/RJ para os anos finais do ensino fundamental 7, 8 e 9 ano
sob o nome Cadernos Pedaggicos. Trata-se de material didtico amplamente utilizado na rede pblica carioca, posto que seu contedo pauta as avaliaes externas municipais e nacionais, condicionando rankings e respectivas recompensas
materiais e subjetivas.
As apostilas dos 4 bimestres letivos de 2013 de todas as disciplinas ficaram
disponveis9 nesse perodo e foram organizadas, por disciplina, de modo a constituir corpora com a ntegra dos contedos dos Cadernos Pedaggicos. Embora no
to extensos quanto em geral se apresentam os corpora dos estudos lingusticos,
sua explorao por meio das ferramentas especficas evidenciou mais uma vez a
potencialidade desse tipo de abordagem.
Entendia-se, com base em proposies da terica feminista Judith Butler (Butler 2003, 1997), que a identidade de gnero se constri performativamente, isto
, no se constitui em decorrncia de marcas biolgicas, mas, sim, pela constante
e difusa repetio do que socialmente se concebe como caracterstico de cada gnero. Interessavam, portanto, no apenas as passagens das apostilas em que a
temtica do gnero era explicitamente tratada, como tambm e sobretudo, aquelas onde, de forma naturalizada, se reafirmavam e/ou se deslocavam os modos do
feminino na nossa sociedade. Desse modo, a explorao do material didtico em
toda a sua extenso parecia especialmente importante. Destacamos, a seguir, duas
das concluses oportunizadas por essa abordagem, que entendemos exemplificar
potencialidades de uma outra maneira de leitura de grandes acervos textuais na
pesquisa do campo educacional.
O primeiro destaque diz respeito ao corpus de Cincias (ApostilasSME/RJCienc).
Na leitura exploratria dessas apostilas, chamou nossa ateno que as palavras
brasileira/brasileiras tinham quase a mesma frequncia de ocorrncia que a sua
variao no masculino. No entanto, quando acessamos os contextos de enunciao dessas palavras, por meio da leitura das linhas de concordncia, identificamos
flagrante desigualdade no valor poltico-cultural dessas referncias.
Enquanto a expresso no feminino qualificava a populao residente no pas,
espcies nativas e prticas culinrias, sua verso no masculino lembrava, na maior
[9]
OSLa volume 7(1), 2015
http://www.rio.rj.gov.br/web/sme/material-pedagogico
[147]
[148]
[149]
No entanto, como obeservamos por Sampson (2001), nem sempre a nfase na objetividade dos dados obtidos com corpus est associada a uma perspectiva corpus-driven, e nem esta ltima est, necessariamente,
vinculada aplicao de testes estatsticos.
OSLa volume 7(1), 2015
[150]
[151]
referncias
Anthony, Laurence. 2012. AntConc (version 3.3.5). http://www.antlab.sci.
waseda.ac.jp.
Arrojo, Rosemary (ed.). 1992. O signo desconstrudo. Pontes.
de Beaugrande, Robert. 2002. Descriptive linguistics at the millennium: corpus
data as authentic language. Journal of Language and Linguistics 1(2). 91131.
Brando, Zaia. 2002. Pesquisa em educao: conversas com ps-graduandos Coleo
Teologia e cincias humanas. Editora PUC-Rio.
Butler, Judith. 1997. Excitable speech. A politics of the performative. Routledge.
Butler, Judith. 2003. Problemas de gnero: feminismo e subverso da identidade. Editora Civilizao Brasileira. Traduo de Renato Aguiar.
Costa, Lus, Diana Santos & Paulo Alexandre Rocha. 2009. Estudando o portugus
tal como usado: o servio AC/DC. Em The 7th Brazilian Symposium in Information
and Human Language Technology (STIL 2009), s/pp.
Davis, Claudia Leme Ferreira, Gisela Lobo Baptista Pereira Tartuce, Patrcia C. Albieri de Almeida & Ana Paula Ferreira da Silva. 2013. Os esquecidos anos finais
do ensino fundamental: polticas pblicas e a percepo de seus atores. Em
Anais da 36a Reunio Anual da ANPEd, .
Freitas, Cludia. 2014. Corpus, Lingustica Computacional e as Humanidades Digitais. Em Miriam Leite & Carmen Gabriel (eds.), Linguagem, Discurso, Pesquisa e
Educao, 2251. DP et Alii.
Gatti, Bernardete. 2004. Estudos quantitativos em educao. Educao e Pesquisa
30(1). 1130.
Gatti, Bernardete. 2012. A construo metodolgica da pesquisa em educao:
desafios. Revista Brasileira de Poltica e Administrao da Educao 28(1). 1334.
Leite, Miriam. 2013. Blogs SME/RJ. http://www.ddeej.com.
Leite, Miriam. 2014. Adolescncia e juventude em desconstruo: textos e contextos na educao escolar. Em Miriam Leite & Carmen Gabriel (eds.), Linguagem,
Discurso, Pesquisa e Educao, 281307. DP et Alii.
Leite, Miriam. 2015. Pesquisa em educao e cibercultura: questes de metodologia e poltica. Em Edma Oliveira & Maria Luiza Oswald (eds.), Educao, cibercultura e redes sociais em tempos de mobilidade, no prelo.
OSLa volume 7(1), 2015
[152]
c o n ta c t o s
Miriam Soares Leite
Universidade do Estado do Rio de Janeiro
miriamsleite@yahoo.com.br
Cludia Freitas
PUC-Rio
claudiafreitas@puc-rio.br
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 153181. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
encadear
encadeamento automtico de notcias
CARLA ABREU, JORGE TEIXEIRA E EUGNIO OLIVEIRA
abstract
This work aims at defining and evaluating different techniques to automatically build temporal news sequences. The approach proposed is composed
by three steps: (i) near duplicate documents detention; (ii) keywords extraction; (iii) news sequences creation. This approach is based on: Natural
Language Processing, Information Extraction, Name Entity Recognition and
supervised learning algorithms. The proposed methodology got a precision
of 93.1% for news chains sequences creation.
[1] i n t r o d u o
[154]
Sequncia de carateres.
[155]
[156]
[157]
[158]
[3.1] Similaridade
Abordamos a similaridade entre artigos noticiosos em quatro passos distintos: (i)
normalizao do contedo noticioso;(ii) identificao dos elementos a comparar;
(iii) comparao entre pares de notcias; (iv) tomada de deciso.
Normalizao
A normalizao de textos uma etapa tradicional em NLP para simplificar a anlise posterior dos mesmos. Realizamos as seguintes tarefas de normalizao:
1) Remoo de smbolos de pontuao, como: <,>, /, ,, (, ), -;
2) Remoo de padres redundantes e que no mbito deste trabalho, no so informativos, como: Lusa - Esta notcia foi escrita nos termos do Acordo Ortogrfico;
3) Remoo de stop-words, atravs da utilizao de uma lista disponibilizada pelo snowball 2 (para a lngua portuguesa);
4) Reduo das palavras sua raiz atravs da utilizao do Porter Stemmer para lngua
portuguesa, disponibilizado pelo PTStemmer (Oliveira 2008).
Na Tabela 1 apresentamos um exemplo da normalizao, desde a notcia original at sua verso normalizada.
[2]
OSLa volume 7(1), 2015
https://snowball.tartarus.org
Exemplo
Notcia original
Nova Deli, 02 jan (Lusa) - A ndia anunciou que vai permitir a cidados
estrangeiros investirem no seu mercado de aes.
1- Pontuao
Nova Deli 02 jan Lusa A ndia anunciou que vai permitir a cidados
estrangeiros investirem no seu mercado de aes.
2- Padres
3- Stop-words
4- Stemm
[159]
Corpo da notcia: o ttulo ou corpo da notcia, como componentes isolados, podem no ser suficientes para a determinao da similaridade. Identificamos o cabealho da notcia, tipicamente o primeiro pargrafo, como sendo um elemento
adicional a considerar para o clculo da similaridade entre notcias (ver Figura 3).
Este cabealho corresponde muitas vezes ao resumo da notcia e como tal muito
informativo.
OSLa volume 7(1), 2015
[160]
Comparao de Notcias
Podem ser utilizadas diferentes mtricas para o clculo da similaridade. Neste
trabalho, consideramos as seguintes: Hamming (He et al. 2004), Levensthein (Levenshtein 1965) e Jaro (Bilenko et al. 2003).
De forma a que os resultados destas mtricas possam ser comparveis, necessrio proceder normalizao dos mesmos, aplicamos a seguinte frmula (Expresso 1) aos resultados retornados pelos mtodos de edio de distncia.
D (s, t) = 1
D(s, t)
, D Q|D [0; 1]
max (|s|, |t|)
(1)
Onde:
D(s, t) a distncia obtida pela mtrica de edio de distncia entre a string s e t;
max (|s|, |t|) o comprimento da string de maior dimenso entre s e t;
D (s, t) a distncia normalizada entre s e t.
Para cada par de notcias calculado o D . A deciso sobre a similaridade
decidida no passo posterior.
OSLa volume 7(1), 2015
[161]
[162]
Expresses Relevantes
As expresses relevantes correspondem a ngrams que aparecem explicitamente
no contedo noticioso e que de uma forma simplificada podem transmitir informao relevante contida no texto.
Para a extrao deste elemento do texto foi adicionado um passo intermdio
abordagem apresentada na seco [3.3.1]. Para tal, aps a normalizao foi aplicado um filtro de forma a obter expresses do texto. As expresses so ngrams,
que obedecem a certos padres gramaticais, como: sequncias de nomes (Domingos Pacincia), nome e adjetivo (homens encapuzados) entre outros.
A anlise da frequncia neste caso efetuada sobre os padres. O resultado
retornado pela anlise de frequncia indica-nos quais as expresses relevantes
para a notcia em questo. A ltima etapa consiste na atribuio das expresses
relevantes notcia.
Entidades
O reconhecimento de entidades mencionadas, nomeadamente o nome de personalidades, essencial no contexto de extrao de termos e expresses chave das
notcias.
Existem disponveis vrios recursos para o reconhecimento de entidades mencionadas para a lngua portuguesa, como os mencionados pela Linguateca3 . No
entanto e no mbito deste trabalho, estamos perante um domnio muito dinmico, as notcias, onde constantemente aparecem novas entidades (Charlie Hebdon, Fukushima). Optamos por implementar um sistema que se adapta a estas
caractersticas.
Foi implementado um algoritmo com o objetivo de verificar, numa primeira
fase, quais as palavras no texto que se iniciam com um carter maisculo. Das palavras encontradas, se a palavra maiscula estiver posicionada no inicio da frase
verificado se a palavra ou no uma stop-word, e caso seja, ento no considerada. Para as palavras que passarem a fase anterior verificado se so precedidas
[3]
OSLa volume 7(1), 2015
http://www.linguateca.pt/LivroSegundoHAREM/
[163]
https://store.services.sapo.pt/pt/Catalog/other/free-api-information-retrieval-verbetes
OSLa volume 7(1), 2015
[164]
D1 (a, b) = 0.3
|ka | |kb |
+ 0.7
max(|ka |, |kb |)
|ka | |kb |
D2 (a, b) =
max(|ka |, |kb |)
|ka | |kb |
i=1 ( j=1aj =bi W ka j W kb i )
|ka | |kb |
|ka | |kb |
i=1 ( j=1aj =bi W ka j W kb i )
|ka | |kb |
(2)
(3)
Onde:
W ka j o peso da palavra-chave j no agrupamento a;
W kb i o peso da palavra-chave i no agrupamento b;
|ka | e |kb | so o nmero de palavras-chave iguais entre os agrupamentos a e b;
max (|ka |, |kb |) o nmero mximo de palavras-chave distintas.
As distncias D1 (a, b) e D2 (a, b) tm em conta a percentagem de termos em
comum entre os dois agrupamentos e a relao dos pesos que os termos em comum tm nos seus agrupamentos. D1 (a, b) estabelece um peso entre as duas parcelas, dando um maior relevo parcela que mede o relacionamento dos pesos das
palavras em comum; em D2 (a, b) no existem pesos associados s parcelas, mas
sim, uma relao entre elas.
Para o clculo da similaridade entre as expresses relevantes a abordagem
utilizada foi distinta. Para este caso, a normalizao incluiu um passo adicional,
OSLa volume 7(1), 2015
[165]
remoo das stop-words. Aps esta tarefa foi construda uma string com todas as
expresses pertencentes a cada agrupamento, no considerando para este tipo
de termo relevante o seu peso. O clculo da similaridade entre as expresses foi
baseado num algoritmo de edio de distncia o qgrams (Ullmann 1977) (q = 3).
[5]
[6]
[166]
[7]
OSLa volume 7(1), 2015
[167]
Na Figura 7 podemos constatar que maioritariamente os grupos so constitudos por 2 notcias similares. possvel observar que o nmero de grupos existentes inversamente proporcional ao nmero de notcias que o compe.
[168]
[169]
existem. A abrangncia (recall) indica-nos, neste contexto, taxa de notcias duplicadas encontradas face s realmente existentes mas que no conseguimos identificar manualmente. A medida F1 estabelece uma relao entre a preciso e a
abrangncia. A accuracy indica-nos a avaliao geral do sistema.
A avaliao aos termos relevantes focou-se em avaliar, dos termos extrados,
quais so de facto realmente representativos da notcia. A avaliao foi realizada
usando a Expresso 4. A avaliao geral do sistema dada pelo somatrio percentagem de termos representativos das notcias analisadas, Expresso 5.
E(ni) =
TermosRepresentativos
TermosAtribudos
(4)
||N ||
Avaliao =
(E(ni ))
||N ||
i=1
(5)
Onde:
TermosRepresentativos corresponde ao nmero de termos relevantes ou entidades atribudos pelo mtodo, que realmente representam o contedo noticioso;
TermosAtribudos corresponde ao nmero total de termos relevantes ou entidades
atribudas ao documento;
||N ||: nmero de notcias da coleo N;
ni : corresponde notcia de ndice i do conjunto de notcias N.
[170]
Algoritmos
LHJ
LHJ
LHJ
LHJ
LHJ
LHJ
LHJ
t
0.60
0,70
0,70
0,70
0,80
0,80
0,80
f
0.60
0,60
0,70
0,70
0,70
0,80
0,80
[171]
c
0.60
0,60
0,60
0,70
0,70
0,70
0,80
[172]
Palavras
D1
D2
D1
D1
Entidades
D2
D2
D1
D2
Personalidades
D1
D1
D1
D2
[6.1] Experincias
Similaridade - Algoritmos de Edio de Distncia
Os resultados obtidos nesta experincia Exp1 podem ser observados na Tabela 5. Desta tabela exclumos os resultados obtidos para algoritmo Jaro, devido
ao seu desempenho constante.
Exp
1, 1
1, 2
1, 3
1, 4
1, 5
1, 6
1, 7
Levensthein
P
R
F
0,941 0,761 0,841
0,950 0,655 0,775
0,951 0,645 0,769
0,972 0,637 0,770
0,965 0,507 0,665
0,964 0,483 0,643
0,962 0,463 0,625
P
0,941
0,940
0,940
0,940
0,939
0,939
0,938
Hamming
R
F
0,289 0,442
0,284 0,436
0,284 0,436
0,284 0,436
0,279 0,430
0,279 0,430
0,279 0,430
[173]
figura 10: Valor da mtrica F1 obtido pelos diferentes algoritmos nos diferentes
intervalos de tempo.
Decision Tree
SVC
SVC Linear
Random Forest
P
0,863
0,931
0,938
0,803
R
0,679
0,508
0,561
0,542
F1
0,760
0,657
0,702
0,647
A
0,998
0,997
0,998
0,998
tabela 6: Resultado mdio das mtricas de avaliao obtidas pelo k fold cross validation.
[174]
Avaliao
0,732
0,762
0,804
SVC
0.931
0.921
0.906
0.931
Decision
Tree
0.849
0.821
0.764
0.834
Random
Forest
0.859
0.852
0.824
0.858
[175]
call, o que significa que consegue detetar mais casos do que o Hamming. Uma razo
para que isto suceda est relacionado com uma particularidade deste ltimo algoritmo que a comparao de strings do mesmo comprimento; a nvel da mtrica
F1 , tambm o Levensthein obtm um melhor resultado. Atravs da anlise efetuada a estes trs algoritmos possvel concluir que o Levensthein o algoritmo mais
indicado para o clculo da similaridade entre pares de notcias.
[176]
Desenvolvemos uma interface web para permitir ao leitor a navegao entre cadeias de notcias. A interface que elaboramos pode ser observada na Figura 11.
A interface composta por cinco seces distintas. A primeira seco permite que o utilizador defina as caractersticas das cadeias de notcias a visualizar. permitido definir o intervalo temporal, a categoria das notcias e ainda as
palavras-chave. A segunda seco, informa o utilizador quais as caractersticas
das histrias que esto representadas na interface.
As histrias so representadas visualmente na terceira seco. O grfico com
a representao das histrias pode ser repartido em trs elementos interconectados. Comeando pela parte inferior do grfico, em 3.3, as linhas representam
os agrupamentos de notcias existentes. O comprimento destas barras varia consoante o nmero de notcias que compe cada agrupamento. Na parte superior
do grfico, em 3.1, os arcos representam as ligaes existentes entre os agrupaOSLa volume 7(1), 2015
[177]
[178]
Para a deteo de notcias duplicadas usamos uma abordagem baseada na semntica para o clculo da similaridade entre notcias. Foi tambm utilizado um algoritmo de aprendizagem supervisionado na determinao da semelhana entre
as mesmas. Adicionalmente, as notcias incluem informao temporal e, tal como
acreditvamos, existe um intervalo onde h uma maior tendncia para o aparecimento de notcias cujo grau de similaridade aponta para a (quase) duplicao.
O nosso estudo indicou que tendencialmente as notcias consideradas duplicadas
aparecem num intervalo inferior a 24 horas. A nossa abordagem, para a determinao de notcias cujo grau de similaridade as classifica como (quse) duplicadas,
num intervalo de tempo de 24 horas, obteve uma preciso de 93.8% quando usado
o par Levenshtein, SVC Linear.
Para a criao de ligaes entre grupos de notcias similares, a nossa abordagem consistiu na medio do grau de semelhana entre os diferentes grupos. Para
esta etapa, sugerimos uma nova forma de medio de distncia que tem em conta
os termos em comum e a expresso de cada termo nos agrupamentos de notcias
similares. Para a determinao das ligaes, foram tambm utilizados algoritmos
de aprendizagem supervisionada. A abordagem proposta para a realizao desta
segunda tarefa apresenta uma preciso de 93.1%. Este resultado, no representa,
no entanto a preciso global do sistema, uma vez que h propagao de erro entre
as vrias etapas.
OSLa volume 7(1), 2015
[179]
Como trabalho futuro ser importante criar testes mais exaustivos e objetivos
para as cadeias de notcias. Tais testes, consistiro, entre outros melhoramentos,
na medio da familiaridade do leitor com um tema em especfico antes e depois
da utilizao da plataforma e na medio do erro propagado pelo sistema.
Tambm pretendemos melhorar o sistema atravs da:(i) introduo de sumrios das notcias, (ii) deteo de novos factos e (iii) hierarquizao de notcias.
agradecimentos
Agradecemos a colaborao do Labs SAPO UP pela disponibilizao dos dados utilizados neste trabalho.
referncias
Allan, James, Jaime G. Carbonell, George Doddington, Jonathan Yamron & Yiming
Yang. 1998a. Topic detection and tracking pilot study final report. Em Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop,
194218.
Allan, James, Ron Papka & Victor Lavrenko. 1998b. On-line new event detection
and tracking. Em Proceedings of the 21st annual international ACM SIGIR conference
on research and development in information retrieval, 3745. ACM.
Banerjee, Somnath, Krishnan Ramanathan & Ajay Gupta. 2007. Clustering short
texts using Wikipedia. Em Proceedings of the 30th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, SIGIR 07, 787788.
ACM.
Bilenko, Mikhail, Raymond Mooney, William Cohen, Pradeep Ravikumar &
Stephen Fienberg. 2003. Adaptive Name Matching in Information Integration.
IEEE Intelligent Systems 18(5). 1623.
Elmagarmid, Ahmed K., Panagiotis G. Ipeirotis & Vassilios S. Verykios. 2007. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1). 116.
Garcia, Marcos & Pablo Gamallo. 2013. FreeLing e TreeTagger: um estudo
comparativo no mbito do Portugus. Relatrio tcnico. ProLab Technical Report, vol. 01. http://gramatica.usc.es/~gamallo/artigos-web/
PROLNAT_Report_01.pdf.
He, Matthew X., Sergei V. Petoukhov & Paolo E. Ricci. 2004. Genetic code, Hamming distance and stochastic matrices. Bulletin of mathematical biology 66(5).
14051421.
OSLa volume 7(1), 2015
[180]
[181]
c o n ta c t o s
Carla Abreu
Faculdade de Engenharia da Universidade do Porto
cfma@fe.up.pt
Jorge Teixeira
Faculdade de Engenharia da Universidade do Porto
jft@fe.up.pt
Eugnio Oliveira
Faculdade de Engenharia da Universidade do Porto
eco@fe.up.pt
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 183206. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
resumo
Encontrar pessoas com interesses semelhantes dentro de um domnio pode
fornecer um importante auxlio na gesto de centros de investigao. Como
a produo acadmica facilmente obtida em bases de dados bibliogrficas
e acadmicas, estas podem ser usadas para descobrir as afinidades entre os
investigadores que no estejam j evidenciadas pela co-autoria. Este processo de descoberta d-se com a ajuda de tcnicas de anlise de texto, na
base dos termos utilizados nos respectivos documentos. A afinidade pode
ser representada em forma de rede, em que os ns representam os artigos
de cada investigador e as ligaes representam similaridade entre os diferentes investigadores. Cada n pode ser caracterizado atravs de diversas
medidas de centralidade na rede e algoritmos de deteco de comunidades
permitem identificar grupos com interesses semelhantes. Cada n ainda
caracterizado por um conjunto de palavras-chave e resumos descobertos
automaticamente com a ajuda de tcnicas avanadas. Este artigo fornece
mais detalhes sobre os mtodos adoptados e/ou desenvolvidos, alguns dos
quais foram implementados no nosso prottipo. Os mtodos descritos so
gerais e aplicveis a muitos domnios diferentes, incluindo documentos que
descrevem projetos de I&D, documentos associados a legislao, processos
judiciais ou procedimentos mdicos. Acreditamos deste modo que este trabalho pode ser til para um pblico relativamente amplo.
[1] i n t r o d u c t i o n
[184]
[1]
OSLa volume 7(1), 2015
[185]
[186]
This section presents the main steps undertaken to uncover the unknown information regarding affinities. The method involves the following steps:
(i) Identify institutions and obtain researchers names;
(ii) Use web/text mining to process researchers publications;
(iii) Elaboration of similarity matrix and visualization as a graph;
(iv) Discovering potential communities linked by affinities;
(v) Elaboration of a co-authorship graph and differential analysis of graphs;
(vi) Identification of important nodes (researchers) in the graph;
(vii) Characterization of nodes using keywords;
The details about all these steps are given in the following sub-sections. Additional functionalities that are not part of the implemented prototype include:
(i) Characterization of nodes using summaries;
(ii) Learning to generate shortened sentences for summaries.
The details about all these steps are given in the section [3].
[187]
not match the name used in the bibliographic database. Also, as researchers may
have several variants of their name, several entries may exist in the bibliographic
database for the same researcher. So these issues need to be resolved.
It could be argued that the researchers names might not be retrieved from
the web pages of a particular research institution / R&D center, as these appear
in the articles. This approach has, however, a disadvantage that the set of research institution / R&D centers would grow, as more articles would be encountered and processed. We prefer to restrict the number of R&D centers to a certain
pre-defined set.
Another problem is that we may have several investigators with the same
name in the bibliographic database. One of the techniques used by Bugla (2009)
is the following. To determine whether a given publication of P in some bibliographic database should be attributed to person P on a given site, a check is made
whether both (i.e. P and P ) have the same home institution. Various other researchers have investigated the issue of determining whether several variants of
one name belong to the same author and various methods have been proposed
(e.g. Santos & Ribeiro 2011).
Regarding the particular bibliographic database, we have chosen Authenticus
database, which was developed by the University of Porto, because it retrieves
publications from several other bibliographic databases (incl. SCOPUS, Google
Scholar, ISI Web of Science, DBLP and Orcid). In the work reported here, we were
able to skip many of the Web/Text Mining steps just described, as we were provided with a database that included all relevant information.
[188]
figure 1: Researcher affinity network for R&D center LIAAD of INESC Tec
Visualization tools
Visualization tools play an important role in data analysis, as visual information
organization enables the analyst/user to interpret and detect patterns or other
relevant information faster and more effectively. This requires developing tools
that show the information in an intuitive and interactive way.
The developed web application prototype Affinity Miner2 is based primarily on R language and an appropriate set of packages. With the data conveniently indexed we used R as a language platform for the implementation and to
represent the data which needs to be conveniently indexed.
For this task we use the shiny package (RStudio, Inc 2014) that is a web application framework. In this way, our web application can react instantly to user
inputs with the goal of changing the output displayed to the user.
[2]
OSLa volume 7(1), 2015
See http://gallicyadas.pt/affinity-miner/.
[189]
Another requirement is the output availability in remote locations and the use
of standardized frameworks and software (e.g. HTML, JavaScript etc.). The best
way of doing this is by presenting the output, including network graphs, in a web
browser. For this task we chose sigma.js library, a JavaScript library dedicated to
graph drawing (Jacomy 2013). It enables the network display on web pages and
may be used to integrate network exploration in rich web applications.
[190]
figure 2: Researcher affinity network for the 5 R&D centers of INESC Tec and identified communities
(2014). This involves constructing a graph that represents basically the difference
between the two graphs.
The following two figures illustrate this. Figure 3 shows a part of co-authorship
graph that includes some researchers of LIAAD. Figure 4 shows a part of differential graph resulting from the differential analysis. It shows all the affinity links
that do not have a corresponding link in the co-authorship graph.
For example, we note that Mrcia Oliveira has just one co-authorship link
with Joo Gama, while the differential graph shows three other affinity links
to Alpio Jorge, Pedro Campos and Pedro Quelhas Brito. These links have been
revealed by the differential analysis. Such links may be of interest firstly to the
researchers involved, but also to the management when creating new teams for
a new project.
[191]
[192]
Pavel Brazdil
S
R
S
Joo Gama
[193]
Keywords
Data Mining and Decision Support; Algorithm Selection via Metalearning and Planning; Meta-Learning;
Web Mining, Text Mining and Web Intelligence; Artificial Intelligence.
classification algorithm; logic programming; inductive logic programming; knowledge discovery; data mining; artificial intelligence;
Data Mining and Decision Support; Knowledge Discovery from Data Streams; Artificial intelligence
data stream; decision tree; change detection;
knowledge discovery; data mining; sensor network; artificial intelligence; classification algorithm; computer science; sensor data; decision
support system;
table 1: Comparison of the automatically selected keywords (S) with their real
keywords (R) obtained from web pages
Table 1 shows that several keywords agree well with the real ones, identified by the researchers on their web pages. It appears that the real expressions
are more meaningful and would lead to better thematic assessment. In this area
it is important to avoid both too general keywords (e.g. computer science) and
too specific ones. This reveals the need for further studies in this area, which is
related to the problem of summarization using short sentences or snippets discussed next.
[3] c h a r a c t e r i z at i o n o f n o d e s ( r e s e a r c h e r s ) u s i n g s u m m a r i e s
[3.1]
[194]
[195]
The training data for supervised methods is in the form of a list of sentences
S1 . . . Sm , each characterized by a set of n features and a score, which represents
the target variable.
< S1i , f11i , . . . , f1ni , score1i >
< S2i , f21i , . . . , f2ni , score2i >
..
.
< Smk , fm1k , . . . , fmnk , scoremk >
Machine Learning
Modelu
figure 6: Training data for creating a model for a given document set DS
This scheme is illustrated in Figure 6. The index i (or k) represents a particular
document, index u a particular human summarizer who has supplied the golden
standard summaries.
Various features were proposed in the past. The features of Ouyang et al.
(2011) were sentence length without stop-words, sentence position, average tf-idf, sentence to query similarity, among others.
Valizadeh & Brazdil (2014) enriched this set with some features that were derived from the graph-based representation, such as sum of similarities between current sentence and other sentences, number of nonzero links sentence rank of T-LexRank,
besides others which lead to marked improvements in the quality of summaries.
Enhancing the coherence of summaries by detecting actor-object relationship (AOR) between sentences
Ideally, the sentences selected into the summary based on their scores should
be coherent and supplement each other in their meaning. One method that can
model this is by detecting a special case of direct anaphora, which was studied by
Valizadeh & Brazdil (2015). This occurs when one sentence introduces an object
that plays the role of an actor or a subject in another sentence. This relationship
is referred to shortly as actor-object relationship (AOR). The sentences that satisfy
this relationship have their score value enhanced.
To be able to do this, it is necessary to use a parser. The authors have opted for
the Stanford dependency parser, as it is freely available (de Marneffe et al. 2006).
The parser returns, for each sentence, a set of relations of the type tag(ti , tj ),
where tag characterizes the relationship between the terms ti and tj . The tags
that were exploited in this work were, for instance, dobj, representing the direct
object of the verb, nsubj(tj , tk ), representing a nominal subject/actor of the action. One example of a tag is dobj(seize 47, compound 51). The items 47 and
51 represent identifiers determining where the words seize and compound appear
in the parse tree.
OSLa volume 7(1), 2015
[196]
[197]
summary with human summaries and the latter tend to be more coherent than
the ones generated previously by automatic methods.
[198]
figure 8: The pipeline architecture for learning sentence reduction rules from
web news text.
figure 10: Two sentence reduction cases with three kinds of features highlighted.
OSLa volume 7(1), 2015
[199]
The learning process yields a relatively large set of reduction rules which can
then be applied to new sentences. A combination and even a composition of several reduction rules can be applied to a single sentence. The reduction rules incorporate different conditions, like for example, a restriction on the length of the
eliminated segment. Besides, the reduced version should still maintain its grammaticality. For the former we use statistical lexical and syntactical models, automatically constructed from corpora. Example of two rules generated are shown
in Figure 11.
[200]
Conclusions
We have presented a framework that uncovers research communities, real or potential ones, based on their scientific production. This is done by retrieving publication tiles for a given set of researchers, representing them in corresponding
text files and elaborating a similarity matrix. This in turn can be used to construct
a network of affinities.
Further processing leads to representations in the form of graphs. The community detection algorithms are used to uncover sub-graphs representing real or
potential communities. These can be compared to the formal organization structure.
In our prototype we have devoted a special attention to the visualization of the
graph of communities, as well as the characterization of its nodes (researchers).
For this we have reused existing automatic techniques for selecting relevant keywords from texts.
Further steps involve differential analysis based on the affinity and co-authorship graphs. This analysis enables us to identify people that could potentially
benefit from working together.
OSLa volume 7(1), 2015
[201]
Future work
In the future we intend to process the abstracts or even full articles. We will
consider also a substantially higher number of research centers and include thus
more researchers. This represents some challenges for the process of elaborating the similarity matrix and the corresponding network. To overcome these, we
plan to use the incremental / data-streaming approaches (Gama 2010).
It would also be interesting / useful to incorporate into our prototype certain techniques of update summarization explored recently by Costa (2014) who is
a member of our group. This would enable to determine in what way a particular
node differs from others.
As was shown earlier our current prototype is capable of characterizing each
node with a set of keywords. In sections [3.1] and [3.2] we have discussed some
aspects of our research in the area of automatic summarization. So far, these
techniques have been implemented in the form of stand-alone programs. We plan
to incorporate them in our prototype (Affinity Miner). This will lead to a more
comprehensive characterization of nodes (researchers), which may be of interest
to users.
A validation step needs to be added to our methodology. We plan to carry out
a survey by questioning some researchers included in our study. We will inquire
about who are the closest colleagues that conduct the most similar research. The
outcome will be compared to the predictions obtained from the graph generated
by our system.
An important issue that could be addressed stems from the fact that different
researchers may use different vocabulary/terminology to describe the same concepts. This happens frequently when the researchers belong to different communities. This problem is difficult to overcome. It is possible to use, as some others
did, Wordnet and DBpedia (Leal et al. 2012) to identify synonyms and related terms.
This may be difficult for some specific domains, which may require the use of specific dictionaries, or the use of techniques that can identify potential synonyms
(e.g. Grigonyt et al. 2010).
Another line of research that will be followed will exploit linguistic knowledge. We note that the sentence reduction can be attained through the transformation of an adverbial finite clause into a prepositional or adverbial phrase or
non-finite clauses. Consider, for instance quando anoiteceu = noite. In this
example, the number of words is the same, the number of characters has been
reduced, yielding a simpler and equivalent expression. Another example is the
transformation of relative clause into a gerundive or participial clause (e.g. as garrafas que continham cerveja = as garrafas contendo cerveja). Since the same relations of meaning can be inferred in different types of structures, it is possible to
use shorter sequences to convey the same meaning as the longer ones. To account
for different semantic values of sentences, we will use a theoretical framework
OSLa volume 7(1), 2015
[202]
acknowledgments
This work has been partially funded by FCT/MEC through PIDDAC and ERDF/ON2
within project NORTE-07-0124-FEDER-000059 and through the COMPETE Programme (operational programme for competitiveness) and by National Funds through
the FCT Fundao para a Cincia e a Tecnologia (Portuguese Foundation for
Science and Technology) within project FCOMP-01-0124-FEDER-037281.
We wish to thank Fernando Silva and his collaborators, who are responsible
for the Authenticus bibliographic database, for providing us with data that we
needed for this study titles of publications of INESC Tec researchers.
We wish to thank also the colleagues working from FLUP carrying out research
in the area of linguistics Ftima Oliveira, M. da Purificao Silvano and Antnio
Leal for taking interest in abstractive summarization and their willingness to
contribute. This may open possibilities for interesting new advances in the future.
references
Asher, Nicholas & Alex Lascarides. 2003. Logics of Conversation. Cambridge University Press.
Bugla, Sylwia. 2009. Name identification in scientific publications. University of Porto
MSc thesis.
Carbonell, Jaime & Jade Goldstein. 1998. The Use of MMR, Diversity-based Reranking for Reordering Documents and Producing Summaries. In Proceedings of the
21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 335336.
Choobdar, Sarvenaz, Pedro Ribeiro, Sylwia Bugla & Fernando Silva. 2012. Comparison of Co-authorship Networks Across Scientific Fields Using Motifs. In
Proceedings of the International Conference on Advances in Social Networks Analysis
and Mining (ASONAM), 147152.
Clarke, James & Mirella Lapata. 2006. Constraint-based Sentence Compression an
Integer Programming Approach. In Proceedings of the COLING/ACL, 144151.
OSLa volume 7(1), 2015
[203]
Cohn, Trevor & Mirella Lapata. 2008. Sentence Compression Beyond Word Deletion. In Proceedings of the 22Nd International Conference on Computational Linguistics, 137144.
Cohn, Trevor & Mirella Lapata. 2009. Sentence Compression As Tree Transduction. Journal on Artificial Intelligence Research 34(1). 637674.
Cordeiro, Joo, Gael Dias & Guillaume Cleuziou. 2007a. Biology Based Alignments
of Paraphrases for Sentence Compression. In Proceedings of the Workshop on Textual Entailment and Paraphrasing, 177184.
Cordeiro, Joo, Gal Dias & Pavel Brazdil. 2007b. New Functions for Unsupervised
Asymmetrical Paraphrase Detection. Journal of Software 2(4). 1223.
Cordeiro, Joo, Gal Dias & Pavel Brazdil. 2013. Rule induction for sentence reduction. In Lus Correia, LusPaulo Reis & Jos Cascalho (eds.), Progress in Artificial
Intelligence, vol. 8154, 528539. Springer.
Costa, Vitor. 2014. Update Summarization. Universidade do Porto MSc thesis.
Erkan, Gnes & Dragomir R. Radev. 2004. LexRank: Graph-based Lexical Centrality As Salience in Text Summarization. Journal on Artificial Intelligence Research
22(1). 457479.
Feldman, Ronen & James Sanger. 2007. Text Mining Handbook: Advanced Approaches
in Analyzing Unstructured Data. Cambridge University Press.
Galley, Michel & Kathleen McKeown. 2007. Lexicalized Markov Grammars for Sentence Compression. In Human Language Technologies 2007: The Conference of the
North American Chapter of the Association for Computational Linguistics, 180187.
Gama, Joo. 2010. Knowledge Discovery from Data Streams. Chapman & Hall/CRC.
Grigonyt, Gintar, Joo Cordeiro, Gal Dias, Rumen Moraliyski & Pavel Brazdil.
2010. Paraphrase Alignment for Synonym Evidence Discovery. In Proceedings of
the 23rd International Conference on Computational Linguistics, 403411.
Iacobucci, Dawn. 1994. Graphs and Matrices. In Social Network Analysis, 92166.
Cambridge University Press.
Jacomy, Alexis. 2013. sigma js. http://sigmajs.org.
Knight, Kevin & Daniel Marcu. 2002. Summarization Beyond Sentence Extraction:
A Probabilistic Approach to Sentence Compression. Artificial Intelligence 139(1).
91107.
OSLa volume 7(1), 2015
[204]
[205]
[206]
c o n ta c t s
Pavel Brazdil
LIAAD-INESC Tec; FEP, Univ. of Porto
pbrazdil@inescporto.pt
Lus Trigo
LIAAD-INESC Tec
lptrigo@inescporto.pt
Joo Cordeiro
LIAAD-INESC Tec; Univ. of Beira Interior
jpaulo@di.ubi.pt
Rui Sarmento
LIAAD-INESC Tec
rui_sarmento@hotmail.com
Mohammadreza Valizadeh
LIAAD-INESC Tec; Univ. of Ilan
valizadehmr@gmail.com
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 207222. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
abstract
This paper describes two machine translation tasks that require language
expertise: (1) paraphrasing as a technique to prepare texts for translation
and a method for linguistic quality assurance, and (2) the evaluation of translation produced by machine translation systems. These tasks will be exemplified through support verb constructions, a subtype of multiword units
that machine translation systems have difficulty translating. The paper raises awareness of the need to integrate enhanced linguistic knowledge in machine translation systems and the need to place the human factor as a core
value in order to ensure translation quality.
[1] i n t r o d u o
[208]
anabela barreiro
tido apenas marginalmente na investigao em traduo automtica. A integrao ainda mal explorada dos recursos lingusticos em sistemas essencialmente
estatsticos , em muito, responsvel pelos erros crassos que as tradues produzidas pelos sistemas de traduo automtica online apresentam, impedindo que
estas sejam usadas para fins comerciais na ausncia de um esforo significativo
de ps-edio. No caso dos sistemas de base gramatical, a falta de recursos lingusticos para alimentar as bases de dados destes sistemas tambm cria graves
lacunas de origem maioritariamente lexical. No se sabe ainda que aproximao
hbrida ser a mais eficaz a longo prazo e conduzir a uma qualidade de traduo
superior.
Enquanto os investigadores procuram avanar o estado da arte e melhorar a
tecnologia atravs da criao e desenvolvimento de sistemas que traduzem cada
vez melhor, a traduo automtica representa uma realidade que j no pode ser
ignorada tambm no universo da traduo profissional, fazendo parte da formao e currculo dos tradutores (Maia 2005). Apesar dos resultados ainda pouco
fidedignos, a traduo automtica comea a integrar o quotidiano de um nmero
crescente de clientes e mercados, que colmatam as suas deficincias atravs do
treino de sistemas em domnios especficos usando corpora baseados em textos
traduzidos profissionalmente para esses domnios (Bick & Barreiro 2015) e atravs do uso de ferramentas automticas de ps-edio dos textos traduzidos automaticamente (Vieira & Specia 2011). Por conseguinte, na esfera da traduo profissional, a interveno humana essencial no processo de correo e certificao
do controlo de qualidade lingustica da traduo automtica. Outra forma de interveno e que tem sido menos explorada do ponto de vista do processamento da
linguagem natural a do parafraseamento usado como tcnica de pr-edio do
texto da lngua-fonte, por vezes conduzindo a uma linguagem controlada usada
em textos tcnicos e cientficos. Queremos aqui reforar que uma traduo automtica de qualidade no ser alcanvel sem o fator humano, nomeadamente sem
a interveno de especialistas das lnguas envolvidas na traduo e a sua participao nas tarefas que visam a qualidade do texto a traduzir e do texto traduzido.
Este artigo apresenta duas importantes tarefas da traduo automtica que
requerem a participao de peritos com conhecimentos lingusticos profundos
das lnguas de traduo. A primeira tarefa consiste no parafraseamento como
mtodo de preparao do texto na lngua-fonte, de modo a garantir uma melhor
qualidade de traduo desse texto. A segunda tarefa corresponde avaliao da
traduo produzida pelos sistemas de traduo automtica. As duas tarefas sero
exemplificadas atravs das construes com verbos-suporte, um tipo de unidade
lexical multipalavra que os sistemas de traduo automtica em vigor no conseguem traduzir com qualidade.
[209]
[2] c o n s t r u e s c o m v e r b o s - s u p o r t e e m t r a d u o a u t o m t i c a
[210]
anabela barreiro
entre os elementos mesmo quando esto distantes entre si na frase. Por exemplo, deu [muitos e longos] passeios pela [N] ou no fez [absolutamente nenhum] comentrio sobre [N] representam construes com verbos-suporte no adjacentes que
mantm inseres entre os verbos-suporte dar e fazer e os predicados no verbais passeios e comentrio, respetivamente. Uma insero qualquer palavra que
se encontre entre dois elementos da unidade lexical multipalavra, exceto se essa
palavra for um artigo definido ou indefinido antes de um nome predicativo. Em
geral, quanto mais inseres e variabilidade morfossinttica existir numa construo com verbo-suporte, mais difcil a sua traduo automtica. Os estudos j
referenciados mencionam tambm a variedade lingustica apresentada pelas variantes estilsticas ou parafrsticas (fazer um estudo = realizar/efetuar/desenvolver
um estudo ou fazer um trabalho = elaborar um trabalho, entre outras), que utilizam
verbos-suporte no elementares (Ranchhod 1990). Essas variantes estilsticas podem apresentar diferentes graus de variabilidade, indo desde as construes que
permitem um nmero consideravelmente extenso de inseres entre o verbosuporte e o predicado nominal, at as expresses idiomticas semi- ou totalmente
fixas (dar o brao a torcer = ceder)1 . Construes com verbos-suporte no adjacentes so difceis de processar, alinhar e traduzir, permanecendo um dos maiores
desafios contrastivos para os sistemas de traduo automtica.
[3] fa c t o r h u m a n o n o c o n t r o l o da q u a l i da d e l i n g u s t i c a
Desde que os sistemas de traduo automtica estatstica comearam a ser treinados com grandes quantidades de dados, nomeadamente com milhes e milhes de
corpora paralelos disponveis na internet, que o efeito de erro gramatical se comeou a diluir e a ter um impacto gradualmente menor em tradues cada vez mais
robustas do ponto de vista lexical. Ao nvel da traduo comercial, os menores
custos envolvidos na tarefa da ps-edio justificam o uso da traduo automtica e um papel relevante desempenhado pelos tradutores tem consistido na correo dos erros gramaticais nos textos traduzidos automaticamente. No entanto,
muitos dos problemas lingusticos das tradues automticas tm na sua base a
falta de qualidade do texto na lngua-fonte. Em geral, o controlo da qualidade
lingustica dos textos da lngua-fonte tem sido relegado para segundo plano, no
havendo ferramentas robustas de auxlio edio e reviso de texto que envolvam
parafraseamento. Neste sentido, em trabalho anteriormente realizado, apresentmos uma abordagem cientfica baseada no parafraseamento que tem como objetivo melhorar a traduo automtica (Barreiro 2009), acentuando a necessidade
[1]
Como expresses idiomticas entendem-se expresses no transparentes, no entendidas/traduzidas literalmente, em que o significado da expresso diferente do significado individual das palavras que a
constituem. Podemos considerar a existncia de uma gradao da idiomaticidade, que pode variar entre o ligeiramente no literal e o muito obscuro. Algumas expresses idiomticas assumem um valor
figurativo que se conhece apenas atravs do uso comum, outras acabam por fossilizar-se com o passar
do tempo.
[211]
[3.1]
[212]
anabela barreiro
construes sintticas livres, tais como a coordenao de sintagmas nominais e a
passiva, entre outras. A informao lingustica relevante para a construo das
parfrase que foram geradas (como resultado dessa investigao) foi formalizada
em dicionrios e gramticas desenvolvidos no ambiente lingustico NooJ e utilizados em vrias tarefas de processamento de lngua natural, sob o ponto de vista
monolingue e bilingue. Os recursos bilingues portugus-ingls do Port4NooJ, disponvel em domnio pblico2 , integram a ontologia SAL do modelo OpenLogos e
foram construdos como o alicerce desse estudo. O seguimento desse trabalho
deu origem aos sistemas ReEscreve, ReWriter, ParaMT e eSPERTo apresentados
em (Barreiro 2008, 2009, 2011; Barreiro & Cabral 2009; Barreiro et al. 2011). O
eSPERTo um Sistema de Parafraseamento para Edio e Reviso de Texto, atualmente em fase de desenvolvimento no mbito de um projeto com o mesmo nome3 .
Este projeto tem como objetivo o desenvolvimento de uma plataforma web para
gerao de parfrases linguisticamente complexas. As parfrases sero geradas a
partir da aplicao de uma tcnica hbrida de aquisio de conhecimento lingustico baseada em estatstica e regras gramaticais. A integrao de conhecimento
frsico e de unidades lexicais multipalavra no sistema permitir um mapeamento
otimizado de construes, estruturas e frases semanticamente equivalentes, que
servir de auxlio no ensino de escrita e na produo e reviso de textos em portugus. Este conhecimento lingustico poder ser tambm usado em pr-edio
para a traduo automtica, de modo a garantir uma maior qualidade dos textos
a traduzir e da qualidade da traduo desses textos.
http://www.linguateca.pt/Repositorio/Port4NooJ/
http://esperto.l2f.inesc-id.pt/
[213]
[214]
anabela barreiro
ter dado a possibilidade de contrastar um sistema de regras baseadas em padres
com um sistema estatstico, permitiu-nos diagnosticar e avaliar qualitativamente
erros de traduo em fenmenos lingusticos muito especficos.
[215]
[216]
anabela barreiro
literal do verbo-suporte e escolha lexical errada para o nome predicativo, preposies e determinantes. Estes problemas requerem um esforo pequeno de psedio, j que se tratam de palavras muito curtas. Os resultados quantitativos, os
exemplos ilustrativos, e as avaliaes qualitativas detalhadas para todos os pares
de lnguas podem ser consultados em Barreiro et al. (2014). Passaremos a apresentar com especial pormenor a descrio dos erros de traduo de construes
com verbos-suporte do par inglsportugus, apenas superficialmente referidos
no trabalho anterior.
EN
PT
[217]
- These specifications gave insight into the space of possible case-based systems, and elucidated
human interaction properties.
P T - Estas especificaes deu uma *viso *para o espao de possveis sistemas baseados em casos, e
elucidou Propriedades interao humana.
EN
Nos casos de construes menos idiomticas, os erros afetam geralmente apenas um ou dois elementos da construo, como o verbo-suporte ou a preposio.
Por exemplo, em (iii) o verbo-suporte makes foi traduzido literalmente por faz em
vez de torna. Em (iv), a preposio for foi traduzida por para em vez de por. Em (v),
a preposio to foi traduzida pela preposio para em vez de a.
(iii)
(iv)
(v)
- On the one hand, such a rich grammatical theory makes it possible to write grammars that contain very rich linguistic knowledge.
P T - Por um lado, uma teoria tal gramatical rica *faz possvel escrever gramticas que contm o conhecimento lingustico muito rico.
EN
EN - Schafer testified he believed his bureau chief in Beirut, Lester Coleman, was responsible for his
photo appearing as part of the Pan Am affidavit.
P T - Schafer atestou que ele acreditou no seu chefe de escritrio em Beirut, Lester Coleman, foi responsvel *para sua fotografia que aparece enquanto a parte da panela declarao.
- The new Government which came to power in April 1984 has expressed a desire to give priority
to agriculture development and to remove past obstacles.
P T - O governo novo que assumir poder em Abril 1984 exprimiu um desejo de dar *a prioridade *para
o desenvolvimento de agricultura e de retirar-se por obstculos.
EN
O sistema Google Translate apresenta vrios erros de concordncia em construes que o sistema OpenLogos consegue traduzir corretamente. Esses erros
podem ser entre o sujeito da frase e o verbo-suporte (vi), ou entre o sujeito da
frase e o adjetivo predicativo da construo com verbo-suporte ((vii) e (viii)).
(vi)
EN
PT
(vii)
EN
PT
(viii)
EN
PT
[218]
anabela barreiro
[219]
[4] c o n c l u s o e t r a b a l h o f u t u r o
Estudos realizados anteriormente revelam lacunas importantes ao nvel da anotao, identificao, representao, reconhecimento, processamento e avaliao
das construes com verbos-suporte. Os atuais sistemas de traduo automtica
no conseguem traduzir com qualidade os fenmenos lingusticos apresentados
pelas construes com verbos-suporte. Uma tarefa importante que pode conduzir
a uma melhor traduo das construes com verbos-suporte a do seu parafraseamento. Um sistema que permita mapear construes com verbos-suporte com
os seus equivalentes semnticos, sejam eles variantes estilsticas, variantes parafrsticas ou verbos, constitui uma mais valia para a traduo (humana e automtica). Entre outros aspetos positivos, o parafraseamento tem a vantagem de servir
como ferramenta de auxlio na transformao estilstica de textos, permitindo a
converso de um texto palavroso num texto semanticamente equivalente, mas
utilizando um menor nmero de palavras e uma linguagem mais controlada, e por
conseguinte, mais fcil de traduzir por uma mquina.
Outra tarefa de grande relevo para o aperfeioamento dos sistemas de traduo a da avaliao da traduo das construes com verbos-suporte. Os erros
refletidos na traduo destas construes por dois importantes sistemas de traduo automtica, o OpenLogos e o Google Translate, permitem concluir que as
unidades lexicais multipalavra continuam a ser um problema em aberto na rea da
traduo automtica, independentemente do tipo de aproximao adotada pelo
sistema. Os erros encontrados no interior das construes traduzidas poderiam
ser minimizados se as unidades lexicais multipalavra fossem tratadas como unidades indissociveis. A falta de composicionalidade das unidades lexicais multipalavra, nomeadamente a das construes com verbos-suporte, fica tambm comprometida com a falta de interveno humana qualificada na tarefa de alinhamento
de segmentos bilingues ou multilingues usados para treinar sistemas de aprendizagem automtica. Apesar da grande pertinncia da qualidade dos alinhamentos
dos vrios elementos da frase nos sistemas estatsticos, este tema est ainda pouco
explorado do ponto de vista lingustico e computacional, motivo pelo qual optmos por no o incluir neste artigo. No entanto, no podemos deixar de referir que
a impossibilidade de os sistemas de traduo automtica estatsticos permitirem
alinhar unidades lexicais multipalavra cujos elementos que as compem se encontrarem em situaes de no adjacncia, constitui uma das razes do fracasso dos
sistemas de traduo automtica. Tambm nesta tarefa, o envolvimento de fator
humano especializado ou a especializar-se em traduo ser determinante para
o processo de aprendizagem automtica de conhecimento lingustico que conduzir qualidade da traduo destas expresses, tema que merece ser abordado
com a devida ateno em trabalho futuro.
[220]
anabela barreiro
agradecimentos
Agradeo a Diana Santos e a Stella Tagnin os comentrios pertinentes, que permitiram melhorar este artigo. Este trabalho foi parcialmente financiado pela FCT
atravs de uma bolsa de ps-doutoramento (SFRH/BPD/91446/2012).
referncias
Aziz, Wilker, Sheila Castilho Monteiro de Sousa & Lucia Specia. 2012. PET: a tool
for post-editing and assessing machine translation. Em Eighth International Conference on Language Resources and Evaluation (LREC2012), 39823987.
Baptista, Jorge. 2005. Sintaxe dos nomes predicativos com verbo-suporte SER DE. Fundao para a Cincia e a Tecnologia/Fundao Calouste Gulbenkian.
Barreiro, Anabela. 2008. ParaMT: A paraphraser for Machine Translation. Em
Computational Processing of the Portuguese Language, 8th International Conference,
(PROPOR 2008), 202211.
Barreiro, Anabela. 2009. Make it Simple with Paraphrases: Automated Paraphrasing
for Authoring Aids and Machine Translation: Universidade do Porto. Tese de Doutoramento.
Barreiro, Anabela. 2011. SPIDER: A System for Paraphrasing in Document Editing
and Revision Applicability in Machine Translation Pre-editing. Em Alexander
Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, vol. 6609
Lecture Notes in Computer Science, 365376. Springer.
Barreiro, Anabela & Lus Miguel Cabral. 2009. ReEscreve: a translator-friendly
multi-purpose paraphrasing software tool. Em Marie-Jose Goulet, Christiane
Melanon, Alain Dsilets & Elliott Macklovitch (eds.), Proceedings of the Workshop
Beyond Translation Memories: New Tools for Translators, The Twelfth Machine Translation Summit, 18.
Barreiro, Anabela, Johanna Monti, Brigitte Orliac & Fernando Batista. 2013. When
Multiwords Go Bad in Machine Translation. Em Proceedings of the Workshop on
Multi-word Units in Machine Translation and Translation Technology, Machine Translation Summit XIV, 2633.
Barreiro, Anabela, Johanna Monti, Brigitte Orliac, Susanne Preu, Kutz Arrieta,
Wang Ling, Fernando Batista & Isabel Trancoso. 2014. Linguistic Evaluation of
Support Verb Constructions by OpenLogos and Google Translate. Em Nicoletta
Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard,
Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation
(LREC14), 3540. ELRA.
OSLa volume 7(1), 2015
[221]
Barreiro, Anabela, Bernard Scott, Walter Kasper & Bernd Kiefer. 2011. OpenLogos
Rule-Based Machine Translation: Philosophy, Model, Resources and Customization. Machine Translation 25(2). 107126.
Bick, Eckhard & Anabela Barreiro. 2015. Automatic anonymisation of a new
Portuguese-English parallel corpus in the legal-financial domain. Neste volume.
Chacoto, Luclia. 2005. O Verbo Fazer em Construes Nominais Predicativas: Universidade do Algarve. Tese de Doutoramento.
Chiang, David. 2005. A hierarchical phrase-based model for statistical machine
translation. Em Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL05, 263270. Association for Computational Linguistics.
Gross, Maurice. 1984. Lexicon-grammar and the syntactic analysis of French. Em
10th International Conference on Computational Linguistics and 22nd Annual Meeting
of the Association for Computational Linguistics, Proceedings of COLING , 275282.
Gross, Maurice & Jean Senellart. 1998. Nouvelles bases pour une approche statistique. Em Actes du colloque international JADT-98, .
Koehn, Philipp. 2005. EuroParl: A Parallel Corpus for Statistical Machine Translation. Em Conference Proceedings: the tenth Machine Translation Summit, 7986.
AAMT.
Maia, Belinda. 2005. Machine Translation and Human Translation: using machine
translation engines and parallel corpora for teaching and research. Em International Contrastive Linguistics Conference, 123145.
Maia, Belinda & Anabela Barreiro. 2007. Uma experincia de recolha de exemplos
classificados de traduo automtica de ingls para portugus. Em Diana Santos
(ed.), Avaliao conjunta: um novo paradigma no processamento computacional da
lngua portuguesa, 205216. IST Press.
Maia, Belinda, Anabela Barreiro & Lus Sarmento. 2003. EVAL - Evaluation of
Machine Translation at FLUP. Apresentao em AvalON2003. http://www.
linguateca.pt/documentos/MaiaBarreiroSarmentoEVALAvalon2003.
pdf.
Maia, Belinda, Diana Santos, Lus Sarmento & Anabela Barreiro. 2004. TrAva
- a tool for evaluating Machine Translation - pedagogical and research possibilities. Apresentao na ABRAPT. http://web.letras.up.pt/bhsmaia/
belinda/pres/abrapt-trava.ppt.
OSLa volume 7(1), 2015
[222]
anabela barreiro
Marcu, Daniel, Wei Wang, Abdessamad Echihabi & Kevin Knight. 2006. SPMT: statistical machine translation with syntactified target language phrases. Em Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing,
EMNLP 06, 4452. Association for Computational Linguistics.
Ramisch, Carlos, Aline Villavicencio & Christian Boitet. 2010. Multiword Expressions in the wild? The mwetoolkit comes in handy. Em Proceedings of the 23rd
International Conference on Computational Linguistics (COLING 2010), 5760.
Ranchhod, Elisabete. 1983. On the Support Verbs Ser and Estar in Portuguese.
LingvisticInvestigationes Volume 7. 317 353.
Ranchhod, Elisabete. 1990. Sintaxe dos Predicados Nominais com Estar. Instituto
Nacional de Investigao Cientfica.
Salkoff, M. 1999. A French-English Grammar: A Contrastive Grammar on Translational
Principles Linguisticae investigationes. J. Benjamins.
Sarmento, Lus, Anabela Barreiro, Belinda Maia & Diana Santos. 2007. Avaliao
de Traduo Automtica: alguns conceitos e reflexes. Em Diana Santos (ed.),
Avaliao conjunta: um novo paradigma no processamento computacional da lngua
portuguesa, 181190. IST Press.
Scott, Bernard (Bud). 2003. The Logos Model: An Historical Perspective. Machine
Translation 18(1). 172.
Vieira, Lucas & Lucia Specia. 2011. A review of translation tools from a postediting perspective. Em 3rd joint EM+/CNGL Workshop bringing MT to the user:
Research meets translators (JEC), 3342.
Zollmann, Andreas & Ashish Venugopal. 2006. Syntax augmented machine translation via chart parsing. Em Proceedings of the Workshop on Statistical Machine
Translation, StatMT 06, 138141. Association for Computational Linguistics.
c o n ta c t o s
Anabela Barreiro
INESC-ID
anabela.barreiro@inesc-id.pt
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 223234. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
abstract
This paper contrasts some texts that deal with the history of the terminology research in Brazil, especially the research aimed to extract or recognize
terminology in corpora, a widespread practice among us only from the year
2000, with pioneering texts on this topic produced by Portuguese researchers, represented here by Belinda Maia. The intention is to recognize her
role as disseminator of the corpus-based methodologies. The paper follows
showing how the dialogue between the Terminology studies from Brazil and
Portugal is important for the promotion of Portuguese language in the global scenario of scientific and technical communication.
[1] i n t r o d u o
Este texto trata de cotejar algumas publicaes que servem como um exemplo de
testemunhos da trajetria da pesquisa terminlogica no Brasil, especialmente a
pesquisa orientada para a extrao ou reconhecimento de terminologias a partir
de corpora, mtodo de trabalho que s foi nacionalmente disseminado entre ns
a partir dos anos 2.000, e textos pioneiros sobre esse modo de pesquisa apresentados por estudiosos portugueses, representados aqui por Belinda Maia. A inteno
deste trabalho, assim, prestar o devido reconhecimento ao papel de Belinda Maia
como disseminadora da ideia do trabalho com corpus, quando nossos primeiros
estudos sobre Lingustica de Corpus recm conseguiam alguma repercusso e reconhecimento no Brasil (Sardinha 2000). Desde ento, essa metodologia de trabalho, que tem reunido no Brasil a Lingustica de Corpus (LC) e o Processamento
de Linguagem Natural (PLN), permanece como algo altamente desafiador, sobretudo entre a comunidade de pesquisadores linguistas que ainda hoje tm pouco
contato com tcnicas computacionais.
Do lado lusitano, revisamos dois textos de Maia (2003), Using Corpora for
Terminology Extraction: Pedagogical and computational approaches produzido
para um evento de 2001 (PALC) e Maia (2002), Corpora for terminology extraction
the differing perspectives and objectives of researchers, teachers and language
services providers produzido para um evento de 2002 (LREC). Do lado brasileiro,
tratado um texto de minha autoria (Finatto 2003), publicado em um boletim
da Associao Brasileira de Lingustica (ABRALIN), no qual, juntamente com ou-
[224]
p u ta c i o n a i s
Para a construo de um corpus, bem sabemos, h todo um conjunto de procedimentos, bastante penosos, mas ao final muito gratificantes, de modo que o acervo,
criteriosamente reunido, realmente sirva para representar, com segurana, um
dado estado de uso de lngua. Maia (1997) j nos apresentava um texto sobre como
se poderia enfrentar essa tarefa de um modo relativamente tranquilo, produtivo e
colaborativo, reunindo-se esforos de diferentes pessoas que tivessem interesses
de pesquisa semelhantes em torno desse trabalho.
Mais tarde, em 2004, conforme apontavam Maia et al. (2004, pg. 45), em um
texto que tratava justamente da cooperao entre brasileiros e portugueses em
torno de corpora para ensino, ensino de traduo, traduo e pesquisa de terminologias, uma vez construdo um corpus, definido, grosso modo, como uma coleo
de textos em formato digital, sendo ele etiquetado ou em cru, havia toda uma
parte de ferramentas para observar e analisar o uso da lngua nesse conjunto de
textos. Essas ferramentas, como vamos quela poca no Brasil, pareciam uma
mgica. Afinal, elas permitiam a observao de muitos dados ao mesmo tempo,
em vez do antigo mas familiar trabalho de se ler uma mesma ocorrncia de
palavra ou de uma dada expresso linha a linha ao longo de um texto ou de vrios
textos disponveis apenas em formato impresso. Eu mesma, em 1998, ainda examinava os textos das mais de centenas de leis brasileiras sobre o meio ambiente
desse modo, com lpis, caneta sinalizadora colorida e papel, para a produo de
um dicionrio da sua terminologia.
Pois aquelas ferramentas computacionais mgicas j ofereciam informao
de natureza estatstica, que poderia, posteriormente, ser analisada para fins especficos. E, conforme ensinavam Maia et al. (2004), os grandes corpora monolngues
traziam a possibilidade de se estudar a lngua no nvel lexical e sinttico, o que
tenderia a auxiliar imensamente quem se interessasse, por exemplo, por identificar terminologias em grandes acervos de textos cientficos ou tcnicos.
OSLa volume 7(1), 2015
[225]
Infelizmente, ainda hoje no Brasil, em 2015, muitos colegas linguistas desconhecem quaisquer metodologias do trabalho com corpus, embora estejamos em
uma poca de grande informatizao, quando nem mesmo se precisa mais comprar algum software para realizar o papel das ferramentas. Afinal, no faltam opes gratuitas e ferramentas prontas para uso on-line. Conforme j afirmamos em
trabalho recente (Novodvorski & Finatto 2014), a LC no Brasil e por extenso
os trabalhos com corpora e com ferramentas para sua explorao associou-se a
diferentes aventuras de investigao e praticamente nada rejeitou em termos de
parcerias de trabalho. O dilogo tem sido uma marca constante, mesmo com quem
conceba a LC apenas como um modus operandi computacional e quantitativo. A
despeito dessa impresso, deve ter ficado claro, pelo menos nesses primeiros 10
anos de percurso do trabalho com corpora no Brasil comemorados em 2015 com
uma dcada da publicao do artigo de Sardinha (2000) , que fomos muito alm
de apenas contar palavras.
[3] r e c o n h e c i m e n t o d e t e r m i n o l o g i a s e m c o r p o r a
[226]
No original: Terminology is not the simple accumulation of words, their equivalents in other languages,
definitions and a certain amount of grammatical information. Nor is it the simple matching of term
to concept. One has to deal with all the usual problems of language - social, geographical, historical,
political, and other aspects of style and register. At the level of standardisation, one can even become
involved in authentic battles between academics or commercial companies who want to see the words
they use to describe their particular theories or products prevail.
[227]
[228]
[229]
graxos (BR) e cidos gordos (PT), disbarismo (BR) e embolia gasosa (PT), entre
outros casos concretos que se poderia conferir, por exemplo, no Glossrio Panlatino
de Pneumopatias Ocupacionais/Profissionais.3
Em sntese, recomendamos que corpora portugueses sejam tratados em separado dos corpora brasileiros, sempre muito bem identificados, especialmente se o
uso dos dados extrados servir para abastecer produtos para a traduo. H muitos pontos coincidentes, naturalmente, mas as diferenas no se pode ignorar,
tampouco essas diferenas, devidamente repertoriadas, devem servir para que
se possa pensar na inviabilidade de se escrever tambm em portugus o conhecimento cientfico e tcnico. Em um trabalho que rena as fontes e as terminologias dos dois pases, as denominaes comuns ficariam marcando, assim, um
portugus internacional, ao passo que se assinalam, sempre, os usos diferentes de
Portugal (PT) ou do Brasil (BR).
Uma tal postura, o que discute Santos (2014) quando trata da questo de
diferenas lingusticas entre Portugal e Brasil em seu trabalho intitulado Como
estudar variantes do portugus e, ao mesmo tempo, construir um portugus internacional? A autora entende que importante termos corpora em portugus sem uma
separao de variantes considerando a ideia de um portugus internacional. No
contexto dessa pergunta, pelo menos no mbito da Terminologia, entendo que
importa descrever essas variantes e p-las em contato, em conjunto, ainda que
individualizadas, de modo que todos saibamos uns dos outros e de seus usos especficos.
Sob uma outra tica, igualmente interessante para uma reflexo que abarca e
extrapola o trabalho de Santos (2014), temos o estudo de Coulthard (2005). Nesse
trabalho, o autor j aponta, com base em um extenso estudo em corpus, uma influncia do estilo redacional do artigo em ingls sobre a escrita original de artigos
em portugus por parte de pesquisadores de Pediatria no Brasil. Assim, o estilo
em portugus brasileiro, pelo menos em artigos cientficos de Pediatria, j aparece anglofonizado, talvez at para facilitar a traduo do texto para um ingls
lingua franca. O corpus paralelo de Pediatria reunido por Coulthard (2005), por
ns expandido, encontra-se disponvel para consulta, em diferentes formatos e
recursos.4
[4] corpora na terminologia e nas terminologias
Em 2013, organizado por Tagnin & Bevilacqua (2013), foi publicada no Brasil uma
coletnea de artigos que servem, em tese, para atestar a boa juno e o sucesso
do trabalho terminolgico baseado ou guiado por corpora. O objetivo da obra ,
na verdade, reiterar, para ns do Brasil, que j h uma interface produtiva e pro[3]
[4]
Disponvel
gratuitamente
em
http://www.oqlf.gouv.qc.ca/ressources/bibliotheque/
dictionnaires/panlatin_pneumopathies20130124.pdf.
Consultar em http://www.ufrgs.br/textecc/textped/Dicionarios/DicPed/.
OSLa volume 7(1), 2015
[230]
claro, hoje, entre ns linguistas brasileiros que lidamos com Terminologia e Terminografia, que um termo especializado , antes de tudo, um valor ativado no discurso (o termo discurso, para mim, no exatamente um sinnimo de texto, mas
no cabe aqui essa discusso). Essa concepo devemos principalemente Teoria
OSLa volume 7(1), 2015
[231]
[232]
agradecimentos
Agradeo a Diana Santos pela oportunidade de participar desta publicao e tambm FAPERGS, CAPES, no mbito do Programa Stic-AmSud (projeto 047/2013),
FAPERGS e ao CNPq, instituies de apoio pesquisa no Brasil, pelo apoio s
minhas iniciativas de estudo de pesquisa.
referncias
Coulthard, Robert James. 2005. The application of corpus methodology to translation:
the jped parallel corpus and the pediatrics comparable corpus: Universidade Federal
de Santa Catarina. Tese de Mestrado.
Finatto, Maria Jos Bocorny. 2003. Sobre o enfoque lingstico-terminolgico de
manuais acadmicos de Qumica Geral. Em Associao Brasileira de Lingustica ABRALIN (ed.), II congresso internacional da ABRALIN, 2001, 184186.
OSLa volume 7(1), 2015
[233]
Finatto, Maria Jos Bocorny. 2007. Explorao terminolgica com apoio informatizado: perspectivas, desafios e limites. Em Aparecida Negri Isquerdo & Ieda Maria Alves (eds.), As Cincias do Lxico. Lexicologia, Lexicografia, Terminologia. Volume
III, 447458. Editora da UFMS/Humanitas.
Finatto, Maria Jos Bocorny & Marcos Goldnadel. 2013. Formao de terminlogos: experincia com /corpus/ em uma graduao em traduo. Em Stella
Tagnin & Cleci Bevilacqua (eds.), Corpora na Terminologia , 87112. HUB Editorial.
Lopes, Lucelene. 2012. Extrao automtica de conceitos a partir de textos em lngua
portuguesa: Pontifcia Universidade Catlica do Rio Grande do Sul (PUCRS). Tese
de Doutoramento.
Maciel, Anna Maria Becker. 2013. Terminologia e corpus. Em Stella Tagnn &
Cleci Regina Bavilacqua (eds.), Corpora na terminologia, 2945. HUB Editorial.
Maia, Belinda. 1997. Do it yourself corpora... with a little bit of help from your friends! Em B. Lewandowska-Tomaszczyk & P. J. Melia (eds.), Practical applications
in language corpora, 403410. Lodz: Lodz University Press.
Maia, Belinda. 2002. Do-it-yourself, disposable, specialised mini corpora - where
next? Reflections on teaching translation and terminology through corpora.
Cadernos de Traduo 1(9). 221235.
Maia, Belinda. 2003. Using Corpora for Terminology Extraction: Pedagogical and
computational approaches. Em Barbara Lewandowska-Tomaszczyk (ed.), PALC
2001: practical applications in language corpora, 5668. P. Lang.
Maia, Belinda, Lus Sarmento, Stella E. O. Tagnin & Sandra Maria Alusio. 2004.
Idias que cruzam o Oceano. CROP - Revista da rea de Lngua e Literatura Inglesa
e Norte-Americana 10. 4364.
Marcolin, Paula, Aline Evers, Maria Jos Bocorny Finatto & Marcos Goldnadel.
2010. Pneumopatologias: formao em terminologia em curso de traduo no
Brasil. Em Actas da RiTerm 2010, 254278.
Novodvorski, Ariel & Maria Jos Bocorny Finatto. 2014. Lingustica de Corpus no
Brasil: uma aventura mais do que adequada. Letras & Letras - UFU 30(2). 716.
Santos, Diana. 2014. Como estudar variantes do portugus e, ao mesmo tempo,
construir um portugus internacional? http://www.linguateca.pt/Diana/
download/VariantesPIGSCP.pdf.
Sardinha, Tony Berber. 2000. Lingstica de Corpus: histrico e problemtica.
DELTA 16(2). 323367.
OSLa volume 7(1), 2015
[234]
c o n ta c t o s
Maria Jos Bocorny Finatto
Instituto de Letras, Universidade Federal do Rio Grande do Sul, Brasil
maria.finatto@gmail.com
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 235252. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
ensinador paralelo:
alicerces para uma pedagogia nova
DIANA SANTOS E ALBERTO SIMES
abstract
After outlining some of Belinda Maias main ideas of how to use comparable
corpora in translation teaching and learning, we present a new translator
training tool: Ensinador Paralelo. It is an extension of Ensinador, originally
developed for use with monolingual corpora (Simes & Santos 2011). This
new tool produces exercises based on translations (previously done by professional translators or students, as we will see).
In order to make the text more interesting to Belinda Maia we also study
critically four translations of Lewis Carrolls children books.
[1] i n t r o d u o
[236]
[237]
Mas com o envolvimento de ambos os autores em cada vez mais novos corpos paralelos, como mostraremos no que se segue, pareceu chegada a altura de
expandir a ideia, e a funcionalidade, para os muitos casos j existentes.
Ao contrrio do Ensinador, que foi pensado para se apoiar exclusivamente sobre os corpos do AC/DC dado o seu tamanho e abrangncia, no parecia necessrio ainda usar mais material, o ParaEnsinador (nome do ensinador para
corpos paralelos) pretende poder ser usado pelo menos sobre os corpos da Linguateca e sobre os corpos do Per-fide (Arajo et al. 2010). Isto obrigou-nos a ter
mais cuidado na sua implementao, de modo a permitir a sua fcil instalao em
diferentes sistemas, assim como a possibilidade de configurao, para poder lidar
com vrios corpos, lnguas e diferentes formas de codificao e anotao.
[2.1]
Implementao
Embora o ParaEnsinador no tenha grandes novidades em termos de implementao em relao ao Ensinador monolingue, parece-nos importante realar, neste
documento, a sua tecnologia de base.
Tal como para o Ensinador, os corpos usados pelo ParaEnsinador devem estar,
naturalmente, codificados em Open Corpus Workbench (OCWB)1 . Tendo o OCWB
suporte para corpos paralelos, o ParaEnsinador baseia-se nessa informao para
realizar pesquisas paralelas.
Assim, para que um corpo paralelo possa ser usado pelo ParaEnsinador necessrio que cada uma das lnguas seja codificada de forma independente no OCWB,
seguida da importao de dados de alinhamento (que indicam, para cada segmento de uma lngua qual o segmento da lngua de destino que lhe corresponde)2 .
A interface Web implementada usando o mdulo Perl Dancer23 que pode funcionar sob um qualquer servidor Web, desde Apache a Starman.
A interligao entre a interface Web e o OCWB realizada usando o mdulo
CWB::CQP::More4 que, recentemente, recebeu uma atualizao para corpos paralelos.
Para que fosse possvel a gerao de diferentes tipos de exerccios foi necessrio alterar a sintaxe pr-definida usada pelo Corpus Query Processor do OCWB,
adicionando-lhe alguns atributos extra.
As alteraes sintaxe do OCWB so detalhadas na prxima seco, junta[1]
[2]
[3]
[4]
Ver http://cwb.sourceforge.net/.
Alguns investigadores tm usado ficheiros em formato TMX (Translation Memory eXchange) para armazenar os seus corpos paralelos. Uma TMX pode ser importada facilmente para o OCWB usando a ferramenta
tmx2cwb do mdulo Perl XML::TMX::CWB http://metacpan.org/release/XML-TMX-CWB.
Ver http://metacpan.org/release/Dancer2.
Ver http://metacpan.org/release/CWB-CQP-More.
OSLa volume 7(1), 2015
[238]
Convm realar que, infelizmente, neste momento ainda no foi possvel tornar a linguagem de pesquisa
flexvel suficiente para as duas lnguas do corpo paralelo. Assim, o utilizador ter de escolher uma lngua para a qual a sintaxe estendida deva ser usada, enquanto que para a outra lngua s poder usar
expresses de pesquisa do OCWB.
Espera-se que, no futuro, ou atravs da incorporao de algumas funcionalidades extra de pesquisa do
lado do OCWB, ou atravs de alguma soluo intermdia, se possa vir a ter a linguagem estendida para
ambas as lnguas.
[239]
.NOME
[240]
O texto de Lewis Carroll Alice in Wonderland (Carroll 1865), assim como a sequela
Through the Looking-Glass, and What Alice Found There (Carroll 1871), um clssico
da literatura britnica e mundial, e alm disso um livro de culto at aos nossos
dias. Belinda Maia no esconde o seu entusiasmo por ele, demonstrado pela sua
invocao em lides acadmicas, como em Maia (2008a) por ocasio dos dez anos
da Linguateca. Mas encontra-se em boa companhia: Com efeito, h outros textos
na rea da traduo que tambm invocam, embora de maneira diferente, a genialidade deste matemtico-escritor, como o caso de Chesterman (1998, pgs. 56).
Tambm um dos principais socilogos portugueses, de renome mundial, escolheu
mais uma vez estes livros (ou a sua personagem principal) para denominar vrios
dos seus projetos: veja-se Santos (1994) e o projeto aludido em Santos (2014c).
Aps termos escrito este artigo, descobrimos que j havia pelo menos dois
artigos escritos com base nestes mesmos textos, analisando, felizmente, outras
questes (Silva & Fromm 2011, 2012). Alm disso, a verso inglesa tem sido usada
em vrios livros e artigos de estatstica, como o caso de Baayen (2008), ou simplesmente como referncia ou citao em tudo o que possa ter algo a ver com
OSLa volume 7(1), 2015
[241]
A biografia de Alan Turing (Hodges 1983) est cheia de aluses, e mesmo livros de ensino a nvel universitrio na Noruega (Borge 2008).
Ver
http://dinis.linguateca.pt/dispara/CorTrad/AutoresTradutoresCorTradlit.php#
alice para informao detalhada sobre elas.
Os erros de traduo no foram encontrados de forma sistemtica, mas sim atravs da nossa interao
diria com o corpo. Este artigo no pretende apresentar uma metodologia de deteo ou quantificao
de problemas, limita-se a notar que uma anlise em paralelo permite identificar muitos problemas.
OSLa volume 7(1), 2015
[242]
A likely story indeed! said the Pigeon, in a tone of the deepest contempt.
Uma bela histria, de fato! disse a Pomba com o mais profundo desprezo.
Uma histria promissora, certamente, disse a Pomba, com um tom do
mais profundo desprezo.
(2)
Just then she noticed that the Queen was close behind her, listening: so she
went on likely to win, that its hardly worth while finishing the game.
Justo neste momento, notou que a Rainha estava atrs dela, ouvindo tudo.
Da continuou: ... competente no jogo, que nem sei se vale a pena ir at
o final da partida.
Exatamente neste instante ela percebeu que a Rainha estava bem ao seu
lado, ouvindo, ... boa nesse jogo que vai ser muito difcil chegar ao final
da partida.
[9]
[243]
Este um caso que no raro mas cuja importncia, sobretudo num contexto
didtico, nunca demais salientar.
Vamos agora observar alguns casos de jogos lgico-matemticos clebres dos
livros de Carroll.
Then you should say what you mean, the March Hare went on. I do,
Alice hastily replied; at least... at least I mean what I say... thats the same
thing, you know.
Ento voc deve dizer o que pensa, continuou a Lebre de Maro. Eu digo
o que penso, Alice apressou-se em dizer, ou, pelo menos... pelo menos
eu penso o que digo... a mesma coisa, no ?
Ento voc pode dizer o que acha, a Lebre de Maro continuou. E vou,
Alice replicou rapidamente, pelo menos-pelo menos, eu acho o que digo
o que a mesma coisa, voc sabe.
(5)
[244]
Thats a great deal to make one word mean, Alice said in a thoughtful
tone.
Uma grande coisa fazer uma palavra significar o que a gente quer! murmurou Alice pensativamente.
Isto fazer uma s palavra exprimir muita coisa disse Alice num tom
de voz duvidoso.
Aproveitamos este exemplo para tambm realar aquilo que j foi mencionado
antes por vrios investigadores (veja-se, por exemplo, Caldas-Coulthard (1996)):
o portugus tem consideravelmente maior riqueza no que se refere aos verbos de
expresso, comparada com o quase monoplio do say ingls. Temos pois murmurar neste exemplo, e muitos outros so tradues de say nestes textos. Por outro
lado, a dificuldade em converter o discurso direto ingls, misturando por exemplo as convenes das duas lnguas, tambm notrio em (6), complexidade essa
discutida e exemplificada por Santos (1998b).
Antes de deixar a questo do sentido, fulcral na lingustica, vejamos a clebre
sentena de Humpty Dumpty e como foi atacada pelos dois (novos10 tradutores.
(7)
Repare-se que, neste caso, a traduo dos nomes prprios foi diferente, tendo a
segunda tido o cuidado de escolher uma palavra mais apropriada lngua de destino, mas perdendo na nossa opinio a graa do nome ingls. Neste caso a primeira
traduo rigidamente colada ao texto fonte, enquanto a segunda tem a preocupao de falar como se fala na oralidade, e parece-nos bem mais conseguida. No
entanto, adiciona a informao de que as palavras passam a ter outro sentido,
quando o Gorducho (ou Humpty Dumpty) apenas diz, taxativamente, que tem
esse sentido.
Terminamos por um caso, o (8), em que os tradutores discordam na sua interpretao, mas produzem ambos frases pouco inteligveis.
(8)
[10]
OSLa volume 7(1), 2015
[245]
Enquanto o primeiro tradutor produz algo sem ps nem cabea, e que no pode
deixar de ser interpretado pelo leitor como perfeito disparate, o segundo consegue transmitir pelo menos parte da graa, ao usar o mesmo verbo descobrir em
dois sentidos diferentes, embora tenha perdido a parte da negao e da pronncia
no padro.
Em (11) temos outro exemplo de um neologismo negativo que hoje em dia
usado em ingls corrente ao contrrio da traduo portuguesa aqui proposta,
que continua cmica.
(11)
un important, your Majesty means, of course, he said, in a very respectful tone, but frowning and making faces at him as he spoke.
Desimportante o que Vossa Majestade quer dizer, claro, disse em
tom muito respeitoso, embora franzindo as sobrancelhas e fazendo caretas enquanto falava.
Desimportante, o que Vossa Majestade quer dizer, claro, ele disse,
em um tom respeitoso, mas franzindo o cenho e fazendo caretas.
[246]
Thats just what I complain of! You should have meant! What do you
suppose is the use of a child without any meaning? Even a joke should
have some meaning... and a childs more important than a joke, I hope.
You couldnt deny that, even if you tried with both hands.
o que me aborrece. Voc vive julgando. Onde se viu uma simples criana julgar? Isso bom para os juzes.
Pois isto o pior! Voc deveria ter a inteno! De que serve uma menina sem intenes? At um passarinho que abre as asas tem inteno
de voar; uma menina deve ter muito mais intenes que um passarinho!
Voc no pode negar isso, nem que tente com as duas mos!
Neste exemplo, mais uma vez difcil de traduzir para portugus dados os dois sentidos de meaning usados (alis, note-se que cada tradutor escolheu uma alternativa
diferente), o primeiro tradutor escolhe dizer algo que completamente contraditrio com o sentido original, criticando que uma criana julgue11 , enquanto o segundo mantm o sentido de reprovao por a criana no ter intenes/opinies,
mas substitui a comparao de uma criana com uma piada (uma comparao que
s faz sentido se se traduzir meaning por sentido, claro) pela introduo espria de
um passarinho com o qual compara uma criana.
A segunda comparao inesperada no mesmo trecho, nomeadamente tentar
negar algo com ambas as mos, mantida satisfatoriamente pelo segundo tradutor, mas omitida completamente pelo primeiro.
Acabamos este artigo, que poderia continuar quase indefinidamente, com a
discusso da adivinha que motiva uma discusso filosfica sobre semelhana em
Chesterman (1998)12 :
(13)
The Hatter opened his eyes very wide on hearing this; but all he said was
Why is a raven like a writing-desk?
O Chapeleiro arregalou os olhos ao ouvir isso, mas tudo o que disse foi:
Por que um corvo se parece com uma escrivaninha?
O Chapeleiro arregalou os olhos ao ouvir isso, mas, tudo que ele disse foi:
Por que um corvo se parece com uma escrivaninha?
A traduo escolhida foi literal claramente, a palavra secretria foi preterida devido a ser uma palavra ambgua entre uma profisso e uma pea de mobilirio
e praticamente igual nos dois casos (apenas uma vrgula e um pronome pessoal
a mais no segundo), o que mostra sem sombra de dvida que os tradutores no
se preocuparam em resolver ou compreender a adivinha. Passaram-na simples[11]
[12]
de tal maneira estranho que podemos at imaginar que a censura na altura vogente no Brasil tenha
algo a ver com isto.
Uma possibilidade de tentar compreender a adivinha seria comparar a sua traduo nas vrias lnguas e
pelo menos tentar ver se algum tradutor teria chegado a uma resposta satisfatria. Chesterman, contudo,
no faz nem sequer prope fazer isso.
[247]
Concluindo, pretendemos apresentar uma ferramenta que pode tornar mais fcil
ao professor ser mediador entre duas culturas, duas pocas, dois estilos, duas lnguas mas, se os exemplos da Alice so emocionantes, a mesma riqueza se poder
encontrar em tradues tcnicas ou de livros de outra ndole. Basta que os alunos
sejam dirigidos para os casos mais interessantes e pedaggicos no seu domnio.
O Ensinador Paralelo apenas uma ferramenta para ajudar o professor, que aqui
dedicamos Belinda.
[13]
[248]
agradecimentos
Agradecemos a Flvia Santos da Silva e a Guilherme Fromm por nos terem facultado os textos da Alice e as suas tradues, e a Jamilly Alvino e a Stella Tagnin pela
reviso do seu alinhamento para o CorTrad. Estamos tambm muito gratos a Signe
Oksefjell e a Brett Drury pelos seus comentrios pertinentes, que nos permitiram
melhorar este captulo.
referncias
Agarwal, Apoorv, Augusto Corvalan, Jacob Jensen & Owen Rambow. 2012. Social
Network Analysis of Alice in Wonderland. Em Proceedings of the NAACL-HLT 2012
Workshop on Computational Linguistics for Literature, 8896. Association for Computational Linguistics.
Arajo, Slvia, Jos Joo Almeida, Alberto Simes & Idalete Dias. 2010. Apresentao do projecto Per-Fide: Paralelizando o Portugus com seis outras lnguas.
Linguamtica 2(2). 7174.
Baayen, R. Harald. 2008. Analyzing Linguistic Data: A practical introduction to Statistics
using R. Cambridge University Press.
Borge, Inger Christin. 2008. Matematisk verktykasse. Universitetsforlaget.
Caldas-Coulthard, Carmen Rosa. 1996. A traduo e os problemas da representao da fala. Em Malcolm Coulthard & Patricia Anne Odber de Baubeta (eds.), Theoretical Issues and Practical Cases in Portuguese-English Translation, 145156. The
Edwin Meilen Press.
Chesterman, Andrew. 1998. Contrastive functional analysis. Benjamins.
Ebeling, Signe Oksefjell. 2006. Trivial Corpus Pursuit: An online game that facilitates autonomous learning. Em Susanne Anette Kjekshus Koch (ed.), Ringer i
vann. Fleksibel lring - Kvalitetsreformen i praksis, 93104. Fleksibel lring, Universitetet i Oslo.
Ebeling, Signe Oksefjell. 2009. Oslo Interactive English: Corpus-driven exercises
on the Web. Em Karin Aijmer (ed.), Corpora and Language Teaching, 6782. John
Benjamins Publishing Company.
Frankenberg-Garcia, Ana. 1998. Using translation traps to sort out portugueseenglish crosslinguistic influence. Em Proceedings of the 1st Brazilian International
Translators Forum, University of So Paulo, 2633.
OSLa volume 7(1), 2015
[249]
Frankenberg-Garcia, Ana. 1999a. Crosslinguistic influence as a key to extracting second language teaching materials for monolingual classes from translation corpora. Apresentao em Workshop Contrastive Linguistics and Translation
Studies: Empirical Approaches. http://www.linguateca.pt/Repositorio/
Frankenberg-Garcia99.pdf.
Frankenberg-Garcia, Ana. 1999b. Using bilingual corpora to produce second language teaching materials. Apresentao em Symposium on contrastive linguistics
and translation studies.
Frankenberg-Garcia, Ana & Diana Santos. 2002. COMPARA, um corpus paralelo de
portugus e de ingls na Web. Cadernos de Traduo IX(1). 6179.
Gardner, Martin. 1960. The Annotated Alice: Alices Adventures in Wonderland [and]
Through the Looking Glass. Bramhall House.
Hodges, Andrew. 1983. Alan Turing: The Enigma. Simon and Schuster.
Hofstader, Douglas R. 1997. Le Ton beau de Marot: In praise of the Music of Language.
Basic Books.
Maia, Belinda. 2003a. Constructing comparable and parallel corpora for terminology extraction - work in progress. Em Dawn Archer, Paul Rayson, Andrew
Wilson & Tony McEnery (eds.), Proceedings of the Corpus Linguistics 2003 conference
(CL2003), 485.
Maia, Belinda. 2003b. The pedagogical and linguistic research implications of the
GC to on-line parallel and comparable corpora. Em Jos Joo Almeida (ed.),
Corpora Paralelos, Aplicaes e Algoritmos Associados (CP3A), 3132. Universidade
do Minho.
Maia, Belinda. 2003c. What are comparable corpora. Em Silvia Hansen-Schirra
& Stella Neumann (eds.), Proceedings of the workshop on Multilingual Corpora: Linguistic Requirements and Technical Perspectives, 2734.
Maia, Belinda. 2006a. Corpora Comparveis. Material de ensino na Primeira Escola de Vero da Linguateca. http://www.linguateca.pt/escolaverao2006/
Corpora/EDV2006Corporacomparaveis.pdf.
Maia, Belinda. 2008a. Alice no Pas das Maravilhas ou as aventuras e desventuras de uma linguista no mundo do PLN. Apresentao no Encontro Linguateca:
10 anos. http://www.linguateca.pt/Linguateca10anos/Apresentacoes/
AprMaiaL10.pdf.
OSLa volume 7(1), 2015
[250]
[251]
[252]
o b r a s l i t e r r i a s m e n c i o n a da s
Carroll, Lewis. 1865. Alice in Wonderland.
Carroll, Lewis. 1871. Through the Looking-Glass, and What Alice Found There.
c o n ta c t o s
Diana Santos
Linguateca e Universidade de Oslo
d.s.m.santos@ilos.uio.no
Alberto Simes
Linguateca e CEHUM, Universidade do Minho
ambs@ilch.uminho.pt
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 253281. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
a tool at hand:
gestures and rhythm in listing events
case studies of european and
african portuguese speakers
ISABEL GALHANO RODRIGUES
resumo
Este artigo explora os gestos e os movimentos do corpo na interao face
a face a partir uma perspetiva etnogrfica dos estudos do gesto. A anlise
centra-se na comparao entre os gestos de listar e outros meios usados
para apoiar a atividade de elaborar uma lista. Os aspetos considerados so
as caractersticas formais e o ritmo dos gestos, e a sua coordenao com as
unidades lexicais correlacionadas dos enunciados. O corpus recolhido para
esta anlise consiste em quatro interaes com falantes de diferentes culturas, cuja atividade de listar foi examinada em termos de caractersticas
morfolgicas e padres rtmicos, com o objetivo de detetar tanto regularidades como diferenas (culturais) nos gestos de listar.
[1] i n t r o d u c t i o n
The main question I will explore in this paper is how listing activities, so frequent in face-to-face interaction, are performed in different languages/cultures.
In spite of being aware of the impossibility to generalize the results of these case
studies, this paper offers some examples of different forms of making lists, of how
hand gestures are coordinated with speech and how these modalities work together: not only in making a list, but also in making the list visible for the hearer.
This article is divided into three parts: an overview of the theoretical background
(section [2]), the description of listing gestures, their subdivisions and further aspects related to their use (section [3]), and the micro-analysis of some parts of the
recorded corpus (section [4]). The corpus consists of four interactions: the first between European Portuguese speakers speaking Portuguese, the second between
German speakers speaking German, the third and the fourth between Angolan
speakers speaking Portuguese. The analysis considers speech lexical items and
prosody and co-speech body movements, or kinesic modalities, above all gesture, head and trunk movements and gaze orientation.
[254]
This linguistic approach of speech and gesture involves an interdisciplinary theoretical background: 1) several orientations of Conversation and Discourse Analysis (e.g. Sachs et al. (1974); Henne & Rehbock (1982); Roulet et al. (1985)) and
Contextualization Theory (e.g. Gumperz (1982a, 1992)); 2) Interactional Linguistics (cf. Selting & Couper-Kuhlen (2000)); 3) and Gesture Studies (Ekman & Friesen
(1969); Goodwin (1981); Hall (1974); Kendon (2004); McClave (2000, 2001); McNeill
(1992, 2000); Mller et al. (2013, 2014)).
The first group allowed to consider face-to-face interaction (a) as an activity
that is reciprocally and simultaneously constructed by speaker and hearer; and
(b) as a phenomenon comprising different levels: the level of thematic development, the level of structural relations between units, the level of emotion and
modalization, and the level of the interpersonal relations between speaker and
listener regarding their interactional roles (Galhano Rodrigues (1998, 2007). The
second group offers the framework for the analysis of prosody. Its principles,
developed from the Contextualization Theory of Gumperz (1982b), view prosodic
phenomena as important contextualization cues for the codification and decodification of speech. The categories of analysis within these theories were conceived to access prosody from a pragmatic point of view, so that they are flexible
enough to explain prosodic variations caused by different kinds of spontaneous
phenomena in the interactional context. Gesture Studies, on its turn, represents
the background for the description of gestures and other body movements in their
relation to speech.
The following units and their subdivision were taken into account for speech
segmentation: the turn-taking system (Sachs et al. 1974), which corresponds to
the exchange in Discourse Analysis theory (Sinclair & Coulthard 1975; Moeschler
1987, 1994); the turn (Goffman 1974, pg. 201); the conversational acts (Henne &
Rehbock 1982, pg. 17); and the conversational signals (Galhano Rodrigues 1998).
For the description of the prosody, were considered the following categories
and phenomena: intonational unit, pitch, intensity, quantity, beat-clashes and
rhythm (e.g. Auer & Couper-Kuhlen (1994)), silent pauses, full pauses and sound
elongations (e.g. Boomer & Dittman (1962); Goldman-Eisler (1972); Selting (1988);
Uhmann (1992)).
Regarding gestures, a fundamental concept for their identification is the gesture unit, which is composed of gesture phrases (gestures) that can be divided into
different phases: preparation, stroke and retraction ((Kendon 1980, pg. 214) (McNeill 1992, pg. 83)). The identification of units in other body movements is more
complex, as the various body parts have very different (and sometimes very subtle) features when it comes to movement shape and direction. The trunk is the
body part that makes the least complex movements: it can only move forwards,
backwards, and to both sides, according to two axes. Eye movements are slightly
OSLa volume 7(1), 2015
[255]
more complex, because they involve the direction one is looking at, as well as
the position of the eyes in the ocular globe, eye-lid movements and the degree of
eye opening. Linked to eye movement is eyebrow-raising, here included in the
group of facial expressions. Due to technical constraints, only the movements of
the mouth and eye region were taken into account, while the micro-movements
of the face had to be left aside. Thus, in the case of less defined or more complex
movements, movement units as I called the units considered for the other parts
of the body are limited by the points of the greatest amplitude (which can, in
fact, be minute) of its trajectory. Another unit is, for instance, the period of time a
gaze is kept in a certain direction. In this case, we cannot talk about a movement,
but about a movement-freezing, in other words, a static unit.
[3] l i s t i n g g e s t u r e s
It is common knowledge that when people make lists of items, be they objects,
feelings, problems, situations, theories, etc., they tend to use some cues to inform
the interaction partner(s) that they are listing a certain number of items. This
quantity of items is supposed to be small enough to be counted with the fingers
(from 5 to 10), or big enough to justify the use of a support that helps speakers
organize their speech, so that the hearers know which elements of the utterance
belong together and constitute a listing unit. This structuring support is given by
different kinds of tools. One of these tools is prosody: prosodic cues like pitch,
intensity and speech rate, as well as voice quality, are important discourse markers. They can show which parts of the utterances belong together. The asides,
which are generally performed at a higher speech rate, a lower and constant pitch
and a lower voice quality, are a good illustration of this. Prosodic cues are also
important for the creation of rhythmic patterns and rhythm. A rhythmic pattern
is established after the repetition of three similar prosodic patterns. Rhythm creates expectations in the hearers (cf. (Auer & Couper-Kuhlen 1994, 82 segs) (Galhano Rodrigues 2007, pg. 175)), since after each unit in a rhythmic sequence the
hearers expect to hear another unit with the same rhythmic pattern. In the case
of listing lists, the prosodic pattern is characterized by an ascending pitch at the
end of the intonational unit (in this case, the intonational unit coincides with the
listing unit). This ascending pitch also indicates that something else is going to
be said; in other words, its function is to keep the hearers attention and to focus this attention on what is going to be said next (this ascending pitch can also
be described as a conversational opening signal, cf. Galhano Rodrigues (2007,
pg. 509)). Most of the times, prosodic prominence coincides with the countable
item, i.e., the most important topic. According to Erickson (1992), listing events
are characterized by the fact that each new item of information is introduced at a
regular rhythm, with identical time intervals between the information units. As
a rule, the primary accents fall on the most important topics of the listing list and
OSLa volume 7(1), 2015
[256]
[257]
ger. When number five is reached (i.e., when the thumb of one hand touches the
thumb of the other hand) the thumb bends against the palm of the hand (and the
hand is closed). It is important to note that these remarks are not the result of a
systematic study, but some general empirical observations and annotations I have
collected in these past few years. In fact, when listing is explored in a systematic
way, other interesting details can be found such as, for instance, the regularity,
intensity and amplitude of the movements in relation to the listed items, which,
in their turn, are correlated to the speakers emotions and motivation in communicating.
To facilitate the description of listing gestures and account for their precise
synchronization with speech and prosodic prominence, it is essential to distinguish between the different phases of a listing gesture. Here the listing gesture
is defined as a gesture-unit composed of several gesture-phrases whose function
is to enumerate instances, objects, events, etc. Each gesture accompanies a listing
act; in other words, its function is to accompany the verbalization of one element
within the set of elements to be counted. This act coincides with the listing unit,
as mentioned above. According to the subdivisions of the gesture-units, these
gestures are also composed of preparation, stroke and retraction. But in the case
of two-hand listing gestures, the part of the stroke with more amplitude is the
moment when the index finger touches the finger of the other hand. For this
reason, I use the term touch instead of stroke.
In the case of Portuguese, we may say that the most current form consists in
the following phases:
Preparation one hand is open, with the palm almost turned upwards
(listable hand); the other (listing hand) is raised, with palms downwards, index finger stretched, the other fingers relaxed or closed.
Touch the index finger of the listing hand touches/presses/grasps the
little finger of the listable hand, positioned with the palm upwards.
Retraction the listing hand lets go of the finger and moves slightly upwards (together with the arm).
This sequence is repeated starting with the little finger, followed by the ring
finger, the middle finger, the index and the thumb; then, the same procedure can
be repeated with the same hand or the other way round, i.e., the listable hand becomes the listing hand. In an ideal listing activity, each topic or listed element is
isolated from the others and iconically located on one finger only. Therefore, the
fingers become markers for parts of speech and actively support discourse organization. Sometimes, when each topic involves more than one act (for instance a
longer sequence, with side sequences), fingers may be pressed and held down for
the entire period of time during which these acts are verbalized. Furthermore,
OSLa volume 7(1), 2015
[258]
[4.1]
Three Portuguese female students talk about gender roles giving the example of
their own parents. In the interaction interval transcribed below, the speaker, LV,
the student sitting in the middle, is saying that women come home from work and
have to do all the housework, while men come home from work and do nothing.2
[1]
[2]
OSLa volume 7(1), 2015
Announcements are the metacommunicative preparatory and focusing acts as the topographic opening signals (Galhano Rodrigues 2007, pgs. 200203, 490491, 502504), one kind of conversational signals
(Galhano Rodrigues 2001, pgs. 448449). The conversational signals are polisemic and polifunctional and
can, in different proportions, assume interactive, topographic, modal and turn-taking functions (Galhano Rodrigues 1998, 70 sec). Conversational act is the communicative unit produced simultaneously by
speaker and hearer (Galhano Rodrigues 2007, pg. 222).
Prosodic transcription after the GAT system (Selting et al. 1998).
[259]
really,
they do
`tu`do;
do
everything
|_______________________________|
|
looks at VB, raises her head slightly
1-39
(0,115) -dE:sde=AlmOos
from
|_________
|
lunches
`jan-tares em `CA::-:sa-
and
dinners
at
home
1-40
raises hand
(retraction)
tiding up
the
houses
|______________________________|
|
touches middle finger with right thumb and pressures it backwards; at 'casas' looks at VB
1-41
desde- ah'
from
ah
|_____________|
|
always looking at VB, touches left index finger with right thumb, and holds it down;
looks upwards; lowers hands, always holding right index.
1-42
do
much
more
|____________________| |_____________|
|
|
looks ahead, continues pressing left
index with right thumb; head and gaze
towards VB.
1-43
come
from work
|________________________|
|
moves hands to the front of the trunk; raises right hand at 'tra-',
raises left hand at '-lhar'; turns head to the right and looks again at VB.
1-44
comes
do`tra`balho>=s `SEte,
from
work
at seven
|_____________________| |__________________________
|
|
turns head to the front, looks to the front;
moves left arm to the left side
[260]
gets
home
has
to
make
the dinner
____________________||______| |__________________|
|
|
raises head
moves hands
apart
1-46
`depois-a`caba defa`zer
then
she finishes
making
o-jantar=`arrumar-TU::dodinner
tidies everything up
|_______________________| |__________________________|
|
|
turns head to the front
looks down, brings
right hand close to left hand.
1-47
VB:
your father
sits down
|___________________________|
|
LV looks at VB, parts her hands and inspires
1-48
LV:
father there
sen`ta dI::`nho,]
sitting
|_________________| |___________________|
|
|
turns head to the front; leans trunk
backwards and moves arms apart to
both sides, hands with palms down
[261]
d) touches and holds back/down her index finger at desde, turns her head to
the front, looks up, lowers her hands, always holding the index finger down;
she then lets go of it at muito (line 142);
Act d) is the beginning of a listing act on the index finger. However, the
speaker could not remember more items to list. Her gaze orientation upwards
and the lowering of the hands (a normal reaction during a hesitation) reveal a
moment of increasing cognitive effort, when the speaker is trying to remember
other items to list. The strategy to overcome this obstacle in speech production
consists in summarizing the content of all these listing acts together in a single
one: fazem mesmo muito mais (they really do much more). The prosodic features
of this unit typical of an emphatic speech style focus on act 142, the solution for this problem, drawing the hearers attention to it and, consequently, distracting them from the incomplete preceding listing act. Afterwards she goes on
enumerating further activities by women in general. To introduce this sequence
she makes opening gestures, raising first the right and then the left hand, with
palms up, focusing on the verbalized act: elas vm de trabalhar (line 143). This act
(line 143) is composed of a false start. To go on speaking, and again distracting
hearers from this moment and drawing their attention to what is going to be said
next, she makes another gesture with focusing/opening properties: she moves
her arm slightly apart and gesticulates with one hand marking the noun phrase:
a minha me (my mother). In this case, these elements function as the repair element of the repairable elas form (Schegloff et al. 1977). After having overcome
this difficult moment, she puts her hands on her lap in a resting position, and goes
on verbalizing the elements needed to contextualize a narration. The actions in
the narration begin at 7 p.m., when her mother comes home from work. At this
point, to announce the many things her mother does, which she is preparing to
enumerate, she makes a new listing gesture. This time, however, she does not use
her thumb but her right index as listing finger. Her right index:
e) touches and holds back/down the ring finger at fazer and lets go of it after
jantar (line 144);
f) touches and hands back/down the middle finger at jantar and lets go of it
after tudo (line 145).
We can see that the left hand finger is held down with the right index during
the whole verbalization of the sequences: fazer o jantar (make dinner), [fazer o]
jantar arrumar tudo ([prepare] dinner, tidy everything). We may ask if the fact
that she started listing on the ring finger has a logical explanation. In my opinion,
the speaker perceives the preceding syntactic cluster ela chega a casa as a first
countable topic, though she fails to accompany it with a listing gesture. The use of
the ring finger to accompany the verbalization of the next topic/cluster tem que
OSLa volume 7(1), 2015
[262]
[263]
this point we could ask whether the listing gesture is more linked to prosody or to
the topics expressed by words. It seems that, in this case, the modalities gesture
and prosody are responsible for establishing a kind of hierarchical structure: the
largest unit is structured by prosody, whereas the smaller units within this larger
unit, which correspond to two topics, are accompanied by gestures. The morphological features of the listing gestures confirm what was said in Section [3]. in
relation to the sequential use of the listable fingers. As for the trajectory of movements, their reduced amplitude could be attributed to both personality, gender
and context but could also be determined by cultural habits. Only a quantitative
research of this phenomenon could provide reliable data on the individual and
cultural features of the listing gestures. Nevertheless, some more easily observable aspects can be anticipated, i.e., the fingers used to list, the preferred order
of the fingers and the kind of information allocated to the fingers.
[4.2]
Three German students, a man and two women, talk about adoption. The speaker
in this segment of interaction humorously narrates a recent event involving a
child, which illustrates his position regarding the theme.
Prosodic transcription
2-01 FH ich Habe dann ein NachmI have (spent) the afternoon-
|_______________________|
|
sitting leaning backwards, hands on the lap, palms on the belly, head turned to the front.
turns head to the left, and raises left arm up and to the front, hand with palm up, thumb
stretched out
OSLa volume 7(1), 2015
[264]
IM `CAso`lare`JA.
at casolare
right
|_____________________________| |__________________|
|
|
continues the movement to the left,
simultaneously going up and down with the arm.
[
2-03 ST
]
kinder
children
of
david
right
look after
2-05
(--)
|___|
|
Looks to the left to ST and again to the front; maintains finger pressured.
[
]
2-06 AF ((laughter))
2-07 FH dAnn will ER immer
then he wants
`SPIE:::len ja
always
to play, right
|__________________|
|
|_______________|
|
[
2-08 ST
]
((laughter))
[
2-09 AF
]
((laughter))
wants to throw
KAR`ten
`wErfen;
the cards
|____________________________________|
|
keeping left hand configuration (hand closed, thumb upwards) and posture (head and trunk leaning
backwards) makes a gesture with right hand depicting the act of throwing forwards. Afterwards prepares
the following gesture: lowers right hand, touches relaxed the index finger of the left hand.
[265]
2-11 ST
((laughter))
[
2-12 AF
((laughter))
|______________________________|
|
raises the right hand slightly, moves left hand upwards, thumb and index finger
stretched out, and touches left index with right index. Makes a kind of head shake.
[
2-14 ST
]
((laughter))
|_______________________|
|
raises hand looks to his hand and touches again the left index with the right index making
a head shake; at sssigkeiten he leans head backwards, turned to the left, holding finger down.
[
2-16 ST
]
((laughter))
2-17 FH -ECHT
`die schnauze
Really
|_____|
|
fed up
with children
|_____________| |_________________|
|
|
2-18 ST
[
]
((laughter)) von kindern
[
2-19 AF
2-20 FH
((laughter))
|__________|
|
2-21 ST
[
((laughter)]
]
((laughter))
[266]
[267]
enough to be able to yield a logical reason for such distribution, but what matters
is the regularity observed as well as the interruption of this regularity in order to
make another type of gesture capable of transmitting the speakers emotions and
intentions in a more convincing way: he wants to justify why he is not interested
in having children by resorting to the efficient example of David.
[268]
-qu'existia o -VASco da
GA:ma'
|______________| |________________________________|
|
|
preparation phase of gesture
3-02
(XXX)
with right hand index touches and holds down little finger
of the left hand, raising hands to the chest and lowering
them again to the waist, turning thumb upwards
jogar BASquete'
play
basket
|______________| |________________________________|
|
|
lowers right hand and moves it
to the right, stretching index
finger upwards;
3-03
-fiquEI=a
touches left little finger with right index and raises hands;
lowers and moves hands apart at 'banquete', smiling and
looking at hearers.
a sabEr
I learned
|__________|
|
raises and streches out
right arm,pointing with
the index at hearer on
the right.
3-04
-qu'exisTIA=A
|________|
|
(nA:::::me)'
(name)
|____________| |___________|
|
|
holds ring finger
down and raises
hand almost to the
chest
|_______________________________________________| |____|
|
|
beats four times with palm of one hand against back of the other hand, making
trajectories of considerable amplitude.
one identical
beat and back
to rest position
[
3-05 GF
]
ai ?
really?
3-06 DS `yah'
yes
|_____|
|
hand at rest position, always looking at the hearers
OSLa volume 7(1), 2015
[269]
3-07 GF ((laughter))
[
]
3-08 NP ((laughter))
[270]
|________________|
|
moves arms to both sides, palms turned upwards,
turns head to DS and moves arms again to the front,
to rest position, hands relaxed between his legs.
|______________________|
|
leaning backwards, head turned to the speaker, arms relaxed, dangling
on both sides of the chair.
OSLa volume 7(1), 2015
-euFUI-
I sware
[271]
DiJEI:::-
I have been
dj
4-04
raises hands,
left hand palm upwards,
stretches right index out
(preparation)
eu fui di `JEI:::
I was a
dj
|__________________|
|
repetition of gesture touches the same finger and head movement
4-05
em minh`A casa-
`tenho
at home
I have
|_______________|
|
|_______|
|
4-06
`tinha `TEnho
I had
I have
|_____________|
|
moves both hands to the right, turning
head to the left and looking down.
4-07
um aparelho
damned
|_____________|
|
turns head to the front;
opens wide arms to both sides,
depicting size.
big
|_____________|
|
brings hands together in front of the chest,
maintaining elbows raised, at shoulder height
[272]
[273]
[5] d i s c u s s i o n
The conclusions that can be drawn from the above analysis are the following:
(i) An ideal and complete listing activity is not to be found in these examples
of spontaneous interactions.
(ii) Listing activities are structured and marked as such by morphosyntactic,
prosodic, and nonverbal means. These means help the speaker structure
and organize his/her discourse and provide the listeners with interpretation cues that allow them to decode without effort the information conveyed by the speaker.
(iii) The nonverbal cues found are listing gestures, accompanied or not by head
movements.
(iv) The listing activities described are performed with both hands: the listing
hand, whose index finger or thumb are used to list on the fingers of the
other hand, which has been called the listable hand (with listable fingers).
OSLa volume 7(1), 2015
[274]
[275]
appendix
[5.1]
Case Study 1
[276]
[5.2]
Case Study 2
[277]
[278]
[5.3]
Case Study 3
[5.4]
Case Study 4
[279]
references
Auer, Peter & Elisabeth Couper-Kuhlen. 1994. Rhythmus und tempo konversationeller alltagssprache. Zeitschrift fr Literaturwissenschaft und Linguistik 96.
78106.
Boomer, Dieter & Allen T. Dittman. 1962. Hesitation pauses and juncture pauses
in speech. Language and Speech 5(4). 215220.
Ekman, Paul & Wallace Friesen. 1969. The repertoire of nonverbal behavior: categories, origins, usage and coding. Semiotica 1(1). 4998.
Erickson, Frederick. 1992. They know all the lines: Rhythmic organization and
contextualization in a conversational listing routine. In Peter Auer & Aldo Di
Luzio (eds.), The Contextualization of Language, 365397. John Benjamins.
Galhano Rodrigues, Isabel (ed.). 1998. Os sinais conversacionais de alternncia de vez.
Granito Editores e Livreiros.
Galhano Rodrigues, Isabel. 2001. O papel da entoao na alternncia de vez. In
Actas do XVI Encontro Nacional da APL, 447458.
Galhano Rodrigues, Isabel (ed.). 2007. O corpo e a fala. Sinais verbais e no-verbais na
interaco face a face. FCG/FCT.
Galhano Rodrigues, Isabel. 2010. Gesture space and gesture choreography in European Portuguese and African Portuguese interactions: a pilot study of two
cases. In Stephan Koop & Ipke Wachsmuth (eds.), International Gesture Workshop
2009, 2333. Springer.
Goffman, Erving (ed.). 1974. Frame Analysis. An Essay on the organization of experience.
Harper Colephon Books.
Goldman-Eisler, Frieda. 1972. Pauses, clauses, sentences. Language and Speech
15(2). 103113.
Goodwin, Charles (ed.). 1981. Conversational Organization. Interaction between speakers and hearers. Academic Press.
Gumperz, John (ed.). 1982a. Discourse Strategies. Cambridge University Press.
OSLa volume 7(1), 2015
[280]
[281]
Mller, Cornelia, Alan Cienki, Ellen Fricke, Silva Ladewig, David McNeill & Sedinha
Tessendorf (eds.). 2014. Body-Language-Communication. An International Handbook
on Multimodality in Human Interaction, vol. 2. de Gruyter Mouton.
Roulet, Eddy, Antoine Auchlin, Jacques Moeschler, Christian Rubattel & Marianne
Schelling (eds.). 1985. Larticulation du discours en franais contemporain. Peter
Lang.
Sachs, Harvey, Emanuel Schegloff & Gail Jefferson. 1974. A simplest systematics
for the organization of turn-taking for conversation. Language 50. 696735.
Schegloff, Emanuel, Gail Jefferson & Harvey Sachs. 1977. The preference for self
correction in the organization of repair in conversation. Language 53. 361382.
Selting, Margret. 1988. The role of intonation in the organization of repair and
problem handling sequences in conversation. Journal of Pragmatics 12. 293322.
Selting, Margret, Peter Auer, Brigit Barden, Jrg Bergmann, Elisabeth CouperKuhlen, Susanne Gnthner, Christoph Meier, Uta Quasthoff, Peter Schlobinski
& Susanne Uhmann. 1998. Gesprchsanalytisches Transkriptionssystem (GAT).
Linguistische Berichte 173. 91122.
Selting, Margret & Elisabeth Couper-Kuhlen. 2000. Argumente fr die entwicklung einer interaktionalen linguistik. Gesprchsforschung - On-line-Zeitschrift zur
verbalen Interaktion 1. 7695.
Sinclair, John & Malcom Coulthard (eds.). 1975. Towards an Analysis of Discourse.
The English used by teachers and pupils. Oxford University Press.
Uhmann, Susanne. 1992. Contextualizing Relevance: On some forms and functions of speech rate changes in everyday conversation. In Peter Auer & Aldo
di Luzio (eds.), The Contextualization of Language, 297336. John Benjamins.
c o n ta c t s
Isabel Galhano Rodrigues
Faculdade de Letras da Universidade do Porto
irodrig@letras.up.pt
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 283300. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
traduo automtica na
interao com mquinas
ANTNIO TEIXEIRA, JOS CASIMIRO PEREIRA,
PEDRO FRANCISCO E NUNO ALMEIDA
abstract
Automatic translation is usually related to conversion between human languages. Nevertheless, in human-machine interaction scenarios new forms
of translation emerged. This work presents two examples. First, from the
area of Natural Language Generation, is presented a data-to-text system,
where data stored in a database regarding a medication plan is translated
to Portuguese. As second example, is presented a system addressing the
transmission of information from humans to computers, showing that automatic translation can be useful in the development of systems that use
voice commands for interaction and having multilingualism as a requirement. The examples presented, part of our recent work, demonstrate the
increase of application areas for automatic translation, area that received
many and valuable contributions from Belinda Maia.
[1] i n t r o d u o
A traduo automtica de linguagem , em geral, associada converso entre lnguas humanas. No entanto, sendo a interao com computadores (ou sistemas
integrando estes, como os robs), no essencial, a transmisso de informao e
sendo as linguagens naturais a melhor forma, at hoje, criada pelo Homem para
codificar informao - como defende, por exemplo, Santos (1992) Natural language is so far the most comprehensive tool for (humans to) encode and reason
with knowledge - natural que a traduo automtica tenha papis a desempenhar na nossa interao com as mquinas (e das mquinas connosco).
A nossa interao com as mquinas , em geral, bidirecional. Tomemos como
exemplo uma aplicao simples para nos informar sobre a previso do tempo, a
correr num dos cada vez mais omnipresentes Smartphones (que no nos atrevemos a traduzir...). A linguagem natural passvel de utilizao na transmisso de
informao da previso para a semana, sob a forma de um texto ou mesmo pela
leitura desse texto, usando um sintetizador de voz (dois exemplos do que se designa habitualmente como modalidades de sada). Outra utilizao na interao
consiste em navegar nas vrias informaes disponveis utilizando comandos de
voz, dizendo, por exemplo, quero saber a previso para os prximos dias (exem-
[284]
Pelo menos com a tecnologia mais comum. Existem propostas recentes de sistemas que integram estas
duas partes.
[285]
[286]
[287]
[2] t r a d u o n a c o n v e r s o e n t r e da d o s e t e x t o p a r a p o r t u g u s
[2.1] Implementao
Em linhas gerais, o funcionamento do sistema ilustrado na Figura 2):
Moses
Variante
baseada em
sintagmas
Base de
Dados com
Informao
sobre
medicao
Variante
baseada em
sintaxe
Gerador de Frases
Aplicao
[288]
Frase correspondente
Helena pode tomar agora o Seretaide.
Vai-se deitar ento tome quatro comprimidos Primperan.
Antes de deitar senhor Lima no se esquea da bomba de inalao.
Dona Teresinha est na hora de almoo
tome os trs comprimidos Ibuprofeno.
[289]
Os resultados (1) e (2) e outros anlogos podem ser comparados com os gerados
por humanos, e que integram o corpus. Apresentam-se de seguida dois exemplos
de sada do sistema (assinalados com S antes) alinhados com frases produzidas
por humanos (com H antes). Estes exemplos mostram o alinhamento entre as frases criadas pelos humanos, e que servem de referncia, e as geradas pelo sistema,
para uma mesma entrada. As frases so, aqui, apresentadas em minsculas para
que seja possvel evidenciar as suas diferenas, como a seguir explicado. Nos alinhamentos, a falha do sistema em incluir uma palavra na frase marcada com
***. Quando ocorrer troca de palavras, ou quando houver adio de palavras,
estas diferenas so evidenciadas atravs de maisculas. Esta evidenciao especialmente direcionada para os casos em que haja marcao com ***.
(3)
(4)
Em (3) a grande diferena resulta do sistema usar o singular para OS COMPRIMIDOS. Num cenrio em que a frase gerada a nica informao que transOSLa volume 7(1), 2015
[290]
(6)
(8)
Destes exemplos transparece uma pior capacidade da variante do sistema baseada em sintaxe. Possivelmente, pelo tamanho bastante limitado do corpus utilizado e pelo facto de no se ter ainda conseguido um bom desempenho da anotao sinttica. Estes resultados esto a ser encarados pelos autores no como uma
OSLa volume 7(1), 2015
[291]
prova de que esta variante do sistema tem menor potencial, mas como um desafio
para melhorar o desempenho dos processos adicionais que envolve.
[3] t r a d u o n o s u p o r t e i n t e r a o p o r vo z m u l t i l i n g u e
Como referido na introduo, uma forma de configurar o conhecimento e compreenso de fala atravs de gramticas, definidas para a aplicao em vista. Adoptamos essa abordagem para o desenvolvimento de diversas aplicaes suportando
a interao por voz, destacando-se o assistente AALFred (Saldanha et al. 2013;
Teixeira et al. 2014b) do projeto AAL PaeLife. Para que a interao possa ser efetuada em mltiplas linguagens - o AALFred suporta atualmente Ingls, Portugus,
Francs, Hngaro e Polaco - definida uma gramtica semntica de base e as gramticas para as outras lnguas so obtidas por traduo, seguida de verificao
manual durante o desenvolvimento. As gramticas necessrias para o reconhecedor de fala so tambm derivadas automaticamente.
[3.1] Implementao
Por forma a dar resposta s necessidades de desenvolvimento distribudo de aplicaes e facilitar a verificao manual, o sistema foi implementado como um servio web (webservice) e um portal web associado.
O sistema (Figura 3) dual na funcionalidade. Suporta o desenvolvimento
e o uso em contextos de interao real, altura em que as gramticas nas vrias
lnguas ficam disponveis para utilizao, sendo selecionadas em funo da lngua
em utilizao.
Em contextos de interao, o sistema responsvel pela compreenso da linguagem natural, aproveitando as gramticas enviadas para o servio em fase de
desenvolvimento. Recebe a sada de reconhecimento de fala e retorna as informaes semnticas extradas. Tambm retorna, a pedido, as informaes necessrias
sobre as palavras e frases necessrias para configurar o reconhecedor de fala.
Dadas as limitaes das tradues automticas, o servio tambm oferece suporte a reviso manual e atualizao subsequente de gramticas. Esta utilizao
particularmente adequada quando se est na fase de desenvolvimento de uma
aplicao, como o AALFred, ao permitir que cada parceiro envolvido no projeto
possa rever e corrigir as gramticas geradas automaticamente.
Todas as operaes so feitas atravs do acesso a APIs2 , garantindo um controlo de operao consistente e completo.
Para permitir a introduo de novas gramticas, uma interface especfica
necessria para o desenvolvedor. Esta interface permite submeter uma gramtica
e verificar os resultados da sua traduo, tanto em termos de gramtica gerada
como de frases geradas por ela.
[2]
[292]
[3.2]
O Phoenix (Ward 1990) foi escolhido como o analisador (parser) tendo sido tambm
adoptado o seu formato de especificao de gramticas. A escolha teve por base
a robustez do Phoenix a erros no reconhecimento e desempenho e versatilidade
que demonstrou em variadssimas aplicaes.
O sistema de anlise semntica Phoenix (Ward 1990) modela diretamente a
semntica de um domnio especfico usando gramticas semnticas baseadas em
quadros (frames) e slots. Cada slot tem uma gramtica livre de contexto associado,
que especifica padres de sequncias de palavras que coincidem com o slot e
compilada como uma rede de transio recursiva (RTN). So preenchidos atravs
da comparao entre a sequncia de palavras das frases em anlise com estas redes
recursivas (Tur & De Mori 2011, p. 51).
O objetivo do analisador (parser) extrair as anotaes semnticas (tags), conforme definido na gramtica semntica. Esta operao efetuada sobre a lista
de palavras que foi fornecida pelo sistema de reconhecimento de fala. Aps esta
tarefa, o texto juntamente com as respetivas anotaes enviado para processamento pelo Gestor de Interao. Por ltimo, o resultado final usado pela aplicao.
As gramticas do Phoenix contm as regras livres de contexto que especificam
os padres da palavra. Uma pequena gramtica exemplo apresentada em (9).
OSLa volume 7(1), 2015
[293]
[Main]
([AGENDA])
([CONTACTS])
;
[CONTACTS]
(show this contact's [PHOTOS])
;
[PHOTOS]
(photographies)
(photos)
(pictures)
;
As regras, uma por linha, aparecem entre parntesis curvos. Nomes entre parntesis rectos indicam no-terminais. Palavras em minsculas indicam smbolos
terminais. possvel indicar que algo opcional utilizando o * ou que pode ter
uma ou mais ocorrncias usando o +.
[294]
Traduo
O processo de traduo consiste em submeter o resultado da expanso (palavras
mais as suas regras gramaticais/histria) e receber as frases traduzidas resultantes (emparelhamento de palavras na traduo com as palavras correspondentes
na fonte).
Para a traduo, a escolha recaiu sobre o tradutor Bing (Microsoft 2014), utilizado atravs da Microsoft Translator API (Microsoft 2015), devido sua capacidade de fornecer informao sobre a reordenao das palavras. Esta informao facilita a correspondncia das palavras da traduo com palavras de origem,
essencial para a reordenao das palavras aquando da reconstruo das regras.
Alm disso, este tradutor tambm permite obter mltiplas tradues por pedido,
o que permite a expanso de uma gramtica existente para oferecer suporte a vrias frases semelhantes, sem a necessidade de entrada adicional. Podemos, assim,
aumentar a cobertura da nossa gramtica de forma automtica e sem esforo.
Em (11) apresentam-se as frases resultantes da traduo da expanso apresentada anteriormente, em (10).
(11)
Reconstruo da gramtica
Quando a gramtica analisada (a fim de expandi-la depois), um objeto diferente
criado para cada instncia de qualquer regra. Como tal, para cada palavra terminal presente na instruo resultante da expanso da gramtica, podemos determinar exatamente qual a regra que deu origem ao caminho que leva a ela, aps
a traduo. Como temos informao relativa reordenao disponvel, sabemos
quais as regras que geraram o texto resultante da traduo.
OSLa volume 7(1), 2015
[295]
O algoritmo desenvolvido utiliza a histria de expanso da gramtica e as frases traduzidas. Consiste em analisar informaes de histrico dos antepassados
para refazer a gramtica. Isto feito atravs da fuso de no-terminais, do mesmo
nvel, em toda a gramtica numa abordagem de cima para baixo. As Figuras 4 e 5
ilustram as fases inicial e final deste processo.
[M ain]
[CON T ACT S]
mostrar
[M ain]
[CON T ACT S]
[P HOT OS]
fotos
[M ain]
[CON T ACT S]
deste
[M ain]
[CON T ACT S]
contacto
[P HOT OS]
fotos
deste
contacto
[AGENDA]
(agenda)
([CHANGEDATE])
(go to my agenda)
(*go *to [NEXT] [DATEELEMENT])
(*go *to [PREVIOUS] [DATEELEMENT])
OSLa volume 7(1), 2015
[296]
[AGENDA]
(abre a minha agenda)
(abre [WEEKDAYS])
(abrir *a agenda)
(abrir a minha agenda)
(*abrir [WEEKDAYS])
(agenda)
([CHANGEDATE])
(*eu quero ver a minha agenda)
(ir para a minha agenda)
(ir para *a [NEXT] [DATEELEMENT])
(ir para *a [PREVIOUS] [DATEELEMENT])
(ir para [WEEKDAYS])
(mostra-me a minha agenda)
(mostra-me [WEEKDAYS])
(mostra a minha agenda)
(mostrar a minha agenda)
(mostrar [WEEKDAYS])
([NEXT] [DATEELEMENT])
([PREVIOUS] [DATEELEMENT])
(quero ver [WEEKDAYS])
;
[4] c o n c l u s e s
[297]
destas tecnologias e a adoo de sistemas capazes de dialogar com o utilizador, tero de ser contempladas as diferenas entre as lnguas e culturas. Por exemplo3 ,
em algumas lnguas h mais preliminares o que dever implicar dimenses diferentes para cada bloco ou mesmo a necessidade de adicionar blocos inexistentes
na lngua original. Consideramos que o prottipo existente pode desempenhar
um papel relevante na criao de corpos comparveis, ao permitir recolha das
interaes em situaes similares para diferentes lnguas.
Estes dois exemplos mostram a utilidade crescente dos sistemas de traduo
automtica, mesmo para a nossa comunicao com as mquinas. Estas possibilidades s se tornaram possveis com o trabalho de muitos, em que se inclui e
destaca a Belinda, para que a traduo, automtica ou no, evolusse.
Para terminar, consideramos que a relao entre mquinas, humanos e traduo vem acrescentar ainda mais riqueza relao que Belinda Maia sempre tem
considerado ser benfica entre tradutores (humanos) e mquinas, em que as mquinas podem ajudar humanos na traduo. Nos exemplos apresentados, a traduo ajuda comunicao/interao entre esses mesmos humanos e as mesmas,
ou outras, mquinas.
agradecimentos
Os autores agradecem a todos os que contriburam para a criao do corpus e a todos os que participaram na avaliao das frases que tornaram possvel o trabalho
na gerao de frases. Um agradecimento especial ao Mrio Rodrigues pela ajuda
na obteno e utilizao do analisador sinttico para o portugus.
Relativamente ao trabalho na traduo das gramticas semnticas, os autores
no podem deixar de agradecer a todos os parceiros do projeto AAL PaeLife, e
em especial ao Microsoft Language Development Center (MLDC), pela ajuda na
definio de requisitos, pelo retorno que nos forneceram, e pela adopo deste
componente no AALFred.
Os autores agradecem a preciosa ajuda na reviso do texto de Samuel Silva.
O trabalho mencionado neste artigo foi parcialmente financiado pelo FEDER,
COMPETE and FCT atravs dos projetos AAL/0015/2009, AAL PaeLife, QREN AAL4ALL
e financiamento unidade de investigao IEETA (PEst-OE/EEI/UI0127/2014).
Os nossos agradecimentos, tambm, aos Editores deste volume pelo convite que muito nos honra -, pela ajuda, comentrios, disponibilidade em todo o processo e, muito mais importante, por se dedicarem a esta nobre iniciativa.
[3]
[298]
referncias
Arajo, Roberto, Rafael Oliveira, Eder Novais, Thiago Tadeu, Daniel Pereira &
Ivandr Paraboni. 2010. SINotas: the Evaluation of a NLG Application. Em Proceedings of the Seventh International Conference on Language Resources and Evaluation
(LREC), 23882391.
Bateman, John & Michael Zock. 2003. Natural language generation. Em Ruslan
Mitkov (ed.), The Oxford Handbook of Computational Linguistics, 284304. Oxford
University Press.
Ferreira, Flvio, Nuno Almeida, Ana Filipa Rosa, Andr Oliveira, Jos Casimiro Pereira, Samuel Silva & Antnio Teixeira. 2014. Elderly centered design for interaction - the case of the S4S medication assistant. Em Procedia Computer Science,
vol. 27, 398408.
Hunter, James, Yvonne Freer, Albert Gatt, Ehud Reiter, Somayajulu Sripada, Cindy
Sykes & Dave Westwater. 2011. BT-Nurse: computer generation of natural language shift summaries from complex heterogeneous medical data. Journal of
the American Medical Informatics Association (JAMIA) 18. 621624.
Koehn, Philipp. 2014. MOSES: Statistical Machine Translation System - User Manual
and Code Guide. http://www.statmt.org/moses/manual/manual.pdf.
Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Wade Shen, Christine Moran, Richard Zens, Ondej Bojar, Alexandra
Constantin & Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. Em 45th annual meeting of the association for computational linguistics (demo and poster sessions), 177180.
Langner, Brian. 2010. Data-driven Natural Language Generation: Making Machines
Talk Like Humans Using Natural Corpora: Carnegie Mellon University. Tese de
Doutoramento.
Langner, Brian & Alan W. Black. 2009. MOUNTAIN: A Translation-based Approach to Natural Language Generation for Dialog Systems. Em First International
Workshop on Spoken Dialogue Systems Techology (IWSDS), s/pp.
Lemon, Oliver. 2010. Learning what to say and how to say it: joint optimization
of spoken dialogue management and natural language generation. Computer
Speech & Language 25. 210221.
Microsoft. 2014. Bing translator. http://www.bing.com/translator/.
Microsoft. 2015. Microsoft translator API.
translator/translator-api.aspx.
OSLa volume 7(1), 2015
http://www.microsoft.com/
[299]
Novais, Eder, Rafael Oliveira, Daniel Pereira & Thiago Tadeu. 2009. A Testbed for
Portuguese Natural Language Generation. Em Seventh Brazilian Symposium in
Information and Human Language Technology, 154 157.
Pereira, Jos Casimiro, Antnio Teixeira & Joaquim Sousa Pinto. 2012. Natural
Language Generation in the context of Multimodal Interaction in Portuguese.
Electrnica e Telecomunicaes 5. 400409.
Portet, Franois, Ehud Reiter, Alberto Gatt, Jim Hunter, Somayajulu Sripada,
Yvonne Freer & Cindy Sykes. 2009. Automatic generation of textual summaries from neonatal intensive care data. Artificial Intelligence 173. 789816.
Reiter, Ehud & Robert Dale. 2000. Building natural language generation systems. Cambridge University Press.
Saldanha, Nuno, Jairo Avelar, Miguel Dias, Antnio Teixeira, Daniel Gonalves,
Emmanuel Bonnet, Karine Lan, Nmeth Gza, Petra Csobanka & Artur Kolesinski. 2013. A Personal Life Assistant for natural interaction: the PaeLife
project. Em AAL Forum, poster presentation.
Santos, Diana. 1992. Natural Language and Knowledge Representation. Em Proceedings of the ERCIM Workshop on Theoretical and Experimental Aspects of Knowledge
Representation, 195197.
Santos, Diana & Alberto Simes. 2015. Ensinador paralelo: Alicerces para uma
pedagogia nova. Neste volume.
Stent, Amanda & Martin Molina. 2009. Evaluating automatic extraction of rules
for sentence plan construction. Em Proceedings of the SIGDIAL 2009 Conference:
10th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 290297.
Stent, Amanda, Rashmi Prasad & Marilyn Walker. 2004. Trainable sentence planning for complex information presentation in spoken dialog systems. Em Proceedings of the 42nd annual meeting on association for computational linguistics, 7986.
Teixeira, Antnio, Pedro Francisco, Nuno Almeida, Carlos Pereira & Samuel Silva.
2014a. Services to support use and development of speech input for multilingual multimodal applications for mobile scenarios. Em The Ninth International
Conference on Internet and Web Applications and Services (ICIW), Track Web Servicesbased Systems and Applications, 4146.
Teixeira, Antnio, Annika Hmlinen, Jairo Avelar, Nuno Almeida, Gza Nmeth, Tibor Fegy, Csaba Zaink, Tams Csap, Blint Tth, Andr Oliveira &
Miguel Sales Dias. 2014b. Speech-centric multimodal interaction for easy-toaccess online services. Em Procedia computer science, vol. 27, 389397.
OSLa volume 7(1), 2015
[300]
c o n ta c t o s
Antnio Teixeira
Departamento de Electrnica Telecomunicaes e Informtica/IEETA
Universidade de Aveiro
ajst@ua.pt
Jos Casimiro Pereira
Instituto Politcnico de Tomar
casimiro@ipt.pt
Pedro Goucha Francisco
IEETA, Universidade de Aveiro
goucha@ua.pt
Nuno Almeida
Departamento de Electrnica Telecomunicaes e Informtica/IEETA
Universidade de Aveiro
nunoalmeida@ua.pt
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 301322. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
resumo
O plgio tem sido tradicionalmente classificado como um ato imoral e violador das normas ticas, mais do que uma ao ilegal (Garner 2009; Goldstein 2003), e o plgio jornalstico no exceo. Como referem Coulthard &
Johnson (2007), a reutilizao de texto por jornalistas, sem atribuio ou com
atribuio de autoria inadequada, no normalmente considerada plgio. A
isto acresce o facto de as convenes relativas reutilizao de notcias das
agncias no serem universais. Porm, as graves consequncias inerentes
m prtica jornalstica (como o caso de Jayson Blair, do The New York Times)
mostram que as implicaes no se limitam esfera da tica, mas, pelo contrrio, possuem impacto legal, incluindo processos de demisso. Um dos
problemas, no entanto, consiste em provar determinada reutilizao textual
como plgio.
Este estudo apresenta os resultados de uma anlise lingustica forense que
pode ser utilizada para provar casos de suspeita de plgio ou para iniciar a
investigao de textos insuspeitos. Com o objetivo de identificar os mecanismos utilizados e como pelos jornalistas para comporem os seus prprios textos a partir das notcias das agncias, este trabalho compara notcias publicadas na seco Mundo de jornais de referncia portugueses com
possveis fontes publicadas em ingls. Os resultados da anlise mostram que:
(a) a atribuio de autoria , frequentemente, inadequada, mesmo quando os
jornais de referncia citam as suas fontes (normalmente, conhecidas agncias internacionais); (b) nem sempre existe uma correspondncia direta com
uma nica fonte entre a verso plagiadora e a verso plagiada (indicando
reutilizao de texto de diferentes media e websites internacionais); e (c) as
notcias so plagiadas a partir de textos publicados noutras lnguas, constituindo plgio translingue. Conclui-se que a anlise lingustica forense possui
potencial de prova e de investigao em casos de plgio e violao de direito
de autor, no s monolingue, mas tambm translingue.
[1] n e w s p l a g i a r i s m
News plagiarism has been perhaps one of the most challenging areas of research
into plagiarism. Unlike student plagiarism, text reuse by journalists with little or
[302]
rui sousa-silva
no attribution at all does not seem to be usually regarded as plagiarism (AnglilCarter 2000; Coulthard & Johnson 2007), not even when substantial amounts of
text are reused. This is one of the problems reported by Anglil-Carter (2000) in
her discussion of the subject. As the borderline of plagiarism is as dependent on
its definition and on the authors intention as much as it is on the text genre, the
usage of large amounts of text by journalists with little or no attribution tends
to be overlooked. This is a result of the underlying assumption that news pieces
are expected to report on real-world facts and events. And since, for reasons of
faithfulness, these facts and events cannot be reported differently, the more faithfully a journalist reports them, the more professionally they act, and the higher
the likelihood that a higher textual overlap is to be expected. Therefore, texts
reporting those facts and events can hardly be charged with plagiarism.
Another reason for this apparent leniency with news text lifting is that news
corporations frequently subscribe to paid newswire services whose contents they
are allowed to reuse. Additionally, when faced with the need to acknowledge their
sources, journalists seem to have a double-standard. On the one hand, they do not
hesitate to clearly cite their primary sources and keep their identity confidential when necessary to protect them in order to ensure the truthfulness of the
news piece. In some extreme cases, they even resist pressure to identify these
sources. On the other hand, they often reuse text from other (secondary) sources
to write their articles, while not always citing them. This is the case of reusing
text from other media organisations, or even from newswire services.
Notwithstanding these underlying assumptions, journalists have been punished for plagiarising. In February 2015, Jared Keller, the news director of the
news site Mic, was fired after he was found to have lifted passages of text from
other news sources. Keller reproduced the text literally or with minor changes,
with little or no reference to the source. Where he provided a reference, this was
made in passing. That same month, the columnist Tanveer Ahmed was dismissed
by the Australian after a blogger accused him of plagiarising an American political website. Two years earlier, the New Yorker writer Jonah Lehrer was fired for
recycling New Yorker blog posts, among other misdeeds. One of the most paradigmatic cases, however, is that of Jayson Blair, who in 2003 resigned from The New
York Times after facing accusations of journalistic fraud, including plagiarism. In
particular, he was accused of lifting material from newswire services and other
newspapers, such as the Washington Post and The San Antonio Express-News. In
2007, a reader of the Portuguese quality newspaper Pblico found that the journalist Clara Barata plagiarised from other sources, including Wikipedia. This case
is even more complex than the others, as the texts were not lifted from an original
in the same language, but instead from an original in another language. A similar
case is that of a reporter of the Telegraph-Journal in Canada, who was fired in
2009 for lifting a news piece from LAcadie Nouvelle.
[303]
This paper investigates how a forensic linguistic analysis can assist the detection and/or provision of evidence of news plagiarism. It builds on the assumption
that it is crucial to devise a method for identifying the textual elements that can
be used to flag a text as a potential instance of plagiarism, not only to raise suspicion about its originality, but also to develop translingual plagiarism detection
techniques (Sousa-Silva 2014). A method of this type is presented below.
[2] n e w s , p l a g i a r i s m , a n d l i f t i n g
Indeed, although a vast body of research into plagiarism has been published over
the last decades (Anderson 1998; Anglil-Carter 2000; Carroll 2001; Carroll & Appleton 2001; Jameson 1993; Lindey 1952; Pecorari 2008; Howard & Robillard 2008;
Roig 2001; Scollon 1995; Howard 1995), it has focused mostly on academic plagiarism, to the detriment of other instances of text reuse. One of the reasons
why academic plagiarism has attracted most research attention is that it is seen
as an educational issue that needs to be identified during the students academic
path (Carroll 2001; Carroll & Appleton 2001), and especially teach students how to
adopt an appropriate academic conduct (Howard 1995). On the contrary, comparatively little research has been conducted into news text reuse. This is supported
by the strong views, usually matching the infringing journalists argument, that
writing news pieces is different from academic writing, and that in order to preserve the readability of the article citing all the secondary sources used is impractical. Paradoxically, although the conventions and regulations applying to the use
of newswire copy are not universal, they tend to be clear in this respect. Cases of
such conventions and regulations abound. Agencies require that the source(s)
be credited, and forbid the unacknowledged use of authored articles, i.e. news
pieces signed by individual reporters, rather than being simply news wires.
The Reuters Handbook of Journalism (Reuters 2008), e.g., describes plagiarism
as a cardinal sin. It strongly argues that, whereas ethical guiding principles contribute to a better journalism, rigid rules restrict and constrain the ability to
operate. The Reuters Style Guide states in addition that, in accordance with the
Reuters Code of Conduct, the companys journalists are required to always search
for and report the truth, fairly, honestly and unfailingly (Reuters 2008, pg. 1). In
addition to stating that plagiarism is a cardinal sin, this style guide considers
fabrication and plagiarism two of the 10 Absolutes of Reuters Journalism. Their
journalists are, therefore, required to do a proper attribution to the source of
material that is not theirs, and are instructed that it is insufficient to label video
or a photograph as handout ; on the contrary, it is a requirement that the source
be clearly identified. This style guide further states that it is essential for transparency that material we did not gather ourselves is clearly attributed in stories
to the source, including when that source is a rival organisation and concludes
that failure to do so may open us to charges of plagiarism (Reuters 2008, pg. 5).
OSLa volume 7(1), 2015
[304]
rui sousa-silva
Likewise, the International Federation of Journalists1 (IFJ) and the Portuguese
journalists union (Sindicato dos Jornalistas2 ) consider plagiarism a serious professional offense. Similarly, the style guide of the main Portuguese quality newspaper, Pblico3 , establishes that plagiarism is forbidden by the newspaper, and
adds that all relevant information collected from other media organisations or
news agencies must be attributed. In cases where the news piece is based on
news wires of different agencies, these should be cited in the text in the order
they have most contributed to the news article. When the news wires are used
as mere sources, and the article is mainly written by the journalist, the agencies
should be cited in the body of the news article. But if the article is based mainly
on news wires, then a reference to these should be included. In addition, the
style guide explicitly states that texts translated from other languages should be
clearly marked as translations and include the translators name.
It is then unsurprising that, in accordance with its policy, Pblico published
an apology, in 2006, for one of their journalists, Clara Barata, who published an
article that was mainly translated from the New Scientist and Wikipedia. The
suspicion was raised by a reader, who noticed that the text looked familiar to
him when he first read it, and later identified the original sources. The newspaper initiated an investigation and later realised that the journalist plagiarised
13 significant extracts using translation. The case was compared to that of the
famous New York Times journalist, Jayson Blair, who in 2003 was dismissed after the newspaper was challenged by other news organisations for accusations of
plagiarism. Cases of news plagiarism have however long been reported. In 1996,
another news organisation, the Portuguese news agency Lusa, had submitted a
complaint to the journalists union, Sindicato dos Jornalistas, claiming that several Portuguese media organisations were plagiarising texts authored and signed
by their own journalists, and which were not included in newswire services.
Given the stance adopted by these organisations and media self-regulatory
measures, news plagiarism cases have been unsurprisingly addressed more often by self-regulation, codes of ethics and deontology than by the law. And this
traditional perspective of journalism as being exempt from plagiarism has been
challenged, not the least by journalistic practice, as well as by the practice illustrated by the cases discussed above. It is thus evident that, despite reporting
facts, news are subject to principles of originality as much as other text genres,
including student assignments. News plagiarism therefore is not treated much
differently from academic plagiarism. Like academic plagiarism, it is not only
subject to internal rules and regulations, but also tends to be resolved internally
by the respective organisations.
[1]
[2]
[3]
See http://www.ifj.org/en
See http://www.jornalistas.online.pt/
Available at http://static.publico.clix.pt/nos/livro_estilo/16p-palavras.html
[305]
In recent years, many people, from literary critics and copyright lawyers to teachers and forensic linguists, have shown a growing interest in the field of plagiarism and plagiarism detection, even if for different reasons (Coulthard & Johnson
2007). Whereas the literary critic may be interested in judging the literary quality of a literary work, the teacher is more interested in educating students and
hence concerned more with the moral values of plagiarism itself, than with the financial implications of the infringement (Howard 1995; Robillard & Howard 2008;
Scollon 1994, 1995). The copyright lawyer, on the contrary, is prone to be more
interested in the financial implications of plagiarism and seek for the corresponding compensation.
Plagiarism has been traditionally considered an immoral, more than an illegal
act (Garner 2009). Consequently, it should be more appropriately addressed as an
ethical, rather than a legal offense (Goldstein 2003). This is especially so because
the works entitled to protection are immaterial and ubiquitous. As a result, they
can be simultaneously used by different people, thus compromising the original
authors ability to control the use of his/her own work (Pereira 2003, pg. 20).
OSLa volume 7(1), 2015
[306]
rui sousa-silva
However, it has been demonstrated that plagiarism is indeed both immoral
and illegal (Finnis 1991; Eiras & Fortes 2010), which makes it punishable by law
(Pereira 2003). Plagiarism is thus more appropriately addressed as both a moral
and an ethical issue. As I argued elsewhere, [o]n the moral side, plagiarism brings
social implications, with the power to ruin the reputation of the plagiarist; on the
legal side, it implies the infringement of moral rights, and often financial rights,
both of which are punishable by law (Sousa-Silva 2013, pg. 61). Indeed, as these
financial rights are more easily quantifiable than the respective moral rights, it
is not surprising that they are the ones more promptly addressed by the courts.
It is not uncommon that instances of plagiarism bring along serious legal implications. And neither are the cases brought before the courts of law restricted to
those having financial implications. Many high-profile cases brought to the fore
in recent years show that, not only is plagiarism seen as a violation of codes of
ethics, but also it is punished. News plagiarism is not an exception, as the cases
presented above demonstrate.
This makes plagiarism well suited for a Forensic Linguistics approach, as forensic linguists set as their research object the legal aspect of the act and the result
of such act. In legal cases, forensic linguistics can and do not only assist the investigative procedures, by assisting ethics committees, boards and decision makers
determining lifting; they also provide linguistic evidence to a Court as to whether
two or more texts have been produced independently, or whether they build upon
a previous original text.
Forensic linguistics is the field of linguistics that applies a linguistic analysis across all types of interaction in the legal context (Caldas-Coulthard 2014). In
other words, this field is above all focused on all aspects of the interaction between language and the law. However, linguists operating in forensic contexts
have contributed significantly to cases that span beyond the purely legal. In
the field of plagiarism in particular, linguistic analyses have made significant advances in recent years in the detection of same-language plagiarism and translingual plagiarism alike. It has been almost 20 years since Johnson (1997) compared a set of student texts to conclude that they were not original. By devising
a method that consisted of comparing only lexical items, rather than using string
matching techniques, she demonstrated that they were a result of collusion, i.e. a
sort of group plagiarism. Although the text strings were altered in order to produce slightly different versions, a comparison of the lexical items showed that the
texts had not been produced independently.
Johnsons linguistic analysis did not involve the courts, but was sufficient
to demonstrate lifting among students. And more importantly, her analytical
methods were later applied in court cases. Turell (2004) built upon Johnsons
(1997) work to investigate whether a linguistic analysis that had previously been
tried and tested with student plagiarism could also be used to successfully deter-
[307]
The Internet World Stats website reports that in 2013 English was by far the most widely used language
in the Internet see http://www.internetworldstats.com/stats7.htm
OSLa volume 7(1), 2015
[308]
rui sousa-silva
giarism. Owing to these constraints, there is currently no means of systematically
screening texts for translingual plagiarism in the same way as there is to detect
same-language plagiarism. As a result, such cases can almost only be grasped by
intuition, without any computer assistance.
In most cases, translingual plagiarism consists of texts that are translated
freely and informally from another language, without acknowledging the original author. This is hardly the case of literary texts, a professional and acknowledged translation of which is usually commissioned. But translation of other text
genres (e.g. news and blog comments, besides academic plagiarism) without attribution can easily pass unnoticed. This is mainly because, contrary to Turells
study above, they do not plagiarise another translation in the same language, but
rather the original, in another language. The text is thus not lifted word-for-word,
which makes the plagiarism more difficult to monitor.
In this respect, a forensic linguistic analysis is crucial, not only to assist the
detection procedure, but also to demonstrate the extent of the borrowing, and
whether a text is an instance of plagiarism, or on the contrary whether the textual
reuse is acceptable. More importantly, this analysis is able to provide evidence
that a text or more than one was not produced independently. This will be
addressed in the next section.
[4] r a i s i n g s u s p i c i o n a n d d e t e c t i n g p l a g i a r i s m
This paper first studies the detection of verbatim reuse of news articles. Subsequently, a method is proposed to raise suspicion that a text may have been plagiarised. Thirdly, it illustrates how to find evidence that a text has plagiarised
another text in another language. This research is based on a corpus of news
pieces that are publicly available, and which are supposed to have been produced
independently, although on similar topics.
[309]
verbatim plagiarised text is in italic typepace in both instances, and the underlined text in these two annexes shows minor changes introduced to the text (and
which, however, do not alter the text meaning).
Extract 1: Jornal de Notcias
Os microscpicos gros de plen das plantas podero vir a derrubar a ideia de que
ainda h crimes perfeitos, ao dar pistas seguras para deslindar casos que desafiam
os limites da investigao criminal. A PJ j recorreu a este tipo de anlise para resoluo de pelo menos trs crimes. O que parece fazer parte dos domnios da fbula ou
da fico cientfica uma realidade j em prtica por meia dezena de investigadores
forenses no mundo [, e]. Portugal faz parte dessa vanguarda atravs de Mafalda
Faria, que [desenvolve o seu trabalho] trabalha na Universidade de Coimbra e
no Instituto Nacional de Medicina Legal (INML). [A metodologia, fruto tambm
do engenho e arte de quem a vem desbravando, no mais do que a] A anlise
do plen e de esporos de plantas que ficam agarrados ao corpo de pessoas e de objectos [vo ajudar] vai ajudar a reconstituir o percurso e locais de aco de criminosos e vtimas. Em homicdios, violaes, roubos, contrafaco de medicamentos,
trfico, contrabando e at no combate ao terrorismo a Palinologia, cincia oriunda
da Botnica, tem vindo a ajudar as cincias forenses a investigar e a explicar crimes.
A Inglaterra e a Nova Zelndia fazem da Palinologia uma prtica corrente para casos mais complexos, e aceite como prova pericial em tribunal. Nos EUA, Austrlia
e Portugal tem dado uma ajuda investigao criminal.
PJ j recorreu a anlises do poln
O contributo dos estudos de Mafalda Faria, nos dois ltimos anos, foi solicitado pela
Polcia Judiciria para ajudar a reconstituir crimes como os do jovem universitrio
que em Coimbra assassinou a ex-namorada, no homicdio de um homem numa quinta
de Viseu ou em casos de trfico de droga. Para certas situaes, a Palinologia a
nica que pode resolver. Se, por exemplo, se encontra a arma do crime sem impresses digitais poder ter plen, no daquele local, mas da sua provenincia, explica a investigadora agncia Lusa, preconizando o seu alargamento a vrias reas
da investigao criminal. Depende do tipo de crime. Se for trfico, contrafaco ou
contrabando, so os prprios produtos analisados. No homicdio tem de se ir ao local recolher amostras das plantas e solo para analisar. Na vtima so amostras no
cabelo, nas cavidades nasais e no vesturio, se tiver, explica a investigadora.
Potencial singular para investigao criminal
Os gros de plen apresentam caractersticas que lhe conferem um potencial singular para a investigao criminal. Pode ser encontrado agarrado em praticamente
qualquer objecto ou pessoa, e altamente resistente degradao mecnica, biolgica e qumica. Os agressores podem lavar o sangue, mas no os gros de plen,
porque no os vem, por serem microscpicos, afirma Mafalda Faria, frisando que
mesmo aps lavagens das roupas ser possvel encontr-los nelas. Por outro lado,
OSLa volume 7(1), 2015
[310]
rui sousa-silva
esses microscpicos gros tm uma grande capacidade de transferncia, das plantas para as pessoas e entre pessoas e, ao mesmo tempo, so bastante aderentes.
A Palinologia Forense uma investigao ps-doutoramento que Mafalda Faria, da
Faculdade de Cincias e Tecnologia da Universidade de Coimbra (FCTUC), ir concluir no final do corrente ano, sob orientao do neozelands Dallas Mildenhall e do
portugus Duarte Nuno Vieira, presidente do Instituto Nacional de Medicina Legal
(INML). financiada pela Fundao para a Cincia e Tecnologia. Ela o resultado do
bichinho pelas cincias forenses que a levou a concorrer, sem sucesso, a lugares na
Polcia Judiciria e no INML. Queria trabalhar em investigao forense em vestgios
no biolgicos, para dar sequncia sua formao em ecologia.
Extract 2: TVI
O fim dos crimes perfeitos?
A palinologia, que analisa gros de plen, desafia dogmas e quer ajudar a
investigao criminal
Por: Redaco /PP
Os microscpicos gros de plen das plantas podero vir a derrubar a ideia de que
ainda h crimes perfeitos, ao dar pistas seguras para deslindar casos que desafiam
os limites da investigao criminal, escreve a Lusa. O que parece fazer parte dos
domnios da fbula ou da fico cientfica uma realidade j em prtica por meia
dezena de investigadores forenses no mundo, e Portugal faz parte dessa vanguarda
atravs de Mafalda Faria, que desenvolve o seu trabalho na Universidade de Coimbra
e no Instituto Nacional de Medicina Legal (INML).
A metodologia, fruto tambm do engenho e arte de quem a vem desbravando, no
mais do que a anlise do plen e de esporos de plantas que ficam agarrados ao corpo
de pessoas e de objectos e vo ajudar a reconstituir o percurso e locais de aco de
criminosos e vtimas. Em homicdios, violaes, roubos, contrafaco de medicamentos, trfico, contrabando e at no combate ao terrorismo a Palinologia, esta cincia
oriunda da Botnica, tem vindo a ajudar as cincias forenses a investigar e a explicar
crimes. A Inglaterra e a Nova Zelndia fazem da Palinologia uma prtica corrente
para casos mais complexos, e aceite como prova pericial em tribunal. Nos EUA,
Austrlia e Portugal tem dado uma ajuda investigao criminal. O contributo dos
estudos de Mafalda Faria, nos dois ltimos anos, foi solicitado pela Polcia Judiciria
para ajudar a reconstituir crimes como os do jovem universitrio que em Coimbra
assassinou a ex-namorada, no homicdio de um homem numa quinta de Viseu ou em
casos de trfico de droga.
A nica resposta
Para certas situaes, a Palinologia a nica que pode resolver. Se, por exemplo, se
encontra a arma do crime sem impresses digitais poder ter plen, no daquele local, mas da sua provenincia, explica a investigadora agncia Lusa, preconizando
o seu alargamento a vrias reas da investigao criminal.
OSLa volume 7(1), 2015
[311]
The news piece published by JN (Extract 1) has a textual overlap of 96%, i.e.
527 out of a total of 554 words (the original piece published by Lusa was 550 words
long). The text published by TVI (Extract 2) has a textual overlap of 100%. This
online news piece reused all the 550 words of the text published by Lusa, although
a few additional words were added (the text published by TVI is 566 words long).
This is the result of the slight alterations made to the original news article published in the newspaper. It should be noted that Lusa is referenced in passing, as
quotes used in the text are attributed to the news agency. However, nowhere in
the article is authorship attributed to the original news piece.
The piece broadcast by TVI also references Lusa in passing, by attributing the
quotes to the agency, but goes further then JN in that it attributes the authorship to their own reporter and the TV station newsroom (Redaco/PP). The
changes introduced to the TVI text are only minor, even if compared to the ones
introduced by JN. Interestingly, there is one sentence in the original article that
lacks a word, and hence the reproduction of that error raises some issues of ungrammaticality: Se, por exemplo, se encontra a arma do crime sem impresses
digitais poder ter plen, no daquele local, mas da sua provenincia. In order
for the sentence to be grammatical, at least a pronoun is needed after digitais
and before poder, such as ela or esta. However, neither JN, nor TVI seemed
to have noticed it, and reproduced the grammatical error. This provides a clear
OSLa volume 7(1), 2015
[312]
rui sousa-silva
evidence that the text is not original. Furthermore, chronological aspects show
the directionality of the lifting, i.e. that JN and TVI lifted the text from Lusa (or
from each other), but not the other way around.
[313]
parking of buses transferred to the field of Kitchen. In the tower of the old
wall with the inscription Here Born Portugal plans to establish a viewpoint
that is an ideal place to observe the new floor of the square, designed by the
plastic artist Ana Jotta, based on the same rocks of quartz and basalt now
available .
The assistance will be financed by EU funds after being approved an application to the program of urban regeneration of the NSRF in the value of 9.9
million.
Authority takes possession of convent
Well near the Toural, the former Convent of Dominica, in the seventeenth
century, will be incorporated in the project of Capital of Culture. The municipality approved yesterday by the declaration of ownership of the property
where usucapio are installed several cultural associations. In the building, now dilapidated, will be installed in the residence artists. The camera
will have to find an alternative site for the installation of the seats of Tertulia Nicolina and Child Center of Popular Culture, although not yet officially
have contacted the associations. The building for the House of Memory is
also flagged. This is an old industrial plastics, the Count of Margaride avenue, into the city. This partially empty factory has an area free in the back
so that the building is created from scratch.
Extract 4:
Iran rallies planned amid clampdown
Anti-government protesters in Iran have announced they are to hold another rally in the capital to dispute the veracity of a presidential election.
Supporters of candidate Mir Hossein Mousavi called on Wednesday for a
rally to go ahead at 5pm local time (13:30 GMT), despite the authorities imposing a ban on the opposition gatherings. Mahmoud Ahmadinejad, the incumbent president, was officially declared winner of Fridays election by a
margin of two-to-one over Mir Hossein Mousavi. Hossein, a reformist candidate who was the nearest rival to Ahmadinejad, a conservative, has accused
the authorities of rigging the vote. But Ahmadinejad has said that the result
proved he has popular support. The election result confirmed the work of
the ninth government which was based on honesty and service to the people, he said on Wednesday in a statement to Irans ISNA news agency.
Violence on tape
Despite the restrictions placed by the government on the media, violent
scenes of police beating Mousavi supporters taken on mobile phones have
been broadcast on news bulletins across the world. The Revolutionary Guard
has warned the countrys online media it will face legal action if it creates
tension. Within the country, mobile phone text services have been down
OSLa volume 7(1), 2015
[314]
rui sousa-silva
since the election. There is no access to Facebook, Twitter, or YouTube.
The interior ministry has ordered an investigation into an attack on university students in which it is claimed four people were killed. Anoushaka
Maraslian, a Middle East analyst in London, told Al Jazeera: University
cities in Iran have always been very active in political dissent. Thats the
concern of the elders; thats the concern of the Guardian Council, and thats
why theyre making conessions, because they realise that young Iranians are
leading the protests with parallels to [the revolution in] 1979. At least
seven people have been killed in recent clashes between the authorities and
the opposition movement, according to state media reports, while hundreds
more are thought to have been injured. For its part, the foreign ministry
summoned the Swiss ambassador, who represents US interests in Tehran, on
Wednesday to protest at interventionist US statements on Irans election.
Obama told CNBC there appeared to be little difference in policy between
Ahmadinejad and Mousavi. Either way we are going to be dealing with an
Iranian regime that has historically been hostile to the United States, he
said. Mousavi has called on his supporters to hold peaceful demonstrations
or gather in mosques on Thursday in solidarity with people killed or hurt
in the post-election unrest. In the course of the past days and as a consequence of illegal and violent encounters with [people protesting] against
the outcome of the presidential election, a number of our countrymen were
wounded or martyred, Mousavi said on his website. I ask the people to
express their solidarity with the families by coming together in mosques
or taking part in peaceful demonstrations.
[315]
give the forensic linguist a clue as to whether the text might have originated
somewhere else in which case it would be considered plagiarism. Extracts 5
and 6 illustrate this method.
Extract 5 reproduces the article that was originally published in Portuguese.
The news piece does not attribute the text to any news agency in particular; on
the contrary, only a general reference to Agencies is initially made. After translating this text into English, a few sentences were selected to perform an Internet search using lexical items as keywords, while discarding functional words.
These lexical items were therefore used as filtered n-grams (Maia et al. 2008). The
search based on these search parameters returned two relevant articles: one was
published by The Australian newspaper5 , and the other one was broadcast in the
Channel News Asia website6 . With the exception of minor differences in details
related to dates (e.g. Sunday or weekend, and a paragraph used by Channel
News Asia that was left out by the The Australian), the two articles were entirely
identical. In both cases, authorship was attributed to the same source, Agence
France Presse (AFP) and, in the case of Channel News Asia, to ls/yb.
Extract 6 transcribes the text published originally by The Australian. Since the
two texts are reproduced in Extracts 5 and 6 in their original language, the comparison focused on identifying the strings with overlapping ideas, rather than the
strings of identical text. The underlined text shows the overlapping strings. The
numbers at the beginning of the underlined strings show the matching strings in
the other text.
Extract 5: The Pblico news article
Encontro com Abbas em Washington
Obama defende um Estado palestiniano e o fim da expanso dos colonatos
2009-05-28 23:25:00 PBLICO, Agncias
O Presidente Barack Obama defendeu hoje a criao de um Estado palestiniano. [01]No fim do seu primeiro encontro com o presidente da Autoridade
Palestiniana, o lder norte-americano repetiu uma vez mais o seu [02]apelo a
Israel [02]para que ponha fim construo nos colonatos erguidos dos Territrios Palestinianos e honre os compromissos que assumiu. As duas partes,
afirmou Obama na Casa Branca, tm [05]obrigaes face ao roteiro o
plano internacional de 2003 para a resoluo do conflito israelo-palestiniano.
Nestas inclui-se parar com a colonizao. [04]Durante a discusso com o
novo primeiro-ministro israelita, Benjamin Netanyahu, a semana passada,
fui muito claro quanto necessidade de travar a colonizao, esclareceu
ainda Obama. Os palestinianos devem por seu turno fazer progressos na
[5]
[6]
http://www.theaustralian.news.com.au/story/0,25197,25555182-5018557,00.html
http://www.channelnewsasia.com/stories/afp_world/view/432503/1/.html
OSLa volume 7(1), 2015
[316]
rui sousa-silva
melhoria das suas foras de segurana e na reduo do incitamento antiIsrael, defendeu. Sou um grande crente da soluo de dois estados, disse
ainda Obama, afirmando-se confiante na possibilidade de progressos em
direco paz entre israelitas e palestinianos. Nas curtas declaraes imprensa que tiveram lugar depois do encontro de Washington, Mahmoud Abbas sublinhou, por seu turno, a urgncia de tais progressos, declarando que
[03]o tempo [] um factor essencial no processo. O apelo ao fim da colonizao na Cisjordnia e em Jerusalm Oriental j tinha sido feito na vspera
pela secretria de Estado, Hillary Clinton: [06]Nenhuns colonatos, nenhumas excepes de crescimento natural. E j hoje, antes do encontro entre
Abbas e Obama, Israel reagira pela voz do porta-voz do Governo, que explicou que o futuro dos colonatos s ser decidido atravs das negociaes com
os palestinianos. [07]Entretanto, temos de permitir que a vida continue
normalmente nestas comunidades, disse Mark Regev. O que isso significa
que mesmo que no sejam construdos novos colonatos, a expanso dos j
existentes poder prosseguir.
[317]
based on just a conversation that we had last week, Mr Obama said. Because obviously Prime Minister Netanyahu has to work through these issues
in his own government, in his own coalition. The US president also called
on Mr Abbas to offer security improvements to Israel and to quell anti-Israel
incitement in Palestinian mosques and schools. Mr Abbas warned that all
parties should work to alleviate the plight of the Palestinians and move towards statehood. I would like to take this opportunity to affirm to you that
we are fully committed to all of our [05]obligations under the roadmap, from
the A to the Z, he said. Mr Abbas added that he had shared ideas with Mr
Obama based on the roadmap and the 2002 Saudi peace plan backed by the
Arab league. The US-backed roadmap calls for a halt to Jewish settlement
activity in Palestinian territories and an end to Palestinian attacks against
Israel but has made little progress since it was drafted in 2003. Ms Clinton
said Mr Obama wants to see a stop to settlements. [06]Not some settlements, not outposts, not natural growth exceptions. But Israel dismissed
the blunt US call. [07]Normal life will be allowed in settlements in the
occupied West Bank, government spokesman Mark Regev said, using a euphemism for continuing construction to accommodate population growth.
He added the fate of settlements will be determined in final status negotiations between Israel and the Palestinians and in the interim, normal life
must be allowed to continue in those communities. The Palestinian Authority has ruled out restarting peace talks with Israel unless it removes all
roadblocks and freezes settlement activity. Mr Netanyahu told Mr Obama
last week at their first White House meeting that he was willing to immediately relaunch the peace talks but failed to publicly back the creation of a
Palestinian state or to freeze settlement activity. The Israeli prime minister
told his cabinet at the weekend he did not intend to build new settlements
but that it makes no sense to ask us not to answer to the needs of natural
growth and to stop all construction, aides said. The Abbas meeting represented Mr Obamas latest attempt to revive the stalled Middle East peace
process, which have included talks with Jordans King Abdullah II, Mr Netanyahu and in London with Saudi King Abdullah. Next week, Mr Obama
will meet the Saudi King in Riyadh and deliver a long-awaited address to the
Muslim world in Cairo. But he said he would not lay out his long-awaited
peace plan in the speech, which he said was designed to lay out a path for a
better US relationship with the Islamic world.
AFP
The shallow linguistic analysis above shows that some sentences containing
overlapping ideas consist of quotations, and hence tend to be appropriately used
in the text. As they quote someone elses direct speech, they are the type of facts
that cannot be subject to plagiarism. The analysis also reveals that the order of
the ideas differs in the two texts, so overlapping strings are used in different sections of the article. This might suggest that the text was produced independently.
Additionally, the Portuguese article was published on 28 May, whereas the articles
OSLa volume 7(1), 2015
[318]
rui sousa-silva
published in The Australian and broadcast by Channel News Asia were both published on 29 May. Although prior authorship is a strong indicator of originality,
this does not mean that the Portuguese article does not derive from the original
AFP newswire, especially considering that the two World section news articles
(which attribute authorship to an international news agency (AFP)) greatly overlap. Although access to the original AFP news wire is restricted, comparison with
the two articles published on 29th May suggests that the Portuguese article also
derives, at least partly, from the same source. The comparison shows, as well,
that many strings in the article that are supposed to have been produced independently overlap with strings in the text whose authorship is attributed to AFP.
Strikingly, the sentence Ms Clinton said Mr Obama wants to see a stop to settlements. Not some settlements, not outposts, not natural growth exceptions is
attributed to Hilary Clinton in the Portuguese text, but AFP describes it as Obamas
reported speech.
[5] w h y o d d n e s s m at t e r s
The results of the analysis provide evidence that news plagiarism exists and can
be detected, even in instances of text reporting facts. It is also forbidden and
seriously punished by those news corporations. The cases discussed demonstrate that, although quality newspapers are more careful in citing their sources
(usually well-known international agencies), attribution is often incomplete, inadequate, or vague. In the cases presented in this paper, for instance, JN made
no attribution at all, Pblico attributed authorship to Agencies without naming
any agencies in particular, and TVI lifted the original text entirely and passed it off
as their own. These commonly represent a violation of the established standards
and ethics policies, when regularly enforced. For instance, although Pblico has
a clear ethics policy and instructions on when and how to cite, it published an
article vaguely attributing authorship to Agencies. In this respect, news plagiarism is not much different from academic plagiarism, with the exception that
the latter is done by people training as writers, whereas the former is done by
professional writers.
The analysis of the texts also shows that (free) machine translation tools are
a good resource to test suspect cases of translingual plagiarism. In the case discussed, the result of a machine-translated non-suspect article enabled the selection of some sentences that were used to conduct an Internet search. After discarding the functional words and focusing on the lexical items, two articles published in different news companies were found that were likely to derive from
the same source. Although it could be argued that the contrastive analysis of
the Portuguese (suspect) text against the text whose authorship is attributed to
AFP is not enough to sustain the claims of plagiarism, it clearly shows that the
Portuguese version has not been produced independently, despite the inexistent
OSLa volume 7(1), 2015
[319]
one-to-one match between the Portuguese and the English versions. What this
suggests is that there is a high likelihood that the same piece of news includes
different releases from the foreign press and international websites.
[6] c o n c l u s i o n
The research presented in this article, despite being built upon a shallow linguistic analysis, supported the design of a new approach to translingual plagiarism
detection, whose potential was previously demonstrated (Sousa-Silva 2014). It
adds to an extensive body of research conducted over the last decades, which
demonstrates that forensic linguistics has the investigative and evidential potential in cases of plagiarism, as well as in cases of copyright infringement. On the
investigative side, a forensic linguistic analysis has assisted in the development
of methods, tools and procedures to reveal and detect instances of plagiarism.
On the evidential side, this approach has long demonstrated and proved why a
certain instance of reused text is plagiarism, or conversely why a certain text is
falsely accused. The latter, in particular, is an area that requires a more in-depth
linguistic analysis, which is beyond the scope of this article.
The forensic nature of plagiarism has often been challenged, on the grounds
that most cases of plagiarism (such as academic) do not involve legal instances.
Indeed, academic plagiarism cases tend to be managed by the academy, as much
as news plagiarism cases tend to be addressed by the media corporations involved.
Therefore, they are usually but not always judged as a moral, more than a
legal issue, and settled outside the courts of law. The involvement of the courts of
law in plagiarism cases (including academic) is not new, especially as a means of
rescinding degrees. Nevertheless, given that accusations of plagiarism can and do
have serious implications on the suspect plagiarists life, proving or disproving an
instance as plagiarism can be unquestionably relevant, both within and outside
the courts of law.
The future for research into plagiarism is anything but dull, and clearly shows
a great opportunity for collaborative research involving forensic as well as computational linguists and engineers. Although strong methods of linguistic research into plagiarism have been developed, there is always room for improvement, not only by designing new analytic methods, but also by adapting existing ones (whose relevance has been demonstrated) to new challenges. Computational forensic linguistics is definitely an area from which plagiarism detection
can greatly benefit. Although those systems that use linguistic information are
good performers, simple string matching software often return disappointing results. In this respect, Maia et al.s (2008, pg. 83) argument for the collaboration
between linguists and engineers remains valid today as it was by then: [w]hat is
needed is good will and serious attempts by both sides to understand each others
point of view. If this can be made to happen, everyone will benefit and the results
OSLa volume 7(1), 2015
[320]
rui sousa-silva
for research will be far greater than if they continue to work separately. Like
Alice, one cannot but become curiouser and curiouser
[7] a c k n o w l e d g m e n t s
references
Anderson, Judy. 1998. Plagiarism, Copyright Violation and Other Thefts of Intellectual Property: An Annotated Bibliography with a Lengthy Introduction. McFarland
& Company, Inc.
Anglil-Carter, Shelley. 2000. Stolen language? : plagiarism in writing Real Language
Series. Longman.
Caldas-Coulthard, Carmen Rosa. 2014. ReVEL na Escola: o que a Lingustica
Forense? ReVEL 12(23). 16.
Carroll, Jude. 2001. What kinds of solutions can we find for plagiarism? http:
//www.gla.ac.uk/media/media_13513_en.pdf.
Carroll, Jude & John Appleton. 2001. Plagiarism: A Good Practice Guide. Oxford
Brookes University.
Coulthard, Malcolm & Alison Johnson. 2007. An Introduction to Forensic Linguistics:
Language in Evidence. Routledge.
Eiras, Henrique & Guilhermina Fortes. 2010. Dicionrio de Direito Penal e Processo
Penal. Quid Juris.
Finnis, John. 1991. Intention and side-effects. In Raymond G. Frey & Christopher W. Morris (eds.), Liability and responsibility: Essays in law and morals, chap. 2,
3264. Cambridge University Press.
Garner, Bryan A. 2009. Blacks Law Dictionary. West 9th edn.
OSLa volume 7(1), 2015
[321]
Goldstein, Paul. 2003. Copyrights highway: from Gutenberg to the celestial jukebox.
Stanford University Press.
Howard, Rebecca. 1995. Plagiarisms, Authorships, and the Academic Death
Penalty. College English 57(7). 788806.
Howard, Rebecca Moore & Amy E. Robillard. 2008. Pluralizing Plagiarism: Identities,
Contexts, Pedagogies. Boynton/Cook.
Jameson, Daphne A. 1993. The Ethics of Plagiarism: How Genre Affects Writers
Use of Source Materials. Bulletin of the Association for Business Communication
56(2). 18.
Johnson, Alison. 1997. Textual kidnapping a case of plagiarism among three
student texts? The International Journal of Speech, Language and the Law 4(2). 210
225.
Lindey, Alexander. 1952. Plagiarism and originality. Harper & Brothers.
Maia, Belinda, Rui Sousa Silva, Anabela Barreiro & Ceclia Fris. 2008. N-grams in
search of theories. In Barbara Lewandowska-Tomaszczyk (ed.), Corpus Linguistics, Computer Tools, and Applications - State of the Art (PALC 2007), vol. 17, Peter
Lang.
Pecorari, Diane. 2008. Academic Writing and Plagiarism: A Linguistic Analysis. Continuum.
Pereira, Alexandre Librio Dias. 2003. Problemas actuais da gesto do direito
de autor: gesto individual e gesto colectiva do direito de autor e dos direitos conexos na sociedade da informao. In Estudos em Homenagem ao Professor Doutor Jorge Ribeiro de Faria, 1737. Faculdade de Direito da Universidade do
Porto.
Reuters. 2008. Reuters Handbook of Journalism. http://handbook.reuters.
com/index.php/Main_Page.
Robillard, Amy E. & Rebecca Moore Howard. 2008. Plagiarisms. In Rebecca Moore
Howard & Amy E. Robillard (eds.), Pluralizing plagiarism: Identities, contexts, pedagogies, 17. Boynton/Cook.
Roig, Miguel. 2001. Plagiarism and Paraphrasing Criteria of College and University
Professors. Ethics and Behavior 11(3). 307323.
Scollon, Ron. 1994. As a matter of fact: The changing ideology of authorship and
responsibility in discourse. World Englishes 13(1). 3346.
OSLa volume 7(1), 2015
[322]
rui sousa-silva
Scollon, Ron. 1995. Plagiarism and ideology: Identity in intercultural discourse.
Language in Society 24. 128.
Sousa-Silva, R. 2014. Detecting translingual plagiarism and the backlash against
translation plagiarists. Language and Law / Linguagem e Direito 1(1). 7094.
Sousa-Silva, Rui. 2013. Detecting Plagiarism in the Forensic Linguistics Turn: School
of Languages and Social Sciences, Aston University PhD dissertation.
Turell, M Teresa. 2004. Textual kidnapping revisited: the case of plagarism in literary translation. The International Journal of Speech, Language and the Law 11(1).
126.
c o n ta c t s
Rui Sousa-Silva
Centro de Lingustica da Universidade do Porto
r.sousa-silva@lflab.pt
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 323336.
http://www.journals.uio.no/osla
ISSN 1890-9639 / ISBN 978-82-91398-12-9
resumo
Os rticos so provavelmente a classe consonntica do portugus que conheceu o maior nmero de mudanas no ltimo sculo. A literatura costuma referir as observaes de Viana (1883, 1903) a propsito do incio do processo
de substituio gradual da vibrante mltipla alveolar pela vibrante mltipla
uvular. Neste artigo, tentamos identificar e datar outras mudanas, verificadas posteriormente, que vieram alterar a configurao e a organizao geral
das vibrantes do portugus: (i) na subclasse das vibrantes mltiplas, referiremos a introduo de consoantes fricativas (e, nas variedades brasileiras
da lngua, das consoantes glotais tambm) para o lugar do rtico uvular que
iniciou o processo de entrada no portugus no final do sculo XIX; (ii) na
subclasse das vibrantes simples, referiremos a emergncia das variantes retroflexas, admitidas para o portugus do Brasil h j algumas dcadas (principalmente, em resultado da variao sociolingustica) e que, no portugus
europeu, parece comear a instalar-se a partir da fala de jovens escolarizados de alguns centros urbanos. Estes dados encontram suporte em alguns
estudos recentes e, como ser posto em destaque no presente texto, no corpus do Arquivo Dialetal do Centro de Lingustica da Universidade do Porto.
[324]
joo veloso
[1] i n t r o d u c t i o n
The main aim of this study is to analyse the main changes that have been taking
place in the organization of the rhotics system of Portuguese in the last century,
broadly speaking.
The first of such changes was the introduction of a uvular trill ([ ]), replacing
the traditional Romance trill (alveolar [ r ]), which started towards the end of the
19th century, perhaps as a phonemic borrowing from French.
That is not the end of the story, as we shall see, and many subsequent changes
have taken place since then. I will propose that the most recent of such changes
is the emergence of a retroflex flap ([ ]) in short, Belindas R , which is becoming more and more frequent in certain phonological contexts and under given
sociolinguistic conditions, maybe as the result of another phonemic borrowing,
now from English. Different varieties of Portuguese with special emphasis on
European and Brazilian Portuguese will be taken into consideration.
I will divide my text into three parts: in section [2], I shall concentrate on the
(supposedly) first steps of the changes that will be considered here and try to formulate the main questions to be analysed; in section [3], a brief description of the
rhotics systems of European and Brazilian Portuguese will be given; section [4]
will focus on some ongoing changes that can be observed in Contemporary Portuguese. A section with some final remarks will end the chapter.
[2] t h e f i r s t m a j o r c h a n g e : t h e e m e r g e n c e o f vicious [r]
[2.1] R U [ r ] or [ ]?
In 1883 and 1903, Gonalves Viana, the father of Portuguese modern phonetics, wrote about the (then) recent introduction of a new rhotic in European Portuguese: the uvular trill [ ], which, according to him, was gradually replacing the
original Romance [ r ], described by the author as the most original, most genuine,
still most expanded in his centurys language (Viana 1883, pg. 20; 1903, pg. 19).
In his colourful, suggestive language, Gonalves Viana depicts what nowadays
should be described, in sociolinguistic terms, as an ongoing change, obeying the
main features of most sound changes in the worlds languages: it had had a sudden start among urban (supposedly educated) speakers, it had a sociolinguistic
motivation (it is reasonable to assume that its introducers wanted to sound more
cosmopolitan and more sophisticated1 ), and, little by little, it spread to new
speakers communities:
[1]
Contrastingly, Barbosa (1983, pg. 193) denies that [ ] was a direct borrowing from French and that it
corresponded to a prestigious articulation, on the basis of the following main arguments: (i) there is
no evidence that the change had originated in the Royal circles, in spite of frequent marriages between
Portuguese princes and French princesses, (ii) the French adjective vicieux, used by Viana (1903), has a
very negative meaning, and (iii) similar changes took place in other languages, suggesting that phonetic
rather than sociolinguistic variables were the real triggers of the phenomenon.
[325]
Indeed, in the space of a few decades, uvular [ ] became the standard trill
of European Portuguese. It is confirmed by observations found in the most authoritative grammatical and phonological descriptions of the language (Barbosa
19832 ; 1994; Barroso 1999; Mateus & DAndrade 2000; Mateus et al. 2003; Emiliano
2009), which give it as the Portuguese unmarked vibrante mltipla3 , confining the
alveolar trill [ r ] (i.e., the original trill of Portuguese, common to most Romance
languages) to a minority of speakers (see, e.g., Mateus et al. 2003, pg. 1000).
So, it seems quite reasonably safe to assume that, from a purely phonological,
descriptive point of view, // might be considered as the most recent the youngest phonemic segment of European Portuguese4 . Its admission to the phonological system of the language was relatively fast and, to some extent at least,
socially motivated, as suggested above.
[2.2]
As hinted at above, the birth of // in European Portuguese, as witnessed by Viana (1883, 1903), is not the last step of the recent historical changes involving
Portuguese rhotics. In this section, it is my aim to highlight some further devel-
[2]
[3]
[4]
[326]
joo veloso
opments that have taken place within the subsystem of Portuguese rhotics5 since
such early observations. In part, these developments could even lead us to question the appropriateness of insisting to look upon rhotics as a true natural class
in Portuguese, although such a discussion will not be developed in this paper.
In the following sections of this chapter, I will focus on two different, though
inter-related, issues concerning the rhotics of Portuguese. I will try to show that
the changes that are referred to by Viana (1883, 1903) or by Barbosa (1983) are just
a part of a story involving important changes that have altered not only the phonetic nature of European Portuguese trills, but that have also affected the other
subclass of Portuguese rhotics flaps , both in the European varieties of Portuguese and in other, non-European dialects of the language. That is to say, the
changes that Viana (1883, 1903) identified with respect to the emergence of an
uvular trill [ ] gradually replacing the alveolar [ r ] should most likely be seen
just as the first step of a major historical change altering the whole system of
rhotics in this language. Some of its effects are still taking place in Contemporary
Portuguese. In the development of these observations, I shall concentrate on two
main specific questions:
what has happened to Portuguese trills since Vianas (1903) vicious [ ]?
what is happening, in the current stage of the language, within the specific
subset of Portuguese flaps?
In this analysis, data from both European and Brazilian Portuguese (EP and BP,
respectively) will be taken into consideration; a brief mention will also be made
of another variety of Portuguese, spoken in the Atlantic island of So Tom.
[3] t h e o r ga n i s at i o n o f r h o t i c s i n m o d e r n p o r t u g u e s e
In this section, I shall start by giving a general overview of how rhotics are organized within the consonant system of Portuguese, not paying special attention to
[5]
For the sake of simplicity and terminological ease, rhotics is used throughout this chapter as a phonetically/phonologically motivated class of sounds and as an appropriate label to name them. Nevertheless,
it is borne in mind that it is extremely difficult to identify a set of stable characteristics that keep such
sounds objectively apart as a specific phonetic/phonological class. The following words by Ladefoged
& Maddieson (1996) illustrate this issue very clearly; note that the authors point out, as the singularity
which most probably is the main privative feature shared by all members of this class, the (extralinguistic,
accidental) fact that rhotic sounds are written with Roman r or Greek , and practically nothing
else: This chapter describes the class of sounds that are sometimes labeled rhotics, or more informally,
r-sounds. Most of the traditional classes referred to in phonetic theory are defined by an articulatory or
auditory property of the sounds, but the terms rhotic and r-sound are largely based on the fact that these
sounds tend to be written with a particular character in orthographic systems derived from the GrecoRoman tradition, namely the letter r or its Greek counterpart rho. The International Phonetic Alphabet
provides a wide selection of symbols based on plain, rotated, turned or otherwise modified lower-case
and capital versions of the letter r, including r , R , , , , , K , [. . . ] (Ladefoged & Maddieson
1996, pg. 215). For additional information regarding the discussion about the motivation of rhotics as a
natural class, see the arguments by Ladefoged & Maddieson (1996) referred to in footnote 6.
[327]
the historical and variationist data that form the core of this study.
Supposedly, rhotics form a special class of consonants, belonging to the subset of sonorants in Portuguese. From a phonetic point of view, they are usually
voiced and formed by a brief contact (or a short series of brief contacts) between
two articulators within the oral cavity6 . This brief contact is not enough to cause
real obstruction of the airflow, though, and as such it does not give rise to any
inharmonic noise component.In fact, from a phonetic point of view, these consonants show high levels of harmonic energy and spectrographic patterns which
make them very similar to vowels and glides (Lindau 1985, pg. 160 ff.; Ladefoged
& Maddieson 1996, pg. 215 ff.). In close relation to this, they have high degrees
of inherent sonority, which, in turn, makes them prone, in most languages, to
occur in syllable codas and, in a significant number of languages, too, as syllabic
nuclei. In a rather simplified SPE fashion, they are [+cons], [+son] (being distinguished from other sonorants, in the standard model of generative phonology, by
the negative marks [-nas], [-lat]). In languages like Portuguese, they correspond
to [-syll], whereas, in languages like Czech, Sanskrit and others (perhaps English),
they can receive the mark [+syll].
A common distinction that is found in many languages at least, in the description of many languages keeps rhotics formed by one single contact of two
oral articulators (=flaps or taps) apart from those where a series of rapid contacts
of this kind takes place within a very short time window (=trills).
In Modern European Portuguese (henceforth: MEP), it is traditionally assumed
that rhotics contrast at the surface level7 : one flap, allegedly invariant and common to all speakers, phonetically realized as coronal [ R ], vs. one trill. This contrast occurs word-medially, in pairs such as the ones found in example (1); the
main question which is most often mentioned has to do with the trills phonetic
realization. As said before, according to the literature, in MEP the standard trill
is the voiced uvular [ ] (that is to say, Gonalves Vianas prophecy has been fulfilled!), whilst alveolar [ r ] still survives in a minority of speakers (Barbosa 1983,
1994; Barroso 1999; Mateus & DAndrade 2000; Mateus et al. 2003; Emiliano 2009).
This is the main reason why I chose [ ], instead of [ r ], to transcribe all the trills
in example (1).
[6]
[7]
The most prototypical members of the class of rhotics are trills made with the tip or blade of the tongue
(IPA r). These central members of the class show phonological relationships to the heterogeneous set
of taps, fricatives and approximants which form the remainder of the class. In addition to tongue tip and
blade articulations, trills and other continuants made at the uvular place are also classed as rhotics. [. . . ]
It is not therefore the manner of articulation that defines this group of sounds. Neither is there a particular place involved, as both Coronal and Dorsal articulations are included. Consequently an issue for
phoneticians is whether the class membership is based only on synchronic and diachronic relationships
between the members of the class, or whether there is indeed a phonetic similarity between all rhotics
that has hitherto been missed. [. . . ] (Ladefoged & Maddieson 1996, pgs. 215216; my italics).
As for the arguable phononological status of these surface contrasts, see again footnote 4.
OSLa volume 7(1), 2015
[328]
joo veloso
(1)
[4] o n g o i n g c h a n g e s a n d va r i at i o n i n p o r t u g u e s e r h o t i c s
After the general survey given in the previous section with the essentials about
rhotics as a specific class of sounds, in Portuguese and other languages, I will return to the specific topic of this paper and on the data that were mentioned in the
introduction: the ongoing changes that have been affecting Portuguese rhotics
for several decades.
In this section, as previously announced, my observations will be split into
two main directions: trills (again. . . ) and taps.
An even more vicious trill: in Portuguese, sonorant rhotics are becoming (phonetically) non-sonorants (fricatives and glottals)!
I began this chapter by recalling how Viana (1883, 1903) sounded so critical about
the changing of [ r ] into [ ], which seemed to be completely accomplished within
a few decades, as outlined above.
In this section, I shall draw our attention to a further development of this
phonetic change. In fact, what is particularly interesting to notice, nowadays,
is that the innovative [ ] seems to be undergoing a subsequent, more drastic
change in Portuguese. In fact, a growing number of speakers are replacing [ ] by
a fricative that is to say, by an obstruent, typically behaving not as a sonorant,
but more similarly to, say, a stop or an affricate, acoustically speaking , within
a range of choice which includes, in EP, velars (unvoiced [ x ] or voiced [ G ]) and
uvulars (unvoiced [ X ] and voiced [ K ]).
Even though these realisations are not yet fully recognized as phonemes,
or at least as the most common or standard allophones of the Portuguese vibrante
mltipla, several phonological descriptions of EP admit explicitly its occurrence
and its frequency. Barbosa (1994, pg. 107) identifies Barbosas (1983) work as the
first to have ever noticed the emergence of a phonetic fricative in the place of a
phonological vibrante. Barbosas (1983) exact words are as follows:8
[4.1]
[8]
Following a non-IPA convention which used to be very common among Portuguese linguists just a few
decades ago, Barbosa (1983) transcribes the uvular trill as // (after the Greek letter , rho), instead
of //.
[329]
Frequency scale of the phonetic realizations of phonological trills of Modern European Portuguese in the corpus of the Arquivo Dialetal do Centro de
Lingustica da Universidade do Porto (ap. Rennicke & Martins 2013):
[K]
(76%)
[9]
[10]
>
[X]
(24%)
>
[x]
(16%)
>
[r]
(11%)
>
[]
(11%)
[330]
joo veloso
Very interestingly, all these data show:
(i) That the vicious [ ] that Viana (1903) identified as the most spreading
in 19th century Portuguese is, in the current stage of the language, the least
represented allophone of the phonological multiple trill, with the same
percentage of occurrence that is found for its direct competitor in Vianas
(1883; 1903) writings (the original Romance alveolar trill [ r ], which has not
completely disappeared from spoken Portuguese);
(ii) That fricatives seem to be, at the current stage of EP, the most representative realizations of Portuguese rhotics: according to these data, [ K ] is by
far the most frequent of the trill allophones. This corroborates the previously mentioned impressionistic observations of Barbosa (1983, 1994), Barroso (1999) and Mateus & DAndrade (2000);
(iii) That BP has gone one step further in this change, replacing rather unanimously all phonetic trills by fricatives (like in EP) and by glottals as well, as it
is the case for BP (Silva 2002).
So far, on the basis of all the data that were taken into consideration here, we
could trace a rough chronology and genealogy of Portuguese trills (3).
(3)
Portuguese trills (EP and BP) since the early observations by Gonalves
Viana (Viana 1883, 1903):
{
Alveolar Trill
Pre and early 19th century
/r/
{
Uvular Trill
19th 20th century
//
EP: Fricatives
[K]>[X]>[x]
[X,G,h,H]
(Silva 2002)
NB: /r/ and // have not disappeared completely from Modern EP or Modern BP (see information in the text itself). In the table, only the innovative
allophones are considered on the timeline according to the supposed date
of their emergence in the language.
[331]
The main conclusion to be drawn from these data and arguments is that the
story and the history of Portuguese trills does not end with Vianas (1903) observations; from that moment onwards, other changes have altered the inventory
and the relations between phonemic segments and their allophonic realizations
within this class. The most drastic of the recent changes affecting this phonetic
subclass has been the emergence of fricatives (and, in BP, of glottals, too) as phonetic counterparts of phonemic segments generally assumed as sonorant rhotics,
in a way that can be found, quite strikingly, in other languages as well (as it seems
to be the case of Italian, according to Ladefoged & Maddieson (1996, pg. 219)).
[4.2] Trills Are Not The End Of The Story, Yet. Retroflex Flaps, Or Belindas
I shall now focus on another side of the story of rhotics change in Portuguese:
the emergence of a retroflex flap ([ ]), occurring in the place of the alveolar flap
(supposedly invariant across all the speakers of EP (=[ R ]), according to the literature).
To my knowledge, only a few previous studies refer to the existence of this
new flap in EP, in addition to the phonetic transcriptions of the Arquivos materials (under the responsibility of Pedro Tiago Martins)11 , which identify and
transcribe a large number of realizations of /R/ as an approximant retroflex ([ ]). Rennicke & Martins (2013, pg. 520), based on their analysis of the same
corpus, are certainly among the first studies to acknowledge such phonetic realization in EP. None of the aforementioned authoritative phonological descriptions
of EP phonology see, for instance, Barbosa (1983, 1994), Barroso (1999), Mateus
& DAndrade (2000), Mateus et al. (2003) even acknowledge the existence of this
consonant in EP.
The lack of reference to a retroflex flap in such phonological descriptions of
EP contrasts with the work of Rennicke & Martins (2013) and with a careful analysis of the materials made available by the Arquivo; it also contrasts with my
own strong linguistic intuitions. As a native speaker of Portuguese in daily contact with the Northern varieties of the language, mainly with the varieties spoken
in Oporto by young, educated speakers, and as an attentive linguist particularly
keen on variation phenomena, my impression is that a retroflex [ ] (maybe [ ])
is becoming more and more common among these groups of speakers in the city
of Oporto. It seems to be more frequent among young, educated female speakers than among males. Its rough distributional pattern seems to be the following:
retroflex flap occurs mainly in syllabic codas (very seldom in onsets), most often
in stressed word-final position (examples: professor professor [ pRuf"so ]; fazer
to do [ f5"ze ]; amor love [5"mo]).
[11]
The phonetic transcriptions found in the Arquivos website (http://cl.up.pt/arquivo) were subject
to a double-checking verification and validation, according to the Inter-Judge Agreement methodology
as described by Martins & Veloso (2012).
OSLa volume 7(1), 2015
[332]
joo veloso
If this intuition proves correct as the Arquivos materials and the study by
Rennicke & Martins (2013) suggest we could be witnessing a phenomenon quite
similar to the one Viana (1883, 1903) described regarding the emergence of [ ]
about one hundred years ago. Some parallelisms between the two changes should
be highlighted here:
both may have started as urban innovations;
most likely, both result from a phonemic borrowing phenomenon: [ ]
could have been borrowed from French, the dominant foreign language
among educated Portuguese in the 19th century (even though Barbosa (1983,
pg. 193), as seen above, disagrees with this interpretation); [ ] could probably be the result of a borrowing from English, the main foreign language
among Portuguese educated youngsters.
Actually, [ ] and [ ] are also the most frequent realizations of /R/ by foreign
learners and speakers of Portuguese who have English as their mother tongue.
As for retroflex flaps in BP, they behave differently from EP retroflex flaps.
First of all, and contrary to what happens among Portuguese authors, many phonological descriptions of BP explicitly refer to a retroflex variant of flaps (see, for
instance, and among many others: Netto 2001, pg. 99100; Silva 2002, pg. 34, 49;
Rennicke 2011). The main reason for this probably resides in a series of interrelated facts:
retroflex realizations of flaps in BP are much more widespread than in EP,
and occur in a larger number of prosodic contexts (stressed and unstressed,
final and non-final syllables; filling either syllable onsets or codas). This
contributes to making this realization more salient from a perceptual point
of view;
in addition to the spread of retroflexion, retroflex flaps have for a long time
been socially identified, often stigmatized, with a specific speech style generally associated with non-urban, low-educated speakers; it even has a current specific designation: R caipira (=caipira R, caipira meaning, in
a slightly judgmental way, an inhabitant from the most remote rural areas
of the country, typically characterized by low degrees of education12 ).
As for this particular topic, we can conclude that, whereas retroflex [ ] is
emerging in EP, even if completely ignored by the most prominent phonologists
of this variant of the language, it has been a common phonetic realization in BP
[12]
Nevertheless, the current geographic and social distribution of retroflex flaps in BP is much more widespread; it is very often heard in urban contexts and produced by highly educated speakers of the language
(see, e.g., Rennicke 2011).
[333]
for some time as recognized by the phonological descriptions regarding this variant13 .
[5] f i n a l r e m a r k s
To conclude, we could say that Portuguese rhotics are perhaps the consonants
which have been undergoing the most stunning phonetic and phonological changes
for the last decades. Vianas (1883; 1903) and Barbosas (1983) remarks about the
emergence and stabilization of an uvular trill [ ] following the historical [ r ] have
to be viewed as the first steps in a process which is not yet completely accomplished.
Linguistic, social and geographical factors seem to interact in the sound changes
and substitutions that have been taking place for more than one century. At the
present moment, no one can be entirely sure how the story of Portuguese R
will really end and phonologists should pay special attention to a theoretical issue
that will arise from the following steps of the process: given the desonorantization
of trills (mostly realized as [-son] fricatives, in EP and BP, and also as glottals, in
BP), and bearing in mind that they are acquired differently from flaps in some
prosodic contexts (Almeida 2011; Amorim 2014), will it make sense to insist on
postulating a class of rhotics in Portuguese? This is a question that is left for future research.
To sum up, I include a final table putting together all the attested changes
affecting all rhotics of Portuguese trills and flaps in the two main varieties of
Portuguese (EP and BP). In a way, this table completes the one given in (3), which
included trills only.
acknowledgments
I thank Diana Santos for the invitation and encouragement to publish in this
volume. Thanks are also due to Pedro Tiago Martins, who read and commented an early draft of this text and corrected some parts of it. Part of this research was funded by Portugals Fundao para a Cincia e a Tecnologia, through
CLUP, the Centre of Linguistics of the University of Porto (Strategic Project PEstOE/LIN/UI0022/2014).
[13]
Quite interestingly, some varieties of Portuguese show opposite tendencies, towards a fortition of flaps,
which become (uvular) trills. This is the case of some varieties spoken around the Portuguese city of
Setbal (Southern dialects) and of So Tom Portuguese (STP), where flaps do not exist at all. In the segmental positions where in other varieties a flap is expected, speakers articulate an uvular [ ] (examples:
laranja orange EP Standard [ l5"R
5Z5 ], STP [ l5"
5Z5 ]; prato dish EP Standard [ "pRatu ]; STP [ "patu ]).
OSLa volume 7(1), 2015
[334]
joo veloso
(4)
Change of Portuguese rhotics (EP and BP) since the early observations by
(Viana 1883, 1903).
T rills
F laps
{
Pre and early 19th century
{
19th 20th century
Alveolar Trill
/r/
Uvular Trill
//
EP: Fricatives
[K]>[X]>[x]
[X,G,h,H]
(Silva 2002)
{
Alveolar flap
19th and early 20th century
[R]
EP: [ R ]
Emergence of [ ] in certain
BP: [ R ]
From mid 20th century
contexts
NB:
(i) [ R ], [ r ] and [ ] have not disappeared completely from Modern EP
or Modern BP (see information in the text itself). In the table, only
the innovative allophones are considered on the timeline according
to the supposed date of their emergence in the language.
(ii) No specific assumption is made about the exact date of emergence of
[ ] in EP or BP. It is hypothesized that it emerged, in EP, sometime
in the 20th century, given the lack of explicit references to this realization, especially in studies regarding this variant of the language.
(iii) In BP, according to many sources, both trills and flaps can be completely deleted (//R// > ) in some speech styles and under some
prosodic conditions. Such deletion is also possible, less frequently
and affecting only /R/, in EP (e.g., in a final stressed syllable before a
word with an initial consonant: falar baixo to keep his/her own voice
down [ f5"la(R)bajSu ]).
OSLa volume 7(1), 2015
[335]
references
Almeida, Letcia. 2011. Acquisition de la structure syllabique en contexte de bilinguisme
simultan
e portugais-francais: University of Lisbon Phd dissertation.
Amorim, Clara. 2014. Padr
ao de aquisicao de contrastes do PE: a interacao entre tracos,
segmentos e slabas: University of Porto Phd dissertation.
vora
Barbosa, Jorge Morais. 1983. Etudes de Phonologie Portugaise. Universidade de E
2nd edn.
Barbosa, Jorge Morais. 1994. Introducao ao Estudo da Fonologia e Morfologia do Portugu
es. Almedina.
Barroso, Henrique. 1999. Forma e Subst
ancia da Express
ao da Lngua Portuguesa.
Almedina.
Bonet, E. & J. Mascar
o. 1997. On the representation of contrasting rhotics. In
F. Martnez-Gil & A. Morales-Front (eds.), Issues in the Phonology and Morphology
of the Major Iberian Languages, 103126. Georgetown University Press.
Cmara, Joaquim Mattoso. 1977. Para o Estudo da Fon
emica Portuguesa. Padrao.
Emiliano, Ant
onio. 2009. Fon
etica do Portugu
es Europeu. Descricao e Transcricao.
Guimaraes.
Ladefoged, Peter & Ian Maddieson. 1996. The Sounds of the Worlds Languages. Oxford.
Lindau, Mona. 1985. The story of /r/. In Victoria A. Fromkin (ed.), Phonetic Linguistics: Essays in honor of Peter Ladefoged, Academic Press.
Martins, Pedro Tiago & Joao Veloso. 2012. Inter-Judge Agreement in Transcribing
Dialectal Data: A Study of a Corpus of Dialectal Portuguese.
Mateus, Maria Helena & Ernesto DAndrade. 2000. The Phonology of Portuguese.
Oxford University Press.
Mateus, Maria Helena Mira, Ana Maria Brito, Ines Duarte, Isabel Hub Faria, S
onia
Frota, Gabriela Matos, Fatima Oliveira, Marina Vigario & Alina Villalva. 2003.
Gram
atica da Lngua Portuguesa. Caminho 5th edn.
Netto, Waldemar Ferreira. 2001. Introducao a` Fonologia da Lngua Portuguesa. Hedra.
Rennicke, Iiris. 2011. The retroflex r of Brazilian Portuguese: theories of origin
and a case study of language attitudes in Minas Gerais. Lingustica. Revista de
Estudos Lingusticos da Universidade do Porto 6(1). 149170.
OSLa volume 7(1), 2015
[336]
joo veloso
es foneticas de /R/ em
Rennicke, Iiris & Pedro Tiago Martins. 2013. As realizaco
es no sistema
portugues europeu: analise de um corpus dialetal e implicaco
fonol
ogico. In F. Silva, I. Fale & I. Pereira (eds.), Textos Selecionados do XXVIII
Encontro Nacional da Associacao Portuguesa de Lingustica. Coimbra: Associacao Portuguesa de Lingustica, 509523. Associacao Portuguesa de Lingustica.
Silva, Thas Crist
ofaro. 2002. Fon
etica e Fonologia do Portugu
es. Roteiro de Estudos e
Guia de Exerccios. Contexto 6th edn.
Veloso, Joao & Pedro Tiago Martins. 2013.
O Arquivo Dialetal do CLUP:
disponibilizacao on-line de um corpus dialetal do portugues. In F. Silva, I. Fale
& I. Pereira (eds.), Textos Selecionados do XXVIII Encontro Nacional da Associacao
Portuguesa de Lingustica, 673692. Associacao Portuguesa de Lingustica.
Viana, Aniceto dos Reis Goncalves. 1883. Essai de phonetique et de phonologie de
la langue portugaise dapr`es le dialecte actuel de Lisbonne. Romania 12. 2998.
Viana, Aniceto dos Reis Goncalves. 1903. Portugais. Phon
etique et phonologie. Morphologie. Textes. Teubner.
c o n ta c t s
Joo Veloso
Faculdade de Letras, Univerisdade do Porto
jveloso@letras.up.pt
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 337357. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
resumo
O artigo retoma um tema muito discutido na bibliografia sinttica, a questo
de saber se o Portugus Europeu tem alternncia dativa. Ser proposto que
nesta lngua h duas estruturas sintticas basicamente engendradas para as
construes ditransitivas e, deste modo, o Portugus Europeu ter alternncia dativa, mas num sentido muito diferente do que tem o Ingls e outras lnguas germnicas. Ser proposto que no se justifica o n aplicativo nesta lngua e que a preposio , nas duas construes, o mesmo tipo de preposio,
essencialmente um marcador de caso dativo. As razes para a proposta so
certos factos de ordem de palavras, anteposio, ligao e escopo.
[1] i n t r o d u c t i o n
Several Germanic languages have dative alternation, because they exhibit two
synonymous constructions: a prepositional construction with to (1-a) in the order
Direct Object (DO) + Indirect Object (IO) and the Double Object Construction (DOC)
(1-b), characterized by the existence of two NPs with certain order restrictions:
only the pattern V + goal / beneficiary + theme is accepted:
(1)
a.
b.
As Romance languages have special prepositions for the expression of the dative
(a, ), it is classically assumed (see, among others, Kayne (1984)) that these languages have no DOC. Of course there are many languages without prepositions.
Among them Bantu languages deserve a special attention, because they have applicative constructions, where verbs may add or apply a new argument to the
verb root with the help of a special infix, an applicative morpheme. Connected
to this view is the idea, shared by several linguists, that there are no true ditransitive verbs, but only verbs that select an internal argument and that may add a
new participant, the so called IO. These reasons justified the proposal, made in
different ways by Baker (1988), Marantz (1993) and Pylkknen (2002), of a related
analysis of the DOC and of applicative constructions.
[1]
[338]
V NP a NP (V DO IO)
A Maria deu um livro ao Joo.
the Mary gave a book to.the John
Mary gave a book to John
[2]
[3]
OSLa volume 7(1), 2015
About argument and non-argument datives in European Portuguese see, among others, Vilela (1992),
Brito (2009), Miguel et al. (2011), Gonalves & Raposo (2013), specially pp. 1173-1181.
I will use the following category symbols: NP (Noun Phrase), VP (Verb Phrase), PP (Prepositional Phrase),
ApplP (Applicative Phrase).
[339]
V a NP NP (V IO DO)
A Maria deu ao Joo um livro.
the Mary gave to.the John a book
Mary gave John a book
(4)
In EP clitic doubling is possible with a personal pronoun, mainly in an oral register; see (6) and (7) versus (5):
dative clitic doubling
(5)
(6)
(7)
Many authors that have analysed the IO in EP and other Romance languages have
noticed the special status of the IO: it behaves as a NP (marked by dative case) for
effects of binding theory4 and it behaves as a PP for effects of predication5 , where
the presence of the preposition a is mandatory (see, for Portuguese, Duarte (1987),
Duarte (2003), Gonalves (1990, 2002, 2004), Torres Morais (2006), Torres Morais
& Lima-Salles (2010)).
Another important aspect of ditransitive constructions is word order.
In the two sentences (2) and (3), what differs is the word order and the informational structure, being V DO IO the unmarked order and V IO DO the marked
order. The proposal that the unmarked order in EP is V DO IO may be justified by
several facts (cf. Costa (2009)): only (2), not (3), would be an adequate (redundant)
answer to a wh question like (8):
[4]
[5]
As Gonalves (2002, pg. 336) writes, the preposition a is a case marker of the only one argument IO with
verbs like telefonar (to phone), and a case assigner of an extra NP with pedir, dar (to ask, to give) as the
main Vs.
Cf. Masullo (1992) for Spanish.
OSLa volume 7(1), 2015
[340]
(3), with the order V IO DO, has a contrastive focus reading, being an adequate
word order in a context like the one described in (9); therefore, a scrambling6 of
the IO over the DO seems justified (see, for Spanish, Demonte (1995)).7
(9)
O que aconteceu?
What happened?
It has been noticed (see, for instance, Duarte (2003, pg. 287, 290)) that, when the
DO is a clause or a complex NP, as in (11-a), the order is typically V IO DO and not
V DO IO, as in (11-b), which is marginal:
(11)
a.
b.
Even if we have a question with focus on the IO, as in (12), it is the order V IO DO
that we expect, as in (11-a), and not the order V DO IO, as in (11-b), despite the
fact that the IO is the information focus:
(12)
We may conclude that the order V IO DO is possible when one of the following
factors is present: the IO is a contrastive focus; the DO is a complex, heavy constituent.
[6]
[7]
The notion of scrambling is due to Ross (1967) and means the movement operation that is responsible
for the change of the basic word order in a certain language by pragmatic and discursive reasons.
An alternative to scrambling could be the proposal, inspired in Belletti (2004), according to which at the
left periphery of the verbal domain (vP) there is place for discursive functional categories, like TopP,
FocP.
[341]
(15)
What all these examples show is that two word order patterns are possible in Portuguese ditransitive constructions. It is true that in idioms and in some constructions with dar to give as a light verb it is impossible to separate the V and the
DO, as in dar uma lio a algum to teach a lesson to someone ((16-a) and (16-b)),
showing that the link between the V and the DO cannot be broken; if this word
order is changed, the literal meaning of to teach a lesson is expressed (16-b):9
(16)
[8]
[9]
a.
[342]
s y n ta x
[3.1]
The structure of ditransitive constructions has been the subject of many discussions. In the beginning of Generative Syntax the structure (17) was proposed as
a way to describe the selection of two internal arguments by ditransitive verbs:
but (17) does not respect either binary branching or X-bar theory.10
(17)
VP
V DO IO
Also (17), where the DO and the IO occupy parallel positions, do not describe some
data related to fronting, binding and scope; (18)11 and (19)12 were then proposed:
[10]
[11]
[12]
The structures proposed in this paper will be very simplified; we will use syntactic functions in the representations as a way to describe the theme NP (the Direct Object, DO) and the beneficiary / goal / origin
NP / PP (the Indirect Object, IO).
(18) was used by Xavier (1989) for Portuguese. For English, (18) was proposed because of fronting and
ellipsis, where the V forms a constituent with the DO, as in (i), although other fronting data are possible
(see (ii), (iii) and (iv)): (i) and [give candy] he did to children on his birthday; (ii) John intended to give
candy to children on his birthday; (iii) and [give candy to children on his birthday] he did; (iv) and
[give candy to children] he did on his birthday (cf. Phillips (2003), Costa (2009, pg. 8788)).
(19) was proposed for English because of the superiority of the DO over the IO in sentences like (i) John
gave nothing to any of the children on his birthday; in contrast with (ii) *John gave anything to none of
the children on his birthday (cf. Phillips (2003), Costa (2009, pgs. 8788)).
VP
VP
V
IO
[343]
V DO
VP
DO
V IO
However, (18) and (19) are not sufficient, because the existence of the DOC in
many languages and because of phenomena related to binding of pronouns and
scope in certain occurrences favour a structure where the IO should be higher
than the DO, as in (20):13
(20)
VP
V
V
VP
IO
V
V DO
Supposing then that (19) and (20) are adequate, the immediate question is if (19)
and (20) are base-generated structures or if they are derivationally related.
[13]
[14]
Cf. Barss & Lasnik (1986) and Larson (1988, pg. 3368), for English; see paragraph 5 for Portuguese.
For an overview of different approaches see, among others, Ormazabal & Romero (2010) and Oyharabal
(2010).
OSLa volume 7(1), 2015
[344]
a.
b.
More recently,Rapapport Hovav & Levin (2008) and Ormazabal & Romero (2010)
have shown that the dative alternation in English is not necessarily associated
to differences in the meaning of the two variants; in particular, the differences
found above are mainly due to differences in the lexical meaning of verbs: verbs
like to give only have a caused possession meaning, while verbs like to send have
both a caused motion and a caused possession meaning, what means that to send
has a path dimension that is absent in to give.
Meanwhile, other proposals have been suggested.
One the most important is the neo-constructionist approach, where Syntax
determines what is considered the argument structure of a lexical predicate. The
neo-constructionist approach generally proposes two different structures for the
DOC and for the prepositional construction, based on the idea of different meanings of the two variants, as referred above (Marantz 1993; Pesetsky 1995; Harley
2002; Anagnostopoulou 2003; Pylkknen 2002; Cuervo 2003, 2010, among others).
(iii) There are also hybrid treatments like the one proposed by Ormazabal
& Romero (2010), where the framework based on event structure by Ramchand
(2008) is combined with a derivational analysis.
[345]
verbs that select a true second argument, the indirect object, the so called ditransitive verbs, like dar to give, prometer to promise and that there are some
non-argument datives. Other approaches assume that the IO is always an applied,
extra or incorporated argument and that there are no ditransitive verbs (Marantz
1993, Cuervo 2003, 2010, among others).
Developing the idea that datives are not internal arguments of the verb, Marantz
(1993, pg. 116) explicitly calls the DOC in English an applicative construction,
which means that the dative is some sort of extra argument that is applied / incorporated to a verbal predicate. He proposes a structure where the applicative
head is the light v, which takes an event as its argument, licensing the IO as its
specifier and taking it as a participant in the event (23):
VP
NP
(23)
affected object
(e.o., benefactive)
V
V
VP
Appl
Developing Marantzs reflection, Pylkknen (2002) proposes that English and Bantu languages are similar in the sense that the DOC is a type of applicative construction; but they are different in the sense that they project an Appl head in
different positions. Bantu languages allow ergative verbs (like to run) or transitive verbs (like to give) to appear in an applicative construction, with a beneficiary
/ maleficiary argument and for this reason have high applicatives; in English, on
the contrary, in order to have a DOC, it is necessary that the applied argument
has some semantic relation with the verb (to give, to bake), so the applicative node
is a low projection.15
At first sight, this sort of analysis would be rejected for Romance languages
because they have no DOC, they have a special preposition to express the dative
case and they have dative personal pronouns. However, Romance languages have
been described by several authors as languages with dative alternation, with a
construction similar to the DOC and with an applicative head. It is the case of
Cuervo (2003, 2010) for Spanish, Torres Morais (2006) and Torres Morais & LimaSalles (2010) for EP and Diaconescu & Rivero (2005) for Romanian.
[15]
[346]
Clearly influenced by Demonte (1995) and Cuervo (2003) for Spanish and interested on the differences between Brazilian Portuguese (BP)16 and EP, Torres Morais
(2006) and Torres Morais & Lima-Salles (2010) proposed an analysis according to
which EP has dative alternation and justifies two base-generated constructions:
it has a construction where a dative NP argument is projected in the specifier position of a low applicative head, as in (24) and another configuration, where there
is a true preposition a, similar to para, that selects the IO as a complement, as
in (25):17
(24)
(25)
In (24) a is a dative case marker and the NP receives inherent case in the specifier
of ApplP; as a low applicative, the head Appl receives the meaning of possession,
which corresponds to the beneficiary interpretation, licensing the dative argument and relating it with the theme.18
This possibility differs from a true prepositional construction, possible in (25),
where a could be replaced by para as a way to mean the final goal of the event of
sending the letter. In this second possibility, the possessive relation may also be
built, but it is subordinated to the goal / transfer meaning of the preposition; a
clitic is impossible here because directional locatives are never realized as clitics (Torres Morais & Lima-Salles 2010, pg. 198).
The main questions that this analysis justifies are the following: are there any
semantic differences that justify the two structures? Are there two prepositions
a in dative constructions? And is there a justification for an applicative head in
this sort of dative construction?
[16]
[17]
[18]
OSLa volume 7(1), 2015
In Brazilian Portuguese the dominant preposition is para (to, for); and in certain geographical and social
varieties even the DOC may be used (see Torres Morais & Lima-Salles (2010). In Mozambique Portuguese
the DOC is very common (see Gonalves 1990, 2002, 2004); for a general presentation of the variation of
the IO in non-European varieties of Portuguese see Brito (2008).
For details see Torres Morais & Lima-Salles (2010).
The treatment is similar with lhe (O Joo enviou-lhe uma carta, John sent him a letter) with subsequent
movements that explain the final word order.
[347]
Notice that the notion of possession transfer is always stronger with Vs like
dar, to give, emprestar, to loan, alugar, to rent, vender, to sell (cf. Ormazabal &
Romero 2010, pg. 2089, from whom we adapt some of the examples); in fact, (26)
is odd, because the constrastive clause denies the implication of the main clause:
(26)
# A minha tia deu / emprestou algum dinheiro ao irmo, mas ele nunca o
recebeu.
the my aunt gave / lent some money to the brother, but he never it got
My aunt gave / lent some money to her brother, but he never got it.
On the contrary, with verbs like prometer, to promise, oferecer, to offer, enviar,
to send, ensinar, to teach, lanar, to throw, the situation is different and there
is the possibility of failure of successful transfer (Ormazabal & Romero 2010,
pg. 209):
(27)
(29)
(30)
(31)
(32)
[348]
Also, if the classical notion of ditransitive verbs is still in use, an applicative head
as a low verbal category seem also unjustified for EP: the idea is that some verbs
like dar to give, select two true internal arguments.19
We have seen before that ditransitive constructions justify two syntactic structures; but these two syntactic structures should not be based either on different
meanings or on the different nature of the preposition. In the next section we will
investigate some fronting, ellipsis, binding and scope phenomena and we will see
that two base-generated ditransitive constructions may be justified in EP, a proposal already made for Portuguese by Costa (2009).
[5] a r g u m e n t d i t r a n s i t i v e s i n e u r o p e a n p o r t u g u e s e : t wo b a s e -
g e n e r at e d s t r u c t u r e s
Fronting and ellipsis illustrated in (34-a) (examples from Costa (2009)), provide
evidence in favor of an analysis where the V and the DO form a constituent and
therefore this example may justify a structure like (20):20
(34)
[19]
[20]
OSLa volume 7(1), 2015
Miguel et al. (2011), analysing benefactive non-argument datives (a me preparou uma refeio filha / a
me preparou-lhe uma refeio, mother prepared a meal to her daughter / mother prepared her a meal) and
possessives datives (doem as costas ao Joo / doem-lhe as costas, Johns back hurt / his back hurt) propose that
they are merged, along with DP-Themes, under the internal argument, broadly interpreted as Possessive
DP and exhibiting a predicative structure; according to this analysis, no applicative head is justified.
Costa uses these data in favor of the structure (18).
[349]
Barss & Lasnik (1986) and Larson (1988) noticed, for English, that there are some
asymmetries on binding that question not only a tripartite configuration of ditransitive constructions (as in (11)) but also a bipartite configuration where the
IO is lower than the DO. It is why Larson proposes a derivational analysis of the
DOC in English, where the raised IO (the beneficiary/goal) would c-command the
DO (the theme) after movement.
Let us see the distribution of anaphors in ditransitive constructions in EP;
the examples are inspired by Demonte 1995 study for Spanish (Costa 2009; Brito
2010):
(35)
a.
b.
The two variants are possible, similarly to what has been proposed for other Romance Languages (Giorgi & Longobardi 1991, pg. 42 for Italian), but the sentence
with the low reflexive expression (35-b) is slightly better than the sentence where
the reflexive expression is higher than its antecedent (35-a).
Let us see now the same phenomenon with clitic doubling:
(36)
[21]
a.
According to Adger (2003, pgs. 124125) in English it is not possible to make VP preposing with the V
and the DO (i) *Benjamin said he would give the cloak to Lee and [give the cloak] he did to Lee. For him,
ellipsis seems to give the same results: (ii) Who gave the cloak to Lee? * Benjamin (did) to Lee. As for
coordination, although we can have (iii) Benjamin [gave the cloak] and [sent the book] to Lee, this is possible only with a substantial pause after cloak as well as odd intonation on the PP to Lee, suggesting that
we have a case of deletion: (iv) Benjamin [gave the cloak 0] and [sent the book to Lee]. Adger considers
that the behaviour of reflexives favors a binary branching analysis for ditransitive constructions in the
prepositional construction under a shell structure with vP, although he considers that there is weak
evidence from constituency in favour of this treatment. Notice that Adger judgements for English are
different from the ones by Phillips (2003), who admits VP preposing (see footnote 11).
OSLa volume 7(1), 2015
[350]
Due to the presence of clitic doubling, there is here a contrastive focus and a
marked interpretation; nevertheless, the sentence (36-b) with the low reflexive
expression is slightly better than the sentence where the reflexive expression is
higher than its antecedent (36-a).
Let us see how EP behaves as regards other phenomena of binding of pronouns.
In EP null possessives with a bound reading are always better than the ones
with the possessive seu, sua; moreover seu, sua is frequently interpreted as the
second person, related to voc, two reasons that interfere with these phenomena (Brito 2001). Nevertheless, the data favour a higher position of the antecedent
over the expression that contains the possessive, no matter the antecedent is the
DO or the IO:
(37)
a.
b.
c.
d.
These phenomena suggest a shell structure and the idea that the highest argument is base-generated; the same proposal is reinforced by other examples where
binding and scope of quantifiers are involved (cf. again Costa 2009):22
(38)
a.
b.
[22]
OSLa volume 7(1), 2015
Bruening (2001), for English, also proposes that there two available structures for ditransitives and that
there is no scrambling in order to explain the V IO OD order.
d.
[351]
Costa (2009)[pgs. 9596] defends that these phenomena support a structure where
the antecedent / the highest argument is base-generated, no matter it is the DO
or the IO.
What all these data suggest is that EP has two base-generated ditransitive
constructions, like (19) and (20), justified by word order data already presented
above, fronting, binding and scope phenomena, and not on different meanings of
each variant or the existence of two different values of a23 . Through both structures, the ditransitive verb builds its argument structure, in one discharging first
the theme, in another discharging first the goal / beneficiary.
[6] s u m m a r y a n d c o n c l u s i o n s
In this paper EP ditransitive constructions with dar to give and enviar to send,
were studied in some of their syntactic dimensions: EP has dative pronouns, a
special preposition a and exhibits two word patterns, V DO IO and V IO DO. We
have seen that the order V IO DO is due to two reasons: contrastive focus on the
IO or the complexity of the DO. This conclusion was reinforced by the analysis of
many utterances in the corpus of CetemPblico. Nevertheless there is a strong link
between the V and the DO in certain constructions with dar as a light verb that
cannot be broken.
I revised some of the literature on IO / datives and on the DOC. Specifically,
I commented Torres Morais & Lima-Salles (2010) analysis, according to which EP
has dative alternation, in the sense that in one of the structures a is a dative case
marker and in another structure is a low true preposition, similar to para. According to these authors the two constructions are not absolutely synonymous.
On the contrary, I proposed that a is the same dative marker in both positions;
[23]
As we saw above, there have been different proposals in the literature to describe the two variants. Costa
(2009) adopts Phillips (2003) framework, according to which there is an incremental structure building,
from left-to-right but preserving c-command and allowing to build two base-generated structures. Brito
(2014) adopts a treatment inspired in Alexiadou et al. (2011) framework, according to which a (verb)
root is dominated by different functional categories which build syntactic structure; but, contrary to
Marantz, Pylknnen, Cuervo, Torres Morais & Lima Salles, which use the Appl head in order to explain
the incorporation of the IO, the author still makes a distinction between argument datives and nonargument datives and therefore no Appl head is proposed. For the details of the analysis see Brito (2014).
OSLa volume 7(1), 2015
[352]
acknowledgments
I thank Paula Carvalho for helping me to pick up the examples in the corpus of
CetemPblico. As a member of Centro de Lingustica da Universidade do Porto
(CLUP), this research was supported by FEDER / POCTI U0022/2003.
annex i
From CetemPblico, relevant occurrences in bold.
par=ext989232-pol-96a-2: Temos de dar a Samper uma sada, disse o senador conservador Eduardo Pizano,
citado pela Reuter, como quem antev o caos depois da tempestade.
par=ext127620-nd-91a-1: No por acaso que agora, no seu primeiro projecto pessoal, deu a Price o papel de
Inventor.
par=ext578006-des-95a-1: Duas vitrias sucessivas do a um jogador muita confiana, confessou Muster aps
a final, em que, mais uma vez, demonstrou as suas qualidades fsicas.
par=ext472583-soc-95b-2: possvel sustentar a tese de que essa uma maneira oblqua e astuta de o ferir, inclusive porque torna mais difcil o divrcio e d a Diana melhores condies se, apesar de tudo, este vier a
acontecer.
par=ext694500-pol-91b-3: Os raptores deram a Bona um prazo de 48 horas para fornecer informaes sobre o
estado de sade dos irmos Hamadi, dois xiitas libaneses detidos na Alemanha sob acusaes de terrorismo.
par=ext711722-soc-91b-2: Como a verso tinta, mas sem gralhas, como ironiza Augusto Deodato, a agenda apresenta uma seleco que visa dar a quem resida ou venha a Lisboa a oportunidade de gerir melhor os
interesses nesta cidade.
OSLa volume 7(1), 2015
[353]
par=ext796563-soc-96b-1: por estas e por outras, concluiu Lobo Fernandes, que o prestigiado Guia Verde da
Michelin d a Braga a nota mais baixa (uma estrela) na classificao das cidades que apresenta no seu
roteiro turstico. par=ext1344780-nd-91a-1: Para Setembro, dever ter obtido sinais de reactivao que dem a
Carlos Menem uma vitria nas eleies legislativas, o que para muitos peronistas uma misso impossvel.
par=ext856353-clt-96b-2: O objectivo dar a professores, alunos e outros funcionrios a possibilidade de consultarem um rbitro para resolverem os seus diferendos pessoais ou institucionais.
par=ext660500-nd-98b-2: A evoluo do escndalo Monica Lewinsky deu a Hyde uma enorme notoriedade nacional e enquanto o caso no for fechado de vez o senador de pensamento conservador (que h 30 anos teve
um caso extraconjugal) vai continuar a estar sob os holofotes.
par=ext121571-soc-94b-1: O Governo portugus s deu a Bruxelas a informao que lhes convinha, no enviando sequer os pareceres produzidos no mbito da consulta pblica feita sobre o Estudo de Impacte Ambiental
().
par=ext755655-clt-96b-3: Na sequncia final, a suprema crueldade de Wilder dava a Cecil B. de Mille a oportunidade de domar, pela ltima vez, a beleza da sua ave do paraso enlouquecida.
par=ext320712-pol-94a-1: Onde que ia arranjar dinheiro para dar a esses homens a comida, as roupas e o
sabo de que necessitariam?, perguntou indignado o general Niha, primeiro secretrio da Frelimo na provncia de Nampula.
par=ext582831-des-92a-2: Uma sondagem Pblico-Norma realizada no domingo no Estdio da Luz, por ocasio do
jogo Benfica-FC Porto, deu a Jorge de Brito a maioria absoluta para as eleies de 24 de Abril.
par=ext585073-pol-98a-4: Mas uma sondagem divulgada no fim-de-semana d a Cardoso uma confortvel margem:
40 por cento, contra 35 por cento para todos os seus rivais somados.
par=ext677371-pol-92b-2: O Congresso ter que assumir a responsabilidade de dar a Itamar a possibilidade de
organizar o Estado, que foi desorganizado nos ltimos seis anos. par=ext1371639-pol-93a-1: Dia importante,
este 27 de Abril de 1993 ainda mais que aquele, no Vero de h trs anos, em que Gorbatchov deu a Bush luz
verde para a coligao anti-Iraque.
par=ext221520-clt-94b-2: Lestat, quando viu o que Louis tinha feito, deu a Claudia um pouco do seu sangue a
beber, transformando-a tambm em vampiro, para a oferecer a Louis .
par=ext1405400-nd-94b-1: Muoz Molina manifesta uma categrica afinidade com aqueles que do a Lisboa e a
Portugal a forma e o contedo da nossa peculiar identidade.
par=ext1180148-pol-97b-2: A Assembleia da Repblica recusou dar a Pacheco Pereira a prerrogativa de depor
apenas por escrito num processo por abuso de liberdade de imprensa que lhe foi movido pela actual directora
do vespertino A Capital, Helena Sanches Osrio.
par=ext269933-soc-91a-1: Joo Paulo II no deixou de dar a este debate o seu contributo.
par=ext403476-nd-93b-1: O Estado Novo, dentro dos limites consentidos pelas suas opes estratgicas, deu a
Pacheco meios quase ilimitados de concretizar o seu voluntarismo modernizador.
par=ext732008-pol-93a-1: A campanha eleitoral comeou a dar os primeiros passos logo no sbado, aps a dissoluo oficial do Parlamento, que apanhou os desprevenidos os deputados que no esperavam que a moo de
censura contra o Governo de Hanna Suchocka fosse aprovada, dando a Walesa o pretexto que ele esperava
para dissolver o Parlamento.
par=ext670069-pol-93b-2: Dar a cada cubano a possibilidade de possuir, legalmente, a moeda do inimigo, o
dlar, ser assim quebrar um dogma.
annex ii
Some proverbs with dar to give, from Machado (1996, pgs. 161165) and Parente
(2005, pg. 184).
(i) V DO IO order:
D Deus as nozes a quem no tem dentes.
D Deus toucinho a quem no tem espeto.
D honra a quem no a tem.
D Nosso Senhor campos a quem no aproveita os toucinhos.
OSLa volume 7(1), 2015
[354]
references
Adger, David. 2003. Core syntax: A minimalist approach Core linguistics. Oxford
University Press.
Alexiadou, Artemis, Gianina Iordchioaia & Florian Schfer. 2011. Scaling the
variation in romance and germanic nominalizations. In Petra Sleeman & Harry
Perridon (eds.), The noun phrase in Romance and Germanic: Structure, variation, and
change Linguistik Aktuell, 2540. John Benjamins Publishing Company.
Anagnostopoulou, Elena. 2003. The syntax of ditransitives: Evidence from clitics Studies in generative grammar. Mouton de Gruyter.
Baker, Mark Cleland. 1988. Incorporation. a theory of grammatical function changing.
The University of Chicago Press.
Barss, Andrew & Howard Lasnik. 1986. A Note on Anaphora and Double Objects.
Linguistic Inquiry 17. 347354.
Belletti, Adriana. 2004. Aspects of the low IP area. In Luigi Rizzi (ed.), The Structure of CP and IP: The Cartography of Syntactic Structures, vol. 2 Oxford Studies in
Comparative Syntax, Oxford University Press.
Brito, Ana Maria. 2001. Presena/ausncia de artigo antes de possessivo no Portugus do Brasil. In Actas do xvi encontro da associao portuguesa de lingustica,
551575. APL/Colibri.
Brito, Ana Maria. 2008. Grammar variation in the expression of verb arguments:
the case of the Portuguese Indirect Object. Phrasis 2008. 3158.
Brito, Ana Maria. 2009. Construes de objecto indirecto preposicionais e no
preposicionais: uma abordagem generativo-constructivista. In A. Fiis &
A. Coutinho (eds.), Textos Seleccionados do XXIV Encontro da Associao Portuguesa
de Lingustica, 141159. Colibri.
OSLa volume 7(1), 2015
[355]
Brito, Ana Maria. 2010. Do European Portuguese and Spanish have the double
object construction? In Encuentrogg. v encuentro de gramtica generativa (2009),
81114.
Brito, Ana Maria. 2014. As construes ditransitivas revisitadas. alternncia dativa em Portugus Europeu? In Antnio Moreno, Ftima Silva, Isabel Fal, Isabel Pereira & Joo Veloso (eds.), Textos selecionados: Xxix encontro nacional da
associao portuguesa de lingustica, 103119.
Bruening, Benjamin. 2001. QR obeys superiority: frozen scope and ACD. Linguistic
Inquiry 32(2). 233273.
Costa, Joo. 2009. A focus-binding conspiracy. Left-to-right merge, scrambling
and binary structure in European Portuguese. In Jeroen van Craenenbroeck
(ed.), Alternatives to cartography, 87108. De Gruyter Mouton.
Cuervo, Maria Cristina. 2003. Datives at Large: Massachusetts Institute of Technology PhD dissertation.
Cuervo, Maria Cristina. 2010. Against ditransitivity. Probus 22. 151180.
Demonte, Violeta. 1995. Dative alternation in Spanish. Probus 7. 530.
Diaconescu, Constanta Rodica & Maria Luisa Rivero. 2005. An applicative analysis
of double constructions in Romanian. In Actes du Congrs annuel de lAssociation
Canadienne de Linguistique, 111.
Duarte, Ins. 1987. A construo de topicalizao na gramtica do portugus: regncia,
ligao e condies sobre movimento: Universidade de Lisboa PhD dissertation.
Duarte, Ins. 2003. Relaes gramaticais, esquemas relacionais e ordem de
palavras. In M. Helena Mira Mateus, Ins Duarte & Isabel Hub Faria (eds.),
Gramtica da lngua portuguesa, 275321. Caminho 5th edn.
Giorgi, Alessandra & Giuseppe Longobardi. 1991. The Syntax of Noun Phrases: Configuration, Parameters and Empty Categories. Cambridge University Press.
Gonalves, Anabela & Eduardo Paiva Raposo. 2013. Verbo e sintagma verbal. In
Eduardo Paiva Raposo, Maria Fernanda Bacelar do Nascimento, Antnia Coelho
da Mota, Lusa Segura & Amlia Mendes (eds.), Gramtica do portugus, vol. 2,
11551218. Fundao Calouste Gulbenkian.
Gonalves, Perptua. 1990. A Construo de uma Gramtica do Portugus em Moambique: Aspectos da Estrutura Argumental dos Verbos: Universidade de Lisboa PhD
dissertation.
OSLa volume 7(1), 2015
[356]
[357]
Pesetsky, David. 1995. Zero Syntax: Experiencers and Cascades. The MIT Press.
Phillips, Colin. 2003. Linear order and constituency. Linguistic Inquiry 34(1). 3790.
Pujalte, Mercedes. 2008. Sobre frases aplicativas y complementos dativos en el
espaol del Rio de Plata. Cuadernos de Lingistica 15. 139156.
Pujalte, Mercedes. 2009. Condiciones sobre la Introduccin de argumentos. El caso de la
alternancia dativa en Espaol. Universidad Nacional del Comahue, Escola Superior de Idiomas MSc thesis.
Pylkknen, Liina. 2002. Introducing Arguments: Massachusetts Institute of Technology PhD dissertation.
Ramchand, Gillian. 2008. Verb Meaning and the Lexicon: a first phase syntax. Cambridge University Press.
Rapapport Hovav, Malka & Beth Levin. 2008. The English dative alternation: the
case for verb sensitivity. Journal of Linguistics 44. 129167.
Ross, John Robert. 1967. Constraints on variables in Syntax: Massachusetts Institute
of Technology dissertation.
Torres Morais, Maria Aparecida. 2006. Um cenrio para o ncleo aplicativo no
portugus europeu. ABRALIN 5. 239266.
Torres Morais, Maria Aparecida & Helosa Lima-Salles. 2010. Parametric change
in the grammatical encoding of indirect objects in Brazilian Portuguese. Probus
22. 181209.
Vilela, Mrio. 1992. Gramtica de Valncias. Teoria e aplicao. Almedina.
Xavier, Maria Francisca. 1989. Argumentos Preposicionados em Construes Verbais.
Um estudo contrastivo das preposies a, de e to, from: Universidade Nova de Lisboa
PhD dissertation.
c o n ta c t s
Ana Maria Brito
Faculdade de Letras da Universidade do Porto
ambarrosbrito@gmail.com
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 359377. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
corpus-driven glossaries
in translator training courses
STELLA ESTHER ORTWEILER TAGNIN
resumo
A Lingustica de Corpus tem-se mostrado um recurso valioso para a extrao
de candidatos a termos e unidades fraseolgicas a partir de corpora especializados (Bowker & Pearson 2002). Na realidade, trata-se de uma abordagem
relativamente nova j que a maioria dos glossrios baseia-se, em geral, em
material similar anteriormente existente. Embora haja muitos glossrios
no mercado, poucos foram compilados para atender s necessidades dos
tradutores, cuja principal tarefa na traduo tcnica produzir um texto
natural e fluente, seja na sua lngua nativa, ou em uma lngua estrangeira.
Por essa razo, um glossrio que consiste simplesmente de uma lista de termos e seus equivalentes no ser satisfatrio para o tradutor. Como produtores de texto, os tradutores precisam saber como a palavra usada, ou seja,
com quais palavras combina (Firth 1957; Sinclair 1991). Alm disso, a linguagem tcnica abriga termos que consistem de vrias palavras assim como
unidades fraseolgicas ainda mais longas. A compilao de glossrios era
abordada no Curso de Especializao em Traduo na Universidade de So
Paulo como metodologia para melhorar o conhecimento especializado dos
alunos. Aps algumas experincias, verificou-se que a abordagem condizia
com o que Shreve (2006) chamou de prtica deliberada, metodologia que
contribui para o desenvolvimento das habilidades de pesquisa e de traduo
dos alunos, levando aquisio de conhecimento e de tcnicas especializados (Maia 1997, 2002; Tagnin 2002), de que os aprendizes podero se valer em
qualquer rea na qual venham a trabalhar. Este artigo descrever como isso
foi realizado em vrias ocasies, ou seja, com o recurso a uma abordagem
baseada em corpus, e ilustrar, com exemplos de vrios projetos, os passos
seguidos.
[1] i n t r o d u c t i o n
[360]
These courses were discontinued in 2005, that is, the last group completed the course in 2007.
[361]
This paper reports on the decisions made regarding what to teach in a translator training course and describes how Corpus Linguistics can be used for terminological works.
[2] c o r p o r a i n t h e t r a n s l at i o n c l a s s r o o m
The use of corpora in translator training courses has been a fact for over two
decades (Maia 1997, 2002; Tagnin 2002). In Brazil it was introduced as a methodology for the compilation of technical glossaries in the Specialization Course in
Translation at the University of So Paulo in 2001. During a course on Technical
Translation students were divided into thematic groups and instructed to build
an EnglishPortuguese comparable corpus in a specialized area, that is, a corpus with original texts in both languages. They should then extract the technical
terms, identify equivalents and collect examples in both languages. Glossaries
resulting from this activity were made available at the courses site2 under Trabalhos de alunos - Glossrio (Student works Glossary). In 2005, students were
asked to build a bilingual glossary along the lines of a series of technical glossaries
brought out by a local publisher. Each group could choose one field of study, and
the best works would be submitted to the publisher for possible publication. In
2008, as part of a similar course3 , it was suggested that the whole class engage in
one collective project for the construction of a Photography glossary. This project
is discussed in detail in Section [4].
[2.1]
Before deciding on the format of the glossaries to be produced, it was deemed necessary to determine the translators terminological needs (Teixeira 2008; Fromm
2008). When one reflects about this, what immediately comes to mind is that a
translator needs equivalents, which is actually only partially true. As GonzlezJover & Sierra (2004) have already pointed out, terminology materials should help
translators make decisions that are part of their daily practice. And their daily
practice involves much more than just finding an equivalent.
A survey carried out by Fromm (2008) with professional translators on the
features of the bilingual dictionaries they mostly use showed (see Table 1) that
the dictionaries translators find more valuable, apart from the ones that present
all of the above, are the ones the results that provide a translation as well as
examples. And it is this preference that has been the basis on which the template
for our entries was built.
[2]
[3]
http://citrat.fflch.usp.br/node/18
This was a single extracurricular discipline, also called Technical Translation, but not part of a fullfledged course anymore.
OSLa volume 7(1), 2015
[362]
Respondents
14
34
19
23
22
22
41
Percentage
8%
19%
11%
13%
12%
12%
23%
[2.2]
Given that translators are, above all, text producers and that their goal in technical translation is to produce a natural text, they need, in addition to equivalents,
examples that contextualize a certain term found in the source text as well as information about its textual and linguistic patterns. In other words, they need to
know the terms collocations and phraseologies (Tagnin 2002). For terms which
do not have equivalents in the target language, translators would need other
translation possibilities or even suggestions for adaptation. On such occasions,
cultural information may help them to choose adequate substitutions.
Let us illustrate this with an example taken from the area of Cooking. If a
translator needs to translate 1 large onion, finely chopped into Portuguese, he/she
would find it useful to have a glossary which would specify that the Portuguese
cognate for finely (finamente) does not usually occur in this context. Rather,
the most natural translation for finely into Portuguese would be the adverb bem
(= well), which renders bem picada (*well chopped). Another option would be the
diminutive picadinha, with or without the adverb bem. Thus, the glossary would
specify that the best translation options are 1 cebola grande, bem picada or 1 cebola
grande (bem) picadinha. In the case of finely grated Parmesan cheese, the glossary
should provide the information that the usual translation is simply queijo parmeso
(= parmesan cheese), since in Brazil this kind of cheese is customarily finely grated.
Thus, the texture is only specified when the cheese should be coarsely grated,
which would be ralado grosso in Portuguese. The cultural gap becomes even more
evident when the translator encounters the term buttermilk. Although the Portuguese language has a corresponding term, leitelho, it is not used, mainly because this product does not exist in our country. Thus, the glossary could add an
explanatory note or even suggest that buttermilk can be replaced by a mixture of
equal parts of milk and plain yogurt (Teixeira & Tagnin 2008).
However, much of the material available on the market does not meet these
needs and is often limited to a mere list of monolexical terms and their equivalents
OSLa volume 7(1), 2015
[363]
in the target language, without providing examples or other linguistic information that can help the translator to make adequate decisions and create a text in
which naturalness (Sinclair 1984) prevails. Thus, as mentioned before, it is necessary to create a model for a glossary that meets the needs of the translator. In this
sense, as Krieger & Finatto (2004) have suggested, translators can be instrumental in creating new methodologies for the production of reliable terminological
sources of information.
In this paper we claim that a methodology relying on the premises of Corpus
Linguistics can provide this so much needed reliable terminological source of
information for translators.
[3] c o r p u s l i n g u i s t i c s
As we know, Corpus Linguistics is an empirical approach based on the observation of a large number of texts. These texts, always authentic, constitute a corpus, which can be investigated by means of specific computational programs that
produce, among other data, concordance lines (see Figure 1). Concordance lines
show the search word with its surrounding co-text, and allow investigators to
identify recurrent patterns, terms and phraseological units. Concordance lines
can also be sorted alphabetically by the words to the right or to the left of the
search word, which makes identifying recurrent patterns even easier by grouping them together. The first example (Figure 1) is a selection of concordance lines
for the Portuguese word imagem (= image), taken from the Photography corpus.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
imagem
imagem
imagem
imagem
imagem
imagem
imagem
imagem
imagem
imagem
imagem
imagem
imagem
imagem
imagem
ampliada. 7Pressione
ampliada (zoom de
ampliada Utilizando a
ampliada (zoom de
ampliada. 7Pressione
ampliada (zoom de
ampliada (zoom de
captada pelas lentes
captada no modo Adobe
captada por uma
capturada. Alm disso
capturada com
capturada em pelcula
capturada. Alm disso
capturada primeiro
figure 1: A selection of concordance lines for imagem, sorted by 1st word to the
right.
The above concordance lines show the recurrence of three collocations: imagem ampliada, imagem captada and imagem capturada, which might indicate that
OSLa volume 7(1), 2015
[364]
an
an
an
an
an
an
an
an
an
an
an
an
an
an
an
an
an
an
an
an
an
an
figure 2: Selection of concordance lines for image, sorted by 1st and 2nd word on
the left.
Another method to extract terminological units is by using a list of n-grams
(Guinovart & Simes 2009; Maia et al. 2008). These lists show all combinations
of two words (bigrams), three words (trigrams) or even longer combinations, depending on how the researcher adjusts the settings of the program being used.
Again, however, these lists need to be examined by the researcher in order to
decide which combinations are, in fact, terminological units.
[365]
Corpus Linguistics can be used in two ways to compile glossaries: as a methodology or as an approach. In the first case, we refer to it as corpus-based Terminology; in the second, as corpus-driven Terminology. It is the latter that was used in
our courses.
Word
THE
#
TO
AND
IN
A
CAMERA
IS
OR
OF
Freq.
13,665
13,197
4,173
2,705
2,621
2,560
2,216
2,168
2,164
2,111
%
7.91
7.64
2.42
1.57
1.52
1.48
1.28
1.26
1.25
1.22
Texts %
10 100
10 100
10 100
10 100
10 100
10 100
10 100
10 100
10 100
10 100
N
11
12
13
14
15
16
17
18
19
20
Word
IMAGE
ON
YOU
WITH
FOR
BUTTON
IMAGES
MODE
YOUR
WHEN
Freq.
1,697
1,643
1,576
1,309
1,284
1,187
1,156
1,043
973
946
%
0.98
0.95
0.91
0.76
0.74
0.69
0.67
0.60
0.56
0.55
Texts %
10 100
10 100
10 100
10 100
10 100
10 100
10 100
10 100
10 100
10 100
table 2: WordList 20 most frequent words in the Camera subcorpus of the Photography project.
OSLa volume 7(1), 2015
[366]
Key word
CAMERA
IMAGE
BUTTON
IMAGES
MODE
SELECT
FLASH
PHOTOGRAPHS
OR
MENU
EXPOSURE
BATTERY
SHUTTER
PRESS
CARD
KODAK
FILM
PHOTOGRAPHIC
DIGITAL
LIGHT
Freq.
2,216
1,697
1,187
1,156
1,043
828
703
478
2,164
703
636
587
554
835
655
485
446
333
508
370
%
3.71
2.84
1.98
1.93
1.74
1.38
1.18
0.80
3.62
1.18
1.06
0.98
0.93
1.40
1.10
0.81
0.75
0.56
0.85
0.62
RC. Freq.
46
220
37
49
11
94
17
29
4,022
61
39
12
2
415
144
1
433
2
49
238
RC. %
0.01
0.25
0.03
0.03
0.01
Keyness
9,941.70
6,623.78
5,230.30
5,009.45
4,759.94
3,284.80
3,130.42
2,968.63
2,937.68
2,874.92
2,684.16
2,629.87
2,564.53
2,400.90
2,338.61
2,253.63
2,201.97
2,196.38
2,053.56
2,017.40
[367]
camera. This
camera with its
camera. The
Camera Manager
Camera Manager is
Camera Manager
Camera Manager
Camera Manager's
camera, bracketin
camera as you
camera batteries.
camera batteries
camera battery
camera battery
camera
camera This
camera User's guid
camera dock or
camera dock, or
camera dock, Koda
camera dock or
camera acts as a
camera using the
camera, you'll be
camera that cost
The above sequence of activities was followed on various occasions during Technical Translation courses at the University of So Paulo. The most recent ones
took place in 2005 and 2008, as mentioned before. For the sake of illustration,
we will concentrate on the 2008 project on Photography, but will resort to other
areas from the 2005 project when they provide better examples to illustrate the
procedures being discussed.
[368]
[369]
Number of words
72,665
72,864
36,668
59,803
72,716
314,716
Extracting patterns
Let us remember that recurrent patterns in concordance lines may be candidate
terms. Figure 4 shows some of these patterns for the word photographs.
The Figure 4 concordance lines allow us to identify nominal collocations such
as albumen photographs, colo[u]r photographs, digital photographs and family photographs, as well as verbal collocations like clean photographs, display photographs and
even longer phraseological units like water-damaged photographs.
Extracting relevant context (examples)
Once all relevant terms and phraseologies had been identified, examples were
retrieved from the concordance lines to be inserted in the entries. If the concordance line did not show the full context, a double click on it led to the full source
text. Part of it is shown below for concordance line 25 in Figure 5.
Identifying equivalents
One way to identify possible equivalents is to compare the lists of keywords in
both languages. Figure 6 illustrates this procedure for an EnglishPortuguese
Cooking glossary (Teixeira & Tagnin 2008).
Once a pair is identified, concordance lines should be generated to check whether the selected equivalents occur in similar contexts. When there is no such
prima facie (literal) equivalent, search can be pursued by the words collocates or
context (Tagnin 2007). For example, if we wish to find the equivalent for finely
the most frequent adverb in a Cooking corpus we will realize that it is not
finamente, the Portuguese cognate for finely, because this adverb displays a very
OSLa volume 7(1), 2015
[370]
figure 4: Selection of concordance lines for photographs sorted by 1st word to the
left.
Important photographs should be matted to museum standards,
using archival matting and backboard. Check with a professional
in a good framing store.
Do not display photographs in direct sunlight or under bright
lights, and keep them away from heat vents and damp locations.
Store prints in a cool and dry spot; basements, attics, and garages
are not suitable locations for storage because their temperature
and humidity levels vary too much.
figure 5: Expanded context in source text. Relevant concordance line highlighted
by author.
OSLa volume 7(1), 2015
[371]
low frequency in the Portuguese Cooking corpus. So, we can look at the collocates
of finely and see with which words they occur in the target language corpus. One
of these collocates is chopped, picado in Portuguese. The concordance lines will
show that picado co-occurs with bem, yielding the collocation bem picado, but they
also show a typical Portuguese term picadinho, which may also occur with bem:
bem picadinho (Figure 7).
2 cebolas mdias bem picadas
dente de alho bem picado
junte os tomates pelados bem picados.
Calabresa picadinha
100 g de bacon picadinho
2 dentes de alho picadinhos
Polvilhar salsa bem picadinha
figure 7: Selection of some concordance lines for picad*, sorted by 1st word to the
left.
If even this procedure does not reveal an equivalent, it may be because there
is no equivalent in the target language. Thus, in such instances, it would be useful to suggest an adaptation or insert an explanatory note, as was the case for
buttermilk, mentioned earlier in this paper. Because we are dealing with a comparable corpus, with original texts in both languages, this kind of information may
be retrieved from the corpus itself.
OSLa volume 7(1), 2015
[372]
Building entries
To meet translators needs, as discussed above, entries portrayed the following
information:
(1) head word (part-of-speech)
(2) Example in English
(3) equivalent
(4) example in Portuguese
(5) Comments (if necessary)
(6) cross-reference
Here are a few sample entries from the Photography glossary:
(1) acid-free (adj.)
(2) For added protection, acid-free envelopes and boxes are availabe
from conservation suppliers.
(3) de pH neutro
(4) S so aceitveis para embalagens de arquivo de fotografias
papis de pH neutro ou prximo de neutro, isentos de lignina
e sem corantes.
(5) Termo usado quando um produto contm nvel de pH acima
de 7.0. Indica que em sua composio no foi utilizado nenhum
componente com reao cida ou que, com o passar do tempo
se decomponha produzindo resduos cidos que causam srios
danos s fotografias.
(1) adapter card (n.)
(2) The adapter card may have multiple ports.
(3) carto adaptador
(4) Conecte a extremidade de 6 pinos do cabo em qualquer
port disponvel ao carto adaptador IEEE 1394 do
computador.
(1) additional development (n.)
(6) development, additional
At the end of this process, students had built their bilingual glossaries, which
were examined by the instructor and returned with comments and suggestions.
This way, students had the opportunity to revise their work and make any necessary changes, adjustments or additions. Only the final version was evaluated.
OSLa volume 7(1), 2015
[373]
[5] r e s u l t i n g p r o d u c t s
As mentioned above, this procedure was carried out on two occasions, 2005 and
2008. From the glossaries produced by the 2005 class, one on Chemistry was published in 2007 (Perrotti-Garcia & Rebechi 2007).
A Cooking glossary built along the same lines was produced by a former translation student and co-authored by me (Teixeira & Tagnin 2008). Although not part
of either the 2005 or the 2008 project, it is an offspring of a glossary on Cooking
spices and condiments compiled in the 2001 course. After finishing the Translation course, Teixeira pursued her masters degree with a thesis on the translation of cooking recipes (Teixeira 2004) and her PhD with a dissertation on a proposal for a Cooking dictionary aimed at a translators textual production (Teixeira
2008)6 .
The results of the Photography project, unsurprisingly, were a bit uneven.
One group excelled and one presented very poor material. The work of the other
groups was good but needed some improvement. As the aim was to submit high
quality material to a publisher and only one glossary met this requirement, after
grades had been assigned, the instructor called a meeting of those who would be
interested in pursuing the project on their own time and making all necessary
adjustments for the work to be suitable for submission to the publisher. A group
of five students7 decided to embrace the project and the final material was submitted in early 2009. As it is the publishers policy to have all technical glossaries
revised by a professional in the area, the material was examined by a professional
photographer who returned it with a few comments and suggestions. These were
worked on by the group and the Vocabulrio para fotografia was eventually published in 2013 (Tagnin 2013).
[6] a n i n t e r e s t i n g o u t c o m e
A couple of years ago I participated in a round table on the teaching of translation. One of my colleagues, Fabio Alves, from the Federal University of Minas
Gerais, presented the concept of deliberate practice. It goes something like this:
for students to acquire translation competence, their training should aim at developing specific skills that will contribute to their optimal learning and expert
performance in a certain field (Ericsson & Charness 1997). This requires certain
conditions to be met, among which the most mentioned one is subjects motivation to attend to the task and exert effort to improve their performance (Ericsson
et al. 1993, pg. 367) .This is developed by Shreve (2006, pg. 29) who states that for
deliberate practice to occur, the following requirements must be met:
[6]
[7]
[374]
[375]
[7] f i n a l r e m a r k s
references
Alves, Fbio & Stella Esther Ortweiler Tagnin. 2010. Corpora e ensino de traduo:
o papel do auto-monitoramento e da conscientizao cognitivo-discursiva no
processo de aprendizagem de tradutores novatos. In Vander Viana, Stella
Esther Ortweiler Tagnin & Fbio Alves (eds.), Corpora no ensino de lnguas estrangeiras, 189203. HUB Editorial.
Bowker, Lynne & Jennifer Pearson. 2002. Working with Specialized Language: A Practical Guide to Using Corpora. Routledge.
Ericsson, Anders, Ralf Th. Krampe & Clemens Tesch-Romer. 1993. The Role of Deliberate Practice in the Acquisition of Expert Performance. Psychological Review
100. 363406.
Ericsson, K. Anders & Neil Charness. 1997. Cognitive and developmental factors
in expert performance. In P. J. Feltovich, K. M. Ford & R. R. Hoffman (eds.),
Expertise in context: Human and machine, 341. MIT Press.
Firth, John Rupert. 1957. Papers in linguistics 1934-1951. Oxford University Press.
Fromm, Guilherme. 2008. Votec: A construo de vocabulrios eletrnicos para aprendizes de traduo. So Paulo: Universidade de So Paulo PhD dissertation.
Gonzlez-Jover, Adelina Gmez & Chelo Vargas Sierra. 2004. Aspectos metodolgicos para la elaboracin de diccionarios especializados bilinges destinados al
traductor. In L. Gonzlez & P. Hernuez (eds.), Las palabras del traductor: Actas
del II Congreso El espaol, lengua de traduccin, 365398.
Guinovart, Xavier Gomez & Alberto Simes. 2009. Parallel corpus-based bilingual
terminology extraction. In Marie-Claude LHomme & Sylvie Szulman (eds.), 8th
international conference on terminology and artificial intelligence, .
OSLa volume 7(1), 2015
[376]
[377]
Tagnin, Stella Esther Ortweiler & Cleci Regina Bevilacqua. 2013. Corpora na terminologia. HUB Editorial.
Teixeira, Elisa Duarte. 2004. Receitas qualquer um traduz. Ser? - a Culinria como
rea tcnica de traduo. Universidade de So Paulo MSc thesis.
Teixeira, Elisa Duarte. 2008. A Lingustica de Corpus a servio do tradutor: Proposta
de um dicionrio de Culinria voltado para a produo textual: Universidade de So
Paulo PhD dissertation.
Teixeira, Elisa Duarte & Stella Esther Ortweiler Tagnin. 2008. Vocabulrio para
Culinria ingls-portugus Srie Mil & Um Termos. SBS.
c o n ta c t s
Stella Esther Ortweiler Tagnin
Universidade de So Paulo
seotagni@usp.br
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 379395. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
resumo
Este artigo apresenta um algoritmo de multi-view self-training , que identifica os indicadores de sentimento por: 1. extrao relaes causais, 2. As
relaes causais classificao em uma categoria sentimento, 3. agrupamento
causas comuns e 4. atribuindo categorias sentimento a causas comuns para
criar um distribuio sentimento para cada causa comum. Uma avaliao
manual global da estratgia descobriu que ele tinha uma preciso de 70,00%.
[1] i n t r o d u c t i o n
Sentiment analysis has become an increasingly popular area of research. Sentiment analysis typically relies upon the detection of words that have a sentiment
orientation. Sentiment analysis is used in time dependent tasks such as reputation management and stock trading. Reputation management identifies positive
or negative in documents published on the Internet to gauge a value of a brand.
Sentiment analysis in stock trading identifies positive, negative or neutral statements in news or blog posts to identify buy or sell signals for specific stocks or
financial indexes. These tasks are time dependent because they rely upon sentiment to make inferences about future events. For example, profit warnings or
sales figures. Once the event has happened, information related to the event is
worthless. In time dependent sentiment analysis the further ahead in time sentiment about a future can be identified the more valuable the information.
This paper presents an algorithm for identifying indicators of sentiment. Indicators of sentiment for the purposes of this paper are noun phrases that indicate
the existence of sentiment at sometime in the future.
The algorithm relies upon the detection of causal relations and the sentiment
classification of the effect part of the causal relation. The algorithm groups together common causes and the associated sentiment classifications. The sentiment classifications are aggregated into a probability distribution. This sentiment probability distribution is an indicator of future sentiment implied by a
mention of a cause in a text.
[380]
The related work will discuss the following: causation in text, causal relation extraction, sentiment classification and prediction of future texts from information
in past documents.
[2.1]
Causation in text
Causal relations in text can be seen as relation that exists between two events if
one event is the cause of the other (Altenberg 1984). Altenberg (1984) stated that
three conditions must exist before a causative relation can exist in written or spoken language. The three conditions are: 1. encapsulate the two members of the
relationship, 2. express the type of relationship between the relations members
and 3. identify the members in a coherent sequence. An alternate definition of
causative relation was provided by Baron (1974) who stated: Causation is a relationship between two states of affairs, X at time T1 and X at time T2 , and a
cause Z that provides the necessary conditions for causing the change from X to
X . Baron (1974) provided four areas that should be considered when analyzing
causative grammar: 1. what it is represented by the causative relation, 2. what
mechanisms does the language have to represent causation, 3. what level in the
grammar is the causation represented and 4. what syntactic/semantic parameters define the relationship between elements in causative constructions (Baron
1974). Baron (1974) further states that causation can be seen as a relation between
entire propositions and/or sentences.
Two types of causation in text can be considered: explicit and implicit. Explicit causation is when the causative link is explicitly stated, for example in the
generalization for causative verbs, N P V N P 1 , that was provided by Levin (1993).
An example of explicit causation that fits the N P V N P pattern is Smoking
causes cancer.. Implicit causation is when the causal link is implied, for example, The sun was bright and I was sweating. The implied cause the action of
sweating is the warmth of the sun.
[2.2]
The causal relation extraction can be grouped into general methods: manual and
automatic. Manual methods rely upon manually identified characteristics of language, typically patterns, to detect a causative relation. The automatic approaches
tend to be supervised machine learning strategies. Supervised learning strategies
are methods where labelled data is used to induce a classification model that is
used to identify causal relations in unlabelled text.
[1]
OSLa volume 7(1), 2015
[381]
Manual Approaches
A simple approach for manual strategies is to use hand crafted patterns. These
patterns are typically created by human experts and can be domain specific, that
cant be generalized to other domains. In addition the rule construction process
can be a time consuming process. There were a number of approaches that relied
upon domain knowledge and hand-crafted rules. One of the earliest examples
found in the literature was by Kaplan (1991). His system had a pipeline that had
several stages that were: 1. hand coded propositional representational parser,
2. semantic analysis component, 3. causal analysis and 4. knowledge base acquisition. Each stage is dependent upon the previous stage. The causal analysis
component creates a causal chain of events based upon the output of the semantic
analysis component (SAC). The output of the SAC are a series of concept frames
that are represented as structured inheritance network. The root node of the network is known as thing, and the sub-nodes can be members of one of the following classes: objects, actions, or relationships. The causal chain is constructed
by using an event seed pair, for example, air rising and air cooling. The effect
part of the pair is used as a part of the next causal pair. This process continues
until no more causal pairs can be made. The detection of causal pairs is achieved
with propositional clues. Joskowicz et al. (1989) identified causal links between
messages generated by equipment installed in navy ships. This approach also relied upon a manual and domain specific approach.
Machine Learning
A popular supervised approach to extract causative relations is to use a sequence
classification strategy. There are a number of machine learning methods that can
be used in sequence classification strategies, for example Hidden Markov Models
(HMM) and Maximum Entropy Markov Models (MEMM). The research literature
indicates that one of the most common methods for causal relation extraction
are Conditional Random Fields (CRF). Mehrabi et al. (2013) used CRFs in a supervised strategy to extract causative relations from texts about the Geriatric Care
domain. The authors used the following features: tokens, token categories, prefix and suffixes, and Part Of Speech (POS) tag. The CRF had three possible labels:
cause, effect and out.
Riaz & Girju (2014) used verbs and nouns as features for a classifier2 . The features were grouped as: lexical, semantic and structural. Lexical features were described as verb, lemma of verb, noun phrase, lemma of all words of noun phrase,
head noun of noun phrase, lemmas of all words between verb and head noun of
noun phrase.. The semantic features used were the nine noun hierarchies of
WordNet. The structural features were the subject and object of a verb.
[2]
[382]
A common alternative strategy is to propagate label from labelled to unlabelled instances in a transductive strategy (Rossi et al. 2014).
[383]
narios from a causal event. Kunneman & Van den Bosch (2012) used Tweets about
Dutch football to predict future transfers of players.
[3] c o r p u s
The corpus that we used for the experiments was news stories about agricultural
in Brazil. These stories were gathered from various sources from the Internet
from 1995 until 2014. The data was not contiguous, and consequently there were
temporal gaps in the data. The stories were split into sentences and POS tagged
with the De Alencar (2010). The corpus contained 295,307 sentences.
[3.1]
Labelled data was required for the causal relation extraction and the sentiment
classification tasks. A random set of 394 sentences were selected from the corpus.
The data was categorized by a single annotator into two categories: causative and
non-causative. The non-causative category had 84 sentences and the causative
category had 310 sentences. The sentences in the causative category had one of
the following categories added to their words: cause, effect, causative link or noncausative. The density of causative relations was high when compared to other
causative relations annotation exercises we have undertaken (Drury et al. 2014a).
This may be due to the type of text annoatated or the selection of sentences may
have been atypical.
The labelled causative data was sub-divided into three categories (neutral,
negative or positive) for the sentiment classification evaluation. The negative
category had 228 sentences, the neutral 37 and the positive 45 sentences. The
negative category was the majority class. This was unsurprising as most of the
agricultural news stories were negative. Examples of the labelled data can be
found in Table 1. The training data is available from http://goo.gl/IYP1t1.4
Category
Negative
Negative
Positive
Sentence
Recentemente, foram as geadas que afetaram os canaviais.
Fmc lana portal de informaes sobre nematides, praga que
ameaa a cana de acar
o mercado internacional provocaram uma ligeira alta em o pregao
de ontem
table 1: Example of causative labelled data.
[4]
The annotation schema for the data is: N C = non-causitive, CN = Cause Noun, EN = Effect Noun
and CV = Causal Verb.
OSLa volume 7(1), 2015
[384]
The algorithm was designed to: 1. extract causal relations from text, 2. label
cause, effect and casual link of the relation and 3. classify the causal relation into
negative, neutral or positive categories.
A list of causative verbs generated by a previous version of this algorithm is freely available from the
resources described by (Drury et al. 2014b).
[385]
was the majority class and simply guessing this class for all words would have
produced an accuracy of approximately 90.00% without correctly identifying any
causal relations. The accuracy figure was calculated by the number of: 1. effect
words, 2. causative link and 3. cause words classified correctly minus the number
incorrect classification of non-causative and causative elements. The equation for
Ccr
the hold-out function is T cr+Enc
, where Ccr is the number of correct causal relation elements classified (cause, effect, causal link), T cr is the total number of
causal relation elements and Enc is the number of erroneous classifications of
non-causal words as a causal relation element.
The solutions were ranked by accuracy and the bottom 50% of the solutions
were removed. The breeding strategy selected one surviving solution and chose
randomly another surviving solution to breed with. The order of the features of
the breeding solutions was randomized, and 50% of each solution was selected for
the new solution. Duplicate features were removed. The mutation rate was 0.1,
meaning that 25 of the new solutions were mutated. The mutation strategy took
one feature of the solution and either: changed its value or swapped it for a new
feature. The GA ran for 35 generations. The GA was limited to 35 generations because the GA was a time intensive process. The results are displayed in Figure 1.
The diagram shows a steady increase over increasing generations with a number
of plateaus. We hypothesize that the plateaus were caused by delay in the best
solutions influencing the populations. The results represent a 14.28% relative increase over the initial best solution selected on the first generation. The results
were unimpressive because 1. we excluded correct non-causative classifications
from the fitness measure and 2. the limited amount of labelled data produced
weak models.
[386]
O
momento
figure 2: Examples of Word Dependencies in a Causal Relation for the Cause Candidate fumo.
The categories of features selected by the GA strategy where: words ahead
(number of words ahead) 16, 4, 8, word behind (number of words behind) 1, word
features: number, punctuation, start of sentence, sentiment value, stopword and
current word. An example of the features is provided in Figure 2, where the
word features are demonstrated for the cause candidate fumo. The look behind word is O and the look ahead words are: do, momento, Nervoso. Each
of these words had a number of word specific features. For example, the cause
candidate, fumo, would have the following word features: IsStartOfSentence:
false, Ispunctuation: false, HasSentimentValue: false, IsStopword:false and CurrentWord: fumo. Each of the look-ahead and look-behind word-features would
be included in the features for the cause-candidate, fumo.
In addition to using feature selection to improve the performance of the CRF
we evaluated the effectiveness of meta-learning. The meta-learning technique we
evaluated was stacking (Klugl et al. 2012) because the research literature suggests
that stacking CRFs outperform a single CRF. The stacking strategy we attempted
was to provide a separate random part of the training data to each individual CRF.
The CRFs then vote on each classification with the majority vote being accepted
as the classification of the stacked CRF.
We performed a basic evaluation of stacked 3 and 5 CRFs against a baseline of
1 CRF. The evaluation was a hold-out evaluation using he manually labelled data
described on section [3.1]. The hold-out evaluation was 80:20 1 X 10 , where the
data was randomly separated into two partitions: 80% for training and 20% for
evaluation. The process was repeated 10 times. An average accuracy was calculated. We found that a stacked 3 CRFs performed gained the highest accuracy on
the hold-out evaluation. A more in-depth evaluation was made that we describe
later on in the paper.
[4.2] Self-training
The labelled data described on section [3.1] was limited, and consequently any
model produced from this data would likely to be weak and produce errors. This
OSLa volume 7(1), 2015
Name of Strategy
Relative Link Classifier + Rule Labeller + Stacked CRF
Relative Link Classifier + Rule Labeller
Relative Link Classifier + Rule Labeller + Single CRF
Single CRF
[387]
Accuracy
Classification
0.81 0.09
0.61 0.09
0.76 0.09
0.13 0.09
Accuracy
Annotation
0.67 0.09
0.64 0.09
0.72 0.09
0.00 0.00
[388]
[389]
Negative
prejuzos, baixa, danos, perdas
geadas, quebra, diminuio, falta
http://code.google.com/p/rdflib/.
OSLa volume 7(1), 2015
[390]
were tested on the same splits. The evaluation measure was accuracy. The results
are displayed in Table 4. The results clearly show that the guided self-training
strategy produced the superior results.
OSLa volume 7(1), 2015
[391]
Accuracy
0.73 0.04
0.84 0.06
[5] s e n t i m e n t p r e d i c t i o n
The last step in the strategy is to assign a sentiment probability to a cause. This is
achieved by grouping common causes and aggregating their sentiment categories
to produce a sentiment distribution for a specific cause. This grouping process
is illustrated in the following example. We have three causative sentences and
their sentiment categories: 1. chuva causa cheias no Porto, neutral, 2. chuva
causa danos em Minas Gerais, negative and 3. Chuva causa inundaes e destri
casa em Itapetininga, negative. When the cause is chuva, and its sentiment
distribution would be P = {N eu = 0.33, N eg = 0.66, P os = 0.0}.
[5.1]
Experiments
The experiments for sentiment prediction manually evaluated the sentiment classifications for specific common causes. In the experiments we ran the aforementioned causal relation extractor and sentiment classifier. The relations were
grouped by cause and their sentiment distributions calculated. There were 4988
common causes. The most frequent sentiment causal events and their sentiment
distributions are displayed in Table 5.
No. Causal Rel.
116
95
76
73
70
59
41
38
35
30
Cause Event
seca
estiagem
chuvas
cana acar
chuva
clima
governo
brasil
crise
cana
Sent Dist.
neg 0.66 pos 0.05 neu 0.28
neg 0.58 pos 0.13 neu 0.29
neg 0.41 pos 0.04 neu 0.55
neg 0.16 pos 0.1 neu 0.74
neg 0.36 pos 0.01 neu 0.63
neg 0.56 pos 0.12 neu 0.32
neg 0.07 pos 0.17 neu 0.76
neg 0.13 pos 0.18 neu 0.68
neg 0.63 pos 0.06 neu 0.31
neg 0.13 pos 0.27 neu 0.6
[392]
[5.2] Evaluation
We performed a manual evaluation where we randomly selected 10 cause event
groups and evaluated the causal relations that constitute the sentiment distribution. The evaluation tested if: the sentiment category was correct and it was a
causal relation.
The causal events chosen were: expanso, pessoas, petrobras, baixas temperaturas geadas, praga, homem, canais, conab, praticidade and aquecimento global.
The results are shown in Table 6.
The accuracy of the whole sample for: 1. causative relation detection was 0.91
and 2. sentiment classification was 0.77. We can therefore calculate the overall
accuracy as 0.70 for extracting and classifying causal sentimental relations.
The causal relation extraction strategy performed poorly when the common
cause event was Conab.7 This was a special case because it is an organization that
made: 1. predictions about future events or 2. showed possible effects from a
cause. These statements had causal characteristics, but were not causal relations,
for example, Estudo da Conab mostra impacto do clima nas lavouras.
The errors made by the sentiment classification were between: 1. negative
and neutral categories and 2. positive and neutral categories. This type of error
is less serious than classifying a negative relation as positive or vice-versa because
any inference based from this sentiment mistake will be ignored.
[6] c o n c l u s i o n a n d f u t u r e wo r k
This work introduces a new type of sentiment analysis where we predict a sentiment distribution from a cause event. The initial results are encouraging as they
[7]
OSLa volume 7(1), 2015
http://www.conab.gov.br.
[393]
seem to make intuitive sense. For example, seca8 will be mainly negative for
agriculture because of future lower crop yields, however it seems reasonable that
there may be some positive future news (for farmers) in the form of crop price
rises due to lower supply and constant demand, although this news could be seen
as negative for consumers.
The future work is to evaluate the predictive ability of sentiment distributions
of causes. This work is centred around agriculture, and causes such as falta de
chuva or seca are likely to have similar effects on crops in the future as they
have had in the past. It is reasonable to assume at least in this domain that we
can estimate the sentiment distribution of future news stories. This may allow the
improvement of time dependent sentiment tasks such as reputation management
and stock trading.
acknowledgements
This work was supported by FAPESP grant number: 11/20451-1.
references
Altenberg, Bengt. 1984. Causal linking in spoken and written english. Studia Linguistica 38(1). 2069.
Ando, Rie Kubota & Tong Zhang. 2007. Two-view feature generation model for
semi-supervised learning. In Proceedings of the 24th international conference on
machine learning, 2532. ACM.
Baron, Naomi S. 1974. The structure of english causatives. Lingua 33(4). 299342.
De Alencar, Leonel Figueiredo. 2010. Uma ferramenta para anotao automtica
de corpora usando o NLTK. In The 9th brazilian corpus linguistics meeting, s/pp.
Drury, Brett, Paula C. F. Cardoso, Jorge Carlos Valverde-Rebaza, Alan Valejo, Fabio
Pereira & Alneu de Andrade Lopes. 2014a. An open source tool for crowdsourcing the manual annotation of texts. In Computational processing of the portuguese language - 11th international conference, PROPOR, 268273.
Drury, Brett, Paula C.F. Cardoso, Janie M. Thomas & Alneu de Andrade Lopes.
2014b. Lexical resources for the identification of causative relations in Portuguese texts. In Proceedings of workshop on tools and resources for automatically
processing Portuguese and Spanish, s/pp.
Drury, Brett & Alneu Lopes. 2014. A comparison of the effect of feature selection
and balancing strategies upon the sentiment classification of Portuguese news
stories. In Proceedings of ENIAC, s/pp.
[8]
Table 5.
OSLa volume 7(1), 2015
[394]
[395]
c o n ta c t s
Brett Drury
Universidade de So Paulo
Brett.Drury@gmail.com
Alneu de Andrade Lopes
Universidade de So Paulo
alneu@icmc.usp.br
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 397424. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
as wordnets do portugus
HUGO GONALO OLIVEIRA, VALERIA DE PAIVA,
CLUDIA FREITAS, ALEXANDRE RADEMAKER,
LIVY REAL E ALBERTO SIMES
abstract
Not many years ago it was usual to comment on the lack of an open lexicalsemantic knowledge base, following the lines of Princeton WordNet, but
for Portuguese. Today, the landscape has changed significantly, and researchers that need access to this specific kind of resource have not one,
but several alternatives to choose from. The present article describes the
wordnet-like resources currently available for Portuguese. It provides some
context on their origin, creation approach, size and license for utilization.
Apart from being an obvious starting point for those looking for a computational resource with information on the meaning of Portuguese words,
this article describes the resources available, compares them and lists some
plans for future work, sketching ideas for potential collaboration between
the projects described.
[1] i n t r o d u o
[398]
Bases de conhecimento lexical so repositrios organizados de itens lexicais. Entre outras informaes, estes recursos incluem normalmente informao sobre
os possveis sentidos das palavras, relaes entre sentidos, definies e frases que
exemplificam a sua utilizao. O modelo da wordnet, criado para a WN.Pr tendo
o ingls como lngua alvo, provavelmente o modelo mais popular para representar este tipo de recurso. Sua flexibilidade levou no s crescente aceitao
por parte da comunidade PLN, mas tambm sua adaptao para outras lnguas,
tornando-se quase um standard.
OSLa volume 7(1), 2015
as wordnets do portugus
[399]
Ver http://globalwordnet.org/wordnets-in-the-world/
OSLa volume 7(1), 2015
[400]
Noun
bird (warm-blooded egg-laying vertebrates characterized by feathers
and forelimbs modified as wings)
[direct hyponym]
dickeybird, dickey-bird, dickybird, dicky-bird (small bird; adults
talking to children sometimes use these words to refer to small
birds)
cock (adult male bird)
hen (adult female bird)
nester (a bird that has built (or is building) a nest)
night bird (any bird associated with night: owl; nightingale;
nighthawk; etc)
parrot (usually brightly colored zygodactyl tropical birds with
short hooked beaks and the ability to mimic sounds)
bird, fowl (the flesh of a bird or fowl (wild or domestic) used as food)
dame, doll, wench, skirt, chick, bird (informal terms for a (young) woman)
boo, hoot, Bronx cheer, hiss, raspberry, razzing, razz, snort, bird (a cry
or noise made to express displeasure or contempt)
shuttlecock, bird, birdie, shuttle (badminton equipment consisting of a
ball of cork or rubber with a crown of feathers)
Verb
bird, birdwatch (watch and study birds in their natural habitat)
as wordnets do portugus
[2.2]
[401]
Ver http://compling.hss.ntu.edu.sg/omw/
Ver https://www.wiktionary.org/
Ver http://www.omegawiki.org/
Ver http://www.wikidata.org/
OSLa volume 7(1), 2015
[402]
No h dvidas que, para alm da flexibilidade do seu modelo, o carter de domnio pblico da WN.Pr foi um fator chave na sua aceitao. Apesar disso, nem
todos os recursos que seguem este modelo optaram por tornar o seu resultado
livre. Neste leque encontra-se a WordNet.PT, aquela que foi a primeira wordnet
do portugus, mas que se encontra disponvel apenas para explorao atravs da
sua pgina web, no sendo possvel ser descarregada para utilizao local ou integrao em diferentes projetos. Para alm da WordNet.PT, esta seco descreve
outros dois projetos que resultaram na criao de uma wordnet para o portugus e
que, por alguma razo, no se encontram disponveis ou, pelo menos, disponveis
gratuitamente. So eles a WordNet.BR, um projeto, aparentemente, inacabado,
e para o qual apenas esto disponveis os synsets, sob a forma do thesaurus eletrnico TeP; e a MWN.PT que pode ser explorada tanto atravs da sua pgina web
como da pgina do projeto MultiWordNet, mas s pode ser descarregada mediante
o pagamento de uma licena acadmica ou comercial.
[3.1] WordNet.PT
A WordNet.PT (Marrafa 2001, 2002) (doravante, WN.PT) ter sido a primeira wordnet para o portugus. Desenvolvida desde 1998, um projeto coordenado por
Palmira Marrafa, no Centro de Lingustica da Universidade de Lisboa, mais propriamente no CLG Grupo de Computao do Conhecimento Lxico-Gramatical,
em colaborao com o Instituto Cames.
A sua construo essencialmente manual e segue o modelo da EuroWordNet
(Vossen 1997), ou seja, a WN.PT criada de raz para a lngua portuguesa. A sua
verso mais recente, WN.PT 1.6, data de 2006 e abrange vrias relaes semnticas, nomeadamente: geral/especfico (incluindo hiperonmia), todo/parte, equivalncia, oposio, categorizao, e ainda relaes entre os participantes num
evento (incluindo instrumento-para ou lugar-para) e definidoras da estrutura de
um evento (incluindo estar-envolvido-em ou lugar-para). A mesma verso cobre os
seguintes domnios semnticos: atividades artsticas e profissionais, comida, regies geogrficas e polticas, instituies, instrumentos, meios de transporte, vias
de comunicao, obras de arte, sade e atos mdicos, seres vivos e vesturio.
Mais recentemente, este recurso foi expandido para WordNet.PT Global
Rede Lxico-Conceptual das variedades do Portugus (Marrafa et al. 2011), que pretende incluir variantes de outros pases de lngua oficial portuguesa. De acordo
com a informao na sua pgina web,6 a WN.PT Global contm uma rede de 10 mil
conceitos, incluindo substantivos, verbos e adjetivos, as suas lexicalizaes nas diferentes variantes do portugus e as suas glosas. Os conceitos esto integrados em
uma rede com mais de 40 mil instncias de relao. Em 2014, foi apresentada uma
[6]
OSLa volume 7(1), 2015
Ver http://cvc.instituto-camoes.pt/traduzir/wordnet.html
as wordnets do portugus
[403]
[3.2] WordNet.Br
A WordNet.BR (Dias-da-Silva et al. 2002; Dias-da-Silva 2006) (doravante, WN.BR)
foi desenvolvida sob a coordenao de Bento Dias da Silva, na Faculdade de Cincias e Letras da Universidade Estadual Paulista, com vista a criar uma wordnet
para a variante brasileira do portugus. Numa primeira fase de desenvolvimento
(Dias-da-Silva et al. 2002), uma equipa de trs linguistas analisou cinco dicionrios
de portugus do Brasil e dois corpos, de forma a obter informao sobre sinonmia e antonmia. Esta fase resultou na criao manual de synsets e relaes de
antonmia entre eles, bem como na escrita de algumas glosas e seleo de frases
exemplo.
Numa segunda fase, os synsets da WN.BR foram alinhados manualmente com
a WN.Pr (Dias-da-Silva 2006), num processo semelhante ao seguido no projeto
EuroWordNet, onde se recorreu a dicionrios bilngues. Aps o alinhamento com
a WN.Pr, as relaes semnticas estabelecidas entre synsets com equivalncias em
portugus e ingls foram herdadas.
Com base no processo relatado, supe-se que a verso completa da WN.BR cobrir as relaes de hiperonmia, parte-de, causa e implicao (entailment). No
entanto, esta verso no se encontra disponvel na rede, provavelmente por a segunda fase de desenvolvimento no ter sido concluda. Por outro lado, possvel
consultar e descarregar os resultados da primeira fase, disponveis sob o nome de
TeP (Maziero et al. 2008) Thesaurus Eletrnico do Portugus. O TeP mantido pelo
Ncleo Interinstitucional de Lingustica Computacional (NILC) da Universidade
de So Paulo, em So Carlos, Brasil. Inclui mais de 44 mil itens lexicais, organizados em 19.888 synsets, que por sua vez esto ligados atravs de 4.276 relaes de
antonmia.
[3.3] MultiWordNet.PT
A MultiWordNet.PT, normalmente referida como MWN.PT,7 a parte portuguesa
do projeto MultiWordNet (Pianta et al. 2002). Foi desenvolvida pelo NLX - Natural Language and Speech Group, na Universidade de Lisboa, e pode ser comprada
atravs do catlogo da European Language Resources Association.8
De acordo com a sua documentao,9 a MWN.PT inclui 17,2 mil synsets validados manualmente, o que corresponde aproximadamente a 21 mil sentidos e 16
mil lemas, que abrangem tanto a variante europeia como a variante brasileira
[7]
[8]
[9]
Ver http://mwnpt.di.fc.ul.pt/
Ver http://catalog.elra.info/
Ver http://mwnpt.di.fc.ul.pt/features.html
OSLa volume 7(1), 2015
[404]
A criao manual de uma wordnet uma tarefa complexa e que requer muito
tempo. Assim, durante a dcada de 2000, investigadores da rea do PLN em portugus que necessitavam e no tinham acesso WordNet.PT tiveram de encontrar
alternativas livres, que, na maior parte das vezes, eram tambm mais simples.
Neste mbito, para alm do TeP (Maziero et al. 2008), j mencionado na seco [3.2], destacam-se:
[10]
[11]
OSLa volume 7(1), 2015
as wordnets do portugus
[405]
O OpenThesaurus.PT,12 verso portuguesa correspondente ao projeto homnimo, OpenThesaurus (Naber 2004), normalmente utilizado para sugerir
sinnimos em processadores de texto;
O PAPEL (Gonalo Oliveira et al. 2008), uma rede extrada automaticamente
a partir de um dicionrio da lngua portuguesa, e que liga palavras relacionadas por um vasto leque de relaes. Mais recentemente, o PAPEL foi
expandido para CARTO (Gonalo Oliveira et al. 2011), com base na explorao de mais dicionrios;
Alguns dos recursos desenvolvidos no mbito do Port4Nooj (Barreiro 2010),
construdos no ambiente de desenvolvimento lingustico do NooJ (Silberztein 2005), inicialmente extrados do sistema de traduo automtica OpenLogos (Barreiro et al. 2014). Estes recursos incluem, por exemplo, um conjunto de definies e relaes semnticas entre palavras;
O Dicionrio Aberto (Simes et al. 2012), no qual, juntamente com um dicionrio, so disponibilizadas relaes entre as suas palavras.
Uma descrio mais pormenorizada destes recursos, alguns dos quais comparados em Santos et al. (2010), est contudo fora do mbito deste artigo.
[5] wo r d n e t s l i v r e s d o p o r t u g u s
[5.1] Onto.PT
A Onto.PT (apresentada inicialmente em (Gonalo Oliveira & Gomes 2010), descrita de forma resumida em (Gonalo Oliveira & Gomes 2014a), e detalhada em
(Gonalo Oliveira 2013)) uma wordnet desenvolvida no mbito do doutoramento
de Hugo Gonalo Oliveira, sob a orientao de Paulo Gomes, no Centro de Informtica e Sistemas da Universidade de Coimbra. O projeto teve incio nos finais de
2008, num contexto em que no existia uma wordnet livre para o portugus, nem
recursos humanos para criar uma nova wordnet para esta lngua. O objetivo foi
sempre criar uma wordnet de forma completamente automtica, aproveitando
[12]
[406]
[13]
OSLa volume 7(1), 2015
Ver https://pt.wiktionary.org/
as wordnets do portugus
gado
s.m.
triplo_1
triplo_2
Extrao
conjunto de animais criados para diversos fins;
rebanho
=
rebanho SINONIMO_DE gado
=
animal MEMBRO_DE gado
synset1
=
synset1 +tb triple1 =
synset2
triplos yn1
[407]
=
=
Clustering
{manada, rebanho, mancheia, boiada}
{manada, rebanho, mancheia, boiada, gado}
Mapeamento
{bicho, animal, alimal, bstia, minante}
synset2 MEMBRO_DE synset1
Por um lado, a abordagem ECO permite obter uma wordnet de grandes dimenses com pouco esforo a verso 0.6 inclui cerca de 169 mil itens lexicais nicos,
organizados em cerca de 117 mil synsets, que por sua vez se relacionam atravs de
cerca de 174 mil instncias de relao. Por outro, h consequncias a nvel da
qualidade dos contedos. Por exemplo, na verso 0.35 do recurso, estimou-se que
cerca de 74% dos synsets estavam corretos, em 18% no havia concordncia entre avaliadores e os restantes tinham pelo menos uma palavra que no lhes devia
pertencer (avaliao descrita de forma detalhada em (Gonalo Oliveira 2013)). A
qualidade das relaes tambm varia drasticamente consoante o seu tipo. Considerando que relaes entre synsets errados esto tambm erradas, as relaes de
hiperonmia estavam cerca de 65% corretas, nmero que aumentava para 78% a
82% num conjunto que inclua os restantes tipos de relao. Ainda assim, entre outras tarefas, a Onto.PT foi j usada na expanso de sinnimos para recuperao de
informao (Rodrigues et al. 2012) ou de criao de listas de verbos causais (Drury
et al. 2014).
Devido sua abordagem de construo, a Onto.PT no um recurso esttico
e pode, de verso para verso, ter mudanas significativas ao nvel do nmero e
tamanho dos synsets. Assim, no entender dos seus autores, no far sentido tentar alinh-lo com a WN.Pr. H a acrescentar que a Onto.PT se encontra disponvel gratuitamente14 sob a forma de um modelo RDF/OWL, inspirado num modelo
existente para representar a WN.Pr (van Assem et al. 2006), mas expandido para
abranger outros tipos de relao.
[14]
Ver http://ontopt.dei.uc.pt/
OSLa volume 7(1), 2015
[408]
[5.2] OpenWordNet-PT
A OpenWordNet-PT (de Paiva et al. 2012; Rademaker et al. 2014), abreviada como
OpenWN-PT, uma wordnet desenvolvida originalmente por Valeria de Paiva,
Alexandre Rademaker e Gerard de Melo como uma projeo sinttica da Universal
WordNet15 (UNW).
A OpenWN-PT est sendo desenvolvida desde 2010 com o objetivo principal de
servir como subsdio lxico para um sistema voltado para raciocnio lgico, seja
este desenvolvido usando lgicas descritivas (em processo de adaptao) ou lgicas de primeira-ordem, baseadas em representao do conhecimento, por exemplo usando a ontologia SUMO (Pease & Fellbaum 2010).
O processo de construo da OpenWN-PT, decorrente do processo de criao da UWN, usa aprendizagem de mquina para construir relaes entre grafos
que representam informao vinda de verses em mltiplas lnguas da Wikipdia, bem como de dicionrios eletrnicos abertos. Apesar de ter comeado como
uma projeo apenas ao nvel dos lemas em portugus e suas relaes, a OpenWNPT tem sido constantemente melhorada por meio de acrscimos linguisticamente
motivados, quer manualmente, quer fazendo uso de grandes corpos, como o
caso do lxico de nominalizaes que integra a OpenWN-PT (de Paiva et al. 2014b;
Freitas et al. 2014a). Uma das caractersticas da construo deste ltimo recurso
tentar incorporar os diferentes materiais (de qualidade) j produzidos e disponibilizados para a lngua portuguesa, independente de variante.
A OpenWN-PT integra trs estratgias lingusticas no seu processo de enriquecimento lexical: (i) traduo; (ii) corpo; (iii) dicionrios. Com relao traduo,
so usados lxicos e listas produzidas para outras lnguas, como ingls, francs e
espanhol, automaticamente traduzidos e posteriormente revistos. A incorporao de dados de corpos contribui com palavras ou expresses de uso corrente que
podem ser especficas da lngua portuguesa ou que, por outros motivos, podem
no constar nas outras wordnets.
Como a Onto.PT, a OpenWN-PT tambm est disponvel em RDF/OWL, seguindo
e expandindo, quando necessrio, o mapeamento proposto por van Assem et al.
(2006). Tanto os dados da OpenWN-PT quanto as definies do modelo RDF (classes e propriedades) esto livremente disponveis para download.16 A filosofia da
OpenWN-PT consiste em manter a ligao estreita com a WN.Pr, mas tentar remover os erros maiores criados pelos mtodos automticos, usando conhecimentos
lingusticos. Uma consequncia desta ligao estreita com a WN-Pr a possibilidade de minimizar os impactos decorrentes de decises lexicogrficas quanto
[15]
[16]
OSLa volume 7(1), 2015
Por projeo sinttica, entenda-se uma projeo usando simplesmente a informao sintctica de que
registros correspondem a entradas em portugus, sem levar em conta o significado semntico do registro. Como esses registros so construdos automaticamente, pode haver casos em que a configurao foi
equivocada, onde o processo automtico de unificao decidiu que uma palavra em catalo era portugus,
por exemplo.
Ver https://github.com/arademaker/openWordnet-PT
as wordnets do portugus
[409]
Ver http://translate.google.com/about/intl/en_ALL/license.html
Ver http://logics.emap.fgv.br:10035/repositories/wn30
Ver http://compling.hss.ntu.edu.sg/omw/cgi-bin/wn-gridx.cgi?gridmode=grid
Universidade Federal do Esprito Santo
Ver https://sites.google.com/site/ufeswordnet/
OSLa volume 7(1), 2015
[410]
Aps a descrio das vrias wordnets para o portugus, esta seco apresenta uma
comparao das suas verses mais recentes, dentro do possvel, atravs de um
conjunto de tabelas onde estas wordnets so colocadas lado a lado e ainda seguidas das mesmas propriedades para a WN.Pr. Chamamos a ateno para o fato
[22]
[23]
OSLa volume 7(1), 2015
Ver http://wordnet.pt
Ver http://mymemory.translated.net/
as wordnets do portugus
[411]
desta comparao ser superficial e no dever ser vista como mais que isso. Muitos dos indicadores so meramente quantitativos e no consideram a coerncia
ou a utilidade dos contedos.
A tabela 1 apresenta a abordagem seguida na criao e atualizao de cada
wordnet e a forma de disponibilizao. notrio que a alternativa mais comum
criao manual de uma wordnet para o portugus passa pela traduo, manual (MWN.PT), automtica (UfesWN.BR), numa projeo sinttica (OpenWN-PT),
ou ainda em triangulao (PULO). Dentro destas quatro abordagens, o PULO destaca-se por utilizar no s a WN.Pr como wordnet pivot, mas tambm as wordnets
do espanhol e do galego, includas no MCR. Ao contrrio de todas as outras, a estrutura da Onto.PT aprendida de forma completamente automtica, com base na
extrao de relaes a partir de outros recursos textuais ou de outras wordnets, e
da descoberta de aglomerados (clusters) de sinnimos, que do origem aos synsets.
Entre as vantagens de uma abordagem completamente manual, encontra-se a criao de um recurso com uma correo virtual de 100%. Por outro lado, em uma
abordagem automtica evita-se uma grande quantidade de trabalho cansativo,
alm de ser possvel obter um recurso de maiores dimenses em menos tempo.
Sobre a disponibilizao do recurso, recorda-se que o carter de domnio pblico da WN.Pr foi um dos fatores que levou ao seu sucesso. No entanto, nem
todas as wordnets para o portugus tomaram essa opo e apenas as quatro mais
recentes so de utilizao completamente livre.
Wordnet
WN.PT
WN.BR
MWN.PT
Onto.PT
OpenWN-PT
UfesWN.BR
PULO
WN.Pr
Criao
Synsets
manual
manual
traduo manual?
ER+clustering
projeo UWN
traduo automtica
triangulao
manual
Relaes
manual
transitividade
transitividade
ER+clustering
transitividade
transitividade
transitividade
manual
Atualizao
Utilizao
manual
fechada
manual?
synsets livres
?
licena paga
automtica
livre
semi-automtica
livre
?
livre
semi-automtica
livre
manual
livre
tabela 1: Wordnets do portugus e WN.Pr, a sua abordagem de criao e disponibilizao. Apresenta-se um ? nos casos em que desconhecemos a
forma de atualizao da wordnet em questo.
A tabela 2 compara a dimenso das wordnets do portugus relativamente ao
nmero de itens lexicais abrangidos, separados por categoria gramatical. Neste
campo a Onto.PT destaca-se por incluir um nmero mais de trs vezes superior
segunda wordnet com mais itens lexicais, a OpenWN-PT. Isto confirma que uma
abordagem de construo completamente automtica ser aquela com maiores
possibilidades de construir um recurso de grandes dimenses num curto prazo.
OSLa volume 7(1), 2015
[412]
Substantivo
9.813
16.000
17.000
97.531
43.996
20.646
10.260
119.034
Verbo
633
0
10.910
32.958
3.914
3.769
4.032
11.531
Itens lexicais
Adjetivo Advrbio
485
0
0
0
15.000
1.000
34.392
3.995
5.422
1.388
9.066
1.498
3.166
173
21.538
4.481
Total
10.931
16.000
43.910
168.876
54.720
34.979
17.631
156.584
as wordnets do portugus
[413]
lnguas que no o ingls), a avaliao da qualidade ser sempre uma questo complexa, j que no h uma wordnet dourada de referncia e justamente isso o
que se quer construir. Por essa perspectiva, recursos que fazem uso do trabalho
humano apresentam uma vantagem, ainda que no saibamos exatamente como
esta possa ser medida.
Para as wordnets alinhadas com uma wordnet para outra lngua, as relaes
entre synsets podem ser obtidas indiretamente da wordnet pivot, por via de
transitividade. Isso acontece com a MWN.PT, a OpenWN-PT, a UfesWN.BR e com
o PULO. Para a WN.BR, o nmero de relaes apresentado apenas relativo s relaes disponibilizadas juntamente com o TeP, todas elas de antonmia. Para se
compreender melhor a origem destas relaes, foi adicionada tabela 3 a indicao acerca da existncia de algum tipo de alinhamento com outra wordnet. Devido sua abordagem de criao, s a Onto.PT no estar alinhada com a WN.Pr.
Relativamente WN.PT e WN.BR sabemos que, pelo menos, houve intenes de
definir um alinhamento com a WN.Pr, ainda que estes no se encontrem disponveis por vrias vezes os autores da WN.PT mencionam o seu desenvolvimento
dentro da plataforma da EuroWordNet, e os autores da WN.BR indicam como planos futuros o alinhamento da sua wordnet na mesma plataforma (Dias-da-Silva
2006).
Um alinhamento deste tipo pode ser importante para a obteno de conhecimento adicional, a partir no s da WN.Pr, mas tambm de outras a ela alinhadas, o que pode ser relevante em processamento multilngue. Para alm da
herana de relaes, um alinhamento permite aceder a conhecimento de outras
extenses da WN.Pr, tais como a WordNet-domains (Magnini & Cavagli 2000),
a SentiWordNet (Baccianella et al. 2010) ou a TempoWordNet (Dias et al. 2014),
bem como alinhamentos com outros recursos, alguns dos quais referidos na seco [2.2]. Por outro lado, um alinhamento cego pode apresentar limitaes relativas cobertura na lngua alvo, alm de no considerar que lnguas diferentes
representam diferentes realidades socio-culturais, no cobrem a mesma parte do
lxico e, mesmo onde parecem ser comuns, h vrios conceitos lexicalizados de
forma diferente (Hirst 2004). Veja-se, por exemplo, os problemas referidos na
descrio do MWN.PT.
Por fim, na tabela 4 procuramos listar um conjunto de relaes semnticas e
indicar quais esto presentes em cada wordnet. Apesar de algumas wordnets distinguirem entre vrios subtipos destas relaes, optmos por utilizar uma comparao meramente booleana, em que no foi contabilizado nem o nmero de subtipos de cada relao, nem o nmero de instncias de cada tipo. Verifica-se que
apenas WN.PT e Onto.PT cobrem todas as relaes listadas. No caso da Onto.PT,
o conjunto de relaes foi baseado no PAPEL que, por sua vez, se baseou em regularidades presentes em definies de dicionrio. Alguns nomes de relao foram mesmo criados especificamente para um tipo de regularidades, o que torna
OSLa volume 7(1), 2015
[414]
Sentidos
(de palavra)
?
21.000
75,720
248.773
73.802
63.096
17.631
206.978
Synsets
12.630
17.200
19.888
117.450
43.925
48.981
13.709
117.659
Relaes
(instncias)
40.000+
68.735
4.276+?
341.506
74.102
238.413
48.658
285.000
Alinhamento
WN.Pr?
WN.Pr
WN.Pr?
nenhum
WN.Pr
WN.Pr
MCR
Sinon
Anton
Hiperon
Relaes
Meron
Causa
Finalid
Local
Maneira
[7] d i s c u s s o f i n a l
as wordnets do portugus
[415]
para o portugus. Alis, a utilizao de uma wordnet num projeto que vise a lngua portuguesa cada vez menos um problema com uma soluo de remedeio, e
cada vez mais um problema de escolha dentro das alternativas disponveis. Esta
escolha dever considerar, entre outros, a necessidade de alinhamento com outras wordnets, a tolerncia a erros, a necessidade de abrangncia tanto no que
diz respeito s relaes presentes quanto aos itens lexicais cobertos ou mesmo
o oramento disponvel. Uma vez que cada wordnet tem caratersticas diferentes
das demais, tambm no ser de descartar a utilizao de mais de uma no mesmo
projeto.
Ser tambm pertinente perguntar se esta quantidade de alternativas faz sentido ou se seria prefervel os autores interessados focarem-se na construo de
uma nica wordnet para o portugus, tentando aproveitar as foras de cada um
dos projetos descritos.
Os autores deste artigo, responsveis pela Onto.PT, OpenWN-PT e PULO, acreditam que haver vantagens nas duas opes e, por isso, nos prximos tempos,
ser seguida uma abordagem intermdia. Ou seja, o desenvolvimento de cada
wordnet continuar a ser feito pelas mesmas equipas que o tm feito at aqui, mas
haver um maior acompanhamento do trabalho desenvolvido por cada equipa.
Desta forma, entre outras vantagens, cada projeto poder tirar partido do que
feito nos outros, minimizando a quantidade de trabalho duplicado, mas sem perder de vista objetivos especficos de cada projeto.
Como seria de esperar, comum aos trs projetos a vontade de continuar a
melhorar a coerncia, qualidade e abrangncia do seu recurso. Para alm de tarefas j planeadas, a mdio e a longo prazo, especficas para cada um dos projeto,
os autores deste artigo vem com bom olhos futuras colaboraes que possam
tirar partido do que j foi feito e que, a longo prazo, possam at levar a uma integrao ou alinhamento dos seus projetos. Assim, aps enumerar os objetivos
individuais, sero indicadas linhas gerais de potenciais colaboraes que surgem
no seguimento de algumas discusses entre os autores.
[416]
Ver http://logics.emap.fgv.br/wn/
as wordnets do portugus
[417]
car alguns dos seus procedimentos automticos para sugerir novos contedos quer OpenWN-PT, quer ao PULO, nomeadamente: (i) novas palavras
a synsets; (ii) novas instncias de relaes abrangidas; (iii) novos tipos de
relao; (iv) glosas.
(ii) H j interfaces de busca para as trs wordnets dos autores, criadas pelos
prprios autores ou por terceiros. No entanto, OpenWN-PT e PULO querem
ir mais longe e ter uma interface de sugesto e validao dos contedos.
Tanto OpenWN-PT como PULO tm j prottipos para essa interface, e o
seu desenvolvimento poderia ser feito em parceria.
(iii) Seria interessante fazer uma ponte entre outros recursos desenvolvidos pelos mesmos autores. Isto incluiria, por exemplo, um alinhamento do NomLexPT com o PULO e o Dicionrio Aberto no s com PULO, mas tambm o
OpenWN-PT. O Dicionrio Aberto poderia mesmo ser utilizado como uma
fonte adicional de glosas em portugus para os synsets de qualquer uma das
wordnets.
(iv) Os contedos de OpenWN-PT e PULO podero ser explorados pela Onto.PT
e ser uma fonte adicional para calcular o tal valor numrico de confiana.
Alis, medida que estes recursos forem atingindo um nvel maior de coerncia, podero tambm vir a ser usados como referncia na avaliao da
Onto.PT.
De modo a perceber at que ponto estas wordnets se complementam ou no,
e at que ponto faria sentido e seria possvel algum tipo de integrao ou alinhamento, o ponto de partida para uma colaborao mais estreita deveria passar por
uma comparao mais exaustiva das trs, incluindo, dentro do possvel, as restantes wordnets livres. No que diz respeito s wordnets que esto alinhadas com
a WN.Pr, a comparao ser provavelmente mais fcil e direta.
Por sua vez, a comparao poderia comear de uma forma muito simples, com
a criao de uma ligao na interface de cada wordnet que permitisse obter os resultados da mesma pesquisa nas demais wordnets. Poderia depois passar por selecionar aleatoriamente um conjunto de palavras e analisar no s a sua presena
nas vrias wordnets, como os seus sentidos. Mas dada a dificuldade em avaliar
diretamente uma wordnet, um possvel atalho envolveria a seleo de frases padro teste, em linguagem natural, que transmitam determinada relao semntica de forma objetiva (por exemplo, <x> um tipo de <y>, para hiperonmia, ou <x>
tem um <y>, para meronmia). A partir dessas frases, podem ser geradas variantes
atravs da substituio das duas palavras relacionadas pelos argumentos de qualquer relao do mesmo tipo. Um avaliador dever depois indicar se cada frase
resultado mantm a coerncia semntica. Isto foi j proposto por (Cruse 1986) e
OSLa volume 7(1), 2015
[418]
agradecimentos
Um agradecimento Belinda Maia, co-organizadora do Workshop on Language Resources for Teaching and Research e da Escola de Vero da Linguateca, ambos realizados na Faculdade de Letras da Universidade do Porto, onde o primeiro autor
deste artigo (Hugo) teve o prazer de ser convidado a apresentar o seu trabalho
desenvolvido no mbito do PAPEL, que viria a dar origem sua investigao em
torno da construo de wordnets.
referncias
Amaro, Raquel. 2014. Extracting semantic relations from portuguese corpora
using lexical-syntactic patterns. Em Proceedings of the 9th international conference
on language resources and evaluation LREC14, ELRA.
van Assem, Mark, Aldo Gangemi & Guus Schreiber. 2006. RDF/OWL representation
of WordNet. W3C working draft World Wide Web Consortium. http://www.
w3.org/TR/2006/WD-wordnet-rdf-20060619/.
Baccianella, Stefano, Andrea Esuli & Fabrizio Sebastiani. 2010. SentiWordNet 3.0:
An enhanced lexical resource for sentiment analysis and opinion mining. Em
Proceedings of 7th International Conference on Language Resources and Evaluation,
22002204. ELRA.
Banerjee, Satanjeev & Ted Pedersen. 2002. An adapted Lesk algorithm for word
sense disambiguation using WordNet. Em Proceedings of the 3rd international conference on computational linguistics and intelligent text processing (CICLing 2002), vol.
2276 LNCS, 136145. Springer.
Barreiro, Anabela. 2010. Port4NooJ: an open source, ontology-driven portuguese
linguistic system with applications in machine translation. Em Proceedings of
the 2008 international nooj conference (nooj08), Cambridge Scholars Publishing.
OSLa volume 7(1), 2015
as wordnets do portugus
[419]
Barreiro, Anabela, Fernando Batista, Ricardo Ribeiro, Helena Moniz & Isabel Trancoso. 2014. OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries. Em Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios
Piperidis (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation, 37743781. ELRA.
Bobrow, Daniel G, Bob Cheslow, Cleo Condoravdi, Lauri Karttunen, Tracy Holloway King, Rowan Nairn, Valeria de Paiva, Charlotte Price & Annie Zaenen.
2007. Parcs bridge and question answering system. Em Tracy Holloway King
& Emily M. Bender (eds.), Proceedings of the geaf 2007 workshop., 4666. CSLI.
Bond, Francis & Ryan Foster. 2013. Linking and extending an open multilingual
wordnet. Em Proceedings of the 51st annual meeting of the association for computational linguistics, vol. 1, 13521362. ACL.
Bond, Francis & Kyonghee Paik. 2012. A survey of wordnets and their licenses.
Em Proceedings of the 6th global wordnet conference, 6471.
Cruse, Alan D. 1986. Lexical semantics. Cambridge University Press.
Dias, Gal Harry, Mohammed Hasanuzzaman, Stphane Ferrari & Yann Mathet.
2014. TempoWordNet for Sentence Time Tagging. Em Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion,
833838.
Dias-da-Silva, Bento C. 2006. Wordnet.Br: An exercise of human language technology research. Em Proceedings of 3rd international wordnet conference (gwc), 301
303.
Dias-da-Silva, Bento C., Mirna F. de Oliveira & Helio R. de Moraes. 2002.
Groundwork for the Development of the Brazilian Portuguese Wordnet. Em
Advances in Natural Language Processing (PorTAL 2002), 189196. Springer.
Drury, Brett, Paula C.F. Cardoso, Janie M. Thomas & Alneu de Andrade Lopes.
2014. Lexical resources for the identification of causative relations in portuguese texts. Em Proceedings of the 1st Workshop on Tools and Resources for Automatically Processing Portuguese and Spanish, 5663.
Fellbaum, Christiane (ed.). 1998. WordNet: An Electronic Lexical Database (language,
speech, and communication). The MIT Press.
Fellbaum, Christiane. 2010. WordNet. Em Theory and applications of ontology: Computer applications, chap. 10, 231243. Springer.
OSLa volume 7(1), 2015
[420]
as wordnets do portugus
[421]
Gonalo Oliveira, Hugo & Paulo Gomes. 2014a. ECO and Onto.PT: A flexible approach for creating a Portuguese wordnet automatically. Language Resources and
Evaluation 48(2). 373393.
Gonalo Oliveira, Hugo & Paulo Gomes. 2014b. Onto.PT: recent developments
of a large public domain portuguese wordnet. Em Proceedings of the 7th Global
WordNet Conference, 1622.
Gonalo Oliveira, Hugo, Leticia Antn Prez, Hernani Costa & Paulo Gomes. 2011.
Uma rede lxico-semntica de grandes dimenses para o portugus, extrada a
partir de dicionrios electrnicos. Linguamtica 3(2). 2338.
Gonalo Oliveira, Hugo, Diana Santos, Paulo Gomes & Nuno Seco. 2008. PAPEL: A
dictionary-based lexical ontology for Portuguese. Em Proceedings of Computational Processing of the Portuguese Language - 8th International Conference (PROPOR),
vol. 5190, 3140. Springer.
Gonzalez-Agirre, Aitor & German Rigau. 2013. Construccin de una base de conocimiento lxico multilinge de amplia cobertura: Multilingual Central Repository. Linguamtica 5(1). 1328.
Guinovart, Xavier Gmez & Alberto Simes. 2013. Retreading Dictionaries for the
21st Century. Em Jos Paulo Leal, Ricardo Rocha & Alberto Simes (eds.), 2nd
Symposium on Languages, Applications and Technologies, vol. 29, 115126. Schloss
DagstuhlLeibniz-Zentrum fuer Informatik.
Gurevych, Iryna, Judith Eckle-Kohler, Silvana Hartmann, Michael Matuschek, Christian M. Meyer & Christian Wirth. 2012. UBY - a large-scale unified lexicalsemantic resource. Em Proceedings of the 13th conference of the european chapter
of the association for computational linguistics, 580590. ACL Press.
Hirst, Graeme. 2004. Ontology and the lexicon. Em Steffen Staab & Rudi Studer
(eds.), Handbook on ontologies International Handbooks on Information Systems,
209230. Springer.
Kilgarriff, Adam. 1997. I dont believe in word senses. Computers and the Humanities
31. 91113.
Magnini, Bernardo & Gabriela Cavagli. 2000. Integrating subject field codes into
WordNet. Em Proceedings of 2nd International Conference on Language Resources
and Evaluation (LREC), 14131418. ELRA.
Marrafa, Palmira. 2001. Wordnet do portugus: uma base de dados de conhecimento
lingustico. Instituto Cames.
OSLa volume 7(1), 2015
[422]
as wordnets do portugus
[423]
de Paiva, Valeria, Livy Real, Alexandre Rademaker & Gerard de Melo. 2014b.
NomLex-PT: A Lexicon of Portuguese Nominalizations. Em Proceedings of the
Ninth International Conference on Language Resources and Evaluation (LREC), 114
124. ELRA.
Pease, Adam & Christiane Fellbaum. 2010. Formal ontology as interlingua: the
SUMO and WordNet linking project and global WordNet linking project. Em
Ontology and the Lexicon: A Natural Language Processing Perspective, chap. 2, 2535.
Cambridge University Press.
Pianta, Emanuele, Luisa Bentivogli & Christian Girardi. 2002. MultiWordNet: developing an aligned multilingual database. Em Proceedings of 1st International
Conference on Global WordNet, 293302.
Rademaker, Alexandre, Valeria De Paiva, Gerard de Melo, Livy Maria Real Coelho
& Maira Gatti. 2014. OpenWordNet-PT: A Project Report. Em Proceedings of the
7th Global WordNet Conference, 383390.
Resnik, Philip. 1995. Using information content to evaluate semantic similarity in
a taxonomy. Em Proceedings of the 14th International Joint Conference on Artificial
Intelligence, 448453. Morgan Kaufmann.
Rodrigues, Ricardo, Hugo Gonalo Oliveira & Paulo Gomes. 2012. Uma abordagem ao Pgico baseada no processamento e anlise de sintagmas dos tpicos.
Linguamtica 4(1). 3139.
Sampson, Geoffrey. 2000. Review of Fellbaum (1998). International Journal of Lexicography 13(1). 5459.
Santos, Diana, Anabela Barreiro, Cludia Freitas, Hugo Gonalo Oliveira, Jos Carlos Medeiros, Lus Costa, Paulo Gomes & Rosrio Silva. 2010. Relaes semnticas em portugus: comparando o TeP, o MWN.PT, o Port4NooJ e o PAPEL. Em
Textos seleccionados. XXV Encontro Nacional da Associao Portuguesa de Lingustica,
681700.
Silberztein, Max. 2005. NooJ: A Linguistic Annotation System for Corpus Processing. Em Proceedings of HLT/EMNLP on Interactive Demonstrations, 1011. ACL
Press.
Simes, Alberto & Xavier Gmez Guinovart. 2014. Bootstrapping a Portuguese
wordnet from Galician, Spanish and English wordnets. Em Advances in Speech
and Language Technologies for Iberian Languages, vol. 8854, 239248. Springer.
OSLa volume 7(1), 2015
[424]
c o n ta c t o s
Hugo Gonalo Oliveira
CISUC, Universidade de Coimbra, Portugal
hroliv@dei.uc.pt
Valeria de Paiva
Nuance Communications, USA
valeria.depaiva@nuance.com
Cludia Freitas
PUC-Rio, Brasil
claudiafreitas@puc-rio.br
Alexandre Rademaker
IBM Research e FGV/EMAp, Brasil
alexrad@br.ibm.com
Livy Real
IBM Research, Brasil
livyreal@gmail.com
Alberto Simes
CEH, Universidade do Minho e Linguateca
ambs@ilch.uminho.pt
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 425438. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
abstract
This paper describes the main characteristics of SentiLex-PT, a sentiment lexicon designed for the extraction of sentiment and opinion about human
entities in Portuguese texts. The potential of this resource is illustrated on
its application to two types of corpora, the SentiCorpus-PT, a social media
corpus, consisting of user comments to news articles, and a literary piece of
the early twentieth century, The Poor (Os Pobres), by Raul Brando. The data
were processed by UNITEX, a natural language processing system based on
dictionaries and grammars.
[1] i n t r o d u o
A anlise automtica de sentimento (tambm designada na literatura como prospeo de opinio) dedica-se ao tratamento computacional de opinies, sentimentos e atitudes, expressos em textos provenientes de origens diversas, em particular dos media sociais (Liu 2015). As aplicaes que tiram partido desta anlise
baseiam-se, geralmente, em lxicos de sentimento, isto , lxicos cujas entradas
podem ser utilizadas para veicular um determinado sentimento ou emoo. Em
geral, a informao de sentimento descrita nestes recursos corresponde orientao semntica ou polaridade das palavras ou expresses. Neste mbito, os traos mais comummente utilizados so os de negativo, positivo e neutro. A ltima
categoria tem vindo a ser adotada para descrever os casos em que o sentimento
associado a uma determinada expresso no claramente positivo ou negativo,
dependendo fundamentalmente do contexto (sinttico, semntico e discursivo)
em que utilizado (e.g. uma subida surpreendente vs. uma queda surpreendente).
Uma das propriedades das lnguas naturais a ambiguidade ou, numa perspetiva mais abrangente, a vagueza (Santos 1998). Ao nvel do sentimento, uma
mesma palavra pode apresentar polaridades distintas em funo do domnio em
que ocorre, o que tem motivado a apresentao de abordagens para a construo de lxicos de domnios especficos (e.g. Zhang & Singh 2014). , por exemplo, o caso de quente, que, na qualidade de modificador de um nome comestvel
como sopa, pode ser analisado como um predicador positivo (e.g. A sopa ainda est
quente); porm, quando aplicado a um nome bebvel como champanhe, veicula uma
polaridade contrria (e.g. O champanhe est quente).
[426]
A primeira verso do lxico foi disponibilizada ainda em 2010 (SentiLex-PT01). A verso atualmente
disponvel pode ser obtida em: http://dmir.inesc-id.pt/project/SentiLex-PT_02.
[427]
As entradas do lxico correspondem a predicadores humanos, i.e. adjetivos, nomes, verbos e expresses idiomticas de base verbal com a particularidade de
se construir com nomes humanos, elementos nucleares de grupos nominais que,
numa frase, podem desempenhar a funo de sujeito ou de complemento. , por
exemplo, o caso de frgil, que, alm de poder exercer modificao sobre um nome
concreto (e.g. cobertura frgil) ou abstrato (e.g. posio frgil), tambm pode selecionar um nome de natureza humana, sobre o qual exerce modificao (e.g. indivduo frgil). O adjetivo em anlise veicula uma polaridade negativa, qualquer que
seja a natureza do nome com que se combina.
Contudo, h outros casos em que a polaridade do predicador poder diferir
em funo da especificao sinttico-semntica dos argumentos com que este se
constri. Por exemplo, o adjetivo gordo veicula tipicamente um valor negativo,
enquanto modificador de um nome de natureza humana (e.g. indivduo gordo),
mas pode assumir uma polaridade contrria, quando combinado com um nome
como, por exemplo, salrio (e.g. salrio gordo).
H ainda outros casos em que uma mesma forma poder, em funo da construo em que surge, ser, ou no, interpretado como um predicador de sentimento. Por exemplo, o adjetivo distinto dever ser classificado como um predicador de sentimento, com polaridade positiva, quando combinado com sujeitos
de natureza humana (e.g. mdico distinto); contudo, em construes no humanas, a mesma forma poder no veicular qualquer sentimento e/ou polaridade
(e.g. estratgias distintas).
Assim, no desenvolvimento de qualquer lxico, em particular os de sentimento,
deve ter-se em considerao os diferentes contextos sinttico-semnticos em que
as palavras podem ocorrer, para que a descrio das entradas seja o mais fiel possvel, potenciando, desse modo, a sua aplicabilidade em tarefas de processamento.
OSLa volume 7(1), 2015
[428]
[429]
aberrao.PoS=N;TG=HUM:N0;POL:N0=-1;ANOT=MAN
bonito.PoS=Adj;TG=HUM:N0;POL:N0=1;ANOT=MAN
castigado;PoS=Adj;TG=HUM:N0;POL:N0=-1;ANOT=JALC
estimado.PoS=Adj;TG=HUM:N0;POL:N0=1;ANOT=JALC;REV=AMB
enganar.PoS=V;TG=HUM:N0:N1;POL:N0=-1;POL:N1=0;ANOT=MAN
engolir em seco.PoS=IDIOM;TG=HUM:N0;POL:N0=-1;ANOT=MAN
figura 1: Exemplos de entradas do SentiLex-lem-PT02 (lemas).
aberrao,aberrao.PoS=N;FLEX=fs;TG=HUM:N0;POL:N0=-1;ANOT=MAN
bonita,bonito.PoS=Adj;FLEX=fs;TG=HUM:N0;POL:N0=1;ANOT=MAN
bonitas,bonito.PoS=Adj;FLEX=fp;TG=HUM:N0;POL:N0=1;ANOT=MAN
bonito,bonito.PoS=Adj;FLEX=ms;TG=HUM:N0;POL:N0=1;ANOT=MAN
bonitos,bonito.PoS=Adj;FLEX=mp;TG=HUM:N0;POL:N0=1;ANOT=MAN
engoliste em seco,engolir em seco.PoS=IDIOM;Flex=J2p|J2s;TG=HUM:N0;POL:N0=-1;ANOT=MAN
engolistes em seco,engolir em seco.PoS=IDIOM;Flex=J2p;TG=HUM:N0;POL:N0=-1;ANOT=MAN
engoliu em seco,engolir em seco.PoS=IDIOM;Flex=J4s|P3s;TG=HUM:N0;POL:N0=-1;ANOT=MAN
engulamos em seco,engolir em seco.PoS=IDIOM;Flex=Y1p|S1p;TG=HUM:N0;POL:N0=-1;ANOT=MAN
[430]
Polaridade
Negativo
Positivo
Neutro
N de
Predicadores
Intransitivos
4.453
1.396
1.396
Exemplos
arrogante; terror; morrer; no ter onde cair morto
misericordioso; beleza; brilhar; levantar a cabea
misterioso; simples; humilde; ingnuo
Polaridade
N0_Pos
N0_Pos
N0_Pos
N0_Neg
N0_Neg
N0_Neg
N0_Neu
N0_Neu
N1_Pos
N1_Neg
N1_Neu
N1_Neg
N1_Pos
N1_Neu
N1_Neg
N1_Pos
N de
Predicadores
Transitivos
1
162
29
22
9
149
55
29
[431]
Exemplos
estar altura
calar, vencer
impressionar, salvar
encobrir, insultar
ceder, curvar-se
espezinhar, faltar ao respeito
desconfiar, ignorar
admirar, acreditar
SentiCorpus-PT
112.374 (8.591)
49.304 (8.544)
4.548 (1.805)
2.131
(650)
2.338
(998)
446
(181)
[432]
no
no
no
no
no
no
no
No
no
no
no
no
no
no
no
no
Pelo outro lado, os termos classificados como positivos podem estar a ser utilizados de forma no literal, por exemplo, para expressar ironia, um fenmeno extremamente produtivo em textos provenientes das redes sociais (Carvalho et al.
2009).
A Tabela 4 apresenta a lista das cinco palavras de sentimento do SentiLex, com
maior nmero de ocorrncias em cada um dos corpora.
interessante verificar que as palavras em questo, que remetem diretamente
para as temticas retratadas em cada um dos textos, so diferentes. No texto literrio, a palavra mais frequente, sonho, a nica descrita no SentiLex como positiva.
Pelo contrrio, nos comentrios aos debates polticos, o lugar de destaque ocupado por predicadores transitivos, cuja polaridade potencialmente positiva para
[4]
Excerto de texto extrado do Dicionrio de Lngua Portuguesa com Acordo Ortogrfico [em linha]. Porto:
Porto Editora, 20032015. [Data da consulta: 2015-02-13]. Disponvel em http://www.infopedia.pt/
\protect\char"0024\relaxraul-brandao.
Polaridade
Pos
Neg
Neg
Neg
Neg
Ocorr.
109
89
61
56
50
SentiCorpus-PT
votar
voto
verdade
votos
ganhou
Polaridade
Neu Pos
Neu Pos
Pos
Neu Pos
Pos Neg
[433]
Ocorr.
91
76
50
37
35
[434]
[435]
Algumas das construes apresentadas nas concordncias so transitivas, como o caso da construo encostar parede, apresentando dois valores de polaridade distintos: positivo para o argumento que desempenha a funo de sujeito
(no caso, Scrates) e negativo para o argumento que desempenha a funo de complemento direto (no caso, Lou). De salientar que esta informao pode ser corretamente processada por aplicao do SentiLex-PT aos textos, dado que a informao de polaridade tem em conta as propriedades distribucionais dos predicadores,
algo que normalmente ignorado nos lxicos que tm vindo a ser construdos,
tanto para o portugus como para outras lnguas. Esta informao permite, por
exemplo, tornar a extrao de sentimento mais fina e rigorosa. Por exemplo, a
concordncia abaixo resulta do refinamento da pesquisa anterior, requerendo a
presena de uma expresso idiomtica, cuja polaridade potencialmente positiva
para o sujeito e negativa para o complemento do predicador.
u , questionou , barafustou e
.. E com razo , que Socrtaes
a sempre. O operrio jernimo
o acordo com o PP. O Socrates
ldades claro que o Scrates
lo , a Manuela Ferreira Leite
em meu entender. Paulo Portas
chegou para ela ! .. . . Quem
carro usado a Scrates? Quem
C EM PAPEL SOLOFAN. Scrates
[5] c o n s i d e r a e s f i n a i s
O SentiLex-PT um recurso de acesso livre, que tem vindo a ser amplamente utilizado por equipas de investigao nacionais e internacionais, em diversas tarefas
de expanso lexical (destacando-se, entre outros, o trabalho de Gonalo Oliveira
et al. 2014) e anlise sentimento, por exemplo, no contexto poltico (Tumitan &
Becker 2014).
OSLa volume 7(1), 2015
[436]
agradecimentos
Um agradecimento muito especial Belinda Maia, corresponsvel pelas duas Escolas de Vero organizadas pela Linguateca, onde tivemos a oportunidade de nos
conhecer e de abraar um projeto na rea de anlise de sentimento, de onde, entre
outros recursos, nasceu o SentiLex-PT. Uma palavra de agradecimento tambm
Maria Jos Finnato e ao Hugo Gonalo Oliveira, pela leitura do artigo e pertinentes
sugestes.
O desenvolvimento deste trabalho foi parcialmente apoiado com financiamentos da Fundao para a Cincia e a Tecnologia (FCT), referncias UID/CEC/50021/
2013, EXCL/EEI- ESS/0257/2012 (DataStorm), PTDC/CPJ-CPO/116888/2010 (POPSTAR), UTA-Est/MAI/0006/2009 (REACTION) e SFRH/BPD/45416/2008.
referncias
Balage, Pedro, Thiago Pardo & Sandra Alusio. 2013. An Evaluation of the Brazilian
Portuguese LIWC Dictionary for Sentiment Analysis. Em Proceedings of the 9th
Brazilian Symposium in Information and Human Language Technology, 215219.
Barreiro, Anabela. 2008. ParaMT: A paraphraser for machine translation. Em
Computational Processing of the Portuguese Language, 8th International Conference,
PROPOR 2008, Aveiro, Portugal, September 8-10, 2008, Proceedings, 202211.
Carvalho, Paula, Lus Sarmento, Mrio J. Silva & Eugnio de Oliveira. 2009. Clues
for Detecting Irony in User-generated Contents: Oh...!! Its So Easy;-). Em
Proceedings of the 1st International CIKM Workshop on Topic-sentiment Analysis for
Mass Opinion, 5356. ACM.
Carvalho, Paula, Lus Sarmento, Jorge Teixeira & Mrio J. Silva. 2011. Liars and
Saviors in a Sentiment Annotated Corpus of Comments to Political Debates. Em
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 2, 564568.
Freitas, Cludia. 2013. Sobre a construo de um lxico da afetividade para o proOSLa volume 7(1), 2015
[437]
[438]
c o n ta c t o s
Paula Carvalho
Laureate International Universities & INESC-ID
pcc@inesc-id.pt
Mrio J. Silva
Universidade de Lisboa, Instituto Superior Tcnico & INESC-ID
mjs@inesc-id.pt
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 439456. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
resumo
Este trabalho, inspirado pelo artigo de Stig Johansson sobre Loving and hating em ingls e noruegus (Johansson 1998), aplica mtodos semelhantes ao
par portugusingls.
Usando tradues nos dois sentidos no ENPC, Johansson comparou os verbos love e hate em ingls com as suas contrapartidas norueguesas elske e hate,
concluindo que h diferenas entre o uso destes verbos, embora sejam altamente correlacionados. Os verbos noruegueses exprimem em geral um
sentimento forte, enquanto os verbos ingleses tambm so usados num sentido mais fraco, mais frequente em combinao com objetos no humanos
ou completivas.
Com base num subconjunto do COMPARA, o presente estudo investiga o que
se pode concluir da comparao entre love e hate ingleses e os verbos amar e
odiar em portugus. Os resultados so menos claros: se, por um lado, os verbos portugueses parecem alinhar com os noruegueses no sentido de terem
uma rea de aplicao mais restrita do que os ingleses, por outro lado o verbo
odiar muito mais usado com objetos no humanos do que o verbo noruegus
hate. Esta e outras observaes contrastivas sugerem que mais fcil em portugus do que em noruegus atribuir a objetos no humanos sentimentos
fortes, enquanto que em ingls os verbos so usados com um sentido mais
fraco.
[1] i n t r o d u c t i o n
[440]
I loved Natalie.
Jeg elsket Natalie.
(ENPC/FW1)
(ENPC/FW1T)
(2)
(ENPC/RD1)
(ENPC/RD1T)
The current study seeks to establish to what extent conclusions similar to those
drawn for English vs. Norwegian also apply to the language-pair English-Portuguese. In other words, what is typically loved and hated in English and Portuguese? Are the Portuguese verbs closer to the English or the Norwegian verbs
in terms of meaning and use? Answers to these questions will primarily be sought
in material culled from a subset of the COMPARA corpus (see e.g. FrankenbergGarcia & Santos (2003)).
Providing essential background information, both in terms of object of study
and method, Section [2] outlines Johanssons study in more detail. The corpus
used in the present investigation is presented in Section [3], while Section [4]
contains the contrastive analysis proper. Some concluding remarks are offered
in Section [5].
[1]
[2]
OSLa volume 7(1), 2015
[441]
[2] b a c k g r o u n d
Johanssons interest in the verbs under study was sparked as he noticed some odd
uses of Norwegian hate appearing in the newspaper. The examples that triggered
the original study are repeated here as (3) and (4), and were found to be direct
translations from English.
(3)
(4)
Johanssons immediate reaction was that these were instances of anglicisms, inspired by the English source texts (1998, pg. 93), and therefore not considered
idiomatic Norwegian. These observations made him want to take a closer look
into the relationship between the English and Norwegian cognate verbs hate/hate.
He also included in his investigation their more loveable opposites: love and elske.
In the material from the ENPC, he noticed that, in the original texts, the English verbs were more than three times as frequent as the Norwegian verbs. In the
translated texts, however, the frequencies move in the direction of the frequencies found for the corresponding verbs in original texts in the source language
(ibid., pg. 94), as shown in Table 1.
N hate
N elske
E hate
E love
Original texts
23
36
67
100
Translations
34
90
25
62
table 1: Frequency figures for the four verbs in original and translated texts in
the ENPC.
The tendency for linguistic patterns to behave differently in original vs. translated texts may be caused by source language influence on the target language.
This phenomenon has been termed translationese (see e.g. Gellerstam (1986)),
and Johansson suggests that it is highly likely that the occurrences of Norwegian hate in examples (3) and (4) above are examples of translationese (Johansson 1998, pgs. 9494).
Johansson moves on to discuss the overall translation patterns in the ENPC
material, and finds that Norwegian hate and elske are almost invariably translated by their English counterparts hate and love, while the English verbs often
have other renditions in Norwegian than hate and elske. This suggests that the
Norwegian verbs have a more restricted area of use than their English cousins.
OSLa volume 7(1), 2015
[442]
N hate
N elske
E hate
E love
Original texts
Personal Obj. Non-personal Obj.
65%
35%
61%
39%
27%
73%
46%
54%
Translations
Personal Obj. Non-personal Obj.
35%
65%
35%
64%
56%
44%
65%
35%
table 2: Type of object following the verbs in original and translated texts (in percent) in the ENPC (ibid., pg. 95).
Focusing on the original texts in the two languages, we can note that the
Norwegian verbs prefer a personal object, while the English verbs prefer nonpersonal objects. Johansson comments on the translations and says that [t]he
translated texts again show a frequency pattern which reflects the source texts
(ibid.), thus a greater proportion of the Norwegian translations than expected
used elske/hate with the weakened sense and the complementation patterns typical of the English love/hate.
Johanssons study continues with an analysis and a discussion of the Norwegian translation correspondences and he concludes that the differences between
the English and Norwegian verbs come out very clearly both in the overall frequency of the verbs in original texts and in their translation patterns (ibid.,
pg. 101). He also notes that, the distribution differences between original and
translated texts notwithstanding, the translators generally seem to be aware of
these differences, as attested by the rich inventory of translation correspondences.
Nevertheless, the influence from English on the Norwegian language is pervasive and may lead to the use of Norwegian hate/elske in a weakened sense. In
fact, Johansson suggests that the Norwegian verbs may be undergoing a semantic
change. This new weakened use of the two Norwegian verbs has indeed been attested in two follow-up studies based on more recent corpus material (Hasselgrd
2011; Ebeling 2014).
The present study adds another language to the equation, and will follow Johanssons steps in the analysis with the aim of gaining insight which goes beyond the establishment of standard counterparts (Johansson 1998, pg. 103), viz.
love/amar and hate/odiar.3
[3]
In an article entitled Loving and hating the movies in English, German and Spanish, Taboada et al.
(2014) study evaluative language in the genre movie reviews. Their focus is not specifically on the verbs
love and hate, but they mention, referring Johansson (1998), that love and hate and their equivalents
in German and Spanish are actually quite infrequent in our corpus, because they express Affect, which
[] is not very frequent in our corpus, in contrast to Appreciation (Taboada et al. 2014, pg. 131).
[443]
As mentioned in the Introduction, the main source of data used in this investigation is a subset of the COMPARA corpus. COMPARA contains original texts in English and Portuguese with their translations into the other language, and is thus
similar to the ENPC in being a bidirectional translation corpus. Worth mentioning
in this context is that Portuguese was one of the languages that was added in the
multilingual extension of the ENPC, later known as the Oslo Multilingual Corpus
(OMC) (see Oksefjell (1999); Johansson (2007)). As the Portuguese part of the OMC
is unidirectional, i.e. it contains Portuguese translations of English texts but not
vice versa, COMPARA was a more natural choice of corpus for this study. However, some of the texts in the (English-Portuguese part of the) OMC and COMPARA
overlap.
In order to make this study as comparable as possible to Johanssons, a selection of texts available in COMPARA was made, according to the following criteria:
Original texts mainly from the 1980s and 1990s4
A maximum of two texts per author5
The version of COMPARA used here thus contains 20 original text extracts in
Portuguese, amounting to approx. 370,000 words and 14 original text extracts
in English, amounting to approx. 360,000 words, in addition to a similar amount
of text of their respective translations (see the Appendix for a full list of texts
included).6 The fact that different varieties of both Portuguese and English are
[4]
[5]
[6]
One Portuguese text (PPJS1) was published in the late 1970s (but so were some of the Norwegian texts in
the ENPC).
To ensure as balanced a corpus as possible in terms of size, three texts by one Brazilian author were
included (PBPC).
For comparison, the ENPC contains roughly 400,000 words of original text in each language.
OSLa volume 7(1), 2015
[444]
P odiar
P amar
E hate
E love
Original texts
(E: 359,281 / P: 369,203)
39 (10.6 per 100,000 words)
54 (14.6 per 100,000 words)
37 (10.3 per 100,000 words)
84 (23.4 per 100,000 words)
Translations
(E: 412,704 / P: 350,607)
16 (4.6 per 100,000 words)
30 (8.6 per 100,000 words)
49 (11.9 per 100,000 words)
96 (23.2 per 100,000 words)
See e.g. Collins Portuguese Dictionary and The Routledge Portuguese Bilingual Dictionary.
Henceforth, COMPARA refers to the subset used here.
[445]
data, however, the English translations seemed to be drawn towards the Norwegian source texts in being less frequently used.
As English love is more frequent than amar overall (both in original texts and
translations), it seems fair to suggest that love has a wider area of use than amar.
Odiar and hate, on the other hand, occur with a similar frequency in original texts,
while the use of odiar drops in translations. In contrast to the English-Norwegian
data, the difference in distribution of odiar and amar in original vs. translated text
does not seem to be a case of translationese, as their distribution is not pulled towards the use in the source language English. In fact, the reason for this discrepancy is hard to pin down, but, with regard to the former verb, could the notion of
odiar not being used as casually as hate play a role in the minds of the translators?
[4] c o n t r a s t i v e a n a l y s i s
Following Johanssons steps in the analysis, we will first take a look at the overall
translation patterns before moving on to the actual translation correspondences.
NOT
NOT
E hate
E love
0 (out of 39)
7 (out of 54)
E hate
E love
NOT
NOT
P odiar
P amar
No amava o prximo
He had no love for his fellow man
(PBRF1)
In one case being unloved has been used as a translation of no ser amado, while
the last example is a direct quotation from the Bible and has betrothed a wife as a
translation of ama uma mulher.
OSLa volume 7(1), 2015
[446]
P odiar
P amar
E hate
E love
Original texts
Personal Non-personal
objects
objects
51.0%
49.0%
76.0%
24.0%
37.8%
62.2%
57.1%
42.9%
Translations
Personal Non-personal
objects
objects
68.8%
31.2%
63.3%
36.7%
51.0%
49.0%
53.1%
46.9%
table 5: Type of object following the verbs in original and translated texts (in percent) in COMPARA
If we look at the distribution of the English verbs first, we can note that hate
clearly favours a non-personal object in the original texts, while love prefers a
personal object. While the former observation is in line with Johanssons original
study, the latter is not; i.e. love was found to be slightly more common with a nonpersonal object. However, as the distribution of love showed the least discrepancy
between personal vs. non-personal object in Johanssons study (see Table 2), the
choice seems to be arbitrary and most likely due to subject matter of the individual texts.
In the Portuguese original texts there is a clear preference for personal objects
with amar, while in the case of odiar there is a less clear-cut division of labour between complementation patterns. A typical example of amar with a personal object is shown in example (6), while examples (7) and (8) show odiar with a personal
and non-personal object, respectively.
(6)
(7)
[PBRF1]
[447]
[PMMC2]
As can be seen from Table 5, the verbs tend to favour personal objects also in the
translations; this is true even of English hate, albeit only marginally so. This is
related to the use of odiar in the source texts (also with a slight overweight of
personal objects) and the fact that hate is always used as a translation of odiar in
the material at hand (see Table 4 and examples (7) and (8)). It is harder to explain
why the percentage of odiar with a personal object increases to the extent that it
does in the translations, but again it seems to be related to the fact that the other
main translation option of hate (besides odiar) detestar detest takes care of
many of the instances where hate has a non-personal object in the original texts,
as exemplified in (9) where the object is realised by an infinitive clause.
(9)
[EBDL1T1]
Whether this suggests that the translators view odiar as being semantically too
strong or unidiomatic in contexts such as (9) is hard to determine, though. It is
also hard to determine what happens to amar and love in the translations, as both
show a slight decrease in personal objects and a slight increase in non-personal
objects. The reason for this may become clearer when we turn to the next step in
the contrastive analysis, focusing on the actual translations correspondences of
the four verbs under study.
[448]
(11)
[EBJT2]
Thus, where Norwegian was shown not to readily accept a hate-verb with clause
complementation, Portuguese has detestar. However, odiar does not seem to be
completely ruled out, as there were two instances of odiar + infinitive clause in the
Portuguese originals. A brief comparison of instances per million words (pmw) of
amar, odiar, love and hate followed by an infinitive in monolingual corpora shows
the following: amar + inf.: 0.16 pmw, odiar + inf.: 0.28 pmw (based on corpo todos
juntos through the AC/DC project;9 love + to-inf.: 11.41 pmw, hate + to-inf.: 3.99
pmw (based on the British National Corpus BNCWeb cqp edition).
Other non-personal objects
The other non-personal objects attested form a very homogeneous group,
consisting of a noun phrase in all but one of the 19 instances. 14 of these have
detestar in the translation, e.g. (12), while only four have odiar, e.g. (13). The one
instance without a following noun phrase is a passive construction translated by
odiado.
(12)
(13)
[EBJT1]
[EBDL1T1]
http://linguateca.pt
[449]
Personal objects
When hate is followed by a personal object, the translators have chosen odiar
in 12 of the 14 cases. The remaining two have detestar. This reconfirms the impression that hate covers the area of use of two verbs in particular in Portuguese,
namely odiar and detestar. The relationship between hate and odiar is dependent
on type of object, and can be summed up as follows, when hate is used in the original texts:
Complement clause: . . . . . . . . . . . . . . no instances of Portuguese odiar
Other non-personal object: . . . . . . . . . . approx. 21% Portuguese odiar
Personal object: . . . . . . . . . . . . . . . . . . . . . . approx. 85% Portuguese odiar
(15)
[EBDL1T1]
[EBJT2]
As hinted at in the translations of love in both (14) and (15), amar is not used as a
translation in any of the nine cases; instead adorar and gostar (de) are used, five
and four times, respectively. In other words, the tendency is similar to what was
noted for hateodiar; other Portuguese verbs than the closest counterpart amar
take over when love is followed by a complement clause. Although both examples
show love in its weakened sense, example (13) deserves special attention. I believe
the combination modal + love + to-infinitive clause in particular bears witness to
the weakened sense of love when compared to amar (and also Norwegian elske). In
fact, Maia (1994/1996, section 7.5.2) draws attention to this in her discussion of
the use of modals with verbs of emotion, quoting Quirk et al. (1985, 3.64n) who
say that would in such contexts is used to indicate a tentative desire in polite requests, offers or invitations. Moreover, amar + complement clause is not attested
in the original texts of the COMPARA corpus.
OSLa volume 7(1), 2015
[450]
[ESNG1]
(17)
(18)
(19)
(20)
And here he was, making himself sick because the pet he loved was stolen.
[EURZ2]
E aqui estava ele, doente porque a sua adorada ave de estimao tinha
sido roubada.
[EBJT2]
The Portuguese translator has interpreted the second instance of love in this sentence as a verb, while I
have interpreted it as a noun; it is thus not part of the material studied here.
It is not quite clear how Johansson (1998) classified instances of passive and intransitive constructions.
However, he says that [i]n the few instances of intransitive use, the verbs are translated by their standard counterparts (ibid., pg. 96). Since the number of instances in Johanssons study is not reduced after
mentioning this, I take it to mean that Johansson counted them as instances of the non-personal object
category. This is not as straightforward for the passive use, as there is very often an implied personal
object involved. However, the four instances of passive and intransitive love do not skew the results unduly. In addition, Maia (1994/1996, section 6.7) notes that passives with SFoc verbs like love and like are
extremely rare.
[451]
sitive pattern is Maias (1994/1996, section 6.4) observation that amar differs from
love in this respect, i.e. intransitive amar is much more frequently attested than
intransitive love in her material.
In contrast to the translations of hate, we have seen that the translations of love
form a slightly less homogeneous group; instead of two main correspondences as
is the case for hate, there are three for love, in addition to a couple of marginal
ones. Moreover, the verb is not exclusively followed by a noun phrase. This suggests that love in English may have a wider area of use than amar.
Personal objects
Amar is used as a translation of love followed by a personal object in half of the
cases (24 out of 48), and is illustrated in example (21). The other frequent translation correspondence is gostar de, used in 17 cases, and illustrated in example (22).
Other, minor, correspondences include three instances of zero correspondence,
as in (23), three instances of adorar, e.g. (24), and one instance of estimar, e.g. (25).
(21)
Men of Athens, I honor and love you, but I shall obey God rather than
you.
[EUJH1]
Atenienses, honro-vos e amo-vos, mas devo obedecer a Deus antes de a
vs.
(22)
(23)
(24)
(25)
[EBJT1]
[EBJT1]
[EURZ1]
As was the case in Johanssons material, there are two main translation correspondences of love with a personal object in the EnglishPortuguese material. Another
similarity is that there is no tendency as to what kind of personal relationship is
described, that between man-woman, parent-child, friend-friend etc. (i.e. the
senser and phenomenon in Maias (1994/1996) terms).
[452]
This study has followed in the footsteps of Johanssons article concerning the relationship between the typical verbs of love and hate in English and Norwegian.
The aim was to shed light on the relationship between similar verbs in English and
Portuguese. The COMPARA data seem to paint a more complex picture of the use
of these verbs across the two languages. In some respects, the Portuguese verbs
behave in ways similar to the Norwegian verbs, particularly in that they seem to
have a more restricted area of use than their English counterparts.
In other respects, the Portuguese verbs differ from the Norwegian verbs. In
original texts, odiar, for example, is shown to combine more easily with nonpersonal objects than Norwegian hate. These and other cross-linguistic observations suggest that the Portuguese verbs may more easily combine the strong
feeling meaning with non-personal objects than Norwegian, while the English
verbs are more often used in a weakened sense. Alternatively, it could point to
a middle position for Portuguese, where Norwegian hate expresses the strongest
feeling of hate, English hate the weakest, with Portuguese odiar somewhere in between.
The Portuguese translations of love and hate reveal some clear patterns: the
English verbs are tied to two or three Portuguese verbs each. Thus the inventory of correspondences is more restricted than the Norwegian correspondences
reported by Johansson (1998). The translators seem to be well aware of this division of labour between a small set of Portuguese verbs to cover the meanings
of love and hate. Again it is tempting to suggest that Portuguese amar and odiar
are in a middle position, in that the two English verbs have the widest area of use
OSLa volume 7(1), 2015
[453]
and the Norwegian verbs the narrowest, with the Portuguese verbs somewhere
in between.
As was the case in the EnglishNorwegian data the Portuguese translation patterns for love and hate are broadly in agreement in terms of complement types.
Neither amar nor odiar was found with a complement clause, and only around 20%
of the translations with other non-personal objects had amar or odiar. Personal
objects were favoured by both Portuguese verbs. However, in the original data
odiar was found to occur with a complement clause, which supports the suggestion that at least one of the Portuguese verbs may have a slightly more weakened
sense than its Norwegian counterpart. In this context it should be pointed out
that studies of Norwegian elske and hate based on more recent data than the ENPC
found evidence of these constructions occurring naturally in (untranslated) Norwegian (Hasselgrd 2011; Ebeling 2014). In other words, Norwegian elske and hate
were attested with complement clauses. In the original study, Johanssons immediate reaction was that these were anglicisms (1998, pg. 93). While I believe that
his observation is right, it is also a fact that this construction is on the increase
in Norwegian, and what we are witnessing is a language change due to influence
from English (Ebeling 2014).
As pointed out by Johansson (1998, pg. 102), [c]hanges of this kind are natural
wherever there are languages in contact, but it is important to be aware of what
is going on. Whether similar changes, due to influence from English, are also
taking place in Portuguese is hard to determine on the basis of the COMPARA material. To gain insight into the development of the complement patterns of amar
and odiar, diachronic Portuguese material (including material of a more recent
date) has to be consulted, and will therefore have to await future research.
acknowledgements
I would like to thank Cristina Mota and Stella Tagnin for their valuable and constructive comments on a previous version of this paper.
appendix
Overview of the subset of COMPARA used.12
Corpus ID Author
Translator
EBDL1T1 Lodge, David
Figueira, Maria do Carmo
[12]
Title
Title (trans.)
Therapy
Terapia
Place of pub./Publisher
Place of pub./ Publisher (trans.)
London: Secker & Warburg
Lisbon: Gradiva
Year of pub.
Year of pub. (trans.)
1995
1995
EBIM1
McEwan, Ian
Black Dogs
Rodrigues, Fernanda Pinto Ces Pretos
London: Picador
Lisbon: Gradiva
1992
1993
EBIM2
McEwan, Ian
Bastos, Ana Falco
Amsterdam
Amesterdo
London: Vintage
Lisbon: Gradiva
1998
1999
EBJB1
Barnes, Julian
Amador, Ana Maria
Flauberts Parrot
O papagaio de Flaubert
London: Picador
Lisbon: Quetzal
1985
1988
EBJB2
Barnes, Julian
Lima, Jos Vieira de
1989
1990
[454]
[a]
[b]
Corpus ID Author
Translator
EBJT1
Trollope, Joanna
Bastos, Ana Falco
Title
Title (trans.)
Next of Kin
Parentes prximos
Place of pub./Publisher
Place of pub./ Publisher (trans.)
London: Black Swan
Lisbon: Gradiva
Year of pub.
Year of pub. (trans.)
1996
1998
EBJT2
Trollope, Joanna
Bastos, Ana Falco
A Spanish Lover
Um Amante Espanhol
London: Bloomsbury
Lisbon: Gradiva
1993
1999
EBKI1
Ishiguro, Kazuo
The Unconsoled
Rodrigues, Fernanda Pinto Os Inconsolados
1995
1995
EBKI2
Ishiguro, Kazuo
The Remains of the Day
Rodrigues, Fernanda Pinto Os Despojos do Dia
1989
1991
ESNG1
Gordimer, Nadine
Ferraz, Geraldo Galvo
My Sons Story
A histria do meu filho
1990
1992
ESNG3
Gordimer, Nadine
Reis, Paula
Julys People
A Gente de July
EUJH1
Heller, Joseph
Rodriguez, Cristina
Picture This
Imaginem que
EURZ1
Zimler, Richard
Lima, Jos
1998a
1996
EURZ2
Zimler, Richard
Lima, Jos
Angelic Darkness
Trevas da Luz
2000b
1998
1990
1991
table 6: English original texts and their translations into Portuguese in the COMPARA subset (359,281 English words; 350,607 Portuguese words).
Corpus ID Author
Translator
PAJA1
Agualusa, Jos Eduardo
Zenith, Richard
PAJA2
Title
Title (trans.)
A Feira dos Assombrados
Shadow Town
Place of pub./Publisher
Place of pub./ Publisher (trans.)
Lisbon: Vega
Prague: Trafika
Year of pub.
Year of pub. (trans.)
1992
1994
Lisbon: Vega
1990
PBCB1
Buarque, Chico
Landers, Clifford
Benjamim
Benjamin
1995
1997
PBCB2
Buarque, Chico
Bush, Peter
Estorvo
Turbulence
1991
1992
PBJS1
Soares, J
Landers, Clifford
1995
1997
PBMR1
Rey, Marcos
Landers, Clifford
Memrias de um Gigol
Memoirs of a Gigolo
1986
1987
PBPC1
Coelho, Paulo
Clarke, Alan
O alquimista
The alquemist
1988
1993
PBPC2
Coelho, Paulo
Clarke, Alan
O Dirio de um Mago
Rio de Janeiro: Rocco
The Pilgrimage: a contemporary quest for New York: HarperCollins
ancient wisdom
1987
1992
PBPC3
Coelho, Paulo
Landers, Clifford
O Monte Cinco
The Fifth Mountain
1996
1998
PBPM1
Melo, Patrcia
Landers, Clifford
O elogio da mentira
In praise of lies
1998
1999
PBPM2
Melo, Patrcia
Landers, Clifford
O Matador
The Killer
1995
1998
PBRF1
Fonseca, Rubem
Landers, Clifford
1988
1997
PBRF2
Fonseca, Rubem
Watson, Ellen
A Grande Arte
High Art
PMMC1
Couto, Mia
Brookshaw, David
Vozes Anoitecidas
Voices Made Night
1987
1990
PMMC2
Couto, Mia
Brookshaw, David
1990
1993
[455]
Fitton, Mary
Place of pub./Publisher
Place of pub./ Publisher (trans.)
Lisbon: Edies O Jornal,
Publicaes Projornal, Lda.
London: John M. Dent
Year of pub.
Year of pub. (trans.)
1983
PPJS1
Sena, Jorge de
Byrne, John
Sinais de Fogo
Signs of Fire
1978
1999
PPJSA1
Saramago, Jos
Ensaio Sobre a Cegueira
Pontiero, Giovanni Blindness
Lisbon: Caminho
London: Harvill Press
1995
1997
PPJSA2
Saramago, Jos
A Histria do Cerco de Lisboa
Pontiero, Giovanni The History of the Siege of Lisbon
Lisbon: Caminho
London: Harvill Press
1989
1996
PPLJ1
Jorge, Ldia
A Costa dos Murmrios
Costa, Natlia and The Murmuring Coast
Ronald W. Sousa
PPMC1
1986
1994
1997
table 7: Portuguese original texts and their translations into English in the COMPARA subset (369,203 Portuguese words; 412,704 English words)
references
Altenberg, Bengt. 1999. Adverbial connectors in English and Swedish: Semantic
and lexical correspondences. In Hilde Hasselgrd & Signe Oksefjell (eds.), Out
of Corpora: Studies in Honour of Stig Johansson, 249268. Rodopi.
Ebeling, Signe Oksefjell. 2014. Does corpus size matter? Revisiting ENPC case
studies with an extended version of the corpus. Paper presented at Languages
in Contrast - A symposium in celebration of the 20th anniversary of the Nordic Parallel
Corpus project, Lund, 5 December.
Frankenberg-Garcia, Ana & Diana Santos. 2003. Introducing COMPARA: the
Portuguese-English Parallel Corpus. In Federico Zanettin, Silvia Bernardini &
Dominic Stewart (eds.), Corpora in Translator Education, 7187. St. Jerome.
Gellerstam, Martin. 1986. Translationese in Swedish novels translated from English. In Lars Wollin & Hans Linquist (eds.), Translation Studies in Scandinavia,
8895. CWK Gleerup.
Hasselgrd, Hilde. 2011. Loving and hating in English and Norwegian speech. Paper presented at the Jan Svartvik Birthday Symposium, Lund, 19 August.
Johansson, Stig. 1998. Loving and hating in English and Norwegian: A corpusbased contrastive study. In Dorte Albrechtsen, Birgit Henriksen, Inger M.
Meesand & Erik Poulsen (eds.), Perspectives on Foreign and Second Language Pedagogy, 93103. Odense University Press.
Johansson, Stig. 2007. Seeing through Multilingual Corpora: On the Use of Corpora in
Contrastive Studies, vol. 26 Studies in corpus linguistics. John Benjamins.
Johansson, Stig & Knut Hofland. 1994. Towards an English-Norwegian Parallel
Corpus. In Peter Schneider Udo Fries, Gunnel Tottie (ed.), Creating and Using
OSLa volume 7(1), 2015
[456]
c o n ta c t s
Signe Oksefjell Ebeling
University of Oslo
s.o.ebeling@ilos.uio.no
OSLa volume 7(1), 2015
Simes, Barreiro, Santos, Sousa-Silva & Tagnin (eds.) Lingustica, Informtica e Traduo: Mundos
que se Cruzam, Oslo Studies in Language 7(1), 2015. 457470. (ISSN 1890-9639 / ISBN 978-8291398-12-9)
http://www.journals.uio.no/osla
resumo
Neste artigo, discutimos um problema debatido h muito sobre a natureza
aspetual de certas predicaes, classificadas como Activities e Accomplishments
(Vendler 1957, e outros). Este problema foi j colocado de maneira informal
por vrios autores, que assinalaram a complexidade dos Accomplishments,
mas s mais recentemente houve tentativas de formalizao que explicasse a
alternncia entre estes tipos aspetuais que desencadeada pelas propriedades
denotacionais de um dos argumentos de certos verbos.
Tendo em conta alguns dados do Portugus Europeu, propomos que os verbos podem ter informao lexical que relevante para a determinao da
presena ou ausncia de telicidade nas predicaes em que ocorrem. Assim,
certos traos verbais restringem a composio aspetual da predicao, mas
h casos em que o perfil aspetual definido em funo do processo composicional envolvido, uma vez que o verbo no marcado com esses traos.
Neste trabalho, apenas foi considerada a contribuio de certos argumentos internos tendo em conta a sua natureza denotacional (cumulativo / no
cumulativo).
Propomos ainda que, nos casos em que os verbos no so lexicalmente marcados com os traos anteriormente referidos, a predicao no pode ser classificada partida como Activity ou Accomplishment.
[1] i n t r o d u c t i o n
In many aspectual classification proposals, Accomplishments are considered a particularly problematic class (cf. Verkuyl 1972; Mourelatos 1978; Bach 1986; Tenny
1987, among many others), as this class raises several problems not only from a
theoretical point of view but also from a data analysis point of view. Although
there are, in the past, some proposals regarding how to formalize their semantics
(cf. Verkuyl 1993), a particular attention has been paid recently to Accomplishments (cf. Rothstein 2004, 2012; Pion 2006, among others).
From a theoretical point of view, it should be pointed out that, in the majority
of aspectual classes proposals, the class of Accomplishments1 presents the greater
[1]
The term accomplishment is used originally in Vendler (1957) in a proposal describing the different
types of situations based on Aristotle and Kenny (1963). There are, however, other proposals, like Mourelatos (1978), Bach (1986), Moens (1987), Smith (1991). These proposals are all based on Vendlers classification. For a different proposal built, according to the author, specifically for Portuguese, see Santos
(1996).
[458]
John walked a mile/ to the park (in an hour) (Dowty 1979, pg. 60)
Moreover, Dowty (1979) considers that any Activity verb can behave, in the right
contexts, as an Accomplishment and that some verbs classified as Accomplishments
can be classified as Activities when the direct object is an indefinite plural or a
mass noun, as it was already pointed out in Mourelatos (1978, pg. 427).
These considerations lead to the question of how to classify these verbs (see
also Verkuyl 1993, who elaborates his ideas put forward in Verkuyl 1972) and
OSLa volume 7(1), 2015
[459]
moreover how to establish a relation between two predications like (2-a), classified as an Activity (an atelic event), and (2-b), classified as an Accomplishment (a
telic event). Putting it in another way, would it be the case that the verb beber
(to drink) projects an eventuality of the type Activity which is subsequently commuted to an Accomplishment via a quantized direct object (see Krifka 1992, 1998),
or would it be the case that the same verb projects an Accomplishment which is
commuted to an Activity via a cumulative direct object?
(2)
a.
b.
O Rui
bebeu gua
The-Rui drank water
Rui drank water
O Rui
bebeu um copo
The-Rui drank a
glass
Rui drank a glass of water
de
of
gua
water
A way to avoid this problem is to assume that the aspectual classes are defined
at verb phrase level and not at verb level, so that (2-a) would be a basic Activity
while (2-b) would be a basic Accomplishment (cf. de Swart 1998; Rothstein 2004
among others). Nevertheless, this does not explain the relation between the two
sentences and it does not explain either why we do not see a parallel behaviour
with other types of verbs where the contrast cumulative/quantized direct objects
does not trigger aspectual shift (cf. Rothstein 2012). This can be illustrated by
(3), where the contrast between quantized and cumulative direct object o carrinho/areia (the cart/ sand) does not produce any aspectual change:
(3)
a.
b.
O Rui
empurrou
carrinho
(*em 5 minutos/
durante 5 minutos)2
The-Rui pushed
the cart
(in 5 minutes/
for 5 minutes)
Rui pushed the cart (*in 5 minutes/ for 5 minutes)
O Rui
empurrou areia (*em 5 minutos/durante 5 minutos)
The-Rui pushed
sand (in 5 minutes/for 5 minutes)
Rui pushed sand (*in 5 minutes/for 5 minutes)
Moreover, the simple assumption that aspectual classes are defined at verb
phrase level does not explain why the quantized/cumulative direct object alternation does not give rise to aspectual shifts involving other aspectual classes. This
can be illustrated in (4), where (4-a) and (4-b) are states, irrespective of the direct
object being uma mulher (quantized direct object) or poesia moderna (cumulative
direct object), and (4-c) and (4-d) are degree achievements (cf. Dowty 1979; Hay
[2]
We use the standard written symbol * to mark the ungrammaticality of the examples, and # to point
out that the example is acceptable but it does not exhibit the relevant interpretation.
OSLa volume 7(1), 2015
[460]
a.
b.
c.
d.
O Joo
adorou uma mulher
The-Joo adored a
woman
Joo adored a woman.
O Joo
adorou poesia moderna
The-Joo adored modern poetry
Joo adored modern poetry.
O Joo
aqueceu
um prato de
The-Joo warmed up a
bowl
of
Joo warmed up a bowl of soup.
O Joo
aqueceu
sopa.
The-Joo warmed up soup
Joo warmed up some soup.
sopa.
soup
[2] f i l i p ( 1 9 9 9 ) a n d p i o n ( 2 0 0 6 ) p r o p o s a l s a n d e p data
Filip (1999) puts forward a proposal for solving this problem3 . According to her,
the verbs with an incremental theme argument belong to a particular type of
eventuality, incremental eventuality. This type of eventuality is of a lexical nature
in the sense that this classification is ascribed to a verb as a non saturated predicate, that is, a predicate only with variables in its argument positions. In Filips
proposal verbs can be classified as [- quantized] or [+ quantized], corresponding
the former to States and Activities and the latter ones to the other events. However,
the incremental eventualities have in their basis a verb specified as [ quantized],
that is, this kind of verbs is specified with an indeterminate value for quantization. So, these predicates would be telic or atelic according to the quantized or
cumulative nature of their incremental theme argument, or any other incremental argument satisfying a homomorphism to the argument event.
This proposal, based on the notion of quantization, faces some problems when
we look at some EP data. In Filips (1999) proposal the incremental eventualities
are related to the property of quantization, but a sentence like (5), for instance,
denotes a quantized predicate (as there is no proper part of vaguear at praia
(wonder up to the beach) that is vaguear at praia) but it is not telic, as we can see
by the application of the temporal adverbials test compatibility.
[3]
OSLa volume 7(1), 2015
For different perspectives or proposals, see Mourelatos (1978); Declerck (1979); Carlson (1981); Tenny
(1987); Dowty (1991); Depraetere (1995); Ramchand (1997); Krifka (1998), among others.
O rapaz
vagueou
[461]
at
praia
Another problem is the existence of verbs that project eventualities of incremental type, but the alternation of the quantized/cumulative status of the incremental theme argument does not cause any change in the telicity of the predication,
as in (6).
(6)
a.
b.
Examples like (6) show that, in incremental eventualities, it is not always possible
to associate the quantization of the incremental theme to telicity and its cumulativity to atelicity.
A similar idea, that is, there are not just two classes of durative events (Activities and Accomplishments), but possibly three classes is also developed in Pion
(2006). Based on data from Hungarian, Pion proposes a division of Accomplishments in Strong Accomplishments and Weak Accomplishments. The first ones are incompatible with bare plural direct objects and give rise to two readings (presuppositional and scalar) when they occur in the scope of operators like almost. The
second ones are compatible with bare plural direct objects and give rise only to
the presuppositional reading when they occur in the scope of operators like almost, being in this respect similar to Activities. However, the data from EP does
not confirm this kind of division, as there is no restriction to the type of direct
object, differently from Hungarian, as we can see by the contrast between (7-a)
and (8-a) where the first one admits two possible interpretations (as shown in (7a) and (7-a)) but the second one only admits one interpretation (see (8-a)). On
the other hand, when a verb combines with a bare plural (cf. (8-a)), the test with
almost shows only the presuppositional reading (similar to Hungarian), but the
predication is not telic, since it does not combine with in x time, but with for x time
only (cf. (8-b)).
OSLa volume 7(1), 2015
[462]
a.
(8)
a.
[3] a c c o m p l i s h a b l e a c t i v i t i e s
In order to find a way towards solving this puzzle, we propose4 that verbs do carry
some information concerning the telicity of the predications they project. And
we consider telicity as the property of the predications that denote eventualities
having a set terminal point and a consequent state (cf. Garey 1957; Moens 1987,
among others)5 associated to it.
In other words, there are eventualities whose final boundaries can only be set
in an arbitrary way, since these eventualities can extend in time indefinitely. But
there are also eventualities whose final boundaries are an intrinsic characteristic
of their aspectual profile. In this case, if that final boundary is not achieved, then
the predication is not appropriate for describing it.
[4]
[5]
[463]
We can see, in (9), that the predication is compatible only with the adverbial
for x time and it is not compatible with in x time, independently of the occurrence
of a prepositional phrase with the semantic role of Goal (at praia), which usually favours a telic reading of the predications with movement verbs (cf. Krifka
1998; Rothstein 2004; Zwarts 2005, among others). In other words, in the inner
aspect (verb and its arguments), as much as in the outer aspect (with certain nonargument expressions), the predication is atelic, that is, it is an Activity, and this
is related, according to our proposal, to the fact that the verb exhibits some lexical information that imposes atelicity to the predication. Thus a predication with
this kind of verbs will be classified as an Activity
On the other hand, there are verbs that are lexically marked as [+ telic],6 i.e.,
verbs that have an information of telicity in the lexicon, which implies that these
verbs do not allow that the predications in which they occur can be compositionally defined as atelic. For this reason, when these verbs occur with atelicity
triggers, such as argument cumulative nouns, it is not the case that predications
[6]
We are assuming a point of view similar to Engelberg (2002), who claims, grounded on German data, that
a certain type of verbs, such as promovieren (to do a Ph.D) or dinieren (to dine), arise from the lexicon as
quantized predicates, contrary to other authors, such as Krifka (1998), who claims that, from a strictly
lexical point of view, all verbs are cumulative predicates.
OSLa volume 7(1), 2015
[464]
a.
b.
Almoar (to have lunch/to lunch) is a verb that is lexically marked as [+ telic].
So, the predication this verb projects must be also telic. In other words, a predication with almoar (to have lunch) is, in what concerns the inner aspect, an Accomplishment. Thus, in (10-a), the occurrence, as a direct object, of the cumulative
noun sopa (soup) does not interfere with the telicity of the predication (that remains telic), as we can verify by the occurrence of the adverbial in x time. The
occurrence of an adverbial as for x time, as in (10-b), does not shift the aspectual
profile of the predication and, as a consequence, predications in (10) correspond
to Accomplishments, irrespective of the adverbials.
Finally, there are verbs lexically specified as [ telic], which means that these
verbs are lexically underspecified in what concerns telicity. It is in these cases
that the internal arguments of the verbs partially determine the aspectual profile
of the predications. See (11).
(11)
[7]
a.
A referee considered examples (10) ungrammatical. However, the Web has several examples with this
kind of combination. See, for instance, the following ones.
(a)
a Dona Constana tambm acredita que uns dias antes o Scrates almoou durante 3h com Pinto
Monteiro para falar de livros
(http://www.tvi24.iol.pt/opiniao/constanca-cunha-e-sa/
entrevista-a-tvi-prenuncia-nova-estrategia-de-defesa-de-socrates)
(b)
[465]
We can see in (11) that the occurrence of a non count bare noun in the direct
object position determines the atelicity of the predication, while the occurrence
of a measure function as um copo de (a glass of) determines the telicity of the predication, as it is confirmed by the different possibilities of combination with the
adverbial in x time and for x time.
The same happens with verbs of movement plus a Goal prepositional phrase,
as in (12).
(12)
a.
b.
O atleta
correu para a
meta
(em 10 minutos/
#durante 10 minutos)
The athlete ran
to
the finish line (in 10 minutes/
for 10 minutes)
The athlete ran to the finish line (in 10 minutes/for 10 minutes)
O atleta
transportou a tocha para o estdio (em 2 h./
# durante 2 h.)
The athlete carried
the torch to
the stadium (in 2 h./
for 2 h.)
The athlete carried the torch to the stadium (in 2 h./for 2 h.)
However, the class of [ telic] verbs does not seem to be uniform. Instead, it
seems that there is a scale of (a)telicity. For instance, verbs like beber (to drink)
or correr (to run) seem to be more telic than verbs like discutir (to discuss) ou
estudar (to study), since the latter, but not the former, allow not only telic readings, but also atelic ones when the direct object is a quantized predicate. In fact,
when the internal argument of these verbs is realized as a quantized predicate,
a reading of Activity and a reading of Accomplishment of the verbal predicate are
both possible8 , which is in contrast with the majority of the previous cases, where
a reading of Accomplishment is usually mandatory. In these circumstances, these
[8]
The difference between this two readings seem to rely on some notion of completeness, that is related
to telicity. For instance, Rothstein (2008) argues that the telic/atelic distinction bear on the denotation of the verbal predicates: telic predicates denote sets of atomic entities, whereas atelic predicates
denote sets of non-atomic entities. The difference between these sets depends on the existence of criteria for what counts as one entity. If we say that the deputies discussed the law in 2 hours, this means
that the discussion come to an end, i.e., the discussion had a predetermined procedure that was completed and this procedure defines what counts as one event of discussing the law. This interpretation
does not arise with the for x time adverbial. This is very different from what Dahl (1981) suggests for
the relation between the P property and the T property , that is the relation between telic/atelic and
bounded/unbounded. For a discussion of this latter proposal, see also Depraetere (1995).
OSLa volume 7(1), 2015
[466]
Os deputados discutiram a
lei (durante 2 horas/em 2 horas)
The deputies
discussed
the law (for 2 hours/in 2 hours)
The deputies discussed the law (for 2 hours/in 2 hours)
[9]
[10]
[11]
OSLa volume 7(1), 2015
[467]
To sum up, we propose that [+ telic], [- telic] and [ telic] are lexical verbal features. Furthermore, these features are different concerning the aspectual composition: [+ telic] and [- telic] features determine the telicity of the basic predication, irrespective of the nature of the arguments; but the [ telic] feature allows
the telicity of the predication to be determined during the derivation process.
In this case, the quantized/cumulative properties of the homomorphic argument
are aspectually relevant. According to this proposal, based on EP data, the telicity of the predication is not determined at V level in all cases (cf., for instance,
Tenny 1994), nor solely at VP level (cf., for instance, Rothstein 2012). It is possible
that languages diverge in the way they compute telicity (cf., for instance, Filip &
Rothstein 2006, for Slavic languages).
[4] f i n a l r e m a r k s
The problem we concentrated on was to clarify the aspectual status of some predications regarding in particular the problem of telicity. These predications can
have a different classification according to the quantized/cumulative nature of
one of their arguments. That is, they can be classified as Accomplishments or Activities in their inner aspect. This is a long debate as we pointed out mentioning
some of the most relevant bibliography. The two proposals that we briefly discussed (Filip 1999; Pion 2006) do not seem to solve some of the problems put
forward for the EP data presented.
We then proposed that the verbs carry some aspectual information concerning the telicity of the predication they project. So, based on EP data, we suggest
that there are three possible values: [+ telic], [- telic] verbs and [ telic] verbs.
The former determine the telicity or atelicity of the predication irrespective of
the nature of the arguments. The latter one does not do so. In this case, the telicity of the predication will rely on other elements. We only discussed cases where
an argument establishes a homomorphic relation to the event. When this relation
holds, the argument determines if the predication is telic or atelic, depending on
its denotational properties.
We also propose that, when the verbs are [ telic] and none of the relevant arguments is fulfilled, the predication will be atelic and consequently it is classified
as an Activity. So, when the predications projected by the [ telic] verbs are not
saturated, in fact they are not Accomplishments nor Activities, but Accomplishable
Activities, i.e., Activities that can have culmination.
acknowledgments
This paper is dedicated to Belinda Maia, whose activities and accomplishments are
of great importance.
CLUP is supported by FCT, PEst-OE/LIN/UI0022/2014.
OSLa volume 7(1), 2015
[468]
references
Bach, Emmon. 1986. The Algebra of Events. Linguistics and Philosophy 9. 516.
Carlson, Lauri. 1981. Aspect and Quantification. In Philip Tedeschi & Annie Zaenen
(eds.), Syntax and Semantics, vol. 14, chap. 3, 3164. Academic Press.
Dahl, Osten. 1981. On the Definition of the Telic-Atelic (Bounded-Nonbounded)
Distinction. In Philip Tedeschi & Annie Zaenen (eds.), Syntax and Semantics,
vol. 14, chap. 5, 7990. Academic Press.
Declerck, Renaat. 1979. Aspect and the bounded/unbounded (telic/atelic) distinction. Linguistics 17. 761794.
Depraetere, Ilse. 1995.
On the necessity of distinguishing between
(un)boundedness and (a)telicity. Linguistics and Philosophy 18(1). 119.
Dowty, David (ed.). 1979. Word Meaning and Montague Grammar. The Semantics of
Verbs and Times in Generative Semantics and in Montagues PTQ. Reidel.
Dowty, David. 1991. Thematic Proto-Rules and Argument Selection. Language
67(3). 547619.
Engelberg, Stefan. 2002. Intransitive accomplishments and the lexicon: the role
of implicit arguments, definiteness and reflexivity in aspectual composition.
Journal of Semantics 19. 369416.
Filip, Hana (ed.). 1999. Aspect, Eventuality Types and Nominal Reference. Garland
Publishing Inc.
Filip, Hana & Susan Rothstein. 2006. Telicity as a semantic parameter. In James
Lavine, Steven Franks, Mila Tasseva-Kurktchieva & Hana Filip (eds.), Formal Approaches to Slavic Linguistics, vol. 14, 139156. Ann Arbor.
Garey, Howard. 1957. Verbal Aspect in French. Language 33(2). 91110.
Hay, Jen, Christopher Kennedy & Beth Levin. 1999. Scalar structure underlies
telicity in degree achievements . In Tanya Matthews & Devon Strolovitch
(eds.), Proceedings of SALT 9, 127144.
Kennedy, Christopher & Beth Levin. 2008. Measure of Change: The Adjectival
Core of Degree Achievements. In Louise McNally & Christopher Kennedy (eds.),
Adjectives and Adverbs: Syntax, Semantics and Discourse, chap. 7, 156182. Oxford
University Press.
Kenny, Anthony (ed.). 1963. Action, Emotion and Will. Humanities Press.
OSLa volume 7(1), 2015
[469]
[470]
c o n ta c t s
Ftima Oliveira
Faculdade de Letras da Universidade do Porto
moliv@letras.up.pt
Antnio Leal
Faculdade de Letras da Universidade do Porto
jleal@letras.up.pt
OSLa volume 7(1), 2015