09:00 - 09:30 Opening by Elia YUSTE, Workshop Chair
09:30 - 10:00 Silvia HANSEN & Elke TEICH, Computational Linguistics and Translation
and Interpreting Departments (respectively), Saarland University,
Saarbrcken, Germany
The creation and exploitation of a translation reference corpus
15:00 - 15:30 Elia YUSTE, Centre for Computational Linguistics, University of Zurich,
Language Resources and the Language Professional
16:00 - 16:40 Natalie KBLER, Intercultural Centre for Studies in Lexicology, University
Paris 7, France
Creating a Term Base to Customize an MT System: Reusability of Resources
and Tools from the Translator's Point of View
16:40 - 17:00 Afternoon coffee break
The creation and exploitation of a translation reference corpus
While in many branches of linguistics monolingual reference corpora are widely used, in translation research as well as translation
practice the concept of a translation reference corpus has not yet assumed a similarly important role. In this paper, we present the
design of a German-English and French-English translation corpus and explore its use as a reference corpus for translatologists as
well as translators. First, we introduce the basic computational techniques needed to build such a translation reference corpus,
covering the preparation of the corpus as well as its linguistic annotation. Second, discussing some typical translation problems that
occur in English-German and English-French translations, we show how the corpus can be queried making use of the linguistic
translators. It seems to us that there is a lacking interaction
1. Introduction between the developers of corpus tools and researchers and
In the last decade or so natural language corpora have practitioners in the field of translation. The goal of the
assumed an increasingly important role in descriptive present paper is to initiate such an exchange. We proceed in
linguistics. Not only are they employed to inform the following way. First, we discuss the basic
lexicologists, lexicographers and grammarians in the computational techniques needed to make a corpus usable
construction of dictionaries and grammars, but also they as a translation reference corpus (Section 2). We show how
gain importance as works of reference for linguists more a corpus needs to be prepared (alignment, encoding) and
generally. There are many corporaespecially for how it should be enriched with linguistic information, so
English (e.g., BNC1, ICE2, Bank of English3)that have that it becomes possible to pose queries to it that are
been made accessible via the Internet with special user interesting and relevant from a translation point of view.
interfaces which allow one to query a corpus by means of Second, we show how a translation corpus can be queried
KWIC concordances. with a parallel concordancing tool. We illustrate the use of
Also in translation research, corpora have started to an English-German-French translation reference corpus for
become acknowledged as an important source of solving some typical translation problems that occur in
information in the investigation of theoretical issues in translating from English into German and from English into
translatology, such as the question about the status of French (Section 3). Section 4 concludes the paper with a
translations as a special kind of text with specific, summary and some issues for future work.
possibly universal, properties. Here, the typical corpus is
a parallel corpus consisting of two subcorpora, one 2. Computational techniques
containing source language (SL) original texts and the Corpus preparation. For the creation of a translation
other containing translations of those texts into a target reference corpus, a parallel corpus needs to be aligned. For
language (TL), where SL and TL texts are aligned (e.g., this purpose, an alignment program must be applied. One
the Chemnitz corpora4). Some researchers advocate a such program is Dj Vu (Atril, 2000). Figure 1 shows a
three-way corpus design, where original texts in the TL German SL and an English TL text aligned with this tool.
are included as well (e.g., the Oslo corpora5 as well as the
work carried out at Saarbrcken (Teich & Hansen, 2001;
Teich, 2001)), the latter being called a comparable corpus
(cf. Baker, 1995; 1996). Also in translation practice,
parallel corpora are increasingly being used in the form of
translation memories. The compilation of such translation
memories is supported by translation corpus
workbenches. Thus, parallel corpora assume an
increasingly important role both in theory and practice.
In this paper we explore the role of translation corpora
as works of reference for translatologists as well as
Figure 1: Multilingual corpus alignment
Dj Vu aligns a text and its translation on sentence Corpus annotation. A translation reference corpus
basis, storing the aligned texts in one file or in two should at least be annotated with part-of-speech and
separate files depending on the requirements of the query syntactic information. Part-of-speech tagging is carried out
tool used in later stages of analysis. Files can be exported fully automatically, either using a rule-based or a statistical
to translation workbenches and to Microsoft Excel and approach, where recently, statistical approaches prevail. For
Access. Figure 2 shows a Dj Vu output in a TSV (tab multilingual applications, it is important that the tagger can
separated vector) format. be used for more than one language. Analyzing a corpus in
terms of syntactic structure is still a challenging task and
Als Kurt Lukas erwachte, lagen das Messer und vier Mnzen cannot be carried out automatically with satisfactory
in seinem Scho. Kurt Lukas awoke to accuracy yet. Recently researchers in computational
find the knife and four coins on his lap.
Er blinzelte in ein Licht. He blinked, dazzled
linguistics who are interested in the accurate parsing of
by a beam of light. large amounts of text promote what has been called
'Ich bin es, Homobono Narciso' - der Polizeichef stand an interactive parsing, where a parser carries out a shallow
seinen Jeep gelehnt -, 'fast htte ich Sie berfahren. Sie liegen parse and a human may correct or add information to the
unglcklich da.' 'It's me, Homobono proposed parse. For example, the parser assigns syntactic
Narciso.' The chief of police was leaning against his jeep.
Er half Kurt Lukas auf die Beine, Messer und Mnzen fielen
labels to the elements of a clause, but does not resolve
herunter, Narciso hob sie auf. The knife and the syntactic ambiguities of particular kinds, such as PP-
coins fell to the ground when he helped Kurt Lukas up. attachment, leaving this to the human to deal with.
One system which combines part-of-speech tagging and
Figure 2: Dj Vu alignment format shallow parsing is the ANNOTATE system (Plaehn &
Brants, 2000) under development in the TIGER8 and
Also, we encode each text of the corpus in terms of a NEGRA9 projects. ANNOTATE uses the TnT tagger
header that provides meta-information such as title, (Brants, 2000) that can be applied multilingually and has
author, publication, translator, etc as well as text been trained on a number of languages, including English
type/register information (domain, tenor and mode of and German. The tag set used for English is the Susanne tag
discourse). This is important to enable corpus queries set (Sampson, 1995); the one for German is based on the
according to register or other independent variables. Stuttgart-Tbingen tag set (Hinrichs et al., 1995).
Text files are encoded in XML using a modified ANNOTATE carries out an analysis of phrase categories as
version of the Text Encoding Initiative (TEI) standard6 (a well as grammatical functions using a program based on
short header including meta-information is illustrated in Cascaded Markov Models (CMM (Brants 1999a, 1999b)).
Figure 3) and employing a standard XML editor (here: During the interactive annotation with ANNOTATE (see
XML Spy7). The text body is annotated for headings, Figure 4), terminal nodes are labeled for parts-of-speech
sentences, paragraphs, etc. and morphology, non-terminal nodes are labeled for phrase
categories and edges are labeled for grammatical functions.
<subcorpus>fiction (trans_en)</subcorpus>
<name>J. M. Brownjohn</name>
<name>Bodo Kirchhoff</name>
<encodingDesc>Modified TEI</encodingDesc>
</teiHeader> Figure 4: Interactive annotation with ANNOTATE
<body> </body> The tagged and parsed corpus data are stored in the
</text> form of a relational database, but can be exported to text
can be employed. Its query processor (CQP) allows the French parallel texts, we find direct translations, but
queries for words and/or annotation tags on the basis of also pass anterieur and venir de.
regular expressions. For an example of a query executed
on a parallel English-German corpus see Figure 5. English reduced relative clauses. Reduced relative
clauses are a typical feature of English and French, but not
# Query: DE_EN; passives-de = [pos=VB.*] [] {0,1} [pos=VVN.*]; so much of German. We can thus expect translational
#-------------------------------------------------------------------------------------- problems from English into German. A concordance query
729: newspaper . A ferry had <been sunk> just off the island . ' I to a parallel corpus shows the translational options
-->de_de: In den Gewssern vor der Insel war eine Fhre gesunken .
850: country ' s future will <be decided> today . Yours too , perha
available (cf. Figure 7).
-->de_de: Zukunft des Landes entscheidet sich heute .
927: nced , because shots had <been fired> at a remote polling stati # Query: DE_EN; [pos=N.*] [pos=VVN];
-->de_de: Der Schriftsteller und er mten aufbrechen , in einem #--------------------------------------------------------------------------------------
197: g away under tin roofs . <Carcasses suspended> from chains
-->de_de: An Ketten hngend , bluteten zuckende Rinder aus . Schweine
Figure 5: Sample query with CQP 2180: ed behind on his own . A <crucifix reposed> on his lap in place of
-->de_de: An Stelle des Buchs lag ein Kreuz in seinem Scho .
3. Solving translation problems with a 2833: And the mountains wore <cloud-caps frayed> at the edges by
-->de_de: Und die Berge trugen Wolkenhte , die zur Sonne hin
translation reference corpus ausfransten .
With a corpus annotated in the way described in the
preceding section, we now have available a translation # Query: FR_EN; [pos=N.*] [pos=VVN];
resource that is searchable in a meaningful way. While #--------------------------------------------------------------------------------------
1864: of him . This time , the <instrument provided> by Providence was
with a raw text corpus we can only formulate string -->fr_fr: L ' instrument de la Providence fut cette fois un passe-temps
searches, we can now make use of the annotations in 2812: the presence of all the <people gathered> on the Blata , and in his
querying the corpus. In the following, we discuss some -->fr_fr: ' Le cheikh Francis et le patriarche se donnrent l ' accolade
examples of translation problems between English, devant le peuple runi sur la Blata , et dans son sermon , sayyedna parla
German and French. The examples are taken from two
genres, narrative and factual writing. For querying the Figure 7: Parallel concordances for English reduced relative
corpora selected, we use CQP (cf. Section 2). clauses
What can be seen here is that in translations into The concordance shows that for compensation a focus
German, the translational choice is in fact often one-to- particle or adverb (e.g., `gerade) can be used to signal the
one, but also, past tense or present subjunctive is used. In syntactic focus.
4. Summary and conclusions Hinrichs, E., H. Feldweg, M. Boyle-Hinrichs, and R.
In this paper, we have suggested that translation Hauser, 1995. Abschlubericht ELWIS.
corpora can assume the role of works of reference for Korpusuntersttzte Entwicklung lexikalischer
translators and translatologists. In order for translation Wissensbasen fr die Computerlinguistik. Technical
corpora to serve this purpose, they need to be enriched report, University of Tbingen.
with linguistic information (Section 2). We have shown Plaehn, O., and T. Brants, 2000. Annotate - An Efficient
that some minimal linguistic annotation (part-of-speech, Interactive Annotation Tool. In Proceedings of the Sixth
shallow phrase structure) can already make a translation Conference on Applied Natural Language Processing
corpus a valuable resource for dealing with some typical (ANLP-2000). Seattle.
translation problems (Section 3). Sampson, G., 1995. English for the Computer. Oxford:
While parallel concordancing tools operating on the Oxford University Press.
basis of syntactic annotations already offer useful Teich, E., S. Hansen, and P. Fankhauser, 2001.
information, there are a number of further developments Representing and querying multi-layer corpora. In
that can increase the value of a translation corpus. First, in Proceedings of IRCS Workshop on Linguistic Databases.
corpus searches, it may be useful to be able to express Philadelphia.
constraints on the target language expression as well. Teich, E., and S. Hansen, 2001. Methods and techniques for
Only few parallel concordance programs allow for this. a multi-level analysis of multilingual corpora. In
Second, it could be very useful to be able to refer to a Proceedings of Corpus Linguistics 2001. Lancaster.
comparable TL corpus as well for a comparison of the Teich, E., 2001. Contrast and commonality in English and
translations with original TL texts. Third, for dealing with German system and text. A methodology for the
more complex kinds of translation problems, a translation investigation of the contrastive-linguistic properties of
corpus should be annotated with more abstract kinds of translations and multilingually comparable texts.
linguistic information, e.g., semantic and discourse Habilitationsschrift (submitted for publication), Saarland
information. This requires more comprehensive University.
annotation methods and more sophisticated query
facilities both of which are current research issues in
computational linguistics (cf. Teich et al., 2001).
Finally, from the perspective of the developers of
corpus tools, translation corpora are an invaluable source
for testing the applicability of such tools in multilingual
Comparable Corpora in Translation Research: Overview of Recent Analyses
Using the Translational English Corpus
Maeve Olohan
Centre for Translation and Intercultural Studies
PO Box 88
M60 1QD
This paper discusses the use of a comparable corpus in translation research, where a comparable corpus comprises, on the one hand, a
corpus of translations and on the other hand a corpus of non-translated texts, both corpora being similar in composition, size and other
attributes. The Translational English Corpus, housed at the Centre for Translation and Intercultural Studies in Manchester, is presented
as an example of a comparable corpus used in researching translation. The rationale for using a corpus of this kind to research
translation is addressed. Results of a number of empirical analyses are then summarised, and the potential development and future
exploitation of this corpus resource are outlined.
ultimately lead to a better understanding of the scope,
1. Corpora and Translation Studies significance, usefulness and appropriateness (or not) of
According to Michael Stubbs (2001: 151), corpus corpora to study translation processes and products.
linguistics is concerned with what frequently and
typically occurs, as opposed to isolated, unique instances 2. Translation as Process and Product
of language: Corpus linguistics [] investigates The empirical study of the translation process emerged
relations between frequency and typicality, and instance almost twenty years ago in translation studies, following on
and norm. It aims at a theory of the typical, on the the heels of developments in second language research. It
grounds that this has to be the basis of interpreting what is has since involved the identification, description and
attested but unusual. The corpus-based approach to analysis of what happens during translation, i.e. of the
studying translation has rapidly gained in popularity over mental steps taken by translators between, and including,
the past eight to ten years, with a wealth of data now reception of the source text and production of the target
emerging from studies using parallel corpora, multilingual text. Introspection (in particular the think-aloud method)
corpora and comparable corpora. In addition, corpora, has been the principal methodological tool used in
whether of the ad-hoc or the reference kind, are proving a investigations of the translation process, and the
useful tool in the translator training classroom. introspective studies carried out to date have been largely
Furthermore, most specialised translators would now be data-based and descriptive, often focusing on specific
lost without their translation memory system, i.e. aspects of the translation process (e.g. use of reference
essentially an aligned parallel corpus of source texts and material, decision-making criteria). While a number of
their translations. researchers have carried out descriptive empirical research
This paper focuses on the first of these applications of in this area using the think-aloud method, there are
corpora, namely corpora in translation research. The methodological difficulties with research of this nature and,
special issue of Meta on this topic published in 1998 is as a result, these attempts to investigate the cognitive
useful for an overview of work in this area, as is Chapter processes at work during translation have met with
3 of Kenny, 2001). Olohan (forthcoming b) highlights scepticism from some quarters. Criticism has focused in
some of the strengths and limitations of corpus-based particular on the methodology for data elicitation and
translation studies, based primarily on views put forward collection, including its inability to provide access to
by Maria Tymoczko (1998) and Ian Mason (2001). This thought processes which are subconscious or automated,
paper therefore does not present an overview of the but also on issues of scale and object of investigation.
literature nor does it address the criticisms levelled at While translation process researchers have readily
corpus-based translation studies. Instead it assumes an acknowledged the potential shortcomings of this data
understanding of corpus-based translation studies as the elicitation method, it has been welcomed as a means of
application of corpus analysis techniques, both gaining some insight into something which is otherwise not
quantitative and qualitative, to the study of aspects of the accessible to the researcher. However, an alternative
product and process of translation. Built into this is the approach to translation process research is suggested by
recognition that there are differing opinions as to what Bell (1991), who proposes that a model can and should be
aspects of translation we can apply these techniques to, developed through a combination of induction (i.e. inferring
and that the methodology requires refinement through processes from the product) and deduction (i.e. using
application, discussion of findings and critical assessment. introspective data such as diaries) (ibid.: 29). He suggests
This process is now being undertaken by an ever-growing describing translation competence in terms of
number of scholars in translation studies and it will generalizations based on inferences drawn from the
observation of translator performance (ibid.: 39). He stages in its growth and the composition of the BNC
proposes to observe translator performance by analysing subcorpus is modified accordingly.
the translation product: by finding features in the data of Given that TEC and the BNC subcorpus are comparable
the product which suggest the existence of particular in terms of parameters such as size and composition,
elements and systematic relations in the process (ibid.). features of the language of translation identified in the
This approach lends support for the suggestion that the corpus of translation may thus be compared with features of
compilation and use of corpora of translations would non-translated language as found in the BNC subcorpus.
allow us to analyse features of translation products which Much of the work with TEC carried out to date has focused
can provide evidence of translation processes, both on syntactic or lexical features of translated and original
conscious and subconscious, particularly if we can texts which may provide evidence of the processes of
investigate relations between frequency and typicality, explicitation, simplification or normalisation mentioned
and instance and norm, as advocated by Stubbs (2001: previously. It is possible to catch glimpses of these
151). processes in think-aloud protocols where the translators are
conscious of them and are employing them as part of
3. TEC Translational English Corpus controlled cognitive processes. However, corpus data may
TEC (Translational English Corpus) is a corpus of provide evidence which may constitute the result of such
translated English held at the Centre for Translation processes operating on a more subconscious level too.
Studies in Manchester. It consists of contemporary written
translations into English of texts from a range of source 4. Examples of Comparable Corpus Analyses
languages and it was designed specifically for the purpose It is beyond the scope of this paper to present in detail
of studying translated texts. There are currently just under the studies which has been carried out thus far using TEC
7 million words in the corpus, made up of full running and a BNC subcorpus. However, the results of some recent
texts falling into four text types fiction, biography, studies are summarised here, followed by an outline of
newspaper articles and in-flight magazines with fiction some future directions for translation research using
representing more than 80% of the total. The translations comparable corpora.
are by native speakers of English, both male and female,
and mostly date from 1983 onwards. In addition to the 4.1. Optional Reporting that
texts themselves, information is held on the translator and The first large-scale empirical study using TEC and the
translation process, compiled via questionnaires to BNC subcorpus indicated a substantially heavier use of the
translators and publishers, and stored in header files. reporting that with verbs SAY and TELL in constructions
One of the fundamental concepts in corpus-based such as examples [1] to [4] in TEC than in the BNC
translation studies has been the notion of comparable subcorpus, and it was suggested that this may be evidence
corpus, defined by Mona Baker (1995: 234) as two for a tendency towards explicitation in translated English
separate collections of texts in the same language: one (Olohan and Baker, 2000).
corpus consists of original texts in the language in
question and the other consists of translations in that [1] He says that the ship is now forty-eight hours overdue
language from a given source language or and he wants explanations (BNC)
languagesboth corpora should cover a similar domain,
variety of language and time span, and be of comparable [2] He says the whole army is unsettled because it's known
length. Bakers initial groundbreaking work posited a that Famagusta will never give up while it expects a
number of features of translation which could be relieving ship to arrive (BNC)
investigated using comparable corpora (Baker, 1996), for
example, that translations tend to be more explicit on a [3] I told him that I didn't know who it was he wanted to
number of levels than original texts, and that they speak to, but he was quite insistent that he had seen you
simplify and normalise or standardise in a number of come in (TEC)
Much of the empirical analyses carried out thus far [4] I told him I thought it was a stupid thing for him to do
have focused on the literary component of TEC, namely (BNC)
fiction only, or fiction and biography. Thus, the corpus of
original English put together for use as a comparable Explicitation has long been considered a feature of
corpus is a set of texts selected from the imaginative translation and has been investigated by a number of
writing section of the British National Corpus (BNC). It scholars (e.g. Vanderauwera, 1985, Blum-Kulka, 1986)
has been constructed specifically to match TEC in terms who have identified different means or techniques by which
of both composition and date of publication (from 1981 translators make information explicit, e.g. using
onwards). As in the case of TEC, the BNC texts are supplementary explanatory phrases, resolving source text
produced by both male and female authors, all native ambiguities, making greater use of repetitions and other
speakers of English. Unlike TEC, however, some of the cohesive devices. In general, explicitation has referred to
texts in the BNC subcorpus are extracts albeit as long as the spelling out in the target text of information which is
40,000 words. This was not deemed a significant only implicit in a source text. In these corpus-based studies,
difference in the current studies as they investigate however, we are interested in the making explicit in a
intrasentential patterns. The Translational English Corpus translation of information which is less likely to be made
is being added to all the time, which means that explicit in a non-translated text of the same language.
successive studies present data from TEC at different Scott Burnett (1999) examined the behaviour of some
forms of other verbs of this type, and Olohan (2001) looked
at PROMISE, which can also take an optional that. The Finally, in order may be omitted before to and may
same pattern of heavier use of that in TEC compared with occasionally be omitted before for or that. While the
BNC was also found in these smaller-scale studies. investigation of every instance of the items to, that and for
to see whether an in order has been omitted is not practical,
4.2. Other Optional Syntactic Features it is possible to measure usage of in order to, in order for
Olohan (2001 and forthcoming a) presents a broad and in order that and compare results from the two corpora.
overview of some other optional syntactic features in This investigation showed a marked difference in usage of
English and their occurrence in TEC and the BNC. Since in order to, with 250 instances in BNC compared with
the focus of the research was subconscious processes of 1,225 in TEC. The other forms, in order for and in order
explicitation and their realisation in linguistic forms in that, were infrequent in the two corpora but both occurred
translated texts, optional syntactic features were more often in TEC than in the BNC subcorpus.
pinpointed, based on the hypothesis that, if explicitation is
genuinely an inherent feature of translation, translated text 4.3. Personal Pronouns
might manifest a higher frequency of the use of optional A small-scale study of the use of personal pronouns in
syntactic elements than written works in the same both corpora is also presented in Olohan (forthcoming a).
language, i.e. translations may render grammatical Frequencies of personal pronouns occurring with verb
relations more explicit more often and perhaps in forms will, have, am, is, has and are, both within verb
linguistic environments where there is no obvious contractions and within non-contracted forms, were
justification for doing so than authors in English. recorded. The data show that, when used in conjunction
Working with untagged corpora only, the analysis with these particular verb forms, personal pronouns I, you,
focused predominantly on frequency of occurrence of he, she, we and they are more common in the BNC
optional features and less so on the relationship between subcorpus than in TEC. The differences are extremely
occurrence and omission. It can thus be regarded as a first striking in the case of I (23,409 in BNC; 16,178 in TEC),
step only. However, initial findings certainly encourage and also quite marked in the case of you, she and we. The
more detailed examination, suggesting for example that pronouns he and they occur with these verbs with almost
the use of the relative pronoun which is twice as frequent the same frequency in the two corpora.
in TEC than in the BNC subcorpus. Similarly, a study of
who (in the following constructions: who is, whos, 4.4. Contractions
whove, who have, whod, who did, who had and who As reported in Olohan and Baker (2000), the linguistics
would) found that TEC has a significantly higher overall literature on use and omission of that with a range of verbs
occurrence of the who form. Closer investigation of the indicated that omission was more likely in informal
co-text, which would be required to differentiate contexts. Preliminary analysis of co-occurrence of that
interrogative from relative usage, and to determine the omission and contracted forms (as a crude measure of
optional vs. non-optional nature of the relative pronoun in informality) revealed a definite correlation in both corpora
each case, has not yet been carried out for all of these between use of contracted forms and omission of that.
forms. However, in the case of who is and whos, a Thus, despite lower incidence of contractions in TEC and
separation into interrogative and non-interrogative use higher incidence of that omission in BNC, the likelihood of
showed that 44% of BNC occurrences were interrogative, co-occurrence of a contracted form and omission of that (in
as opposed to only 15% of TEC occurrences. the same concordance line) was very similar in both
The occurrence of the complementiser to, which is corpora. In other words, the BNC texts were more likely to
optional following HELP, was analysed (see examples 5 omit that and use contractions; the TEC texts were more
and 6). likely to include that and not use contractions. This
correlation suggested that contractions merited further
[5] You have special skills and experience which will help investigation.
us to achieve our objective. (BNC) Further detailed analysis of all contracted forms in the
corpora revealed that there are higher occurrences and a
[6] She only wished Antonia were there with her to help greater variety of contracted forms in BNC than in TEC. In
her think over all the things Thomas said. (BNC) many cases, the number of occurrences of a form in BNC is
double that seen in TEC. (It is worth noting again at this
The data showed that although the word form help is point that the corpora under investigation are extremely
more frequent in TEC, its verbal use in both corpora is similar in terms of size and composition.) In addition, there
quite similar. Of these verbal uses, the complementiser to was a general preference for contracted forms over the
is used in 37.5% of TEC instances, compared with only corresponding long forms in BNC, while the TEC data
26% of the BNC occurrences. showed a general tendency to use the long form in
The use of while preceding a gerundial, i.e. while *ing, preference to the contracted one. For example, for all s
and after preceding having + participle was measured in contractions (not including the possessives, thus for the
both corpora. While *ing was seen to occur more than following forms: its, thats, hes, theres, shes, whats,
twice as often in TEC than in BNC. A count of after *ing lets, whos, wheres, heres, hows), the contracted form is
*ed (which obviously does not take irregularly formed significantly more common than the long form in BNC.
past participles into account) also shows a tendency for This is not true for TEC, where the long form is the more
TEC to use this construction more frequently than BNC, frequent in 8 out of the 11 forms. In TEC, the contracted
although the construction was relatively rare in both form is more frequent only for thats, whats, and lets, but
corpora. in these cases represents a smaller proportion of the
combined total occurrences of long and contracted forms to Biber et al.s finding of 75% for fiction. In TEC, on the
than does the long form in BNC. other hand it is 58%, thus considerably lower.
Splitting the analysis into verbs, we can see from
Graphs 1, 2 and 3 that there is a greater incidence of
contracted forms with personal pronouns in BNC than in 4.5. Dialectal features
TEC for present-tense forms of BE, HAVE and WILL. Most of the contractions which featured in the analysis
above were of verbs BE, HAVE and WILL or of the negation
Contractions of BE in BNC and TEC not. However, the BNC subcorpus had a selection of other
types of contractions. Many are typical of spoken English,
100 such as the contraction of multisyllabic modifiers e.g.
80 I'm actully, accidentlly, contradictry, probly, favrite,
60 genrous. Some interjections also had contracted forms, e.g.
40 he's
ahm and fuckem, again characteristic of the spoken
20 we're language, as were contractions of ing (e.g. bleedn), and
0 they're (e.g. thisn) and than (bettern). Some contractions were
BNC TEC also clearly dialectal or sociolectal, with indicators of
regional variations such as the dropped h in beaviour,
beind, wareouse, or the Scottish doesna and havna
Graph 1 Contractions of BE in BNC and TEC, represented (where there is, in fact, no elision between the two words).
as percentage of combined total for contracted and long There were 102 occurrences of es in BNC (dialectal
forms version of hes) and none at all in TEC. Finally, other forms
found were d (= do), y (= you), th (=thou or thy) and t (=
to or to the). All occur considerably more frequently in
Contractions of HAVE in BNC and TEC
BNC than in TEC, e.g. yknow occurs 22 times in BNC and
100 only once in TEC; dyou occurs 362 times in BNC,
80 I've
compared with 72 occurrences in TEC. The last two in
you've particular indicate regional variation and do not occur at all
she's in TEC; by contrast, t, representing to, to the or the occurs
40 he's in front of 99 different nouns or modifiers in BNC (see
20 we've examples 7 and 8), and th occurs 137 times (see example
0 9).
Biber. These preliminary findings seem to indicate that Language Engineering, in Honour of Juan C. Sager.
TEC fiction is not as typical of fiction in English as the Amsterdam and Philadelphia, John Benjamins, 175-186.
works of fiction in the BNC subcorpus. Furthermore, Baker, M. (2000) Towards a Methodology for
some of the results suggest that TEC fiction may exhibit Investigating the Style of a Literary Translator, Target,
features more typical of academic prose in English. If this 12(2): 241-266.
is borne out by future investigations it may contribute to Bell, R. T. (1991) Translation and Translating: Theory and
an understanding of the nature of literary translation and Practice, London and New York: Longman.
its reception in the British literary system. However, there Biber, D. (1988) Variation across Speech and Writing.
are many features to be investigated in the future to shed Cambridge: CUP.
further light on this issue. Biber, D. (1995) Dimensions of Register Variation: A
A criticism sometimes levelled at translation scholars Cross-Linguistic Study. Cambridge: CUP.
is that we focus too much on literary text and literary Biber, D., S. Johansson, G. Leech, S. Conrad and E.
translation. One area in which this research can be Finegan (1999) Longman Grammar of Spoken and
broadened is to add other genres to TEC. A subcorpus of Written English. London: Longman.
non-fictional translated works of social science, politics, Blum-Kulka, S. (1986) Shifts of Cohesion and Coherence
history etc. would provide an interesting contrast to the in Translation, in J. House and S. Blum-Kulka (eds.)
fiction subcorpus. Similarly, a bigger biography Interlingual and Intercultural Communication:
component would enable useful analyses of that genre to Discourse and Cognition in Translation and Second
be carried out, taking into account in particular its Language Acquisition Studies. Tbingen: Gunter Narr,
position somewhere on the continuum between fictional 17-35.
and factual writing. Burnett, S. (1999) A Corpus-based Study of Translational
One aspect of research of this kind which has not been English. Manchester: unpublished MSc dissertation,
discussed in this paper is the investigation of individual UMIST.
translators. Due to the design of TEC and the Kenny, D. (2001) Lexis and Creativity in Translation: A
incorporation of more than one translation by several Corpus-based Study. Manchester: St. Jerome.
translators, it is possible to compare translators and their Mason, I. (2001) Translator Behaviour and Language
practices; for example, Baker (2000) discusses the Usage, Hermes 26: 65-80.
development of a methodology for investigating the style Meta 43(4) (1998) Special Issue: The Corpus-based
of a literary translator and Olohan (forthcoming b) Approach,
examines the contraction patterns of two well-known Olohan, M. (2001) Spelling out the Optionals in
translators across a number of translated works. There is Translation: A Corpus Study, UCREL Technical
much scope for further research of this kind. Papers, 13: 423-432.
At a conference workshop such as LREC where the Olohan, M. (forthcoming a) Leave it out! Using a
emphasis is on practical application of technology in the Comparable Corpus to Investigate Aspects of
translation process, one might question the relevance of Explicitation in Translation. In Cadernos de Traduo,
this kind of detailed analyses of lexical or syntactic Vol. VI.
patterns in translated language. However, if studies of this Olohan, M. (forthcoming b) How Frequent are the
nature ultimately give us a better understanding of how Contractions? A Study of Contracted Forms in the
translators use language, i.e. how translators translate and Translational English Corpus.
what (cognitive) processes are involved, it will be of Olohan, M. and M. Baker (2000) Reporting that in
relevance, not just in the teaching of translation but also Translated English: Evidence for Subconscious Processes
in the development of effective technological resources of Explicitation?, Across Languages and Cultures, 1(2):
for translators in the future. 141-158.
Corpora in Translation Practice
Federico Zanettin
Universit per Stranieri di Perugia
Palazzo Gallenga, Piazza Fortebraccio, 4 - Perugia
The aim of this paper is to trace links between work in the corpus linguistics community and the world of practicing translators. The
relevance to translation work of corpora in general, and bilingual and parallel corpora in particular, is evaluated by comparing corpora
and translation memories and by drawing an analogy between different types of corpora and more traditional reference tools, i.e.
dictionaries. Corpus resources available to translators are placed along a cline going from robust, stable corpora (e.g. large reference
corpora such as the BNC) to virtual, ephemeral corpora (e.g. DIY web corpora). Finally, a few suggestions are put forward in order
to encourage a wider diffusion of corpora and concordancing software among professional translators.
segment. But it can also be seen as a parallel corpus bilingual comparable corpora can be seen as analogous to
which translators manually query for parallel specialized monolingual dictionaries (either or both in the
concordances of (already translated) specific terms or source and in the target language).
patterns. Aligned translation units are conveniently Parallel corpora can instead be compared to bilingual
displayed on the screen, offering the translator a range of dictionaries, with a few important differences: bilingual
similar contexts from a corpus of past translations. A dictionaries are repertories of lexical equivalents (general
translation memory is, however, a very specific type of dictionaries) or terms (specialized dictionaries and
parallel corpus in that: terminologies) established by dictionaries makers which are
a) it is proprietory: TMs are created individually or offered as translation candidates. Parallel corpora are
collectively around specific translation projects. They repertoires of strategies deployed by past translators, as
are highly specialized and very useful when used for well as repertoires of translation equivalents. In selecting a
the translation or localization of program updates translation equivalent from a general bilingual dictionary a
indeed that is their origin but are not much help translator has to assess the appropriateness of the candidate
when starting a new translation project on a different to the new context by starting from a definition and a few
topic or text type. usage examples. A parallel corpus will offer a repertoire of
b) TMs tend to closure, to progressively standardize and translation strategies past translators have resorted to when
restrict the range of linguistic options. This may be confronted with similar problems to the ones that have
an advantage from the point of view of prompted a search in a parallel corpus.
terminological consistency and of processing costs Parallel corpora can provide information that bilingual
for clients or translation agency managers, but is dictionaries do not usually contain. They can not only offer
often detrimental for readability (texts translated equivalence at the word level, but also non-equivalence, i.e.
using a Workbench can become very repetitive) cases where there is no easy equivalent for words, terms or
and the translators eyesight (translators using a well- phrases across languages. A parallel corpus can provide
known Workbench often testify to a yellow-and- evidence of how actual translators have dealt with this lack
blue-eye-syndrome). of direct equivalence at word level. For example, in the
Translation workbenches and translation memories translations by two different Italian translators of a number
have indeed become the most successful technological of novels by Salman Rushdie (Zanettin, 2001b), the word
product to be created for professional translators, but as edges, which usually collocates with a preposition, as in
it often happens with MT products their use is best the phrases around the edges, or at the edges, was
limited to specific text types, such as online help files, never translated literally, but rather omitted:
manuals and all types of reference work which do not 1. biting the skin around the edges of a nail
require sequential reading and for which the scope of mordicchiandosi la pelle attorno all'unghia
translation can be limited to the sentence of phrase level 2. around the edges of Gibreel Farishta's head
(and thus left to a machine). When dealing with other intorno alla testa di Gibreel Farishta
types of texts translators are perhaps better off with a 3. around the edges of the circus-ring
different kind of language resource, i.e. the type of intorno alla pista da circo
corpora which are more familiar to lexicographers and 4. and there was a fluidity, an indistinctiness, at the
linguists and which are only now beginning to enter the edges of them
selection of tools available to professional and trainee vicinissime a loro c'erano una fluidit e
translators. un'indeterminatezza
5. the horses grew fuzzy at the edges
3. Corpora as translation aids i cavalli diventavano sempre pi sfocati
The respective potential uses on the part of 6. blurred at the edges, my father
professional translators of monolingual target corpora, con la mente annebbiata, mio padre
bilingual comparable corpora, and of parallel corpora can 7. looking somewhat ragged at the edges
be illustrated drawing an analogy with other respected con l'aria di un uomo distrutto
tools of the trade, i.e. dictionaries: Monolingual target 8. Mrs Qureishi, too, was beginning to fray at the
corpora can be compared to monolingual target language edges
dictionaries, and comparable source corpora to anche Mrs Qureishi si stava consumando
monolingual source language dictionaries. While In all these cases, the two professional translators have
dictionaries favor a synthetic approach to lexical meaning consistently chosen to resort to zero-equivalence, which
(via a definition), corpora offer an analytic approach (via being a translation strategy rather than a case of
multiple contexts).2 Translators can use target comparative linguistic knowledge would be hardly reported
monolingual corpora alongside target monolingual in any bilingual dictionary.
dictionaries to check the meaning and usage of translation
candidates in the target contexts. Like source language 4. Corpus resources for translators
dictionaries, source language corpora can be consulted for Not all dictionaries are the same, nor are all corpora.
source text analysis and understanding. Large reference Apart from translation memories, corpus resources which
corpora (BNC, CORIS/CODIS, etc.) can function as are of potential use for professional translators could be
general dictionaries, while smaller, specialized and classified along a scale which goes from robust to
virtual. A corpus is a collection of electronic texts
2 assembled according to explicit design criteria which
So-called production dictionaries, which focus on usage usually aim at representing a larger textual population.
information, can be thought of as standing somehow in between Robust corpora are ready-made corpora created and
the two.
distributed by the research community and the language corpora, either made of disposable web pages (e.g.
industry on CD-ROM or accessible through the Internet. Varantola, 2000, forthcoming; Maia, 1997, 2000,
Prototypical examples are large reference national forthcoming; Zanettin, forthcoming; Pearson, 2000) or of
corpora, such as the British National Corpus (BNC) for texts taken from other electronic sources such as
British English, and the Dynamic Corpus or Written newspapers (Zanettin, 2001a) or magazines (Bowker, 1998)
Italian (CORIS/CODIS) for Italian. This type of resource, on CD-ROM. Corpora created from sources other than web
which requires a large building effort, is only now pages can require more time and effort to be built, and can
becoming available to the wider public outside the be more or less disposable depending on the size of the
(corpus) linguistics community, and will probably require translation project and on the resources available to create
some customisation effort in order to become more and manage them.
widespread among language services providers. Reports on the use of corpora by professional translators
Parallel corpora are usually smaller and even less are fewer: Friedichler & Friedbichler, drawing on their
available to the general public than monolingual corpora. experience as translators of medical texts and trainers of
Their construction requires more work than that of technical translators, suggest that domain-specific target
monolingual corpora. Among other factors, text pairs language corpora may usefully complement dictionaries
(rather than single texts) have to be located and before and the Web as resources in the translation process, filling
they can be used they need to be aligned, at least at the the gap between the two. Jsklinen and Mauranen (2000)
sentence level (cf. Vronis, 2000). report on an experimental study involving a team of
There are of course varying degrees of robustness, researchers from the University of Savonlinna and a team
according to the effort and care which has been put in of professional translators translating for the timberwood
achieving a balanced and representative selection of texts, industry. The researchers created a corpus from a variety of
in providing explicit linguistic and extralinguistic sources (web sites, PDF documents, etc.) following
information (corpus annotation) and the means (the suggestions from the translators, and then trained them in
software) to query the corpus for that information using concordancing software (WS Tools, Scott, 1996) to
(McEnery & Wilson, 1996). Corpus design criteria also analyse the corpus. In exchange, the translation team agreed
vary according to the purpose for which a corpus is built, to answer a questionnaire. One of the results of the study
e.g. a comparable monolingual corpus for descriptive was learning that translators often complained that the user-
translation research. In this sense, the less robust (i.e. friendliness of the concordancing software was very low.
the more virtual) corpora are the most truly professional This complaint was seconded by translator trainees in other
type, with reference to translators, since they are rough- studies with disposable corpora where students, usually
and-ready products created for a specific translation working in groups, collected a corpus of HTML documents
project. A distinction is usually made by corpus linguists and used them to help them translate a specific text.
between corpora and archives of electronic texts. An These studies have underlined, nonetheless, the value of
archive is simply a repository of electronic texts: In this corpus building as a way of getting acquainted with the
sense the WWW is an immense (multimedia) text archive. content and terminology of the translation. They have
Virtual or disposable corpora are created by a translator stressed the importance of type and topic of the text to be
using the WWW as a source archive. The WWW and translated as well as of the target language (some text types,
HTML documents need not to be the only source for topics, and target languages are better helped with corpora
small, specialized DIY corpora, and textual archives of than others) and also of adopting sound criteria in choosing
various types and targeted to various users (newspapers, suitable texts for inclusion in the corpus. Most of the
collections of laws, encyclopedias, etc.) are available on corpora in these experiments were target monolingual
cd-rom. The WWW is however certainly the most corpora, though some use of bilingual comparable and even
familiar and user friendly environment for translators: it is parallel corpora was reported.
always available; it is the most comprehensive source of The main benefits and shortcoming of DIY corpora may
electronic texts, and corpus creation, management and be summed up as follows:
analysis can be a relatively straightforward operation Benefits:
(Austermhl, 2001; Zanettin, forthcoming). Building a They are easy to make.
corpus of web pages basically involves an information They are a great resource for content information.
retrieval operation, conducted by browsing the Internet to They are a great resource for terminology and
locate relevant and reliable documents which can then be phraseology in restricted domains and topics.
saved locally and made into a corpus to then be analysed Shortcomings:
with the help of concordancing software. The additional Not all topics, not all text types, not all languages are
time required by creating and consulting a corpus is equally suitable or available.
compensated for by saving in other translation-related The relevance and reliability of documents to be
tasks, such as dictionary consultation (both on paper and included in the corpus needs to be carefully assessed.
electronic), paper documentation (often in the form of Existing concordancing software is not well equipped
parallel texts, e.g. Williams, 1996), help from experts, to handle HTML or XML files, i.e. web pages. There are
and by the fact that the corpus contains information not no or few parallel corpora, since while some parallel texts
available elsewhere. Moreover, the effort is rewarded by (i.e. source texts + translations) can be found on the
improving quality in terms of terminological and Internet, hardly all of them could be included in a parallel
phraseological accuracy (Friedbichler & Friedbichler, corpus designed to provide instances of professional
2000). standards (Maia, forthcoming).
A number of studies have reported on experiments in DIY web corpora stand midway the WWW itself, which
translation and language teaching classes with DIY can be used as if it were a corpus and robust, proper
corpora. As for the Web, a quasi-concordance view of Baker, M. (1993). Corpus linguistics and translation
documents indexed and retrieved is provided by such as studies. Implications and applications. In M. Baker, G.
search engines Google ( or Francis & E. Tognini-Bonelli (eds.) Text and technology.
Copernic ( Corpus linguistics- Philadelphia/Amsterdam: John Benjamins, 233-252.
oriented software currently being constructed for BNC web site,
browsing the WWW as a corpus, such as KwicFinder Bowker, L. (1998). Using specialized monolingual native-
(Fletcher, 2001) and WebConc (Kilgarriff, 2001), will language corpora as a translation resource: a pilot study,
certainly prove a useful tool for translators among other in META 43:4, 631-651.
language professionals. However, while this web as CORIS/CODIS web site,
corpus approach has certainly advantages in terms of Fletcher, W. (2001). Concordancing the web with
time over DIY web corpora (the corpus is always KWiCFinder, presentation given at the Third North
already there), it necessarily looses in precision and American Symposium on Corpus Linguistics and
reliability. Language Teaching, Boston, MA, 23-25 March 2001.
The advantages of robust corpora over virtual Available at
corpora can instead be summed up as follows:
They are usually more reliable. Friedbichler, I. & Friedbichler, M. (2000). in S. Bernardini
They are usually larger. & F. Zanettin (eds.) I corpora nella didattica della
They may be enriched with linguistic and contextual traduzione. Corpus Use and Learning to Translate,
information. Bologna: CLUEB, 107-116.
If parallel, they are already aligned. Jskelinen, R. & Maurannen, A. (2000) Work Package 5:
They come with user-friendly, customised software Development of a Corpus on the Timber Industry - Final
(though, again, not necessarily targeted to the needs of Report, Project SPIRIT MLIS-programme: MLIS-3008
professional translators). SPIRIT 24637, University of Joensuu, Savonlinna School
of Translation Studies.
5. Conclusions Johansson, S. (forthcoming). Reflections on corpora and
Translators can tolerate the learning curve necessary their uses in cross-linguistic research, in F. Zanettin, S.
to adopt corpora and concordancing software among their Bernardini, & D. Stewart (eds.) Corpora in translator
everyday working tools only if they derive benefits. These education.
benefits are the fact that corpora provide information not Kilgarriff, A. (2001). Web as corpus. In P. Rayson, A.
available elsewhere at an affordable cost. Wilson, T. McEnery, A. Hardie and S. Khoja (eds.)
As a way of concluding, I would like to point out Proceedings of the Corpus Linguistics 2001 conference,
possible improvements for existing corpora and UCREL Technical Papers: 13. Lancaster University,
concordancing software: 342-344.
a) Robust reference corpora need to become more Maia, B. (1997). Do-it-yourself corpora ... with a little bit
accessible: for instance, a BNC license is still relatively of help from your friends! in B. Lewandowska-
expensive and the interrogation software might do with Tomaszczyk & P. J. Melia ( eds.) PALC '97 Practical
some customization; the CORIS/CODIS corpora and Applications in Language Corpora. Lodz: Lodz
others have limited access. University Press, 403-410.
b) In order for virtual corpora to become more Maia, B. (2000) Making corpora: A learning process, in
widespread among translators, concordancing software S. Bernardini & F. Zanettin (eds.) I corpora nella
for work with small monolingual corpora has to become didattica della traduzione. Corpus Use and Learning to
capable of dealing with HTML and, increasingly, XML Translate, Bologna: CLUEB, 47-60.
texts. For example, it may be useful to interface the Maia, B. (forthcoming) Training translators in terminology
concordancing software with the Internet browser to and information retrieval using comparable and parallel
provide facilities for file downloading and management, corpora, in F. Zanettin, S. Bernardini & D. Stewart
and for allowing the user to switch between concordance (eds.) Corpora in translator education.
lines and full text view, in order to take advantage of Malmkiaer, K. (forthcoming). On a pseudo-subversive use
multimedia features of electronic texts. of corpora in translator training, in F. Zanettin, S.
c) Bilingual and parallel corpora are scarcely available Bernardini & D. Stewart (eds.) Corpora in translator
and usually of limited size. Bilingual concordancers education.
require bilingual corpora, and given what it takes to locate McEnery, T. & Wilson, A. (1996) Corpus linguistics.
and align text pairs, it is not very likely that individual Edimburgh: Edimburgh University Press.
translators will resort to consulting parallel concordances Pearson, J. (2000). Surfing the Internet: teaching students
unless parallel (aligned) corpora are already available. to choose their texts wisely. In Lou Burnard and Tony
The creation of more corpora of this kind is a matter of McEnery (eds.) Rethinking Language Pedagogy from a
computational resources (especially parallel Corpus Perspective. Frankfurt am Maim et al: Peter
concordancers and efficient aligning utilities) as well as of Lang, 235-239.
more awareness of the usefulness of this resource among Scott, M. (1996). Wordsmith Tools. Oxford: Oxford
translators and language resources providers. University Press.
Sinclair, J. McH. (1996) EAGLES Preliminary
recommendations on Corpus Typology, EAG--TCWG--
BancTrad: a web interface for integrated access to parallel annotated corpora
Toni Badia, Gemma Boleda, Carme Colominas, Agns Gonzlez, Mireia Garmendia, Mart
Universitat Pompeu Fabra
Rambla 30-32 ,
E-08002 Barcelona
The goal of BancTrad is to offer the possibility to access and search through (parallel) annotated corpora via the Internet. This paper
presents the design of the whole process: from text compilation and processing to actually performing queries via the web, while it
describes as well its technical architecture.
The languages we work with are Catalan, Spanish, English, German and French. Queries are possible from any of these languages to
Spanish and Catalan and vice versa (but not between the language pairs formed by French, German and English). The texts go first
through a pre-processing and mark-up stage, then through linguistic analysis and are finally formatted, indexed and made ready to be
consulted. The web interface has been created through the integration some ad hoc applications and some ready-to-use ones. It
provides three different levels of query expertise: basic, intermediate and expert.
The paper is structured as follows: section 1 gives an overview of the project; section 2 describes the text compilation process;
section 3 explains the corpora building and parsing stages; section 4 details the search machine architecture; finally, section 5
describes foreseen applications of BancTrad.
This project is running under the auspices of the Programa
dInnovaci Docent (Educational Innovation Program)
sponsored by our university (Universitat Pompeu Fabra) and
has also been partially financed by the Spanish Government Figure 1: MS Word form used for the mark-up of
and by the 2001FI 00582 grant from the autonomous
extralinguistic features of the texts
Government of Catalonia.
will be handled with a Spanish version of it in a year's
This mark-up takes the following parameters into time. On the other hand, the linguistic analysis for
account: English, German and French texts is made with
- name of the person who introduced the aligned TreeTager, a part-of-speech tagger developed at the
texts (i. a., in order to track translation quality) IMS (see Schmid 1995, 1997). Both CATCG and
- source and target languages
- original and translation references La noia de el port de Barcelona dorm
- publication date (for both the original and the the girl of the harbour of Barcelona sleeps
- register (colloquial, standard, learned, etc.) <s id=1>
- type of text (normative, descriptive, literary, La el Det AFS DN>
etc.) noia noi Nom N5-FS Subj
- subject matter (economy, science, politics, etc.) <contrac forma=del>
- degree of specialisation (low, middle, high). de de Prep P <NA
el el Det AMS DN>
Besides these parameters, and bearing in mind that </contrac>
BancTrad was originally conceived as a tool with port port Nom N5-MS <P
pedagogic applications, we include information on de de Prep P <NA
certain aspects such as idioms, metaphors, puns, degree <enty>
of difficulty, etc. All of these parameters, as well as the Barcelona Barcelona Nom N4G6S <P
information coded within them, were consensuated </enty>
with the teachers and researchers of the FTI. It is dorm dormir Verb VRR2S- VPrin
relevant to note that this mark-up allows us not to make . . . . PT
a rigid classification of the texts in the corpus (see </s>
section 3). Figure 2: Input and output of CATCG
By clicking on the Acceptar (Accept) button, the
options selected in the form are marked in the text in TreeTager are shallow parsers.
SGML format and a script tags the paragraph structure It is important to note that, despite the use of
of the document. Otherwise, this very valuable piece of different tagging tools for exploiting the linguistic
information on the text structure would be lost in the information of our texts, all languages receive a
alignment step. minimum of uniform kind of information: lemma and
Texts are aligned at a sentence level with the align POS tag (syntactic function is only there for Catalan).
tool of the DjVu Database Maintenance, software by Thus, all the languages can be processed and made
Atril ( DjVu aligns texts and queries upon in the same fashion, independently of the
allows editing in quite a user-friendly way. tagging tool used. This favours modularity, for the
The tasks described so far, although only semi- linguistic processing of a certain language can be
automatic, require neither special skills in computing modified without changing any of neither the other
nor much time (the time to go through them for a 400 linguistic processes nor the interface. We now proceed
word-long text -both source and target texts- is 5 to 10 to roughly characterize CATCG and TreeTagger.
minutes). We could have chosen to tackle the
alignment task fully automatically instead, but the error 3.1.1. CATCG
rate of automatic aligners (notably errors in sentence CATCG is a linguistic-based parser that assigns
identification) would have increased too much the error each word a lemma, a POS tag and a syntactic
rate in the subsequent linguistic analysis. However, it function. It uses three major devices:
should be kept in mind that, according to our a) a Perl module for the preprocessing
architecture, the use of a particular tool for the mark-up b) a morphological tag mapping tool that uses a
and alignment independent of the rest of the process, so word-form dictionary created with a
that other tools could be used in the future. morphological generator developed at UPF (Badia
Finally, the texts are transferred to our Linux server et al. 1997)
to proceed with the text processing, which from this c) three grammars using the Constraint Grammar
moment on will be completely automatic. formalism developed at the University of Helsinki
(Karlsson et al. 1995, Tapanainen 1996), which
3. Linguistic Processing and Corpus perform the morphosyntactic disambiguation task
Building and the partial syntactic analysis.
Once the texts are in the server, they undergo two Fig. 2 gives an example of the input and output of
further steps: linguistic tagging and corpus formatting. our system. The SGML tags are the result of the
Both steps are completely automatic. preprocessing, and in the example they mark a
contracted form, an entity and the sentence boundaries.
3.1. Linguistic Processing The columns list the linguistic information: word form,
lemma, part of speech tag, complete morphological
Each language follows a different tagging process. information in an compressed tag and syntactic
On the one hand, Catalan texts are parsed with CATCG function (in order of appearance). The last piece of
(Badia et al. 2000), a Catalan shallow morphosyntactic information is shallow and partial in the sense that it
parser based on a constraint grammar developed by the doesn't fully indicate dependency: note that the
Computational Linguistics group at UPF. Spanish texts
preposition de (from) in the PP de Barcelona gets a The GUI is intended to be adaptable to the user
tag indicating that it modifies a noun to its left (<NA, expertise, to have open access and to be platform
left adjoining Nominal Adjunct); however, no clue is independent. For our GUI to accomplish the two last
given about whether it modifies Barcelona or port. features, an HTML-based interface seemed to be the
best option. To qualify for the first one, the interface
3.1.2. TreeTager had to offer at least three search possibilities: common,
TreeTager is a probabilistic tagger that uses intermediate and expert mode (see next section for
decision trees. It provides each word with a lemma and details).
a POS tag (at the moment, no syntactic information is b) The external program interface
given). This is the module of the architecture that actually
makes the query processing. It interprets the user's
3.2. Corpus formatting query, it searches for it in the corpora and gives the
result back. The program that does the work is
After being annotated, the text files are eventually
commonly called a cgi (Common Gateway Interface,
formatted and processed with the Corpus WorkBench
term whose original sense has been extended to mean
(CWB) tools, a set of linguistic information
exploitation tools developed at the IMS in Stuttgart external program interface). Our cgi is composed of
the following packages:
(Christ 1994; Christ et al. 19992). Thus we build the
actual corpora making them ready to be consulted with i) Common Gateway Interface (CGI)
CQP, the Corpus Query Processor, a tool from the The CGI (properly so named) is a standard device
CWB. This tool allows very flexible and expressive to interface with information servers (such as HTTP
queries for any of the pieces of information encoded servers). It passes a web user's request on to an
(be it the word form, lemma, POS tag or syntactic application program and gives the resulting data back
function). In fact, as a far as one gives corpora the to the user. Herewith the server interprets the users
adequate structure, one can have as a many attributes query.
as one pleases. ii) HTML::Entities
One of the most significant (to us) features of the This formatting package ensures that special
CWB is the fact that it can process aligned corpora. characters (tildes, cedillas, etc.) are properly transferred
Not only is it possible to view the aligned sentences, during the client/server session.
but it is also possible to place restrictions both on the iii) WebCqp::Query, a web adapted version of the
source and on the target language in a query (see CQP
section 5). It has also been crucial to us the special This package was designed by the creators of the
module that lets CQP interacting with the web (see CWB (see above) to let it interact with the web. It can
next section). perform the same kind of queries that CQP performs in
its PC-Linux version. It thus allows a powerful query
4. The search machine and the web setting through regular expressions, access to linguistic
Interface tags (through the defined number of features in the
Technically speaking, the novelty of BancTrad is corpora) and aligned corpus querying.
the integration of several tools that make available
parallel annotated corpora via the Internet. This entails 5. Exploiting BancTrad
that the system has to be able to (1) interpret the query This section outlines different ways in which to
made by the user, (2) search for the query, (3) present exploit BancTrad, from two different but related
the results. For this purpose, two devices were needed: perspectives regarding its potential users. It describes
a graphical user interface (GUI) with a fill-in form and the search possibilities that BancTrad offers (section
an external program interface (to allow browser/server 5.1), which relates to the user's level of expertise.
communication) Besides, it sketches some possible applications for
which BancTrad is indicated (section 5.2), which
relates to the user's professional or academic profile.
quadruples (form, lemma, morphosyntactic tag, and syntactic to the left and/or right sides of the query target. Of
function), including the iteration of identical elements course all the capabilities listed so far are indebted to
Fig. 4 is a screenshot of a search in this mode: it the Corpus Query Processor that we use as a searching
searches for causative constructions from Catalan into engine.
English, that is, for the causative verb fer followed by Fig. 6 shows some of the results for the query on
any verb (see next section for the results). causative constructions made on section 5.1.1:
In fact, these kind of applications just follow from 8. References
the examples described above and the characteristics of
the corpora in BancTrad. On the one hand, as far as the Badia, T., . Egea & T. Tuells (1997) CATMORF:
corpora are real translated texts (see section 2), and Multi-two level steps for Catalan morphology. In
provided the search possibilities sketched above, Demo Proceedings of the Conference on Applied
BancTrad appears to be a useful tool for professional Natural Language Processing. Washington
translators. They could look for evidence of previous Badia, T., Boleda, G., Bofias, E. & Quixal, M. (2001)
translation decisions and even have the information of A modular architecture for the processing of free
the person in charge for that translation. text. Proceedings of the Workshop on 'Modular
On the other hand, linguists and translation Programming applied to Natural Language
theorists (see work done by Baker, M. and Teubert, Processing' at EUROLAN 2001. Iasi, Romania.
W.) could also take advantage of this search engine. In Christ, Oliver (1994) A modular and flexible
fact, this is something we have already been doing with architecture for an integrated corpus query system,
the grammar-developing task we have been carrying on COMPLEX'94, Budapest
for the last three years. We can retrieve data such as Christ, Oliver, Schulze, Bruno M. and Knig, Esther
most frequent readings, syntactic structures, etc. This (1999) Corpus Query Processor (CQP). User's
helps us concentrate on problems arising when dealing Manual, Institut fr Maschinelle
with written text and develop more data-driven Sprachverarbeitung, Universitt Stuttgart, Stuttgart
linguistic-based grammars. It is also interesting to note Karlsson, F. et al. (1995) Constraint Grammar: a
that searches can be made on a sole language, that is, Language-Independent Formalism for Parsing
they must not be bilingual. Unrestricted Text, Mouton De Gruyter: Berlin/New
Other possible applications for BancTrad include York
creating further Language Resources, such as Schmid, Helmut (1995) Improvements in Part-of-
multilingual dictionaries, chunkers, stochastic-based Speech Tagging with an Application to German, in
machine translation systems, etc. Proceedings of the ACL SIGDAT-Workshop, pp. 47-
5.2.3. An added value Schmid, Helmut (1997) Probabilistic Part-of-Speech
Finally, it is important to note that an added value Tagging Using Decision Trees, in Daniel Jones and
to BancTrad's web interface is the fact that it can Harold Somers, editors, New Methods in Language
incorporate other corpora (also monolingual ones) with Processing Studies in Computational Linguistics,
little amount of work. This would enable our users to UCL Press, London, pp. 154-164
query on several corpora, not only the ones prepared at Tapanainen, P. (1996) The Constraint Grammar
the FTI, in a user-friendly and familiar web interface. Parser CG-2, Department of General Linguistics,
For instance, we already have the British National University of Helsinki, Helsinki, Publications,
Corpus as part of our searchable corpora and we are number 27.
planning to integrate the Frankfurter Rundschau corpus
soon as well.
ParaConc: Concordance Software for Multilingual Parallel Corpora
Michael Barlow
Rice University
Dept. of Linguistics
Houston, TX 77005
Parallel concordance software provides a general purpose tool that permits a wide range of investigations of translated texts, from the
analysis of bilingual terminology and phraseology to the study of alternative translations of a single text. This paper outlines the main
features of a Windows concordancer, ParaConc, focussing on alignment of parallel (translated) texts, general search procedures,
identification of translation equivalents, and the furnishing of basic frequency information. ParaConc accepts up to four parallel texts,
which might be four different languages or an original text plus three different translations. A semi-automatic alignment utility is
included in the program to prepare texts that are not already pre-aligned. Simple text searches for words or phrases can be performed
and the resulting concordance lines can be sorted according to the alphabetical order of the words surrounding the searchword. More
complex searches are also possible, including context searches, searches based on regular expressions, and word/part-of-speech
searches (assuming that the corpus is tagged for POS). Corpus frequency and collocate frequency information can be obtained. The
program includes features for highlighting potential translations, including an automatic component Hot words, which uses
frequency information to provide information about possible translations of the searchword.
The heading PARALLEL TEXTS at the top of the
dialogue box is followed by a number in the range 2-4
(i.e, two to four different languages). The FORMAT buttons
allow the user to describe the form of headings,
paragraphs, and sentences, as discussed above. Filenames
can be reordered by dragging them to the appropriate
the translation of head. It could simply be accidental that
tte is found in the French sentence corresponding to the
English sentence containing head.
The idea behind dual KWIC display is to let the user
move from English to French and back again, sorting and
resorting the concordance lines, and inspecting the results
to get a sense of the connections between the two
languages at whatever level of granularity is relevant for a
particular analysis.
5. Frequency information
ParaConc furnishes a variety of frequency statistics,
but the two main kinds are corpus frequency and collocate
frequency. The command CORPUS FREQUENCY DATA in
the FREQUENCY menu creates a word list for the whole
Figure 3: Parallel KWIC displays corpus (or parallel corpora), according to the settings in
FREQUENCY OPTIONS. The results can be displayed in
alphabetical or frequency order and the usual options
4. Hot Words (such as stop lists) are available.
In the previous section, we described the use of Choosing COLLOCATE FREQUENCY DATA from the
SEARCH QUERY to locate possible translations in the FREQUENCY menu displays the collocates of the search
second window. In this section we will look at a utility in term ranked in terms of frequency. In ParaConc, the
which possible translations and other associated words collocate frequency calculations are tied to a particular
(collocates) are suggested by the program itself. We will search word and so the frequency menu only appears once
refer to these words as hot words. First we position the a search has been performed. The collocation data
cursor in the lower (French) half of the results window produced by the COLLOCATE FREQUENCY DATA command
and click using the right mouse button. If we used SEARCH is organised in four columns, spanning the word positions
QUERY earlier, we need to select CLEAR SEARCH QUERY 2nd left to 2nd right. The columns show the collocates in
and then choose HOT WORDS, which invokes a procedure descending order of raw frequency.
which calculates the frequency of all the words in the One disadvantage of the simple collocate frequency
French results window and then brings up a dialogue box table is that it is not possible to gauge the frequency of
containing the ranked list of hot words. The ranked list of collocations consisting of three or more words. To
candidates for hot words based on head are displayed as calculate the frequency of three word collocations, it is
shown in Figure 4. necessary to choose ADVANCED COLLOCATION from the
To select words as hot words, the program looks at the FREQUENCY menu and select one or more languages. The
frequency of each word in the results window and ranks top part of the dialogue box associated with ADVANCED
the words according to the extent to which the observed COLLOCATION allows the user to choose from up to three
frequency deviates from the expected frequency, based on word positions, for example, SEARCHWORD 1ST RIGHT, 2ND
the original corpus. The words at the top of the list might RIGHT. The program counts and displays the three-word
include translations of the searchword, translations of the collocations based on the selected pattern.
collocates of the searchword, and collocations of
translation of the searchword. 6. Workspace
In addition to the basic display of hotwords, a
paradigm option (if selected) promotes to a higher ranking The loading and processing of a parallel corpus in
those words whose form resembles other words in the particular can take some time since the program has to
ranked list. This is a simple attempt to deal with process alignment and annotation data before searching
morphological variation without resorting to language- and analysis can begin. Since the same sets of corpus files
particular resources. are often loaded each time ParaConc is started, it makes
Some or all the hot words can be selected. Clicking on sense to freeze the current state of the program, at will,
OK will highlight the selected words in the results and return to that state at any time, rather than starting
window, and again the words can be sorted in various ParaConc and reloading the parallel corpora afresh. This
ways. is the idea behind a workspace. A workspace is saved as a
special (potentially large) ParaConc Workspace file
(.pws), which can then be opened at any time to restore
ParaConc to its previous state, with the corpus loaded The third option in the advanced search dialogue box
ready for searching. Searches and frequency data are, is TAG SEARCH, which allows the user to specify a search
however, not included in the saved workspace. (Only the query consisting of a combination of words and part-of-
search histories are saved.) speech tags, with the special symbol & being used to
A workspace can be saved at any time by selecting the separate words from tags in the search query. This search
command SAVE WORKSPACE or SAVE WORKSPACE AS from syntax is used whatever particular tag symbols are used in
the FILE menu. The usual dialogue box appears and the the corpus. (Thus it is necessary to enter the form of the
name and location of the workspace file can be specified tags in TAG SETTINGS before a tag search can be
in the normal way. Once a filename for the saved performed.) To give an example: the search string
workspace has been entered, the user is asked to choose that&DD finds instances of that tagged as a
some different workspace options. The line/page and the demonstrative pronoun, which may appear in the corpus
tracked tag info can be saved as part of the workspace. as that<w DD>. Similarly, a tag search for &JJ of& will
(The saved workspace consists of a saved file and an find all instances of adjectives followed by the word off.
associated folder of the same name.) (The dialogue box in Figure 5 contains a variety of other
options controlling the search function, which will not be
7. Advanced Search discussed in this paper.)
The simple searches described in Section 3 will suffice Finally, one kind of search tailored for use with
for many purposes and are especially useful for parallel texts is a parallel search, which is one of the
exploratory searches. The basic TEXT SEARCH is also very options within the SEARCH menu. This type of search,
useful when used in conjunction with a sort-and-delete shown in Figure 6, allows a search to be constrained based
strategy. Particular sort configurations can be chosen to on the occurrence of particular strings in the different
cluster unwanted examples (words preceded by a and the parallel texts.
perhaps), which can then be selected and deleted. For
more complex searches, however, we need to use the
ADVANCED SEARCH command. This command brings up a
more intricate dialogue box (displayed in Figure 5), which
at the top contains the text box in which the search query
is entered.
The main factor impinging on the usefulness of the
software is probably the availability of aligned parallel
corpora and of parallel corpora in general.
Corpora for Terminology Extraction the Differing Perspectives and Objectives
of Researchers, Teachers and Language Services Providers
Belinda Maia
Faculdade de Letras
Universidade do Porto
Via Panormica s/n
4150-563 Porto
Using corpora to find correct terminology is an activity that is interpreted rather differently according to the final objectives of those
involved. This paper will try to show how the perspectives and objectives of researchers, teachers and language services providers do
not always coincide, and how this lack of mutual appreciation and understanding can sometimes cause confusion. We shall first look at
the more speculative aspects of current terminology research for the possibilities they offer in the future, even though some of this
work is not directly related to translation, and consider the reasons why correct terminology is growing in importance in the lives of
both domain specialists and language services providers. We shall then briefly consider both the older prescriptive notions of
standardisation and the descriptive approach made feasible by technology and corpora today. Corpora in the broadest sense from
formally constructed and officially approved collections of texts to the disposable, do-it-yourself corpora anyone can now collect off
the Internet for information on a specific subject come as part of the information revolution provided by technology. They provide
possibilities for any user of language and knowledge that were unthinkable a few years ago, but there are also problems and
Those involved in this workshop on translation work
1. Introduction and research will tend to see terminology research as
The compilation of terminology used to consist largely primarily interested in supplying the needs of the
of collecting the words and phrases considered to be translator for specialised terminology, but this is only one
specific to a certain domain and bringing them together to aspect of the overall picture. A good deal of terminology
form glossaries, with or without definitions or information research is monolingual in nature and directed at the
on how or where the information was gathered. Since standardisation and categorisation of the relationship
translators often had a vested interest in finding, or between concepts belonging to certain domains of
providing recognised equivalents in several languages, knowledge and the terms used to describe them. This type
these glossaries would often become bi- or multilingual at of work is typically carried out by the domain experts,
a later stage. With the increase in availability of electronic with or without the assistance of linguists, and, more often
text, the advantages of using corpora for term extraction than not, in major languages like English, French and
are now generally recognised, particularly since the German. The subsequent translation of these standardised
prescriptive view of terminology work has given way to a terms into other languages is by no means as simple or as
more descriptive approach, and the storage of definitions well organised as it might be, despite official efforts to the
and other information on the terms has been made contrary.
possible by relational databases. Standardisation of terminology has a long history, and
This paper assumes that there are three classes of its objectives have typically been to prevent confusion in
people with a particular interest in this terminology work. the transmission of knowledge, with all the economic,
First there are the researchers in various areas of social, legal and political consequences involved. Some
linguistics in general, as well as more specific areas of knowledge, like engineering, have a long-
terminology research. Many, but not all of these people, standing tradition in producing standardised terminology,
are also the teachers who try to train the professional but even they find it difficult to keep up with technical
language services providers needed today. The word and scientific developments. Many other domains have
linguist as someone proficient in two or more languages little or no organised terminology resources and what
has become ambiguous since the advent of linguistics as exists is often local in nature, in the sense that it is the
an academic discipline, and the tasks required of someone property of certain organisations, companies and other
with a good knowledge of languages are increasingly entities, of varying size and importance.
varied. I have therefore chosen the term language The information revolution caused by the Internet,
services provider to refer to those who not only provide however, has led to demands for better systematisation of
traditional translation and interpreting services, but also knowledge and improved accessibility. For this reason,
those who write and revise texts professionally, specialise the computational side of terminology research today is
in localisation, sub-titling, dubbing and making web increasingly orientated towards facilitating information
pages, create terminological databases and translation retrieval and knowledge engineering (see Budin, 1996,
memories, work with machine translation, and both use and Charlet et al, 2001). Traditional terminology work
and take advantage of the information technology now tends to be painstaking and slow, and is not adapted to
available for a wide variety of projects and customers. coping with the exploding need for retrieving knowledge.
For this reason, efforts are being made by computational
linguists and computer scientists to speed up the process
2. Terminology research of identifying, extracting and processing terminology (see
Bourigault et al (Eds.) 2001, and Veronis (Ed). 2000).
3. Computational terminology to understand the fluidity of the lexicon. After all, one of
So much information is now processed in computer- the perennial problems of general linguistics is how to
readable form that there are obvious advantages to be deal with it in an easily classifiable way, hence all the
drawn from this for machine (assisted) translation, work with projects like Wordnet (at:
translation memories and their related terminology On the other
databases. The corpora required for this type of research hand, experts in any particular domain are also aware of
need to consist of texts that are not just well written, in the the fluidity of concepts and probably spend a good deal of
sense that they represent texts normally produced in a time arguing about how to stabilise them for practical
particular domain of knowledge: they need to use terms purposes - and stable terminology is only one aspect of
that are generally accepted in the community that works in this problem. In practice, they often resort to diagrams,
that domain. When translations exist of these texts, they, images and other pictorial representations in order to
too, need to conform to the same standards of text and circumvent or supplement the limitations of language. The
terminology in the target language if one is to produce general public, however, likes to believe in the stability of
good aligned parallel corpora. both language and concepts, and, for the practical
The experimental work done in computational purposes of communication, we all accept that there has to
terminology usually involves standardised texts in which be some sort of social contract whereby we agree to this
both originals and translations are considered to be of high stability in order to understand each other.
quality. Some of these texts have been provided by Prescriptive terminology has usually aimed at
organisations like XEROX (see Bourigault 1994). The providing this stability in an organised fashion and most
texts are often chosen for their linear compatibility (See specialised dictionaries and glossaries are the result. The
Blank, 2001), which allows for easy alignment at, at least, technology of databases, however, allows for a more
sentence level, and the standardisation of their technical descriptive approach, with all the implications this has for
terminology. This is understandable, since it will only be including all the information terminologists collect in the
possible to proceed with the analysis of a wider variety of course of their work. When one is no longer limited by
texts when some sort of procedure has been worked out on space on paper a major factor in previous
the basis of these controlled corpora rather as machine lexicographical work the prospects of including all the
translation is better at translating controlled language than information available and/or prescribed by international
Shakespeare. standards for terminological databases are, to say the least,
There is, of course, a lot of textual material that tempting. These prospects may seem unnecessary to the
apparently conforms to the needs of this type of research. more immediate problems of communication, but they
The European Commission has worked hard at making as contribute in no small way to various visions of the
many of its multilingual texts available as possible. In systematisation and documentation of knowledge.
order to do this, the translation services have effectively Terminology is not the simple accumulation of words,
created enormous translation memories full of texts their equivalents in other languages, definitions and a
translated by themselves, and one can presume that the certain amount of grammatical information. Nor is it the
terminology used is usually supported by the simple matching of term to concept. One has to deal with
EURODICAUTOM database, which is itself the result of all the usual problems of language - social, geographical,
many years of effort by a large number of people. The historical, political, and other aspects of style and register.
large multinational companies that have invested heavily At the level of standardisation, one can even become
in translation memory software and terminology databases involved in authentic battles between academics or
could also provide a vast amount of material. commercial companies who want to see the words they
Organisations like the International Standards use to describe their particular theories or products
Organisation could provide invaluable material once its prevail.
standards are efficiently translated in other languages.
After all, not only do these standards and their translations 5. Real-life corpora
represent ideal parallel corpora, but the very purpose of When one is not working for the interests of
the texts themselves is to standardise the terminology computational terminology, one will probably not have
used. access to the type of standardised corpora already
described, except for the online documentation of the
4. Real-life terminology European Commission. Besides this, these standardised
There can be no doubt that a lot of the work to which texts, no matter how well written or translated, tend to
we have just referred is impressive and of high quality reflect a degree of deliberate homogenisation of style and
and, therefore, a reliable source of information for the register across languages. In the more routine terminology
most necessary function of all these texts the work carried out in universities and other institutions,
communication of knowledge. However, anyone who has every terminology project will come up against a different
worked seriously on producing terminology with the situation, and circumstances will play an important role.
collaboration of experts will realise that the notion of one First of all, one has to find what texts are available in
concept = one term is an ideal, not a reality. International the domain one is studying and it is more than likely that
classifications that do exist have sometimes tried to escape the most important ones will not be in digital form. We
the problems of normal language in different ways, as have found that this is often the case when one wants to
when natural species are classified in Latin, or chemical use first-class academic texts published by well-known
and mathematical concepts use formulas and symbols. publishers. Working with industrial or commercial
There are various reasons why the one concept = one institutions or companies is one way of obtaining texts,
term notion is an ideal. It is easy enough for the linguist but we have not yet tried this, partly because it will
require careful negotiation, and partly because we have Corpora have always been obligatory elements of our
found several academic partners interested in cooperating project work but, although we have collected quite a lot of
on a serious and more unbiased basis. specialised mini-corpora over the years, we admit that
One can always scan texts, and there are, of course, they have not always been the most successful part of the
plenty of texts already in digital form. It is often easy projects. There are various reasons for this. On the one
enough to obtain permission to use these texts if one hand, perhaps the biggest enemy of terminology related
explains why one needs them and what one intends to do corpora work is the large number of existing on-line
with them, as there is plenty of interest among domain glossaries on everything under the sun that our students
experts to see their terminology systematized. The soon discover from each other. One can, of course, argue
Internet, as we all know, can provide an enormous amount that these glossaries, which are often easy to copy or
of material in certain areas, but is less useful in others. For download, are in themselves language resources of the
example, we have found it of limited interest for certain type we are discussing here. However, they are usually
engineering terminology projects because both the high monolingual, largely in English, often rather general in
level expert-to-expert type of academic article and the scope, and infrequently backed up by any form of official
more didactically orientated teaching text are not freely recognition. When the glossaries are good, complete, and
available to the general public. Too often one ends up officially recognised, adding Portuguese terminology to
with commercial sites trying to sell certain types of them is usually beyond the scope of an undergraduate
engineering equipment, and the information thus obtained project. Of course, one might argue that beginners could
is not necessarily very reliable. In the area of population do worse than discover how to convert them into their
geography, however, where one is dealing with a subject own languages.
that cuts across the disciplines of geography, sociology The big problem here is that such work merely
and demography, one project group was able to find a encourages the idea that finding the right word is
sizeable amount of material in several languages, of both a enough. This means they miss out on the didactic
parallel and comparable nature, precisely because there strengths of making mini corpora - the understanding of
are plenty of official or governmental institutions who the subject itself, brought about by having to find and read
want to publish such material on-line. The other texts, the appreciation of different types and styles of text
interesting aspect of this area is that the subject is gained while doing this, and the extraction of terms in
relatively new and the relative instability of the context. Although students are encouraged to use software
terminology was observable in the texts found. like Wordsmith to look for keywords and to study
As our projects must have a Portuguese component, concordances of both general language words and
one of the problems we have found is that some languages specialised terminology, there is always a preliminary
are more equal than others. If the languages involved are stage when the actual reading of the texts is necessary at
English, French or German, there is a chance that one will least from a pedagogical point of view. If they are lucky,
be able to find reliable texts of a parallel or comparable they will also find definitions in the texts, although these
nature, but the same will not be true of less used are not as frequent, or as reliable, as the literature on the
languages. We have found this to be true at all levels of subject would have us believe.
text we look for. We have also found that the translations There are successful types of glossary work that do not
of websites - whatever the original language - are often of poor require corpora, such as some excellent ones our students
quality and cannot be used as parallel corpora. have done on tools of various types e.g. carpentry and
gardening tools - in which the corpora were largely
6. Teaching and Project work catalogues with images, and students had to work hard to
The type of project work we have done over the years make the words in both languages match the pictures
started as a typical translation exercise in vocabulary provided, a process that involved plenty of questioning of
research that owed much of its dynamics to the fact that individuals, but little text work.
the translation classroom contained PCs connected to the
Internet. Our curriculum had been formulated by believers 7. Conclusions
in the notion that general translation, together with six Corpora and terminology research can work well
months placement at the end of the course, was sufficient together, but they are not always equal partners. Ideally,
for training Modern Languages students to become students should be able to find good texts and extract
translators. Our experience, and that of our graduates, terms, definitions and other information from them. When
soon told us that this was far from enough and we mini-corpora form the basis for terminology work, the
developed specialised subject project work as a way of process of producing the terminology project is
training students in LSP (see Maia, 1997 and Maia, 2000) didactically more valuable, and it is an easy step from
within the limitations of the curriculum. We have now collecting and aligning texts, and then using
moved on to interdisciplinary postgraduate training in concordancing, to understanding the theory behind
terminology and translation work, working with translation memories and other software and making them
professors from the Engineering Faculty and History and work in practice. As we have said, however, valuable
Geography departments. Our early wordlists processed in terminology work can be done without resort to corpora.
Word have now developed into more sophisticated Perhaps the most important attitude to adopt towards
terminology work in Excel and Multiterm, and include project work is flexibility, since each domain brings its
definitions, sources, images and other data fields. We own circumstances and problems. If at the end of the
soon hope to have our own database system and make it experience our undergraduate students have learned how
available online. to take special languages seriously, the main objective has
been achieved. Our postgraduate students already know
how important they are and need to learn how to progress
further, and perhaps even join the process of research into
computational processes that will speed up the
accumulation of valuable resources for all of us who do
not want to see the world speaking only one language.
Working Together: A Collaborative Approach to DIY Corpora
Lynne Bowker
School of Translation and Interpretation, University of Ottawa,
70 Laurier Avenue East, Room 401, Ottawa, Ontario, K1N 6N5, Canada
Corpora can be invaluable resources for translation students, but creating DIY corpora on a frequent basis can be a time-consuming
exercise. This paper describes an experiment whereby the students in a translation class worked in collaboration to build corpora for
use in their technical translation course. The guidelines used for this collaborative approach are outlined, and the results of the
experiment are discussed. A general discussion on the value of the World Wide Web as a resource for building DIY corpora is also
contributed by each student per corpus, c) quality of texts,
1. Introduction d) time frame, and e) file format.
Researchers such as Zanettin (1998), Yuste (2000),
and Bowker and Pearson (2002) have amply demonstrated 2.1. Coordinators
the value of using corpora as translation resources in the For each corpus, two students would act as
context of translator training. However, there are coordinators. When students were acting as coordinators,
relatively few ready-made or off-the-shelf corpora they did not have to contribute texts to the corpus (but
available for use in specialized domains, so translator they still had to do the actual translation homework).
trainers and/or students typically need to construct their Essentially, the coordinators were to act as a sort of
own. This paper outlines an experiment that was clearing house. Students in the class would e-mail their
conducted with 4th-year undergraduate students in a texts to a special account set up for the coordinators, who
French-to-English technical translation course. The would 1) evaluate these texts for relevance, and 2)
purpose of this experiment was to see if it was possible for eliminate duplications (i.e., cases where the same text had
the class to collectively build DIY or disposable been submitted by multiple students). The remaining texts
corpora (Varantola, forthcoming) that could be used as would then be collated into a single corpus that would be
resources for their translation course work. posted on the class Web site.
My previous experiments with corpus building had
proceeding following either a teacher-centred approach or 2.2. Number of texts contributed by each
a learner-centred approach. Both of these approaches had student per corpus
a number of drawbacks. In the case of the teacher-centred Each student (with the exception of the coordinators)
approach, the translator trainer was responsible for would try to identify three relevant texts that would make
constructing all the corpora a job which proved to be a good addition to the corpus. Given a class of between 20
very time consuming (resulting in relatively small and 30 students (this class had 22 students), this number
corpora) and which excluded the students from the design was considered to be a reasonable goal; however, it was
phase of the corpus building process. In the case of the not an absolute. If a student could only identify two
learner-centred approach, each student was individually suitable texts, these would still be welcome; likewise, if a
responsible for building his or her own corpora. This student located four or five relevant texts, they could all
approach also proved to be inefficient, with students be submitted.
building corpora that were often small and generally
poorly designed.
2.3. Quality of texts
It was hoped that by adopting what Kiraly (1999 and
2000) and Yuste (2001) refer to as a learning-centred and The students agreed to put some time and care into
collaborative approach, the resulting corpora would be selecting their three texts. It was noted that if everyone
larger and more useful, and the students would engage in were to simply submit the texts corresponding to the first
active discussions with the trainer and with each other and three hits that came up using a Web search engine, then
would move towards becoming empowered critical there would be a lot of duplication and the texts may not
thinkers and more independent learners. be pertinent, which would limit the value of the corpus.
Table 1: A brief description of the corpora produced as part of the collaborative corpus building exercise.
Finally, the source text on cookies consisted of an or instructional texts that are popularized. More
entry taken from a technical encyclopedia. Once again, specialized material and different text types can be
there were relatively few submissions (41 texts), coupled accessed via the Web, but such information is often
with a high degree of duplication (only 22 texts were available only by paid subscription. This means that while
retained). This was because there are a limited number of the Web can a valuable resource for constructing corpora
electronic technical encyclopedias that could serve as that deal with popularized informative texts, it may prove
comparable texts. Furthermore, it was observed that the less helpful for constructing corpora that must comprise
entries in such encyclopedias tend to consist of short texts, other types of texts.
which resulted in a relatively low word count for the A similar observation was made about the languages
corpus as a whole. of texts available on the Web. The students in this class
were attempting to compile comparable corpora
5. General observations about using the containing English-language texts, of which there are
Web as a resource for building DIY many on the Web; however, they noted that for translators
corpora working in less widely-used languages, there may be
In addition to discussing particular problems that came fewer texts available (at least for the present, though
up when creating specific corpora, the class also discussed hopefully this will change over time).
a number of more general points, many of which The very nature of the Web gave rise to two other
concerned the nature of the Web and its suitability as a observations. Firstly, the idea behind hypertext is that
resource for building DIY translation corpora. For people can jump from page to page to view associated
example, it was noted that there are many texts on the information. Good Web design dictates that there should
Web that are of poor quality and which therefore do not be a limited amount of information on each page so that
make good translation resources. When asked to reflect on people are not required to scroll unnecessarily; related
potential reasons for this poor quality, students came up pieces of information should be provided on separate
with the following possibilities. Firstly, they noted that pages with relevant links between them. When compiling
anyone can post information on the Web, including non- a corpus from the Web, each page must be copied/saved
subject field experts and non-native speakers, and that separately and then later amalgamated into a corpus.
Web documents are not always subject to an editing Therefore, from a corpus builders point of view, it would
process in the same way that printed documents usually be preferable to have a single page containing a lot of
are. Furthermore, the Web is seen by many as an information, as this page could be copied/saved in one
ephemeral resource; people are interested in operation, rather than having that same information spread
communicating information, but unlike the case with over several pages, which would then need to be
printed documents, this information may not be preserved copied/saved separately. This basically means that good
for long (i.e., a Web page can be revised, updated or Web design is not conducive to easy corpus building!
removed very easily) and so people are less willing to Secondly, the multimedia nature of the Web is another
invest much time or effort in formulating that information. characteristic that is not always conducive to building
In other words, many people feel that a Web page does text-based corpora. On a number of occasions, students
not need to be elegant (or even grammatically correct!) as rejected Web pages that would have been extremely
long as it adequately conveys the essential information. useful sources of information but which could not easily
Another comment focused on the types of texts that be incorporated into a text-based corpus because their
are commonly found on the Web. Given that the Web is primary value resided in their graphical or audio content.
most often used as a means of disseminating information This raises an important point: a corpus can be an
to a non-expert audience, it contains primarily informative invaluable resource, but it is not a panacea. There are
many other complementary types of resources that can
also provide helpful information, and these should not be type, nature) for use as a resource for the translation at
ignored. hand.
Finally, the sheer volume of information that is
available on the Web made students aware of the 7. Acknowledgements
importance of formulating search queries carefully in The work described here has been partially funded by
order to be able to focus in on relevant material. As grants awarded to Lynne Bowker by the Faculty of Arts of
previously mentioned, students tended to read the source the University of Ottawa and the University of Ottawa
text first in order to get ideas for potential key words. Research Fund.
These words were then entered into a search engine, and
the resulting hits were examined for relevancy as well as 8. References
for ideas for other key words that could be used for further
searches. In addition to key words that dealt with the Bergeron, Manon and Susan Larsson, 1999. Internet
subject matter, students also found that it could be useful Search Strategies for Translators. The ATA Chronicle
to enter key words relating to the text type. For instance, a 28(7): 22-25.
search using only the subject key word cookie returned Bowker, Lynne, 2000. Towards a Methodology for
many irrelevant texts such as recipes; however, a more Exploiting Specialized Target Language Corpora as
carefully formulated search that combined subject and text Translation Resources. International Journal of Corpus
type key words, such as +cookie +computer Linguistics 5(1): 17-52.
+encyclopedia, returned hits for entries for cookie in Bowker, Lynne and Jennifer Pearson, 2002. Working with
resources such as The Grand Encyclopedia of Computer Specialized Language: A Practical Guide to Using
Terminology, TechEncyclopedia and PC Webopedia. Corpora. London: Routledge.
Other tricks, such as remembering to search for alternate Kiraly, Don, 1999. From Teacher-centered to Learning-
spellings (e.g., encyclopedia/encyclopaedia) also helped to centered classrooms in translator education: control,
increase the number of relevant hits. In addition, as chaos or collaboration? In Innovation in Translator and
mentioned previously, the students also found it useful to Interpreter Training (ITIT) an online symposium held
conduct a search using a variety of different search from January 17-25, 2000)
engines or a meta-search engine. Bergeron and Larsson
(1999) provide additional tips for effective Internet search Kiraly, Don, 2000. A Social Constructivist Approach to
strategies for translators. Translator Education. Manchester: St. Jerome.
Pearson, Jennifer, 2000. Surfing the Internet: Teaching
6. Concluding Remarks Students to Choose their Texts Wisely. In L. Burnard
and T. McEnery (eds), Rethinking Language Pedagogy
Overall, the collaborative corpus building exercise from a Corpus Perspective. Frankfurt: Peter Lang.
proved to be a worthwhile experience. The students Rogers, Margaret and Khurshid Ahmad, 1994.
demonstrated that they were eminently capable of working Computerised Terminology for Translators: The Role of
together to construct valuable translation resources, which Text. In M. Brekke, O. Andersen, T. Dahl and J. Myking
they could then consult to identify relevant lexical, (eds), Applications and Implications of Current LSP
phraseological and stylistic information. Not surprisingly, Research, Vol. II. Norway: Fagbokforlaget.
of the seven collective corpora that were built, the larger Varantola, Krista, Forthcoming. Translators and
ones, such as those on antivirus programs and encryption, disposable corpora. In F. Zanettin, S. Bernardini and D.
tended to contain a greater number of examples. Of more Stewart (eds), Corpora in Translator Education,
interest, however, is the fact that even the small corpora, Manchester: St. Jerome.
such as those on steganography and cookies, contained WordSmith Tools:
useful information. This supports the point made by
researchers such as Rogers and Ahmad (1994), who note SmithTools3.0/download.html
that when working in specialized fields, it is not necessary Yuste, Elia. 2000. Translation Instruction in the Y2K
to have the sort of multimillion word corpora that are Electronic Corpora, Internet and Translation
typically required for general language work. Technology. In CD-ROM Proceedings of the Seventh
In addition to furnishing students with an opportunity Conference of the International Society for the Study of
to explore the merit of corpora as translation resources, European Ideas (ISSEI 2000), Workshop 501 -
this exercise also provided a valuable opportunity for a Teaching Translation in the Information Age.
shift in pedagogical strategy. The collaborative corpus University of Bergen, Norway.
building exercise made it relatively easy for the trainer to Yuste, Elia. 2001. Technology-Aided Translation
take on the role of facilitator (rather than information Training. Hieronymous (3). Bern, Switzerland: ASTTI.
provider), which in turn allowed the students to become Zanettin, Federico. 1998. Bilingual Comparable Corpora
independent learners and critical thinkers, who were and the Training of Translators. In Meta 43(4), 616-
encouraged to reflect on the characteristics of different 630.
text types and on the suitability of the World Wide Web as
a translation resource. Acting as both contributors and
coordinators, students learned to identify relevant features
of texts and to be more discerning with regard to the
appropriateness of a text (e.g., in terms of quality, text
Language resources and the language professional
Elia Yuste
Computerlinguistik (CL)
Institut fr Informatik (IfI) der Universitt Zrich
Winterthurerstrasse 190, CH-8057 ZRICH, Switzerland
This paper aims at raising awareness about electronic language resources (henceforth LR) in the translation community at large.
Examining how technological advances in the profession have transformed the notion of translating itself and what is expected from a
qualified translator today, the paper goes on to focus on resources, rather than tools. It then discusses what type of LR should feature in
the training of professional translators, and how these should be tackled in various translation-training settings. It contains several
useful pointers throughout the article and an extensive bibliography covering the various issues addressed herewith.
Keywords: translation profession, language professional, qualified translator, translation training, tools, resourceful, resources,
language resources (LR), corpora, translation technology and HLT, academic training, vocational training, collaborative approach,
real-life scenarios, translation workflow, multi-user access, corporate language, content management, resource creation / maintenance /
evaluation / validation / exchange, exchange standards
Textual and terminological bridgeheads for traversing the language gap
Marita Kristiansen, Magnar Brekke
Norwegian School of Economics
and Business Administration
Department of Languages, Helleveien 30, N-5045 Bergen, NORWAY
We describe here the basic modules of a concept-oriented bilingual text-and-term-based knowledge management system (KB-NHH) to
which students, teachers, researchers, domain experts, terminologists, linguists, translators and writers of various categories can turn
for content learning, reference and documentation. The aim is to ensure that the interface between English and Norwegian is being
handled with efficiency and consistency.
Primary user context of the implementation described here is an on-campus e-learning system.The aim is to facilitate the
representation, learning, teaching and dissemination of relevant domain knowledge, to monitor changes in and development of the
subdomain languages and to document all through authentic citations. Conceptual linkage of terms and authentic segments in the text
bank allow source inspection and evaluation by user. Focus is on corpus-based term extraction, definitions, terminological
representations, Norwegian-English equivalence problems and contrastive phraseology.
This paper makes a distinct contribution by proposing the integration of a conceptual knowledge-base with the textual manifestation of
its underlying domain knowledge and its terminological representation in one or more languages, all in the context of a standard e-
learning system. This should greatly facilitate learning by bridging the language gap experienced by native and non-native students
alike in approaching a new knowledge domain.
3.2. Knowledge representation. Table 2: Top of System Quirks standard frequency list.
Terminological research is normally based on the
onomasiological principle, the grouping of terms Some of these one-word units of fairly general scope
according to their conceptual meaning. Thus any can be identified as Economics terms, which is useful but
knowledge subdomain can be characterized by a of limited value. SystemQuirk provides two different
(partially) structured set of basic concepts which are functions for enhancing frequency lists to improve on our
represented linguistically through domain-focal terms (cf. term enquiry.
Brekke, 2000). Establishing or extending conceptual
systems (cf. module 8) becomes essential in achieving 3.3.1. Weirdness.
authentic representations of the knowledge which SQ exploits a weirdness-function based on a
constitutes a given subdomain. This activity presupposes comparative ratio which expresses the likely occurrence of
close cooperation between a domain expert and a trained a given item in the text being scrutinized compared to the
terminologist (cf. module 2 & 3) in identifying and same for a large general corpus. Where the latter
delimiting what the basic concepts are, conventional term occurrence is zero the ratio will of course be infinite,
usage, acceptable synonymy etc. The repository for their indicating either a typo, a nonce word, or in fact a
work is a termbank (cf. module 9) holding terminological technical term, which is also indicated by a very high
units defined, classified as to subdomains, and mapped to ratio. As a result a number of items occurring only once in
their respective key concepts and conceptual hierarchies in a given text will be brought to the top of the frequency
module 8. Using the concept as a term record pivot (as is list, and such lists usually give significant inputs to the
done in e.g. Trados MultiTerm, which is employed in the ensuing frequency studies. Table 3 (over) reveals a typical
pilot project) facilitates the inclusion of other language situation. It should be noted in table 3 that of the top 30
equivalents (French, German and Spanish are obvious items on the list, 2/3 of them occur only once, which
candidates for inclusion later on). would effectively drown them out of the investigators
attention had not the weirdness-function been active (cp
3.3. Term extraction (cf. module 5). table 2).
While both tables contain terms which are
The slow time-honored techniques of excerption has
immediately recognizable by an economist they only share
long since been supplemented by increasingly
one (investment), and those on the Weirdness-list are
sophisticated computational methods. Many of the results
clearly of a more specific domain-related scope (and
are impressive but have not allowed us to dispense
presumably less recognizable by a nonexpert). Table 3 has
entirely with the services of the domain expert in tandem
12 inf!-terms, i.e. items not occurring in a large corpus of 3.3.3. Equivalence checking:
general English, while the remainder occur between 151 Plugging the terminological holes.
and 3 times more often than they would in that corpus. In economic domains the terminological pressure from
Thus their degree of specialization is approaching general English has increased in proportion to the rapid
usage. globalization processes seen through the nineties and
continuing unabated, while the readiness to invest in
3.3.2. Terms as strings of content words. professional means for handling the textual interface has
The other tool offered by SQ for sniffing out potential been lacking. Most of the recent efforts have gone into
multi-word terms, aptly named Ferret, is based on a very developing a speech interface, and the systematic
simple algorithm: It takes a general list of function words monitoring and creation of suitable terminology for use in
as boundary signals and proceeds to identify any string of translating economic texts has been left to private
content words uninterrupted by such boundary signals as a initiative. Some subdomains thus appear well looked after,
term candidate. Table 4 displays the results obtained from while others tend to end up with haphazard and ad hoc
examining the same text as above. equivalents for newly formed concepts and terms from
Frq Match SL/GL Ratio
10 capital markets inf!
3 business cycle inf! 8 7 pension 7 mutual
2 annual report inf! capital funds funds
2 central bank inf! markets
1 new york stock inf! 5 see 5 less than 5 life
exchange chart insurers
1 dow jones inf! 5 past 5 4
industrial decade information institutional
average technology investors
1 cost of capital inf! 4 share 4 s economy 3 this year
1 capital stock inf! prices
1 european union inf! 3 on 3 recent 3 since
1 institutional inf! average years america
investor 3 point 3 but there 3 this
1 balance sheet inf! out survey
1 fiscal policy inf! 3 other 3 retail 3 but even
1 solvency 151.2382 countries sales
6 equity 88.5297 3 cost 3 poorest 3 world bank
1 annuity 75.6191 savings countries
1 takeover 75.6191 3 3 emerging 3 supply
foreign economies chain
1 futures 50.4127
31 investment 35.1849
3 short 3 b2b e 3 s gdp
2 premium 32.7001 term
2 inventory 31.8396 3 hedge 3 state 3 an annual
1 liquidity 30.2476 funds street average
1 diversification 27.4978
1 downstream 19.5146 Table 4: Ferreted strings
1 revenues 10.2534 cultures. Since the two languages have very close
3 yield 8.0303 historical and lexical affinities, one should not be
4 bond 7.8565 surprised to encounter a variety of terminological misfits,
1 float 6.8745 from simple (and humorous) folk translations through
1 commodity 5.5500 cognate shifts to serious false friends which may create
1 options 4.4482 hazardous and expensive mistakes.
1 margin 3.2700 Cognates constitute a rich quarry for terminological
misfits. Consider the following examples:
Table 3: Top of System Quirks frequency list with
weirdness-function active. 1. Federal Reserve Bank of Minneapolis President Gary
Stern warned on Friday against the ``moral hazard'' that
For reasons which are unclear Ferret missed two of the may prompt banks to undertake too much risk amid
occurrences of capital markets, and it does seem to invite excessive confidence of government safety nets.
some obvious refinements of its list of boundary signals, Anyone connected professionally with hedging and
but otherwise the high end of the frequency list does insurance will recognize the special term (in bold). While
throw up some promising term candidates. each member of the phrase has a cognate with several
meanings in Norwegian, it is rather obvious that the definitions etc. and go from there into the text bank to
connotations they bring along are quite different from the inspect authentic text samples illustrating usage,
English ones. Nevertheless the temptation to use the phraseology etc. This is particularly useful for a non-
direct method is clearly irresistible, as the following native student. Alternatively the student may proceed
sample (from a sizable collection) will show: directly from conceptual system to the text samples, and
from there via clickable text-embedded terms across to the
2. Kombinasjonen av usikrede lokale banker, moralsk full term-bank representation of the desired concepts to
hasard i utlandet, av kortsiktige utenlandske study definitions, synonyms, acronyms etc.
kapitalplasseringer og Pengefondets Students approaching a new knowledge universe will
innstrammingspolitikk, ga kraftige negative utslag. easily detect concepts not adequately covered or
explained. All searches will be logged to allow a study of
A linguistically sensitive person familiar with the user behavior and user needs, with a view to enhancing
concept underlying the original expression in 1 (including the intuitiveness of the user interface. Following an
their use as separate English words) will realize that the unsuccessful search the user will be asked (through
calque in 2 creates undesirable connotations. automatic routines) to report unfound terms and submit a
Unfortunately many will fail to see the problem, which relevant text segment with source reference, and will have
allows the emergence of a Norwenglish (quasi- a chance to include responses or comments. It will be
Norwegian) terminology lacking professional and cultural considered whether users also should be invited to join an
quality assurance. Arriving at the Norwegian equivalent official discussion group. Success in engaging the user
tferdsrisiko requires professional handling, time, and in such dynamic interaction will not only provide a way of
relevant domain knowledge (another subdomain prefers monitoring a continuous growth of the collection but may
subjektiv risiko). It also requires an efficient also create greater user identification with the KB-NHH,
dissemination channel to ensure its adoption and use. which in turn may have a standardizing effect.
Equivalence checking is thus serious and important
business for anyone purporting to traverse the knowledge 5. Maintenance and development.
gap as well as the language gap through translation or
related forms of text production. It appears to be one stage New concepts are constantly being created in the
of the bridge building which cannot easily dispense with professional community and migrate towards general
the bilingual human expert/terminologist or their term usage, sometimes even grabbing front page headlines:
creation principles, be they linguistically, politically or unit-link, derivatives and hedge funds have recently
culturally motivated. In other words, the bridge heads on enjoyed such instant attention. At the time of writing e-
either side must be anchored in their respective business is very much in vogue (along with almost any
professional context, and the quality of work assured noun with an e- prefix), and creative accounting is already
through a content and terminology management system. a clich in the financial headlines.
Only then can our efficient computer-based tools for This implies that simply registering the constitutive
processing and dissemination come into their own. concepts of a given domain, including their manifestation
through the terminology of one or more national
languages, is not done once and for all. What is required is
4. Dissemination and use. a more or less continual monitoring of the entire life cycle
At the outset the material held in the KB-NHH of any given term, from creation through extension and
database will form the basis for student oriented bilingual expansion to disuse and eventual death. The above are
domain glossaries with definitions, as well as genre- random examples of an ongoing process which is in fact
related material for learning and teaching. Both textbank quite normal, although the speed and intensity may vary
and termbank will be SGML conformant, adhering as far with the times and the subdomain. Ideally the new or
as possible to the TEI guidelines, which allows interactive altered terms would need to be absorbed by writers, their
access via a Web-browser or ftp downloading. In addition underlying concepts defined and systematized by domain
all or parts of the termbank may be distributed on CD- experts and terminologists, standardized by professional
ROM. Printed versions are possible, but the main bodies, and their usage documented through carefully
emphasis will be on interactive use via electronic vetted citations. At the receiving end of this process would
networks. This will take full advantage of the dynamic be speakers of other languages (be they experts,
aspects of electronic media, allowing e.g. fuzzy matching journalists or textbook authors) who would ideally have to
of any search to the nearest form. establish procedures for finding or creating equivalent
The diagram referred to in Appendix outlines the terms and determine proper usage.
current architecture of KN-NHH, a proof-of-concept
implementation of the e-learning oriented application of 6. Outlook
TERMINEC. The student enters the e-learning system (cf.
module 11, a Blackboard-type system) via a standard This paper makes a distinct contribution by proposing
web-browser (cf. module 10), accesses the course catalog the integration of a conceptual knowledge-base with the
and proceeds to the description/presentation of the course textual manifestation of its underlying domain knowledge
content in either English or Norwegian. All domain focal and its terminological representation in one or more
terms have active links to the central conceptual system. languages, all in the context of a standard e-learning
At this point the student may follow the link to the system. This should greatly facilitate learning by bridging
relevant term record in the desired source language, study the language gap experienced by native and non-native
students alike in approaching a new knowledge domain. A
well documented and web-accessible clearinghouse for
English-Norwegian economics text and terminology as
envisaged here would also establish a significant point of
reference for empirically based term-formation and
possibly standardization, thus providing Norwegian
export-oriented corporations with a much needed quality
assurance of the linguistic interface. The same would hold
for Norways administrative and political cooperation
with the outside world, as well as for the global language
industry, which depends on the availability of multilingual
databases and some form of translation. The realism in
trying to stem the flood of English usage in conducting the
professional affairs of people whose normal mode of
communication is something other than English is highly
debatable, but the virtue of avoiding linguistic domain
losses in Norwegian is not.
2. Domain-expert
& 3. Terminologist
1. subdomain knowledge KB-NHH
2. Domain-expert
& 3. Terminologist
Copy- User
right concept hierarchies panel
ontologies, thesauri
Tagging Course
Course content
Course unit
XML coding
Web interface 10. Search engine
11. e-learning
Creating a Term Base to Customise an MT System: Reusability of Resources
and Tools from the Translator's Point of View
Natalie Kbler
Intercultural Centre for Studies in Lexicology
University Paris 7
2, Place Jussieu, 75251 Paris Cdex 05, France
This paper addresses the issue of combining existing tools and resources to customise dictionaries used for machine translation (MT)
with a view to providing technical translators with an effective time-saving tool. It is based on the hypothesis that customising MT
systems can be achieved using unsophisticated tools, so that the system can produce output of sufficient quality for post-translation
proofreading. Corpora collected for a different purpose, together with existing on-line glossaries, can be reused or reapplied to build a
bigger term base. The Systran customisable on-line MT system (Systranet) is tested on technical documents (the Linux operating
system HOWTOs), without any specialised dictionary. Customised dictionaries, existing glossaries completed by adding corpus-
based information using terminology extraction tools, are then incorporated into the system and an improved translation is produced.
The dictionary will be augmented and corrected as long as modifications generate significant results. This process will be described
in detail. The resulting translation is good enough to warrant proofreading in the normal way. This last point is important because
MT results require specialised editing procedures. Compared with the time taken to produce a translation manually, this methodology
should prove useful for professional translators.
of bilingual entries in the specialised area of computing, 2.4. Tools
though they partly have the same headwords. Three The first tool used is an on-line concordancer featuring
glossaries were selected initially, because they contain perl-like11 regular expressions, which gives access to
terms that do not cross LSPs because they are domain- aligned paragraphs of French and English texts from
specific. They were downloaded, corrected, and which a concordance has been extracted. Another on-line
formatted, to be compiled as customised dictionaries in tool is a tokeniser, which allows the user to sort the words
Systranet. Here is the list of selected glossaries and the of a text in alphabetical order, or by frequency.
number of headwords for each: As the general philosophy of this experiment was to
The HOWTO translation project glossary4: a use simple tools, a commercially available term extraction
small glossary of 200 words discussed and agreed tool was selected: Terminology Extractor12, which works
upon in the project discussion list . for French and English. It uses a dictionary to lemmatise
Netglos Internet Glossary5: a multilingual glossary the vocabulary of a text and produce four different output
of Internet terminology compiled in a voluntary, types:
collaborative project, containing 282 terms. Canonical forms: recognised by the program and
The RETIF6 site glossary. This short glossary sorted by alphabetical order or by frequency; the
contains 73 terms approved of by the French most frequent forms are to be considered as
Governmental Terminology Commission for potential terms.
Computing and the Internet. Non words: not recognised by the system; most of
them are specialised terms.
2.2. Corpora Collocations. Collocational extraction is based on
Corpora make up the core resource exploited by the a very simple principle: any sequence of at least
Systran team. Smaller corpora, exploited with simple two -- and at most ten -- words, that is repeated at
tools, produce interesting results on a more individual least once is considered as a collocation. Stop
scale. The smaller corpora used in the experiment had words are discarded to avoid sequences, such as
been collected to teach computer science English to sauvegarde de la [save the], in which la is a
French-speakers (Foucou & Kbler 2000). The texts used determiner preceding the second part of the term,
are highly technical and freely available on the Web: as in sauvegarde de la configuration [save the
Internet RFC7: 8.5 million words: monolingual settings]. Collocates are good candidates for
English corpus. This corpus consists of the technical terms.
Internet Request For Comments available on the KWIC (key word in context): for the combined
RFC documentation site. three lists. This feature is used to extract lexico-
Linux HOWTOs: English to French aligned grammatical information, on verb structures, for
corpus, ca. 500 000 words. The English HOWTOs example.
and their translations in several languages are
available on the Linux documentation site8. 3. Systranet: customisable dictionaries
The above-mentioned corpora are embedded in a Web- Systran MT has been much improved in recent years
based environment that can be accessed on our Wall9 site. (Sennelart et al. 2001). Systranet is an on-line service
offered by Systran. Users have access to a dictionary
2.3. The Web manager which allows them to create and upload their
The Internet has become a necessary resource for own multilingual linguistically-coded dictionaries into
linguists, lexicographers, translators, and other language Systran, in order to improve translation results. These
researchers, providing them with on-line dictionaries, multilingual dictionaries contain a list of subject-specific
reference documents, newsgroups. The Web can also be terms that are analyzed prior to using Systran in-house
considered as an open-ended, unstructured corpus which dictionaries. This feature is based on the assumption,
can be queried using search engines, though these are not demonstrated by Lange & Yang (1999), that domain
tailored for linguistic search. A specific linguistic search selection and terminology restriction are beneficial to
tool is Webcorp10 (Kehoe & Renouf, forthcoming), which translation results.
provides users with concordances, collocates, and lists of Linguistic information, such as part-of-speech, number
words found on Web pages; we have used this for a and gender, subcategorisation, or low-level semantics can
variety of purposes. A Web-based search strategy should be added to the user's dictionary entries. Once the
be used in conjunction with the off-line, finite, corpus- dictionary has been compiled, its accuracy and linguistic
based approach, since they yield complementary coverage can be tested by translating subject-specific
information. texts.
The translation results can be improved by modifying
4 the dictionary, a recurrent process which can be continued
5 so long as the modifications produce significant
http://www- improvement. Systranet offers specific features that allow
8 11 Perl is a particularly appropriate programming language for
9 handling word strings or finding language patterns.
10 12
the user to see which terms have been translated using collocation lists. Unlike the existing glossaries,
customised dictionaries, and which terms are not Terminology Extractor outputs do not provide French
recognised at all. It allows the user to check whether the equivalents for the English words. On-line term banks,
dictionary entries have really improved the translation such as Le Grand Dictionnaire Terminologique14 or
results as expected. Another feature used to complete the Termium15 proved insufficient for translating most terms.
dictionary is the non-word feature: all the words that have A corpus-driven approach was adopted to find French
not been recognised by Systran or the user's dictionaries equivalents: the RFC corpus was used to find more
appear in red. They can then be integrated into the user's information about context, the aligned HOWTO corpus
dictionary. was queried with the regular expressions concordancer
(Wall) to find appropriate translations, as illustrated
4. Experiment and methodology below.
We chose technical documents written by experts for The term README in the computing context is used
experts, the Linux HOWTOs, which are the user manual as a noun, as shown in the following context, in which the
of the Linux operating system. This experiment is part of a term is the head of a subject NP:
larger project that consists in translating all the new
HOWTOs using MT. HOWTOs are documents of various links which Linus describes in the README are set up
size, describing the way to install the system and software correctly. In general, if a
related to it. Existing software is constantly updated and
augmented, so the corresponding documents are updated Figure 1. The noun README in context
and new documents are written with each new program.
These documents have been translated into several The term addon was in the non word list, but by using
languages by the various Linux communities. The French the HOWTO corpus, we found contexts and a French
Linux community has developed a translation project13 in translation:
which the translation is usually done by non professional,
voluntary translators. People choose the document they The FWTK does not proxy SSL web documents but there
want to translate and do the job. Today, most HOWTOs is an addon for it written by Jean-Christophe
have been translated, which makes it possible to align the Le fwtk ne route pas les documents web SSL, mais il
French translations with the English source and use them existe un module complmentaire crit par Jean-
as a parallel corpus.
The task set for the experiment was to provide a
Figure 2. The noun addon and its French translation
complete and appropriate dictionary to translate the
remaining untranslated Linux HOWTOs. This is based on
This stage was necessarily completed by using Web
the assumption that the initial dictionaries will be
search engines to verify some translations found in the
augmented in the light of each new text to translate. Since
HOWTOs, or to deduce new translations from indirect
a comparative study of the translation results -- with and
queries. Since the documents are translated by various
without customised dictionaries -- had to be established,
people who are usually not professional translators, but
each text was first translated without using any specific
computing experts, the French versions of the HOWTO
are not homogeneous. This means that one English term
can be translated by several different words that are true
4.1. Creating the dictionaries synonyms in French. Only one equivalent must be chosen
The methodology is a combinatorial approach, for the MT dictionary. Another problem is the case of
recycling data and using terminology extraction tools. borrowings. In spoken computing French, the English
First, the three glossaries mentioned above were term is often used. Even in written texts, and especially in
downloaded and converted into dictionary files, translations, usage leads translators to keep the English
augmented with linguistic information, giving more than term and give the French equivalent once at the beginning
500 entries. These glossaries were selected when of the document.
translating a HOWTO. Then, a more complete and When no answer can be found in the HOWTO corpus,
corpus-based approach was applied. It produced two types WebCorp can provide solutions. By looking for collocates
of dictionary: step-one dictionary and step-two dictionary. and concordances for an English term in French language
documents, possible translations can be traced back to the
4.1.1. Step-one dictionaries French sites. The collocates of network in French-
The step-one dictionaries were created using term speaking sites, for instance, allowed us to trace back home
extraction software, corpora, and a concordancer. This network and the French rseau domestique (Kbler,
sort of dictionary can be produced using large corpora, but forthcoming).
the most efficient solution for the individual user is to
apply it to the texts to be translated. 4.1.2. Step-two dictionaries
The candidate texts were processed using Terminology Once a set of dictionaries has been produced for each
Extractor. Initial candidates for headwords in the HOWTO, it must be tested not only to correct possible
dictionaries were selected from the non-word and
13 15
errors in the entries, but also to add the new words that are in the same context. Most errors in this particular MT
neither in Systran's nor in the customised dictionaries. The system are due to the same syntactic failures and can
more HOWTOs are translated, the fewer words have to be easily be corrected by the translator, once recognised.
added until the dictionaries are saturated, i.e. no new word Conjunction and disjunction are two of the main
can be added to improve translation results. problems in MT systems that have yet to be solved. The
Step two is illustrated with the Home-Network-Mini- garbled translation is however easily corrected, since the
HOWTO, one of the not yet translated HOWTOs. Below errors are similar each time a conjunction or a disjunction
is an example of translation results with and without appears in an NP context:
customised dictionaries:
Source text Translation result Correct transl.
Source text This page contains a simple cookbook Your internal votre interne et des vos rseaux
for setting up Red Hat 6.X as an internet and external rseaux externes interne et externe
gateway for a home network or small networks
office network. a fulltime Cable une connexion en une connexion en
Without Cette page contient un cookbook simple or ADSL continu d'AADSL continu par le
cust. dict. pour le chapeau rouge 6X connection cble ou l'ADSL
d'tablissement en tant que Gateway
d'Internet pour un rseau la maison ou
Fig. 5: Conjunction and disjunction in an NP context
le petit rseau de bureau.
With cust. Cette page contient un cookbook
Another characteristic of MT systems is the
dict. simple pour l'tablissement Red Hat 6.X
overgeneralisation of transfer rules which leads to errors.
en tant que passerelle Internet pour un
Again, it is quite easy to check and correct those errors,
rseau domestique ou un petit rseau de
for instance, the system translates a zero article in English
by a definite article in French, although, in most cases, it
should be the indefinite article:
Fig. 3: Comparing translation results with and without
customised dictionaries Source text Translation result Correct transl.
decoded by dcod par les dcod par des
In the next table, the customised dictionaries were specific individus individus
completed with the words badly or not at all translated individuals spcifiques spcifiques
with the first version of customised dictionaries.
Fig. 6: An example of transfer rule overgeneralisation
Source This page contains a simple cookbook for
Text setting up Red Hat 6.X as an internet gateway
4.3. Human vs machine?
for a home network or small office network.
Step- Cette page contient un cookbook simple pour We selected two HOWTO totalling 9357 words in
one l'tablissement Red Hat 6.X en tant que English. The expansion coefficient (15% in French) brings
dict. passerelle Internet pour un rseau domestique the total up to 10 750, i.e. ca. 36 standardised pages. This
should take a professional translator from 5 to 7 days,
ou un petit rseau de bureau
depending on the tools used. Systranet took less than two
Step- Cette page contient des recettes simples pour minutes to produce an outcome. Professional translators
two l'installation Red Hat 6.X en tant que assess the proofreading necessary at ca. 2 days. MT can
dict. passerelle Internet pour un rseau domestique therefore be included in the set of tools professional
ou un petit rseau de bureau. translators can actually use.
Further work will focus on reusing customised Zanettin, F. 2000. Parallel Corpora in Translation Studies:
dictionaries to translate cross-LSP texts, such as digital Issues in Corpus Design and Analysis. In Olohan M.
cameras. More testing on the coding of Systranet (ed.) Intercultural Faultlines. Manchester : St Jerome
customisable dictionaries is currently being done with Publishing.
students to improve coding rules and their applications.
Evaluating Translation Memory Systems
Angelika Zerfass
Freelance Translation Tools Consultant
Holzemer Str. 38
53343 Wachtberg
Since the mid 1980s, translation tools have taken over more and more of the daily lives of translators and translation project managers.
But a lot of time now has to be spent on evaluation, training and administrative tasks.
Translation tools were designed to make the translator's work easier, faster and more efficient. They range from conversion utilities to
terminology management, translation memories, machine translation as well as workflow and project management systems.
They were developed with the aim to reduce repetitive translation work, but on the other hand they add different tasks to the workload,
like administrating databases and the like.
This presentation will give an overview of one area of translation tools - the different translation memory systems on the market today
and the technologies they use. It includes a comparison of common basic features like word count, analysis/statistics function and pre-
translation, some tools' specialities as well as the description of data exchange possibilities between the systems by use of the TMX
As there is no one best tool for everything, the aim of this workshop is not, to recommend one tool, but to provide some guidelines
for evaluating translation memory systems according to individual requirements.
Compare new source
with old source
TM New source file
Read in project files system
Fill in translation of
Old same or similar
target segments
The database model on the other hand stores all of the commonly used translation memory systems are
translations ever made in one database, independent of able to work with any language installed on the users
context, which is useful if the same or similar segments machine and they usually also allow the user to add
appear in different projects and document types. Most project or user specific information to each translation.
1 New source
Look up segments file
TM system
Offer translation
Save new translation
to database
Language Resources at the Languages Service of the United Nations Office at
Marie-Jose de Saint Robert
United Nations Office at Geneva
1211 Geneva 10
The language staff at the United Nations makes a very selective use of language technologies. So far no computer-assisted translation
software has been installed on translators workstations even though tests have been conducted for several years on the two major
computer-assisted translation (CAT) systems at United Nations Headquarters in New York, for instance. The aim of this paper is
twofold : 1) to show why CAT systems are not considered as potential sources of improvement of quality nor quantity in translation
work at the United Nations, and 2) to present the kind of language resources that are considered essential for the adequate rendering of
content in any of the six official languages of the United Nations (Arabic, Chinese, English, French, Russian and Spanish). This paper
analyzes the particular linguistic and technical constraints specific to an international setting and argues in favour of a selected number
of language resources used at the United Nations other than translation tools readily available on the market. Among such language
resources, one finds search engines, government and research institutions websites, and, in a not too distant future, institutional
knowledge bases.
(1) the report shows
1. Introduction (2) le rapport montre que
In an international, multilingual environment such as (3) il ressort du rapport que
the United Nations, surprisingly enough, translators and
language staff in general are not considered on the same Also, the correct rendering in French of the English
footing as substantive departments, which prepare reports phrase (4) is not (5) but (6):
and organize conferences. Wherever technological (4) abusive sexual practices that may affect very young
innovations are designed and developed, the primary girls
concern is the diplomatic community or the international (5) pratiques sexuelles abusives qui peuvent affecter
community at large, not the language staff. Although les trs jeunes filles
translators do have a major role to play in the preparation (6) pratiques sexuelles dont peuvent tre victimes les
of parliamentary documentation, their needs, such as trs jeunes filles
prompting automatic alignment of two language versions It is not always clear with CAT whether faulty phrases
of the same document whenever desirable, are very such as (2) and (5) would be offered by the system, as it
seldom taken into consideration by United Nations may only keep the first instance found and disregard other
designers and developers. This low profile for linguists instances of the same phrases found subsequently, and
may well explain why so few technological innovations whether the translator in haste may not accept the phrases
have made their way through to the translator and the in (2) and (5) since both look correct from the
terminologist. More reasons can be found in the very grammatical point of view but are incorrect from the
nature of the translation process in multilateral diplomatic semantic point of view1. Maybe more accurate
settings where linguistic and technical constraints play an information on what CAT systems do is needed. Yet it
important role. remains to be seen whether distributed management of
translation memories can be efficiently organized on a
2. Linguistic Constraints large scale, with fifty translators having the right to update
the translation memory on a permanent basis in each
Several linguistic constraints are obstacles to the language pair.
straightforward application of language technologies to
translation work. Some are quite obvious, while others are 2.1.2. Lexical Variety
specific to international organizations. Translations serve the purpose of a specific
communication need and should not be considered as
2.1. Word Choice models for translators to replicate across the board. Such
Translation cannot be reduced to the mechanical is also the case for terminology in any target language.
substitution of one set of terms in one language by a Mere electronic bilingual dictionaries or glossaries cannot
similar set in another language.
2.1.1. Semantic Adequacy In (2) an inanimate noun is used with an animate verb; in (5) it
The sentence starting with (1) should not be translated is as though sexual practices would be divided into two
into French by (2) no matter how common that phrase is categories: abusive and non-abusive, which is wrong in the case
of very young girls.
but by (3):
satisfactorily capture variation, not only in the original eye either. The fear therefore is that a computer-assisted
language but also in the target language, if based upon the translation system may add more mistakes to the original
assumption that a notion corresponds to a term in English ones, which will then be even harder to detect and correct.
and one or several terms in French, for instance. Names
given to human rights are a case in point. A terminologist 2.3. Different Stylistic Rules
would very happily collect the names of all rights, starting Document drafters use a variety of writing rules and
with the right to food, to adequate housing, and to styles to convey meaning. For instance, among writing
education, while a translator would resent it. Such rights styles one can mention the fact that repetitious words are
are indeed referred to under different names by different not considered as poor style in English but are definitely
speakers, and a too rigid list of rights would miss the considered poor style in French. The English sentence (7)
needed subtleties while discussions are still under way. presents a repetition of the word aircraft which the
Should adequate housing be rendered in French by French rendering in (8) would avoid:
logement convenable, logement adquat, logement (7) the shooting down of civil aircraft by a military
suffisant, or logement satisfaisant, all four equivalents aircraft
being found in United Nations legal instruments or (8) la destruction d'aronefs civils par un appareil
resolutions, and not by bonnes conditions de logement militaire
or se loger convenablement when the context allows or
requires it? Translators want to preserve flexibility, when
2.4. Functional Adequacy
present-day translation systems propagate rigidity and, as
a lurking consequence, poverty of style and vocabulary. Each Committee or Body has specific ways of
For Fernando Peral (2002), a translator at the International expressing an idea in order to reach a consensus within its
Labour Organization: The main operational problems of respective audience or circle. Underlying references to
semi-automatic translation [i.e., translation with the help protagonists, former meetings, earlier decisions discussed
of translation memory systems] are linked to the quality of by Committee members but not explicitly mentioned in
the output and to a process of de-training of the the text play an important role in translation. Sometimes
translator, who becomes less and less used to the mental the reasoning of a rapporteur, a speaker or an author, or
process of searching for proper solutions in terms of an amalgam of lengthy sentences couched in simple terms
functional equivalence and relies more and more on the that are perfectly unintelligible to the outsider, i.e.,
machines decisions, which inevitably affects professional someone who has not participated from the beginning in
development and job satisfaction. the discussions, has to be left untouched in the original.
Acceptability of a translated text does not come solely
2.2. Linguistic Insecurity from its grammatical and semantic well-formedness. It
must also be appropriate within the United Nations
Document originators at the United Nations are context. A translated text must, like its original, follow a
nationals from over a hundred and twenty countries. In highly standardized path: it must convey the impression of
most cases their native language is not one of the official having been written by a long-time member, perfectly
languages of the Organization, and document drafters familiar with the background in which the text has been
erroneously think they have to use English, which may drafted, even if it is deliberately vague or obscure. In fact
prevent them from using their main language, even when most United Nations texts cannot be interpreted without
it is an official language, and produce better originals. prior knowledge of the particular political framework in
Documents may also be submitted to the United Nations which they appear. The sociopolitical motivation and
by officials or experts working for Member States that do rationale behind a text are part of the unwritten constraints
not have either any of the official languages of the imposed on communicative competence at the United
Organization as their main language. Syntactic, semantic Nations. Developments in artificial intelligence are not
and morphological mistakes are therefore not rare in perceived to have reached this level of refinement. As
documents, and in most cases only translators are in a Fernando Peral (2002) puts it: translation is based on
position to detect mistakes and rebuild faulty sentences in finding functional equivalences that require linguistic,
the original text. Only they are required to work in their intertextual, psychological and narrative competence; only
native language that is one of the official languages. Due human beings are capable of determining functional
to lack of resources at the United Nations, only a small equivalences; productivity in translation is therefore
portion of all documents is edited prior to being translated intrinsically linked to the capacity of the translator to find
(e.g., documents prepared by the Commission on Human the adequate functional equivalence, i.e., it is based on the
Rights). Translators consequently do act as filters for quality of the translator.
grammatical correctness and language consistency as they These constraints conflict with the concept of
work on the texts to be translated. As a result, they often translation reuse for translation purposes on which most
improve original texts whenever the drafters or submitting commercially available alignment tools and translation
officers accept their changes in the original documents. A memory systems are based, especially when document
translation memory processing straightforwardly a traceability (i.e., the capacity of retrieving the complete
document to be translated prior to the perusal of a document from which a sentence is extracted by the
translator may not detect inappropriate use of terms or translation memory system) is not guaranteed.
syntactic errors in the original language. Even when an
automatic term-checking system is appended to the 3. Technical Constraints
translation memory, it may not be as efficient as a human
3.3. Lack of Preparedness
Quality requirements are not always met in translated CAT tools are known to be most efficient with
documents for technical reasons. repetitive texts. So far, since at the United Nations not all
texts are available in electronic form, it is hard to assess
3.1. Time Constraints the amount of repetition to be able to ascertain whether or
Non-respect of deadlines for document submission not CAT is an efficient tool in this environment.
results in not allowing translation to be performed in the Proper training also has to be given to translators to
required conditions. Feeding translation memories with make certain they know how to utilize the tools that they
texts that have not been properly revised for lack of time are given. The fear is that translators are no longer
appears to be useless, even when such texts are considered assessed only for their linguistic and narrative competence
as basic texts in an area. The underlying assumption is that and performance, but by their computer skills.
basic texts can be improved over and over as they are Finally, equipment used in an international
cited in other texts, but no one can guarantee that it will organization has to be compatible with the equipment
indeed be the case, as translators are more and more required by a particular CAT software.
required to work under emergency conditions, keeping
revision at a very low level. 4. Tools for Translators
This explains why most documents are not considered Translators at the United Nations make use of internal
by translators as authoritative sources for official glossaries and terminologies developed within the specific
denominations either in the source or in the target institutional constraints.
languages. Most official names of international and
national organizations, bodies and institutions are referred 4.1. In-house Glossaries
to under several names in various documents and
sometimes even within the same document. Alignment A dictionary look-up tool commonly used by
tools and translation memories that would provide translators at the United Nations provides a list of
precedents in two languages to translators might equivalents to remind translators of all possible synonyms
perpetuate the number of variants and confusion rather as is the case for significant in English and its possible
than helping translators to use the right equivalent, unless renderings into French:
quality assessment is performed, which is a rather slow Significant - Accus, apprciable, assez grave/long,
and uneconomical process looked down upon in an era of caractristique, certain, considrable, de consquence,
search for productivity gains. The problem is even more d'envergure, de grande/quelque envergure, digne d'intrt,
complex when it comes to designating a body whose name d'importance, de poids, de premier plan, distinctif,
may be official in one or two languages but not in other efficace, lev, loquent, explicatif, expressif, grand,
languages. Chances are that transliterated names in important, indicatif, instructif, intressant, large, louable,
English, French or Spanish rarely reappear again under the lourd de sens, manifeste, marquant, marqu, net, non
same denomination unless a rather time-consuming ngligeable, notable, palpable, parlant, particulier, pas
compilation is done to provide the best possible indiffrent, perceptible, plus que symbolique, positif, pour
equivalents across official languages that would be used beaucoup, probant, qui compte, qui influe sur, rel,
by translators. Yet as George Steiner (1975) rightly puts remarquable, reprsentatif, rvlateur, sensible, srieux,
it: Languages appear to be much more resistant than soutenu, significatif, spcial, substantiel, suffisant,
originally expected to rationalization, as well as to the symptomatique, tangible, valable, vaste, vritable,
benefits of homogeneity and technical formalization. vraiment; a significant proportion: une bonne part; in any
Languages resist because human beings resist. significant manner: un tant soit peu; not significant: gure;
the developments that may be significant for: les
vnements qui peuvent prsenter un intrt pour; to be
3.2. Digital Divides significant: ne pas tre le fait du hasard. 2
Other technical constraints make the use of CAT Access to validated and standardized terminology is
systems difficult: 1) non-submission of documents in considered more important than access to tools for
electronic form: many documents are submitted on paper document reuse other than the basic cut and paste function
with last minute written corrections linguistic insecurity from documents carefully selected by the translator and
or a changing appreciation of political requirements being not automatically provided by the system. Dictating
the main causes of last minute changes; 2) non-availability sentences afresh, once proper terminology has been
of reference corpora: some official references may exist in identified, also is considered a less time-consuming
one or two languages, and have to be translated into other process than reading and correcting all or a selection of all
languages reference documents that are considered as possible renderings of a sentence found in previously
authoritative in one language pair may not be so in translated documents by a context-based translation tool.
another, thus the task of building translation memories is Language resources used by United Nations translators
labour-intensive, language pair by language pair; 3) thus are primarily terminology search engines that
scarcity of digitalized language resources in some facilitate the search for adequacy given the specific
languages: translators cannot completely switch to ready-
made technological innovations expertise in
conventional research means should be kept. 2
Organisation des Nations Unies (2000).
context in which the document has been drafted, rather rendering was coined and accepted. They may arise in a
than any previous context. French original to be translated into other languages and
thus should be retrievable: assiduits intempestives,
4.2. Web resources avances (sexuelles) importunes, privauts malvenues,
Language resources used by translators also include tracasseries connotation sexuelle. The knowledge base
online dictionaries and government and research would refer, too, to associated terms: attentat la
institutions websites that translators have learned to pudeur, outrages.
identify and query for information extraction and data
mining. Portals have been designed to help translators 5. Conclusion
locate best language and document sources on the In conclusion, United Nations translators are very
Internet. cognizant of the limitations of automated tools for
translation and are more inclined to rely on easily
4.3. Alignment Tools accessible, structured information concerning the history
Additional tools are document alignment tools by and main issues in a particular subject matter in order to
language pairs. Indexing of large text corpora for retrieval be completely free to choose the best translation
of precedents are felt preferable to tools that provide text equivalents.
segments, be they paragraphs, sentences or sub-units with
their respective translations, but without any indication of 6. References
date, source, context, originator, name of translator and Organisation des Nations Unies. Division de traduction et
reviser to assess adequacy and reliability in an d'dition. Service franais de traduction. Vade-Mecum
environment where many translators are involved. du traducteur (anglais-franais), SFTR/15/Rev.3,
septembre 2000.
4.4. Knowledge Base Peral, F. (2002). The Impact of New Technologies on
The construction of a knowledge base is envisaged to Language Services : Productivity Issues in Translation.
help translators perform their task in a more efficient Paper for the Joint Inter-agency Meeting on Computer-
manner. Ideally it would capture all knowledge generated assisted Translation and Terminology (JIAMCATT),
by United Nations bodies and organs and various 24-26 April 2002. World Meteorological Organization.
organizations and institutions working in related fields Geneva.
(i.e., any subject from outer space to microbiology tackled Steiner, G. (1975). After Babel. Aspects of language and
by the United Nations), and the knowledge and know-how translation. (first published in 1975, reedited in 1998 by
of an experienced translator well trained in United Nations Oxford University Press).
matters and that of an experienced documentalist knowing
which documents are the most referred to. Such
knowledge base would, for instance, predict instances
where guidelines should be translated in French by
directives, as given by most dictionaries, and where
principes directeurs would be a more appropriate
translation. In statistical documents at the United Nations,
one finds recommendations, a term which is translated
by recommandations in French and refers to rules to be
followed, and guidelines, translated as principes
directeurs, which are mere indications to be taken into
consideration. If the term directives would be used in
such context, it would convey the meaning of a document
of a more prescriptive nature than recommandations
would, which are actually more binding. Such instances of
translation are best captured by a knowledge base that
refines contexts and provides best reference material on
any topic in the text to be translated. The knowledge base
would provide not only adequate referencing and
documentation of the original, but also the basic
understanding of any subject that arise in a United Nations
Such knowledge base ideally would reduce the choices
offered to the translator rather than list all possibilities.
The easier it is for the translator to make the decisions he
or she needs the faster he or she delivers.
The knowledge base would offer the translator with
past alternatives, too, as in the case of sexual
harassment, translated into French by harclement
sexuel. Other French equivalents were tested before this
Global Content Management Challenges and Opportunities for Creating and
Using Digital Translation Resources
Gerhard Budin
University of Vienna
Department of Translation and Interpretation
Gymnasiumstrae 50, A-1190 Vienna
In this paper the concepts of content management and cross-cultural communication are combined under the perspective of translation
resources. Global content management becomes an integrative paradigm in which specialised translation is taking place.
Localization Terminologies Markup,
IM Translation Ontologies, Collaborative
DM Work
Internationalization Product Design/ Dissemination
Customization Docu/Reports etc. Quality
levels of integration management
Personalization Pieces of art, etc. Corporate
Now we should return to the aspect of cultural Figure 2: the three components of global content
diversity and the way it determines content management. management with individual processes and components,
Global content design, accordingly, is an activity of all three nowadays determined by usability engineering
designing content for different cultures as target groups imperatives
and is cognizant of the fact that content design itself is a
culture-bound process, as shown above.
From the field of cultural studies we can benefit when
looking at definitions of what culture is: a specific mind
5. Pragmatic Issues in Global Content
set, collective thinking and discourse patterns,
assumptions, world models, etc. Content management processes cannot do without
Examples for types of culture are corporate cultures, appropriate knowledge organization and content
professional, scientific cultures, notably going well organization. Terminological concept systems are
beyond the national level of distinguishing cultures. organized into Knowledge Organization Systems (KOS)
Cultural diversity is both a barrier and at the same time that can be used for this purpose of content organization:
an asset and certainly the raison dtre for translation, Thesauri, Classification Systems, and other KOSs,
localization, etc. also conceptualized as (extrinsic) ontologies
The following model shows the various dimensions of (Intrinsic) Ontologies (language-related, e.g.
Global Content Management discussed above. The term WordNet), domain-specific (medicine, etc.)
element global stands for all the cross-cultural activities In order to establish and maintain the interoperability
such as translation, localization, but also customization, among heterogeneous content management systems,
etc. Content includes terminologies and ontologies as its federation and networking of different content
infrastructures, products and their design, user organization systems are necessary in order to facilitate
documentation, but also pieces of art, etc. And the topic-based content retrieval and exchange of content in
management component includes all the processes such as B2B interactions.
markup and modelling, processing, but also quality Global Content Management may have very different
management, communication at the meta level, etc. manifestations. In the area of Cultural Content
Usability engineering is crucial for all these components: Management, for instance, cultural heritage technologies
have developed in order to build up digital libraries,
digital archives and digital museums.
Other applications of Global Content Management
systems are:
ePublishing (single source methodologies)
eLearning (managing teaching content
Cyber Science (Collaborative Content Creation)
Digital Cities and other Virtual Communities
On the pragmatic level of maintaining content
management systems we observe similar problems as on
the level of knowledge management, that a corporate
culture of knowledge sharing has to be developed and Tiwana, Amrit (2000). The Knowledge Management
nurtured, that special communicative and informational Toolkit. Practical Techniques for Building a Knowledge
skills are needed to share knowledge across cultures and Management System. Upper Saddle River: Prentice Hall
that the dynamic changes in content require a management TFPL (1999). Skills for Knowledge Management: building
philosophy that is fully cognizant of the daily implications a knowledge economy. London: TFPL
of these constant changes. Trompenaars, Fons/Hampden-Turner, Charles
Translation resources such as translation memories and (1993/2001). Riding the Waves of Culture.
other aligned corpora, multilingual terminological Understanding Cultural Diversity in Business. 2nd
resources, reference resources, etc. are typical examples of edition. London: Nicholas Brealey Publishing
content that needs to be managed in such global action Wright, Sue Ellen/Budin, Gerhard (comp.) (1997, 2001).
spaces. Handbook of Terminology Management. 2 volumes.
Amsterdam/Philadelphia: John Benjamins
6. Outlook
On the technological level a number of enabling
technologies for global content management have
emerged that are converging into Semantic Web
technologies. Intelligent information agents are integrated
into such systems. They are combined with knowledge
organization systems (in particular multilingual
ontologies). Semantic interoperability has also become a
major field of research and development in this respect.
In the field of the so-called content industry different
business models have developed that could not be more
diverse: on the one hand open source and open content
approaches are rapidly gaining momentum, also facilitated
by maturing Linux-based applications. On the other hand
national, regional and international legislation concerning
intellectual property rights is becoming more and more
strict and global players are buying substantial portions of
cultural heritage for digitisation and commercial
exploitation that might eventually endanger the public
nature of cultural heritage.
Epistemological issues of global content management
will have to be addressed, as well as best practices to be
studied in detail in order to develop advanced methods for
these complex management tasks. Managing cultural
diversity in a dynamic market with rapidly changing
consumer interests and preferences, with new technologies
to be integrated, also requires a strategy for sustainable
teaching and training initiatives (based on knowledge
management teaching and training initiatives) in this
fascinating field.
