Transcription Guidelines
Transcription Guidelines
Transcription Guidelines
Version: 2.9
Last updated: 05292019
1
3.1 Speaker Labelling 11
3.2 Non-speech sound inventory 12
3.3 Music only 12
2
Part 2: Transcription Quality Requirements
1. Minimum accuracy scores for passing validation are 95% for word accuracy at the word
level and 90% for tag accuracy at the tag level. For non-tokenized languages, an
equivalent accuracy will be targeted.
2. If the transcriptions fail validation, transcriptions should be reworked and then re-
validated until they pass the accuracy threshold.
● Transcriptions must be produced strictly and solely by human transcribers. Use of any
Automatic Speech Recognition (ASR) systems (online or otherwise) is strictly prohibited.
If it turns out that the transcriptions were generated by an ASR system (either fully,
partially, or even as a starting point), the transcriptions will be rejected.
● Transcription should represent all words as spoken – including hesitations, filler words,
and false starts.
● Transcription must be orthographic, not phonetic. Refer to American Heritage Dictionary
for reference: https://ahdictionary.com/
● Transcription should include only upper and lowercase letters, apostrophes, tildes,
hyphens, periods, question marks, commas, exclamation points, and spaces. No numbers
or other special characters.
3
● Segments should not last longer than 15 seconds. If a single speaker talks for more than
15 seconds, segment based on sentence level or pauses in speech. Longer speech
segments are strongly preferred over short speech segments.
● Represent unintelligible words with double parentheses without spaces: (())
● Non-speech events are represented with square brackets: [ ]
● "Issall well n' good darlin'." = "It's all well and good darling."
● "Call your representive." = "Call your representative."
However, if a word is deliberately mispronounced, such as for comedic effect, do represent the
variation in the transcription.
● "The volcano said: I lava you." = "The volcano said I lava you."
If the spelling of a word is unclear, use the American Heritage Dictionary as a standard
reference: https://ahdictionary.com/. To reference the names of song titles, movies, TV shows,
brands, etc. use http://google.com/. At the sentence level, transcribe a speaker's utterances
verbatim, even in cases when the speaker's utterances do not conform to the standard grammar of
the language. Do not correct grammatical "mistakes" or variations made by the speaker.
2.2.1.1 Contractions
Standard contractions must be transcribed as pronounced, including the apostrophe, such as
"isn't", "where's", "you're", "y'all". Transcribe the following as a single word:
● gimme
● gonna
● gotta
● lemme
● wanna
● watcha
● kinda
2.2.1.2 Abbreviations
4
Do not introduce abbreviations in the transcription. Always spell out the full word when
pronounced as such. Transcribe abbreviations only if the abbreviation is explicitly articulated by
the speaker. Do not add a period after abbreviated words (unless it's at the end of a sentence).
Note that in English, the titles Ms, Mrs, Mr, and Mx that prefix a person's name are not
abbreviations (they are listed as nouns in the dictionary) and should therefore be transcribed as
such. However, use the spelled-out forms mister or missus when these titles are used without a
name, as in direct address.
● "Mr. Smith this way please." = "Mr. Smith, this way please."
● "Hey mister can you help me with this survey?" = "Hey, mister, can you help me with
this survey?"
● "Directions to the… to the… the hotel" = "Directions to the to the the hotel."
● "Ale… Alexa play Janet Jackson… no wait…" = "Ale~ Alexa, play Janet Jackson. No,
wait."
● "N… n… no. It's Ch… Chom… Chomsky who said that." = "N~ n~ no. It’s Ch~ Chom~
Chomsky who said that."
● #uh
● #um
● #ah
● #er
● #hm
2.2.1.5 Interjections
Interjections are words or expressions that speakers use within an utterance to express
affirmation, surprise, or negation. Each language has its own specific set of interjections that
speakers can use. When transcribing interjections, use language-specific standardized spellings.
5
Interjections do not require any special symbol.For English, we transcribe only the following
interjections:
● eee ● mm ● uh-oh
● ew ● mhm ● whoa
● huh ● nah ● whew
● hm ● oh ● yay
● jeez ● uh-huh ● yep
2.2.1.6 Overlapping speech
If there is overlapping speech where multiple speakers are talking at the same time, then only
transcribe the most dominant voice that you can clearly understand. If all or multiple voices are
dominant and it's difficult to isolate one person's voice over the others, then simply tag the
overlapping speech as [overlap] and refrain from transcribing any speaker's speech.
Note: Within in a single channel audio file where only one speaker is the target of the recording,
overlapping speech might still occur (e.g. as background noise when there are other people
nearby or in the same room speaking). In these cases, transcribe only the speech of the target
speaker.
2.2.1.7 Letters spoken as letters
When a proper name is spelled out, transcribe the spoken letters as capital letters, separated by a
space.
● "My name is John – jay, oh, eich, en". = "My name is John J O H N."
This does not apply to initialisms (e.g. IBM, FBI, etc.) More on transcribing initialism to follow
in Section 2.2.5.
2.2.2 Punctuation
Use punctuation as required by the grammar rules. When transcribing a language other than
English, use punctuation symbols and rules that are appropriate for that language. For example,
in Spanish, ¿? is used as in standard orthography.
● Use end-punctuations (full stop, question mark, exclamation mark) to indicate the end of
a complete sentence.
● Use punctuation symbols that are essential part of the word, such as apostrophes and
hyphens.
● Use commas to break up long stretches of speech. This is to facilitate reader
comprehension.
● AVOID: semi-colons, quotation marks.
Of the list of permissible punctuations, we expect that commas and exclamation marks will be
the most difficult ones to implement. We understand that you will have to make some relatively
subjective and stylistic decisions on the use of the comma and exclamation mark, and
disagreements are not necessarily errors.
2.2.2.1 Commas
Use a comma when it is necessary to make a transcript more readable. Below are some
suggestions of when a comma should be used:
6
● To separate items in a list of three or more, using the serial (aka Oxford) comma (i.e., the
comma before the conjunction that joins the last two elements:
o I enjoy skydiving, snowboarding, and mountain biking.
● To set off a direct address:
o Maryam, listen to me carefully.
o I'm not calling you, my friends, just to whine about my life.
● To break up compound and complex sentences:
o I would like to join you, but I'm afraid I have class at that time.
o Marcos and I couldn't go to the jazz concert, so we watched it on TV instead.
● To set off introductory words and phrases:
o Therefore, they cancelled their trip.
o After taking a break, the team resumed their meeting.
● Around parenthetical phrases:
o That report on the New York Times was, to say the least, a bombshell.
o Getting a hotel by the sea, like the one we stayed last year, would be superb.
2.2.2.3 Apostrophes
Use apostrophes in contractions, possessives of individual letters, possessive "s", or as part of a
person's name.
2.2.2.4 Hyphens
Use hyphens according to standard orthographic rules of the language. If it is not clear if a
compound word should be spelled with a hyphen or not, use the American Heritage Dictionary
as a reference. Here are a few examples of English compound words that can/must use hyphens:
● a-line
● d-day
● ex-boyfriend, ex-drummer, ex-girlfriend, ex-husband, ex-wife
● extra-loud
● self-aware
● t-shirt
● u-turn
● v-neck
7
● x-ray
For product names, only use hyphens if they are parts of the official product names.
2.2.2.5 Tildes
Use tildes to indicate truncated words, whether at the beginning or the end. Use tildes also to
represent false starts and stuttering.
2.2.3 Capitalization
Capitalization should follow orthographic conventions. Capitalize the first word of a sentence.
Proper names include human names (Jeff Bezos), place names (France), product names (iPad,
Xbox), company names (eBay), acronyms (POTUS), initialisms (IMB), and so on.
2.2.4 Numbers
Numbers should never be represented numerically. They should always be written out
alphabetically. Ordinal numbers should be represented as pronounced.
● "5" = "five"
● "5th" = "fifth"
● "306" = "three hundred and six", "three oh six", or "three zero six", depending on how it
was pronounced.
● "Play radio 109.4 FM" = "play radio one oh nine point four F. M."
● "Beverly Hills, 90210" = "Beverly Hills nine oh two one oh"
8
When spelling out numbers, use hyphens as required by the rules of the language. In
English, numbers from twenty-one through ninety-nine are spelled with hyphens. Others are not
hyphenated.
● "twenty-five"
● "three hundred"
● "five hundred fifty-two"
● "nineteen forty-five”
Initialisms refer to terms spoken as series of letters (e.g., IBM, IMDB, HTTP). Initialisms
should be written as upper case letters enclosed within the <initial> and </initial> tags. Note the
space around the tags. Use periods only for initials standing for given names (e.g. E. B. White,
George W. Bush). Otherwise, no period is needed in initialisms.
● "I work for IBM." = "I work for <initial> IBM </initial>."
● "I like ZZ Top." = "I like <initial> ZZ </initial> Top."
● "http://www.google.com" = "<initial> HTTP </initial> colon slash slash <initial> WWW
</initial> dot google dot com."
● "George W Bush paints now" = "George <initial> W. </initial> Bush paints now."
Transcribe a plural initialism with an "s" following the end tag </initial>. Transcribe a possessive
on an initialism with an apostrophe and an "s" after the end tag </initial>.
● "The SATs are nerve-wracking." = "The <initial> SAT </initial>s are nerve wracking."
● "George W's dog was a Scottish Terrier." = "George <initial> W. </initial>'s dog was a
Scottish Terrier."
Proper names that are spelled out are not initialisms and don't require the <initial> </initial> tags.
See Section 2.2.1.7 above for an example.
2.2.6 Unintelligible words and phrases
If a word cannot be understood within a larger phrase, transcribe all segments that are
understandable, and use double parentheses (()) to mark the unintelligible word. There should be
a space before and after the double parentheses, but not within the parentheses themselves.
9
● "Alexa play ???? on spotify." = "Alexa, play (()) on Spotify."
If you have a guess of what the word/phrase might be but are not sure, include the guess within
the double parentheses.
● "Alexa read ????? from audible." = "Alexa, read ((Cat In The Hat)) from Audible."
● "Alexa turn the ????" = "Alexa, turn the ((lights off))."
● "You have to finish todo esto, porque. I have other things to do." = "You have to finish
<lang:Spanish> todo esto, porque </lang:Spanish>. I have other things to do."
● "I'd like to tell her que ya no la quiero." = "I'd like to tell her <lang:Foreign> (())
</lang:Foreign>."
In cases when a speaker switches from a target language to a foreign language but continues to
use grammatical affixes of the target language with the foreign word stem, include the target
language affix within the foreign language tags. For example, when transcribing Tamil data, if a
speaker switches to the English word "engineering" but with a Tamil suffix ல, transcribe it as
<lang:English> engineeringல </lang:English>.
Some loanwords have been grammaticalized in English and should be transcribed as normal
English words without the <lang:Foreign> tag. If it is unclear whether a word is a loanword or
not, consult a dictionary like the American Heritage Dictionary: https://www.ahdictionary.com/.
A word that is listed in the dictionary is a strong ground to consider it an established loanword,
even if it is of foreign origin.
If a recording consists of nothing but foreign speech, add the tag <lang:Foreign> and refrain
from annotating.
2.3 Non-speech (acoustic event) transcription
2.3.1 Non-speech sound inventory
Insert the following labels in the location where it occurs. If it happens in the middle of a word,
add the tag exactly before the word in which it occurred.
10
● [lipsmack] Lipsmacks, tongue-clicks
● [breath] Inhalation and exhalation between words, yawning
● [cough] Coughing, throat clearing, sneezing
● [laugh] Laughing, chuckling
● [click] Machine or phone click
● [ring] Telephone ring
● [dtmf] Noise made by pressing a telephone keypad
● [sta] At the start of continuous background noise (static)
● [cry] Crying/sobbing
● [prompt] IVR prompts or voice recordings commonly found at the beginning of calls
Do not split words to insert a non-speech sound tag, even if it occurs this way in the audio.
● "I will abso-ring-lutely open it" = "I will [ring] absolutely open it."
Use the [noise] tag for all other non-speech sounds not covered by the list of non-speech tags
(e.g., screaming, raining, punching, etc). For additional non-speech tags for 16kHz data, see
section 3.2.
2.3.2 No speech
A time-stamped speech segment may contain periods with no speech. For any period greater than
one second in which there is no speech, add the label [no-speech]. Even if there are some
foreground sounds, just use the [no-speech] tag if there is no actual speech for more than one
second.
Note: In single channel audio files, you can only hear one side of the conversation at a time. As
a result, there will be segments in theses audio files that contain either no speech or only non-
speech sounds (e.g. laughing, breathing, etc). These silent segments do not need to be
transcribed. They should be removed from the transcription file entirely.
2.3.3 Music only
If there is music playing in the foreground or in the background and there is no other information
to transcribe, such as if a customer is put on hold and there's hold music playing, transcribe it
with the [music] label.
3. Additional Transcription Conventions for 16kHz Data
3.1 Speaker Labelling
Each identifiable speaker must have a unique speaker label. The speaker label must be consistent
throughout the entire file. When applicable and if known, provide the following data for each
identifiable speaker in the “speakers” metadata field at the end of each transcription file:
● role
● gender
11
● native dialect
● enter the appropriate speaker label when you can accurately identify the speaker;
● enter “unknown” when you cannot accurately identify the speaker;
● enter “multiple” when the segment contains overlapping speech that is not transcribed,
i.e., the content of the segment is marked only with the [overlap] label.
Note: Specific projects might not require speaker labelling. Please refer to specific project
requirements or consult the project lead for further information.
3.2 Non-speech sound inventory
Use all the non-speech tags that are mentioned in Section 2.3.1 plus the following tag(s) specific
to 16khz:
● [applause] Clapping to show approval or praise. Add it exactly at the location where it
occurred.
12