Transcription (Complex) Guidelines

Terminology
Term Definition
Batch A batch of transcription work is a single, continuous audio file consisting

of many utterances.
Page Batches are usually presented in pages of 20 utterances each. One

batch may consist of several pages of utterances.
Utterance A full recording of speech audio (batch) is segmented into multiple

short utterances.
An utterance is a single unit of transcription. Each utterance has its own
text input box and waveform. Each utterance needs to be saved before
you can move on to the next utterance.
Also called an utt.
Tag Tags are an easy, standardised way to insert additional information

about the audio into transcription.
Timestamp A type of tag inserted on the waveform and represented in the

transcription to indicate a specific time point within the utterance.
Writing
Transcription should follow the standard conventions of the target language.
To reference the names of song titles, movies, TV shows, brands etc. you could do a
quick Google search.
Punctuation
Use of punctuation will vary depend on the specific project. You should always refer to
the project specific Transcription guidelines to see how punctuation should be used.
Special Characters
Do not use special characters or symbols such as quotation marks, dollar signs, etc.
Please transcribe all full words spoken.
Example
$ → dollar
% → percent
Example - speaker pronounces the word "slash"
You hear: it was great slash weird
You transcribe: it was great slash weird
INCORRECT: it was great/weird
Capital letters
Name Entities (e.g. person names, place names, some time words) should be spelled
with a capital letter as per usual writing conventions for the target language.
Example
George
Monday
If a business name is spelled with a capital letter in the middle of the word, this is okay.
Example
eBay
iPhone
YouTube
Do not use a capital letter if the only reason is that the word is at the start of a sentence.
Example - the first word is only capitalised if it is a proper name
they think Sydney is a beautiful city

Sydney is a beautiful city
what are you doing on Tuesday night
Numbers
Do not use any digits (e.g. 1 2 3 4 5 ...). All numbers must be spelled out as full words in
the way they were pronounced.
Example - the number '2012' may be pronounced in many different ways:
2012 ==> two zero one two

2012 ==> two oh one two
2012 ==> two thousand and twelve
2012 ==> twenty twelve
Abbreviations
Do not use any abbreviations. Words must be spelled out in full.
Example
Mr Johnson ==> Mister Johnson
Dr Smith ==> Doctor Smith
Elizabeth St ==> Elizabeth Street
The only exception is if someone pronounces the word as an abbreviation.

Example
Appen Butler Hill Inc ==> Appen Butler Hill Inc (if the person pronounced 'Inc' as
'Inc', not 'Incorporated')
Acronyms
An acronym is a word made up of the first letters of other words that is spoken as a word
(e.g. NASA, FIFA). Acronyms are spelled using capital letters joined with no space.
Example
NASA
FIFA
Initialisms
An initialism is an abbreviation made up of the first letters of other words where each
letter is pronounced separately (e.g. IBM, CPU, ADHD). Initialisms are spelled using
capital letters joined by underscores.
Example
I_B_M
C_P_U
A_D_H_D
Spelled Letters
Spelled letters are where a word is pronounced letter by letter (e.g. L I A I S E). Spelled
letters are transcribed using capital letters joined by underscores.
Example
my name is Jayme and it's spelled J_A_Y_M_E
For single stand-alone spelled letters, transcribe them with an underscore after the letter.
Ensure that there is a space after the underscore so that it is not linked to the following
word.
Example for single stand-alone spelled letters
my blood type is B_ positive
Mixed Initialisms
Mixed initialisms involve combinations of words, letters, and numbers. When a single
concept is expressed, all parts are written together with an underscore. Models like 4S
(below) are written separately from the brand name. Numbers in a proper name are
capitalised when written out.
Example
iPhone four_S
Seven_Eleven
A_B_forty-eight
M_P_three
Email and website addresses

Transcibe emails and websites following the conventions above.
Example
www.amazon.com ==> W_W_W dot Amazon dot com
[email protected] ==> J_ Smith at Gmail dot com
Fragments
When a speaker pronounces only part of a word, write that part of the word and attach a
hyphen to it. Make sure there is a space after the hyphen.
Example - someone begins to say 'motorcycle' but stops after 'moto'
she came to work today by moto- I mean car
Example: someone begins to say 'onions' but stops after 'on-' and then
repeats the word in full
my eyes hurt when I cut on- onions
If it is not clear what the full word was going to be, do not transcribe the word and
instead use the unintelligible tag (see the section on using tags).
Tags
Tags are used to add additional information to transcriptions of speech. Tags can be used
to add information about the audio. These may include noise events, sections of silence,
fillers, foreign speech, and more. As each project may be different, it is important that
you follow the project specific guidelines, as tag usage may differ from what is detailed
below.
Standalone tags are inserted independently into the text box. In Ampersand, these tags
appear as images. In the examples below, these tags are represented in text format
using < > brackets.
Span tags can are used to highlight transcription in the text box.
Speaker Tags
In some projects, you may hear multiple speakers in a batch and each speaker may need
to be identified with a unique Speaker ID tag throughout the batch.
 The speaker ID tagging must be consistently applied to the same speaker

throughout the audio recording. In other words, throughout the entire batch, use
the same speaker ID for the same speaker.
 The first speaker you hear in a new batch of data is speaker 1, the next new
speaker you hear is speaker 2, and so on.
 If you are unsure who the speaker is for any given utterance, use the speaker ID
tag that is most likely and sensible in the context of the speech.
When to use Speaker IDs will vary depending on the project. You should always refer to
the project specific guidelines for information on when and where to use a Speaker ID
tag.
Fillers
Fillers are the sounds people make while they are thinking of what to say next, for
example "um", "ah", "er".
Whenever you hear a filler, insert the filler tag that best represents the sound made.
Example: speaker says "um" after "was"
I was <um> just wondering
Interjections
Interjections are very common in spoken language, but strictly speaking they are not
'words' and would be unlikely to show up in a dictionary or a newspaper article.
Interjections should be transcribed according to the project specific guidelines. In most
transcription projects, interjections should be highlighted with a highlighting span tag.
Overlapping Speech
Overlapping speech is when two or more people are talking at the same time and at a
similar volume.
In some projects, overlapping speech may need to be tagged. How to tag and/or
transcribe overlapping speech will vary depending on the project. You should always
refer to the project specific guidelines for information on how to approach overlapping
speech.
Noises overlapping with speech do not constitute an overlap. Only mark overlaps when
two foreground speakers are speaking at once
Foreign words
You may hear someone speaking in a foreign language. If you cannot understand the
foreign speech, just place a <foreign> tag in place of the words you cannot understand.
Example
no she said <foreign> which means goodbye in Croatian
If someone uses just the occasional foreign word and you know how to spell it, write out
the word and then highlight it using the "foreign word" highlighting tag.
Example: the word in bold is highlighted because it is not English
no she said arrivederci which means goodbye in Italian

Note, foreign names (people's names, place names, festival names, etc.) do NOT
constitute foreign words and should be spelled. If you are unsure of the spelling, you can
make your best guess and highlight it. If a word is particularly difficult to spell, you could
search for it in Google to find the most common variant of spelling.
Similarly, you must consider whether the 'foreign' word is in fact a 'loanword', meaning
that it could be considered part of the language now. This often happens when another
language is widely spoken in a community (e.g. English words in the Netherlands), when
a word is needed for a modern concept such as a computer mouse, or when a language
does not have a word of its own to describe a concept. In English the word
'schadenfreude' may be considered a loanword from German, i.e. it is NOT foreign and
can appear unmarked in an English transcript.
If a word of foreign origin is commonly used and/or understood by speakers (or a
community of speakers) in the language you are transcribing, it should be transcribed.
It is very important that we are consistent in the treatment of loanwords, so when in
doubt, choose to spell the word and highlight it rather than inserting the 'foreign' tag in
place of the word.
Mispronounced words
When it is obvious a speaker has mispronounced a word, use the mispronounced tag to
highlight the word. When you type the mispronounced word, use the normal correct
spelling.
Example - you hear the speaker say "expresso" instead of "espresso"
YOU TRANSCRIBE: espresso
Words pronounced with a regional accent are NOT considered mispronounced. If you are
unsure, imagine asking the person after they spoke if they made a mistake. If that
person would admit they made a mistake, then the word was mispronounced.
Unintelligible Speech
If you come across a word or several words that are not clear because there is
interference, audio problems, or because the person is not talking clearly, enter the
<unintelligible> tag in place of the unintelligible speech.
Of course you should try your best to listen and determine what was said, but in natural
speech there will be unintelligible words often. As a guide you should try at least three
times to understand what was being said. If it is not clear, insert the tag and move on.
Example - speaker mumbles something after "her"
well I already told her <unintelligible> you know I told her
Thought Continues
Sometimes, a thought or sentence in the current utterance may into the next utterance.
In such cases, you should:
 Insert the <continued> tag at the end of the first utterance where the thought is
cut off.
o If you have inserted a comma at the end of the first utterance, the
<continued> tag must be placed after the comma.
o An utterance cannot end with a comma.
Example
UTTERANCE 1: clouds gathered today over the mountains and <continued>

UTTERANCE 2: we are expecting rain for the next few days .
Note: do not use the continued tag if the sentence or thought has ended where the
utterance ends.
Example
UTTERANCE 1: clouds gathered today over the mountains .

UTTERANCE 2: we are expecting rain for the next few days .
Truncations
If a word gets cut off at the end of an utterance because the computer program has not
cut up the audio correctly, this is called a truncation. This is different from a fragment
(where the person stops talking part way through a word). In a truncation, the recording
has cut someone off while they were saying a word. Therefore, truncations only occur at
the start or end of an utterance.
When you hear a truncation at the end of an utterance, write out the truncated word in
full followed by the <truncation> tag. In the following utterance, insert the
<truncation> tag and then continue to transcribe the rest of the sentence.
Example – “probably” has been truncated and is split across two utterances.
UTTERANCE 1: in that case we should probably <truncation>
UTTERANCE 2: <truncation> consider other options
If you can tell that a word was truncated but you don't know what the word is, simply
insert the <unintelligible> tag in place of the word and the <truncation> tag after
the <unintelligible> tag.
Example - the word at the end of the utterance has been truncated but you
couldn't make out the truncated word
UTTERANCE 1: in that case we should <unintelligible> <truncation>

UTTERANCE 2: <truncation> consider other options
No Speech
If an entire utterance contains no speech (e.g. there is only silence or noises) insert the
<no-speech> tag only and move on. The noises in such utterances should not be
tagged.
Unintelligible speech, fillers and interjections ARE considered speech. All other noises
(human and non-human) are NOT considered speech.
Pause
Whenever there is a pause in speech, insert the <pause> tag. In most transcription
projects, pauses of 1 second or more should be tagged. However, you should always
refer to the project specific guidelines for guidance on when to tag pauses.
Example - speaker takes a two second pause between
"just" and "feels"
I don't know why it just <pause> feels different now
Use the tag for pauses within speech (between words) and for silence before the
person commences speaking or after they finish.
If noises occur in the foreground during pauses of 1 second or more within speech, do
not tag these noises - simply put only a pause tag.
If there is no speech at all within an utterance, use the 'no speech' tag (see above).
Speaker noises
All noises made by the main speaker should be tagged with the appropriate noise tag.
Common speaker noise tags that you may see in a transcription project are shown in the
table below. You should always refer to the project specific guidelines for information on
the noise tags used for the project and when to use them.
 Insert the tag exactly where the noise first occurs.
 If it occurs at the same time as a word, put the tag BEFORE the word.
 If the noise occurs more than once in sequence, you only need a single tag.
Tag When to use it
<lipsmack  lip smacks

>  tongue clicks
<breath>  loud inhalation and exhalation

between words
 yawning
<cough>  coughing
 throat clearing
 sneezing
<laugh>  laughing
 chuckling
Other noises
Insert the relevant tag when you hear a noise that is not made by the speaker, and which
is at a comparable volume to the speech. Common noise tags that you may see in a
transcription project are shown in the table below. You should always refer to the project
specific guidelines for information on the noise tags used for the project and when to use
them.
 Insert the tag exactly where the noise first occurs.

 If it occurs at the same time as a word, put the tag BEFORE the word.
 If the noise occurs more than once in sequence, you only need a single tag.
Tag When to use it
<click> Any interference from the phone line (e.g. crackling sounds).
<ring> The sound of a phone ringing.

<DTMF> The sound made by pressing the telephone keypad (DTMF
stands for Dual Tone Multi-Frequency).
<short_noise> Any other short noises that do not continue over several words
(generally lasting less than one second), for example: door
slams, a loud cough by a person in the background, car horns.
<long_noise> Any other long noises that continue over longer periods of
time and perhaps multiple words (generally lasting more than
one second), for example: wind, rain, background speech or
music. This tag is used when the noise begins. The point at
which the stationary noise ends is not marked. Low level
background sounds are expected and do not need to be
tagged.
Timestamping
In most transcription projects, you'll see a waveform in Ampersand for each utterance.
Timestamps are placed on the waveform to divide the audio into segments.
Timestamps are generally used for two purposes:
 Segment periods of non-speech from speech.
 Segment speech based on where the speaker changes.
However, you should always refer to the project specific guidelines for information on
when and where to use timestamps.
Please also refer to 'How to use Ampersand - Timestamping Projects' for generic
guidelines on ho to place and manipulate timestamps in Ampersand.
Where timestamps are placed

Please place timestamps between 0.1 and 0.3 seconds before the start of speech
or noise events and 0.1 to 0.3 seconds after the end of speech or noise events,
as we want to minimise the amount of non-speech audio accompanying speech.
Timestamps must be placed after the first speaker has fully produced the final sound of
the word. Sometimes you will hear a little puff of air or a whisper sound at the end of a
word, and that must be included within the timestamp.
 The faint vertical lines on the waveform typically represents 0.1 second
intervals. You can use these lines to guide decisions about where to place
timestamps.
The table below shows examples of when a timestamp may be required. However, you
should always refer to the project specific guidelines for information on when and where
to use timestamps.
Event What to do
General  There can be multiple timestamps in an utterance.

 Make sure your cursor is in the correct position in the Result field. When you insert the
timestamp on the waveform, the timestamp tag will then be inserted into the correct
position in the text.
When there  Place the timestamp on the waveform to indicate that there is a change in speaker.
is a change The timestamp needs to be placed before the <speaker_ID> tag, just before the
in speaker beginning of speech.
Before  If an utterance starts with a pause (as defined by the project guidelines), place
speech at a timestamp on the waveform to show where the speech begins.
the  Place the timestamp after the <pause> tag.
beginning of
an
utterance
After  If an utterance ends with a pause (as defined by the project guidelines), place
speech at a timestamp on the waveform to show where the speech ends.
the end of  Place the timestamp before the <pause> tag.
an
utterance
At the start  If an utterance starts and ends with a pause (as defined by the project
and end of guidelines), use a combination of the approaches above according to duration of
utterance silence.
Pause  If an utterance contains a pause (as defined by the project guidelines) between
between speech segments, place a timestamp when speech finishes, as well as before the
speech speech begins again.
 Always place the timestamps directly around the pause tag.
Continuous  If there is continuous background noise during a pause before/after/during speech,

background place a timestamp when speech begins/finishes, excluding the background
noise noise.
 Continuous background noise is treated the same as silence in this case, because it's
non-speech and the duration is significant.

Transcription (Complex) Guidelines

Uploaded by

Copyright:

Available Formats

Transcription (Complex) Guidelines

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Transcription (Complex) Guidelines

Uploaded by

Copyright:

Available Formats

Terminology

Batch A batch of transcription work is a single, continuous audio file consisting

Page Batches are usually presented in pages of 20 utterances each. One

Utterance A full recording of speech audio (batch) is segmented into multiple

Tag Tags are an easy, standardised way to insert additional information

Timestamp A type of tag inserted on the waveform and represented in the

they think Sydney is a beautiful city

2012 ==> two zero one two

The only exception is if someone pronounces the word as an abbreviation.

Email and website addresses

 The speaker ID tagging must be consistently applied to the same speaker

I was <um> just wondering

Example: the word in bold is highlighted because it is not English

no she said arrivederci which means goodbye in Italian

UTTERANCE 1: clouds gathered today over the mountains and <continued>

UTTERANCE 1: clouds gathered today over the mountains .

UTTERANCE 1: in that case we should probably <truncation>

UTTERANCE 2: <truncation> consider other options

UTTERANCE 1: in that case we should <unintelligible> <truncation>

Tag When to use it

<lipsmack  lip smacks

<breath>  loud inhalation and exhalation

 Insert the tag exactly where the noise first occurs.

Tag When to use it

<ring> The sound of a phone ringing.

Where timestamps are placed

General  There can be multiple timestamps in an utterance.

Continuous  If there is continuous background noise during a pause before/after/during speech,

You might also like