LOFT System Guidelines
LOFT System Guidelines
Task Objective
You will listen to a dialogue that will likely contain multiple speakers. Your job is to identify and mark when
each speaker is speaking and transcribe the corresponding audio. Some of the audio will contain background
noise, background music, and ringtones; this should be marked too following the below instructions.
**IMPORTANT**
1) Once you are done transcribing a task, you MUST hit the completed button:
2) Transcribe ALL speech according to the hi_in WDC’s. Regular speakers, pre-recorded speakers, and
synthesized speakers.
3) For transliteration, please use English.
4) Please read the section “How to handle ‘Difficult Cases’” carefully.
5) Do not skip any tasks unless they are completely silent.
6) Be aware that a segment should NEVER have more than 0.5 seconds of silence (see step 10).
7) Turns shouldn’t be more than half a minute long (30 seconds).
8) Noise, PII, music, ringtones and DTMF should be labeled with the annotation option.
9) Unintelligible and foreign speech should be labeled with a new turn (and not with an annotation).
10) Do not “name” the speakers. Please keep speaker labels as numbers only (speaker 1, speaker 2, pre
recorded speaker 1, pre recorded speaker 2, etc.)
11) All speaker labels should be consistently formatted. Speaker labels should always: be in all
lowercase, be spelled correctly, and should not contain underscores or hyphens.
speaker 1 Speaker 1
8) Do NOT transcribe PII (definition at the bottom of the document). When PII is heard, add an annotation
and chose “PII” from the drop-down menu. Check illustrated example below.
9) Please transcribe all English words in Latin script and all Hindi words in Devanagari script. If a
‘Hinglish’ word is used, please transcribe in Latin. See the transliteration section for more details.
10) Unintelligible and foreign audio must be entered as a separate turn. Every time you hear
unintelligible or foreign speech, please end the exiting turn and create a new turn (for the same speaker) to
label the audio and add [unintelligible] to the transcription box (example below).
Guidelines
1) Create a new Transcription box for the entire section of audio where you hear speech, using the “add
turn” option. Audio events such as noise, PII, music, Laughter, ringtone or DTMF should be added to
the transcription as an annotation, using the “add annotation” option.
2) Everytime you create a turn, you will have to assign a name to that turn. Use ONLY the below options:
a) speaker #
b) pre recorded speaker #
i) This can be either a recording or a synthesized speaker
3) For the categories below ONLY, please add an annotation and select from the drop-down menu.
Please do not use any other category other than the ones listed here. *Please do not use the
unintelligible or foreign speech annotations. Unintelligible and foreign speech should be added
as a separate turn and have a speaker assigned to it (more instructions below). Use only the
categories below:
a) Noise
b) PII
c) Music
d) Ringtone
e) Laughter
f) DTMF
i) Stands for: Dual Tone Multi Frequency. Also referred to as “touchtone.”
ii) Example: If the operator says “Press 1 to speak to a representative” and you hear a
‘beep’ (the caller pressing the button), that beep is what should be labeled as “DTMF”.
4) Identify the speaker by listening to the audio. The first speaker in the audio should be labeled
speaker 1.
Every time this speaker speaks throughout the rest of the audio, it should be labeled as speaker 1.
a) The next new speaker that is introduced should be labeled as speaker 2. The next different
speaker is speaker 3, then speaker 4, and so on.
b) Pre recorded speakers should be labeled the same way. The first pre recorded speaker that is
heard should be labeled as “pre recorded speaker 1”. The next pre recorded speaker that is
different from “pre recorded speaker 1” is “pre recorded speaker 2”, and so on.
c) Listen back through the audio to be sure you are not creating a duplicate speaker. When in
doubt, use “unidentifiable speaker”. Do not number the unidentifiable speakers.
d) Note that:
i) Lyrics in music should not be labeled. If there is background music and it has lyrics, just
add an annotation and select “MUSIC” from the drop-down.
ii) If there is a pre-recorded greeting/advertisement with music in the background, a turn
should be added as 'pre recorded speaker #' and an annotation “MUSIC” as overlapping
section (see step 8 for overlapping audio).
5) When to create a new segment?
a) When one speaker stops speaking and a new speaker starts speaking, create a new turn for
the new speaker.
b) When one speaker stops speaking and pauses for more than 0.5 seconds, create a new turn
for when this speaker resumes talking. If the pause is for less than 0.5 seconds, then do not
create a new turn (see step 9 for more info on this).
c) When speech goes over 30 seconds without a pause. A turn should not go over 30 seconds
even when speech goes over 30 seconds without a long pause. In this case, the turn should
end at 30 seconds and a new turn should be created to transcribe the remaining speech.
d) When unintelligible audio or foreign speech is heard, please end the current turn and create a
new turn and add [unintelligible] or [foreign speech] in the transcription box. Unintelligible and
foreign speech should be linked to the appropriate speaker. *Please do not use the
unintelligible or foreign speech annotations.
e) The only annotations allowed for this project are: noise, PII, music, ringtone and DTMF.
Enter all PII using the annotation option. When other noise events are heard, add a new
segment by selecting “add annotation” and choose an option from the drop-down menu. PII
should never be transcribed.
6) Edit the transcription time range by using the horizontal red line to help indicate the turn start and turn
end time you are editing for.
a) Note: the horizontal red line will not automatically provide you with the correct start for a
transcription segment. Hitting the “+” button will create a new segment two seconds prior to
where the redline is.
b) Helpful tip: Use the red line to show you the exact time stamp, then ctrl+; t o copy to start time.
Use ctrl+’ to copy as the end timestamp.
c) Manually adjust the start and end times by editing the time stamps in the transcription box.
7) For each new segment you create, you will have to assign a name to the segment. Click the dropdown
menu to select an existing option, or to create a new one.
8) Where it says “text” in the transcription box, transcribe what you hear in the audio for regular speakers
AND pre recorded speakers.
9) For other audio categories such as noise, PII, music, ringtone and DTMF, then simply add an
annotation and select from the drop-down menu. PII should also be entered using the annotation
option.
10) A segment should NEVER have more than 0.5 seconds of silence.
a) Use the audio wave to identify periods of silence.
b) Anytime there is more than 0.5 seconds of silence, be sure there is no segment over that time
period.
i) Here is a good example of skipping 0.5 seconds of silence.
ii) Here is a bad example, do not do this!
11) If the speech is unintelligible, in a foreign language, OR singing create a separate turn to label the
unintelligible, foreign speech or singing audio. Add [unintelligible] or [foreign_speech] or [singing] to the
transcription. These segments should be linked to the appropriate speaker label. Please do not create a
separate speaker label for unintelligible or foreign speech. Note that this is not an overlapping turn (see
image below).
If the ENTIRE audio is sung or is in a foreign language, create a [singing] or [foreign speech] speaker
label, and create one segment lasting the entire duration of the audio. And mark the task as
completed.
12) If there is overlapping speech, or overlapping audio events, the individual segments should overlap
each other. Example:
● Important Notes
○ PLEASE BE SURE SPEAKER LABELING IS CONSISTENTLY FORMATTED.
○ Words should not be cut off when annotating start and end points of an utterance.
○ Speaker names should be distinct. A new speaker id should only be created when a new
speaker is heard.
○ A segment should NEVER have a period of silence greater than 0.5 seconds.
○ When there is unintelligible or foreign speech, end the current turn, create a separate turn with
[unintelligible] or [foreign speech] to the transcription and select the appropriate speaker.
○ When there is overlapping speech, create a Transcription box for both speakers for the same
timeframe.
This section’s purpose is to show rules from the WDC’s, and then provide an answer as to how such
cases should be treated for this project. If a WDC rule does not appear in the list below, then follow
whatever is stated in the WDC’s.
If the prompt cannot be understood, skip it (tag it as [skip]). It is preferable to skip rather than mistranscribe.
● Do not skip, use the [unintelligible] tag, see step #11 above
Skip the utterance if it: contains at least some word(s) that cannot be understood; is in a different language
typically not understood; contains no speech; contains only laughter; contains singing; contains only
synthesized speech (e.g. the voices of Google Now or Siri) and/or pre-recorded speech (e.g. TV or radio).
● Can’t be understood: Use the [unintelligible] tag, see step #11 above
● Different language: Use the [foreigh speech] tag, see step #11 above
● Laughter: Use the annotation option and selected Laughter from the dropdown menu
● Singing: Use the [singing] tag, see step #11 above
For utterances that contain both user-generated speech and pre-recorded or synthesized speech, transcribe
user-generated speech and ignore the pre-recorded/synthesized speech.
● TRANSCRIBE ALL SPEECH
If a prompt contains nonsense words, search them on the internet. If no clear results are found and the word is
unintelligible (there is no single obvious spelling), [skip] it.
● Do not skip use the [unintelligible] tag, see step #11 above
If the speaker sings, [skip]. Use the tag [music] if an entire utterance is music from an instrument, radio, TV,
etc.
● Singing: Use the [singing] tag, see step #11 above
● Music: Use the annotation option and select Music from the dropdown menu
[skip] if audio contains only laughter. Ignore laughter that is interspersed with speech (transcribe only the
speech).
● Laughter: Use the annotation option and selected Laughter from the dropdown menu
Profanity should be fully transcribed. However, feel free to skip a sentence that you feel uncomfortable
transcribing.
● Profanity should be fully transcribed. Otherwise, use the [unintelligible[ tag, see step #11 above
If the context of an alpha-digit sequence suggests it may be a password, credit card number, social security
number, etc., then use [skip].
● For instances of PII, use the annotation option and select PII from the dropdown menu
If an utterance is in a foreign language, tag with [skip], unless it is an easily identifiable media title or a foreign
language phrase commonly understood in the transcription language. Stick to the capitalization and
punctuation conventions of your target language.
● Use the [foreign speech], see step #11 above
If words in a foreign language are included in a sentence of your target language, transcribe only if commonly
understood by speakers of your language. Otherwise, [skip]. Foreign words that are commonly used (and
therefore should be transcribed) can include names of foreign foods or places, pop culture phrases like
"capisce", and greetings or thank yous in prominent world languages.
● Please follow the transliteration instructions below
Only transcribe foreground speech. A user's speech may go from the foreground to the background or vice
versa (determined by change in volume) and can be accompanied by change in speaker audience.
● TRANSCRIBE ALL SPEECH
If one person clearly speaks in the foreground and someone speaks in the background, transcribe the main
speaker and ignore the rest.
● TRANSCRIBE ALL SPEECH
If one person clearly speaks in the foreground and someone interrupts at roughly the same volume with a brief
(less than a second) overlapping speech segment, transcribe the main speaker and ignore the rest.
● TRANSCRIBE ALL SPEECH
If two or more people are speaking at once with no one clearly in the foreground, tag as [overlapping]. Do this
for overlaps longer than one second. Use this tag even when one person is a bit louder than the other(s) and
you can tell what they're saying.
● TRANSCRIBE ALL SPEECH
Transcribe repeated words as many times as uttered, but skip if it is more than 5 times.
● Transcribe all speech exactly as it is heard. Do not leave any words untranscribed.
Write media titles as they are most commonly written. Movie titles and English book titles should be written in
Devanagari.
● Movie titles and English book titles should be written in Devanagari script.
________________________________________________________________________________________
Transliteration
All English words should be transcribed using a Latin keyboard.
All Hindi words should be transcribed using a Devanagari keyboard.
All Hinglish words should be transcribed using a Latin keyboard.
Hinglish definition: Hinglish a hybrid language; it is the mixing English and Hindi together. These words are
neither an English word, or a Hindi word, they are ‘Hinglish’ words. Hinglish can mean different things to
different people, so please use your best judgement when transcribing.
Definitions
Task A random set of letters used to identify the audio wave you are transcribing. The list of
eligible tasks can be seen the Task section of LP.
Speaker ID The speaker id is used to identify the speaker in the audio. Use the same speaker id for
the same person throughout the task.
Speaker turn One continuous contribution to dialogue by a single speaker. It may consist of a single
word or multiple utterances.
Category A category is used when there is audio that we do not need transcribed but only labeled.