0% found this document useful (0 votes)
218 views

Annotation Project

This document provides instructions for annotating audio data to improve AI speech recognition accuracy. Participants are asked to listen to short audio clips and either transcribe any clear human speech or discard the clip if the speech is unintelligible. Guidelines are provided on editing audio clips and transcribing speech into text while following conventions like capitalization and omitting punctuation. The goal is to extract only the most clearly spoken parts of audio for transcription to enhance AI training.

Uploaded by

WinterDiary
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
218 views

Annotation Project

This document provides instructions for annotating audio data to improve AI speech recognition accuracy. Participants are asked to listen to short audio clips and either transcribe any clear human speech or discard the clip if the speech is unintelligible. Guidelines are provided on editing audio clips and transcribing speech into text while following conventions like capitalization and omitting punctuation. The goal is to extract only the most clearly spoken parts of audio for transcription to enhance AI training.

Uploaded by

WinterDiary
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Annotation Project- External

About the project


What's the task?

 Task

Cut a section of clear human speech from the audio and transcribe the
audio into text. AI recognized text is available.

 Process

Unable to paste block outside Docs

Why are we doing it?

To improve AI recognition accuracy

How to use the tool?

On TCS platform, open queues to edit audio section and recognized


text
Look at the screenshot below:

 On the top, you see a sound track with a gray highlighted section. The gray highlighted part is
the the target audio. The rest parts are there to make sure you have context.
 Below the sound track, there are 5 buttons:

o start cut-s: "s" is the shortcut to mark the cutting start point
o end cut-e: "e" is the shortcut to mark the cutting end point
o play cut-a: "a" is the shortcut to start playing the gray section from start
o play-1: "1" is the shortcut to play the audio
o pause-2: "2" is the shortcut to pause the audio

 Ignore the "default cut" and "current cut" as it just indicates where the audio starts and ends
cutting
 Right above the text box, there is an audio classes single choice: by default, "speech" is
selected; if you find that the audio isn't intelligible, you can select "discard" and then submit,
this audio will be marked as an invalid task.
 The text box: Here you enter the speech. You'll see auto-recognized speech there. Listen to
the audio and correct the speech. A few things to note:

o Don't add any punctuations. Apostrophe is not categorized as a type of punctuation


and is required when necessary.
o Remove words that are not spoken in the gray highlighted audio
o Space is required between words. Extra space is OK. So you don't need to worry
about if additional space is introduced
o No need to worry about cases unless it's a name or a known brand name, Such as
Richard, TikTok
 On the right side of the text box, there is a player. You can ignore it as it doesn't affect the
workflow.

How to conduct the edit?

Hi transcribers, welcome to the ASR correction project!


Before start, you'll need a couple of apps ready to conduct the work:

1. Download and sign up on Lark. Ask your vendor manager to invite you to related Lark
organization.
2. Download and sign up on TCS using the Lark account you've created.

That's all the preparation you'll need to do. However, under the
hood, a lot of things happened:

 Your vendor manager has created a Lark team or a Lark department to manage all
transcribers participating in this project;
 With the Lark team registration issue, we granted this Lark team permission to access our
internal permission management portal: Hodor;
 Hodor is the place where we grant Lark teams permission to TCS queues. On Hodor, we
assign different queues to Lark teams so that you, as transcriber, can work on the project.

Till now, you're all set to start working on this project. Here are
the next steps you'll follow:

1. Ask for domain name from your manager. For this project, it'll be tcs-sg.bytelemon.com for
everyone;
2. Ask for queue IDs from your vendor manager; then search queue ID to find the queue; click
on the star sign to subscribe to the queue. You'll then be able to find it under My Interests
tab;
3. Now go to My Interests tab and you'll see a queue there. Click on

 "Moderate" to enter the working environment;


 "QA" to start QA the work done by 1st round moderator;

 "Evaluate" to assess the accuracy after Moderation & QA.

1. You'll now see a page as shown in the screenshot above:

1. Play the audio, listen to it, edit the text errors, adjust the audio start and end point,
finally click "Submit" to go to the next task;
2. If after listening to the audio, it contains unintelligible part, select audio classes as
"Discard" and then submit, you'll go to the next task;
3. When you're done editing the text and audio, and don't want to continue working,
click "Submit and Leave" to go to the previous page.

You may occasionally run into a few issues:

1. On TCS app, if you see error message "User not found...", it means you're not logged in on
Lark;
2. If you can't find your queue by searching the queue ID, check if you're using the wrong
domain: The correct domain is sg (tcs-sg.bytelemon.com), not va (tcs-va.bytelemon.com).
3. If 1&2 doesn't seem to fix your login or access issue, ask your vendor manager to check if
he/she added you to the Lark team

When to perform the task?

June 25 ~ October 31

Action Steps
1. Listen to the intercepted audio highlighted in gray and classify the audio into either [speech]
or [discard]:

1. If [discard], submit it directly and move on to next task


2. if [speech], move on to 2 & 3

2. decide whether to edit the current cut range or not firstly


3. then transcribe the clear human speech into text accordingly

Annotation Guidelines
1.
Categorize Audio
2.

There are 2 options for audio categories: [speech] and [discard]:

Speech:

1. The text is required to be transcribed from the audio.


2. You can further cut the audio to reflect only the clear human voice part in the default cut
(grey area) range.

Discard:

1. The entire audio is not in English.


2. The entire audio is silent.
3. The entire audio is non-human/ non-audible speech or sound (e.g. noise, modal words,
melodies, sounds from animals or nature).
4. The entire audio is singing or humming (e.g. songs with lyrics, rap).
5. The entire audio is overlapping speech (2 or more speakers are talking about different things
simultaneously).
6. The duration of the clear human speech is less than 1 second (the transcribed text should
contain at least 2 words and one of the two should be non-modal words).
7. The duration of the clear human speech is longer than 30 seconds.

1.

Cut Speech
2.

If part of the audio is unclear and cannot be transcribed, edit the


current cut within the default cut (grey area). Always keep the
longer parts with clear English speech to avoid multiple sections.

Cut out non-English speech

 If part of the speech is not in English, cut it out and transcribe the clear English speech.

Cut out non-audible noise

 Cut it out when the non-audible sound is at the beginning or the end of the audio.

 If the noise is in the middle and does not disturb the clear human speech, it is ok to ignore it.

Avoid excessive modal words

 The transcribed text can start or end with up to 2 modal words.


 If part of the speech are clear modal words (you can clearly count the number of them) and
they are at the beginning or the end of the speech, cut out the extra modal words and
transcribe the rest of the speech. For example, there are about 10 "ha" at the beginning of
the audio, cut out the extra "ha" and keep only 2 of them.

Cut out music with lyrics

 If part of the audio is a song with lyrics (at the beginning or the end of the audio), cut it off
and transcribe the clear English speech.

Attentions on overlap speech

 If part of the speech overlaps:

o If 2 or more speakers repeat the same words simultaneously and the words sound
clear, transcribe them directly.
o If 2 or more speakers talk about different things simultaneously, cut it off and keep
only the clear English speech.
o If there is a main voice in a group conversation and the others are low or fuzzy, cut
out the unclear part and keep only the clear speech of the main speaker.

Silent part

 Cut it out if it is at the beginning or the end of the audio


 Ignore the silent part if it is in the middle of the audio

Unknown words (Informal words or words not found in the dictionary)

 Cut it out if it is at the beginning or the end of the audio


 If in the middle, keep the clear speech that is longer.

Note:

 While editing the cut, the edited part is shown in blue;


 Ignore the completeness of the sentence while cutting.

1.

Transcribe the Audio into Text


2.

DOs
 Transcribe the audio based on the AI recognized text.
 ONLY transcribe the gray highlighted section (default cut). The rest of the audio can be
referred to as context.
 Transcribe the speech word for word, including obvious grammatical mistakes.
 Words in the text should preferably conform to American spelling.
 Space is required between words.
 Capitalize the first letter of names of people and places and common abbreviations should be
all capitalized (e.g. USA/FBI/CPC).
DON'Ts
 DON'T use punctuation! (Except for apostrophe and hyphen, such as I'm or I've, "COVID-
Nineteen" or "nose-picking").
 DON'T paragraph the text!

Special requirements for some cases:

Numbers

 Write numbers in full English words. If you hear "one", write "one" instead of "1".

Half-pronounced words

 Half-pronounced words at the beginning or at the end of the speech,

o The half-pronounced word is not a separate word, cut it out. For example, "I am a
stu...(student)", the word "student" is half pronounced and "stu" is not a separate
word. The transcribed text should be "I am a".
o The half-pronounced word is a separate word, transcribe it directly. For example,
"well, that is the super...(supermarket)", the word "supermarket" is half pronounced
and "super" is a separate word. The transcribed text should be "well, that is the
super".

 Half-pronounced words in the middle of the speech

o The half pronounced word is not a separate word, correct the word. For example, "I
still mi..mi..miss you", the word "miss" is half pronounced because of stammer but
"mi" is not a separate word. The transcribed text should be "I still miss you".

 The half pronounced word is a separate word, transcribe the speech word for word. For
example, "The super..supermarket is over there", the word "supermarket" is half
pronounced because of stammer but "super" is a separate word. The transcribed text should
be "The super supermarket is over there".

Whisper or missing syllables

 If the speaker whispers or misses certain syllables of a word and you are sure what the word
is, transcribe the correct word in the text.

Repeat

 Repeated words and sentences must be transcribed strictly according to how many times
they are repeated (except for modal words).

Modal words

 Transcribe the modal words according to the times they are repeated, e.g when you hear 3
"ha" in the middle of the audio, the transcribe text should be "ha ha ha".
 Spelling: oops, oh, gee, geez, um, wow, uh, ahem, yoo hoo, hooray, mmm, ouch, yuck, eew,
ugh, phew, aha, gosh, my, eh, hey, ah, ok.

Proper nouns
 Transcribe the common proper nouns (name of a person/place/product/organization) if you
are sure what they are. If not, cut this part out.

Informal words/abusive words/non-standard pronunciation

 If the speaker uses simplified form/informal/ abusive English words, transcribe them word
for word.
 If the pronunciation is not standard but is able to tell the correct word, transcribe the correct
word.

Words with the same pronunciation (Homophone)

 Listen to the following default cut to confirm what the whole sentence is, write down the
correct word by context.

o e.g. "hole/whole", when you hear "The whole town disagreed with the mayor" in
the audio write down "whole" according to the context.

 If there are multiple homophones and their meanings all match the context, go with any one.

o e.g. "where is my deer/dear." Both words match the meaning of the sentence,
either of them is ok.

Term & Definition


Term Definition
Speech Clear human voice
Discard Get rid of
Intercept cut out/ seize
Transcribe To make a written copy of speech
Oops; Ahem; Yoo-hoo; Hooray; Mmm-hmm; Uh-huh; Ouch, ha
Modal words
ha ha
Default cut A piece of intercepted audio by default (system)
Current cut The audio that you cut
Overlapping
Two or more people speak at the same time
Speech
Two or more words having the same pronunciation but
Homophone
different meanings or spelling
A language used by the people of a specific
Dialect
area/district
Accelerated Increase in speed
whisper Speak in low/ quiet voice
informal expressions in spoken language, especially
slang used by a particular group of people, e.g. children,
criminals, soldiers, etc.
Names, places, terminology(the set of technical words
Proper nouns
or expressions used in a particular subject)
FAQ
Submit your FAQ weekly
https://wenjuan.feishu.cn/m?t=sZ7a5HDuibui-ccdh
Q A
Curse word Transcribe
If has lyrics, discard; if no lyrics,
Background music
transcribe
No, only choose the error type, no
In evaluate can we fix the error
need to fix
for error options (written error,
intercept error, etc.) . The system
Yes, only select one of the error
defaults to only one of them. If
types
there are multiple errors, can only
select one of the tags?
For abbreviations, using "NASA" as
an example, what form should we
transcribe if the audio is
If pronounce is "n" "a" "s" "a",
pronounced as "n" "a" "s" "a", or if
transcribe"N A S A"(3 spaces); if
the audio is pronounced as
pronounce /neser/, transcribe
"/'nesər/"? And do we need to add
"NASA"(no space)
3 spaces when it reads "n" "a" "s"
"a"? (transcribe "n a s a", or "N A S
A")
In the audio, for example, a person
stutters and does not say the
Yes, "bask" is a separate word, so
complete word when he says it for
it should be wrtie out as "bask
the first time, such as basketball,
basketball". There's rule about "half
he said it twice, bask, basketball,
pronounced word" in the guideline.
this kind of situation bask, also
need to write it out
for modal words need to clarify the
policy because it said transcribed
text can start or end with up to 2 if the modal words are at the start
modal words but at the other ot the end, we should only keep
section two of them and cut out the rest, if
Transcribe the modal words in the middle, keep all of the modal
according to the times they are words and transcribe according to
repeated, e.g when you hear 3 "ha" the times they're repeated.
in the audio, the transcribe text
should be "ha ha ha".
Children or adults sing lyrics, no
background music and no melody, discard. as long as the whole audio
the words are clearly heard, is this is song with lyrics we discard it.
speech or discard?
Should we transcribe other sound Only transcribe the clear human
like bush / boosh, etc? voice part
For the short form of because in Only cause is the correct form, and
the audios, do we spell as cause, the transcribed words should be
coz or cuz? able to be found in the dictionary
for expressions like "voila", "bon Non-English parts, if it can be
appetit", "adieu", and "bon voyage", determined that the transcription
do we need to transcribe them or result is correct, it can be
should we just cut them out? transcribed
If an audio is bye bye, is bye bye a
Transcribe, bye is not modal word
modal word?
A.m. and p.m. are written AM? Only "AM" is the correct form, the
rule "common abbreviated words
a.m.? Am? should be all capitalized".
In the audio, the characters' voices
are changed, like electronic
if not lyrics then choose speech
sounds, but the words are clear,
can we choose speech or Discard
YouTube and similar words,should
Y and T Capitalize ,if trancribe as YouTube , brand name
youtube or Youtube is it ok?
how to transcribe COVID-19,Glock-
18?COVID-Nineteen or COVID-
COVID-Nineteen;Glock-Eighteen
nineteen?Glock-Eighteen or Glock-
eighteen?
How to transcribe Mr?Mr, mr,
Mister
mister or Mister?
ok,OK,Okay Or is there a fixed
all fine
writing requirement?

Video Demo

Unable to paste block outside Docs

Bad cases

ASR Annotation Bad Cases

How to export dashboard data

1. Go to Task data under your queue data dashboard


1. Choose the date range, filter then export

1. Under " My export" will fin the data> Click download

1. This is what the file looks like

1_Catgory: the labeling result


1_duration: the audio duration
1. Filter " speech" Category, then calculate the total duration

You might also like