Transcription Guide - Introduction, Labelling and Segmentation
Transcription Guide - Introduction, Labelling and Segmentation
· History
· Home and Gardening
· Legal and Courtroom
· Money and Finance
· Pets and Animals
· Politics and Current Affairs
· Religion and Spirituality
· Science and Nature
· Sports
· Technology
· Travel and Hospitality
· Trivia
· Weather
· 1. Introduction
· 2. Segmentation
· 2.1. Creating Segments
· 2.1.1. General Segmentation Requirements
· 2.1.2. Specific Requirements for Each Segment Type
· 2.1.2.1. Speech
· 2.1.2.2. Babble
· 2.1.2.3. Overlap
· 2.1.2.4. Music
· 2.1.2.5. Noise
· 2.2. Segmentation Examples
· 2.2.1. Example 1 - Segmenting an Audio File with Split-Channel Conversation Telephony
· 2.2.2. Example 2 - Segmenting a Co-Channel Media File
· 2.3. Labelling Segments
· 2.3.1. All Segments
· 2.3.2. Speech Segments Only
· 3. Transcription Conventions
· 3.1. Characters and Special Symbols
· 3.2. Spelling and Grammar
· 3.2.1. Dialectal Pronunciations
· 3.2.2. Mispronounced Words
· 3.2.3. Non-Standard Usage
· 3.3. Capitalization
· 3.4. Abbreviations
· 3.5. Contractions
· 3.6. Interjections
· 3.7. Individual Spoken Letters
· 3.8. Numbers
· 3.9. Punctuation
· 3.10. Acronyms and Initialisms
· 3.11. Disfluent Speech
· 3.11.1. Stumbled Speech, Repetitions, and Truncated Words
· 3.11.2. Filler Words
· 3.12. Overlapping Speech
· 3.12.1. Conversational Telephony
· 3.12.2. Media
· 3.13. Unintelligible Speech
· 3.14. Non-Target Languages
· 3.15. Non-Speech
· 3.15.1. Non-Speech Noises
· 3.15.2. Silence/Pauses
· 4. Metadata Labelling
· 4.1. Labelling the Transcribed File
· 4.1.1. File-level Values
· 4.1.2. Annotator Information
· 4.2. Labelling Speakers in the Transcribed File
· 5. Appendix A: The Complete Set of Non-Speech Tags and Other Markup Tags
1. Introduction
Transcription is the commitment of an audio signal to textual representation. This can include
representing speech data as well as other sound types such as phones ringing or music.
2. Segmentation
Segmentation is the process of "timestamping" the audio file for each given speaker. It involves
indicating structural boundaries within an audio file, such as sound types, conversational turns,
utterances, and phrases within an audio file. Segment boundaries also facilitate the transcription
process by allowing the transcriptionist to listen to manageable chunks of segmented speech at a time.
2.1. Creating Segments
2.1.1. General Segmentation Requirements
· Create segments (i.e. timestamping an audio file) according to the five segment primary types listed
in Section 2.1.2. The five primary types are:
· Speech
· Babble
· Overlap
· Music
· Noise
· Each segment will be timestamped to the milliseconds. Timestamps must be positive floating numbers,
in the format of seconds.milliseconds (e.g., 12.345 for 12 seconds and 345 milliseconds).
· Each segment should have only one primary sound type, which will be listed as the primaryType — one
of the segment objects — in the transcription JSON. See Section 2.1.2 for the required sound types and
their requirements.
· Create each segment tight around its targeted sound type. Leave out continuous stretches of
silence/white noise that last two or more seconds at the beginning, in the middle, or at the end of the
segment.
· Transcription is needed only for Speech segments.
· Create Babble segments for audio signals that consist of speech or isolated vocal noise (e.g. coughing,
laughing) from one or more background speakers (e.g., people standing nearby or in the same room),
even if the speech is partially intelligible.
2.1.2.3. Overlap
· Create Overlap segments for audio signals that consist of overlapping speech between two or more
unintelligible foreground speakers or between three or more foreground speakers, regardless of
intelligibility. Use this also when there is overlapping speech between two or more speakers but it is
difficult to differentiate between foreground and background speakers.
2.1.2.4. Music
· Create Music segments for audio signals that consist of music, songs, singing, or sounds from musical
instruments. This includes theme songs or characters singing songs.
2.1.2.5. Noise
· Create Noise segments for audio signals that consist of any isolated non-speech noise (e.g., applause,
phone ring).
Notes: The term "foreground speaker(s)", or "speaker(s) of interests", refers to the speaker(s) that a
particular recording is intended to capture. For split-channel conversation telephony (i.e. one speaker,
one channel), the foreground speaker is either the caller/agent or the call-receiver/customer. For co-
channel media data (i.e., one channel, multiple foreground speakers), the foreground speakers will vary
depending on the domains. In a political debate, for example, the range of foreground speaker(s) could
include the host, the debaters, and potentially members in the audience with questions; in a reality
television show, the foreground speaker(s) would include all of the protagonists featured.
See Section 2.2 below for some segmentation examples.
2.2. Segmentation Examples
The following examples visualize the desired segmentation based on the segmentation requirements
outlined above. Each visualization has six rows:
Row Description
0 Audio signals
3 Segment ID
5 Transcription
1. Segmentation is tight around each targeted primary type (i.e., Speech in this example).
2. Long stretches of silence/white noise are left out (e.g., between 3.638 and 8.910 seconds).
3. Each segment is less than 15 seconds.
4. Segment 001 consists solely of unintelligible speech from the foreground speak. It is still classified as
Speech and the speech is transcribed as best guesses.
5. Each Speech segment consists of speech that is conversationally or linguistically related.
a. Segment 001 and Segment 002 each consists of a single speaker turn, followed by a pause.
b. Segment 003 consists of a complete sentence. The end of the segment constitutes a sentence
break.
c. Segment 004 consists of another complete sentence, with a 1.5 second pause transcribed as
[no-speech]. The sentence is not broken up into two segments at the pause because that would
have resulted in a segment with speech that is not linguistically or conversationally related (i.e., "#ah,
we're going to talk about #um").