0% found this document useful (0 votes)

2 views8 pages

asr

This technical report discusses the application of transfer learning to Automatic Speech Recognition (ASR) for low-resource languages, specifically Persian. The study demonstrates that training a phoneme recognition system using an English corpus and fine-tuning it with a limited Persian dataset significantly reduces the Phoneme Error Rate (PER) compared to training from scratch. The results indicate that transfer learning can enhance performance and reduce training time for ASR systems in low-resource settings.

Uploaded by

Okperigho U Samuel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views8 pages

asr

Uploaded by

Okperigho U Samuel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/359159354

Transfer Learning for ASR to Deal with Low-Resource Data Problem

Technical Report · March 2022

CITATIONS READS
2 225

1 author:

Maryam Asadolahzade Kermanshahi

Amirkabir University of Technology
5 PUBLICATIONS 15 CITATIONS

SEE PROFILE

All content following this page was uploaded by Maryam Asadolahzade Kermanshahi on 11 March 2022.

The user has requested enhancement of the downloaded file.

Iran University of Science and Technology
Deep Learning (Spring 2019)

Transfer Learning for ASR to Deal with

Low-Resource Data Problem

Maryam Asadolahzade
Department of Computer Engineering
Iran University of Science
and Technology
[email protected]

Abstract

End-to-end models are the state of the art for Automatic Speech Recognition
(ASR) systems. Despite all their advantages, they suffer a significant problem:
a huge amounts of training data are required to achieve a good performance. This
is a serious challenge for low-resource languages such as Persian. Therefore,
we need some methods and techniques to overcome this issue. Transfer learn-
ing is an effective method can address this problem. Our aim is to perform a
phoneme recognition system for the Persian language and explore the effect of
transfer learning. To this end, we first train the network with English corpus.
Then, we transfer the trained network and fine-tune it with Persian corpus. Our
experiments on FarsDat corpus indicate that transfer learning with a few hours
of Persian data, can reduce (Phoneme Error Rate) PER by 7.88% against model
trained from scratch. Moreover, this method attains improvements of 2.08% and
1.52% for PER compared with the DNN-HMM and DNN-HSMM powerful base-
lines, respectively.
Keywords: Automatic Speech recognition, Phoneme (Phone) recognition, Trans-
fer learning, Phoneme Error Rate (PER), Low-resource language, Persian (Farsi)
Language.

1 Introduction
Automatic Speech Recognition (ASR) task is to convert a spoken signal to the text. In this field, end-
to-end approaches are the current state of the art. The important point is such approaches need huge
amounts of data for training [1], [2]. For low-resource languages such as Persian, for which there
is no sufficient data for training that could be an important issue. One of the technique to deal with
this problem is transfer learning. This technique is applied successfully in some previous studies
[1], [2]. To the best of our knowledge, transfer learning for ASR has not been applied on the Persian
language, which is our goal. Therefore, we explore the effect of transfer learning for this language.
Unfortunately, there is very limited research for DNN-based, in particular, end-to-end ASR systems
for the Persian language. The only paper attempted to use end-to-end model for Persian is [3] which
implemented a phoneme recognition system. The motivation of our work is to publish the result
for end-to-end Persian phoneme recognition to alleviate future studies in this area and provide a
framework for comparison for other researchers working on Persian ASR. Furthermore, we aim to
find out to some extent, we can increase the current Persian ASR using transfer leaning technique.
We present the evaluations of the methods using the Persian FarsDat corpus [4].

(LateX template borrowed from NeurIPS 2019)

The rest of the paper is organized as follows, Section 2 reviews previous ASR systems with transfer
learning, Section 3 talks about the architecture and its input and output and the approach used in
the paper. Section 4 describes datasets and performance measures used. Section 5 describes the
experimental results and discussions on it. Finally, Section 6 discuss results with future research
directions.

2 Related work/Background

The first attempt for transfer leaning in DNN-based ASR system was conducted in [1]. They used
four European languages to build a multilingual DNN. They used one DNN such that all the hidden
layers of the DNN is shared except the last softmax layer. Each language has its own softmax layer.
After training with the four languages, they evaluated transfer learning with two target languages:
American English and Mandarin Chinese. American English is phonetically close to the European
languages used for initial network while Mandarin Chinese is different from the European languages.
The results showed transferring hidden layers sharing across languages can improve accuracy for
two new languages. They also concluded when the training data for target language is low, it would
be better just train softmax layer instead of retrain more layers. The authors in [2] used a CNN-
based end-to-end ASR. They trained the model with the English data then transfer it for the German
data. They showed better performance than the German model trained from scratch. In [5], a
multilingual model with 10 BABEL languages was build and it was tested for 4 other BABEL
languages using transfer learning approach. Their results showed that the transfer learning from the
multilingual model shows improvement over monolingual models. A more recent work [6] proposed
language-adversarial transfer learning method. This method helps that the shared features contain
less unnecessary language dependent information. They demonstrate promising results on IARPA
Babel datasets.

3 Proposed method

Our aim is to perform a ASR system for the Persian language. To do this, we first train a ASR
system for English. Then, we transfer the trained network and fine-tune it with Persian data. The
overall diagram of the proposed method is shown in Figure 1.

• Input: The input of the neural network is the spectrogram extracted from the audio signal.
We first segmented the signal into frames. We use 32 ms frame windows, which spanned
every 8 ms. Then we applied Short-time Fourier transform (STFT) and the we mel-scaled
them. We normalized features per input sequence to have zero mean and unit variance.
• Output: The output of the network is the corresponding phoneme sequence of the input
audio signal.
• Architecture: We use 11 1D-convolutional layers on top of each other based on the intro-
duced architecture in [2], [7].We use zero-padding for each layer since we aim to preserve
the dimension of the input. The activation function for the first 10 layers is Relu. For the
last layer, softmax activation is considered to get the probability distribution on phonemes.
The last layer has 30 channel output, each of which correspond to one of the phonemes.
The loss function to train network is Connectionist Temporal Classification (CTC) [8].

After training, the network predicts the probability of each phoneme for each input frame.
We did not use any beam search or decoder or even any language model in our experiments. We
also use the implementation of paper [2] for our project.

4 Experimental Setup

4.1 Dataset

We use two corpora which are explained in the following:

2
Figure 1: The diagram of the proposed method.

English corpus: For training with English language, we use the LibriSpeech corpus [9]. This corpus
consists of 1000 hours of read speech, sampled at 16 kHz, from the domain of audio books. This
corpus is under license ‘CC BY 4.0’ and freely available for download.
Persian corpus: We use FarsDat [4] to train the system for the Persian language, which contains the
utterances of 304 female and male speakers from 10 dialect regions in Iran. Each speaker uttered 20
sentences in two sessions. This corpus is about 5 hours in size. The utterances of first 250 speakers
were used as the training set and utterances of the remaining 54 speakers were used as the test set.
Our experiments are speaker-independent because the speakers in the training set are different from
those in the test set.

4.2 Evaluation metric

In order to evaluate a phoneme recognition system, the common measure is phoneme error rate
(PER). In this measure, the recognized and reference phoneme label sequence should be compared.
Two strings are compared by matching using dynamic programming. Considering the number of
substitution errors, deletion errors and insertion errors shown as S, D, and I, respectively, the PER is
defined as:

S+D+I
P ER =
N
where N is the total number of phonemes in the reference label [10].

5 Results
In this section, we perform three experiments to evaluate the performance. First, we investigate
the effect of number of frozen layers in transferred network. Second, we evaluate the effectiveness
of transfer learning by comparing the result obtained with transfer learning against ASR trained
without exploiting the transferred network. Third, we compare the results with some strong/powerful
baseline methods.

3
Figure 2: The performance for different values for K (number of frozen layers).

5.1 The effect of Number of frozen layers

As mentioned earlier, our network has 11 layers. To fine-tune the network with Persian corpus, we
freeze K number of the lower layers, therefore the weights of 11-K top layers can update during
training. To find the suitable number for K, we evaluate performance for different values for K in
Figure 2.
It can be seen that for 10-frozen layers, the PER is not satisfactory. However, the difference between
10 and 9-frozen layers is really high. Therefore, when we freeze a fewer number of frozen layers, for
example zero or one, we can get the better result. The reason is that English and Persian languages
have some audio properties in common but it is necessary to fin-tune all the layers to reach the best
results. Our results are consistent with finding in [2].

5.2 The effect of transfer learning

In this experiment, we compare the results of transfer learning with the model without transferring,
i.e. we train the network with just Persian corpus. We demonstrate the effect of transferring in terms
of both training time and accuracy in Figure 3.
The results show training from scratch convergence slower than all of the transferred network, even
10-frozen layer transfer network. Besides the accuracy is lower than transferred network. In order
to compare the results precisely, we report them in Figure 4. The results are based on 10 epochs of
training.
As expected, transfer learning outperforms model trained from scratch even with few number of
frozen layers.

5.3 Comparing with baselines

To have a better comparison with other methods, we compare our end-to-end model with some
baselines including Gaussian mixture model-hidden Markov (GMM-HMM), deep neural network-
hidden Markov model (DNN-HMM) and deep neural network-hidden semi-Markov model (DNN-
HSMM), which are explained in detail in our previous work [11].
From the results, we can see that end-to-end system without transferring yields worse performance
than baselines even though it is the state of the art. Because the size of Persian corpus is extremely
low. However, we observe significant improvement for the system trained by transfer learning,
which is expected.

4
Figure 3: Comparing the results of transfer learning and learning from scratch.

Figure 4: Comparing the results of transfer learning and learning from scratch.

5
Figure 5: Comparing the results of transfer learning with some baselines.

6 Discussion

We could obtain three benefits in this paper. Firstly, we provide a comparative framework and an
end-to-end model for future studies in Persian ASR. Secondly, we reduce training time using transfer
learning. Transfer learning can be seen as a kind of weight initialization. Thus, instead of training
network with random weights, with the help of transfer learning we can start learning from suitable
values for weights which are trained previously using a huge corpus. Lastly, we obtain better results
compared to model trained from scratch and baseline methods.
For future work, we will use language model during decoding to achieve additional improvement
may.

References

[1] J.-T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-language knowledge transfer using multilingual
deep neural network with shared hidden layers,” in ICASSP, 2013, pp. 7304–7308.
[2] J. Kunze, L. Kirsch, I. Kurenkov, A. Krug, J. Johannsmeier, and S. Stober, “Transfer learning for speech
recognition on a budget,” 2nd ACL Workshop on Representation Learning for NLP, 2017.
[3] S. Alisamir, S. M. Ahadi, and S. Seyedin, “An end-to-end deep learning model to recognize Farsi speech
from raw input,” in 4th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS), 2018, pp.
1–5.
[4] M. Bijankhan, J. Sheikhzadegan, and M. R. Roohani, “FARSDAT-The speech database of Farsi spoken
language,” presented at the The Proceedings of the Australian Conference on Speech Science and Technology,
1994.
[5] J. Cho et al., “Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and
language modeling,” in Spoken Language Technology Workshop (SLT), 2018, pp. 521–527.
[6] J. Yi, J. Tao, Z. Wen, and Y. Bai, “Language-adversarial transfer learning for low-resource speech recog-
nition,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 27, no. 3, pp. 621–630,
2019.
[7] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2Letter: an End-to-End ConvNet-based Speech Recogni-
tion System,” arXiv:1609.03193 , Sep. 2016.
[8] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling
Unsegmented Sequence Data with Recurrent Neural Networks,” in Proceedings of the 23rd International Con-
ference on Machine Learning, New York, NY, USA, 2006, pp. 369–376.
[9] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain
audio books,” in ICASSP, 2015, pp. 5206–5210.
[10] S. Young et al., The HTK book, vol. 3. 2002.

6
[11] M. Asadolahzade Kermanshahi and M. M. Homayounpour, “Improving Phoneme Sequence Recognition
using Phoneme Duration Information in DNN-HSMM,” Journal of AI and Data Mining, vol. 7, no. 1, pp.
137–147, 2019.

View publication stats

When Whisper Meets TTS Domain Adaptation Using Only Synthetic Speech Data
No ratings yet
When Whisper Meets TTS Domain Adaptation Using Only Synthetic Speech Data
13 pages
Lecture 9 - Speech Recognition
No ratings yet
Lecture 9 - Speech Recognition
65 pages
Vocab Workshop Lesson Plan-1
100% (1)
Vocab Workshop Lesson Plan-1
7 pages
aerospace-11-00219
No ratings yet
aerospace-11-00219
13 pages
Cross-Language Transfer Learning, Continuous Learning, and Domain
No ratings yet
Cross-Language Transfer Learning, Continuous Learning, and Domain
5 pages
1-s2.0-S0957417424009850-main
No ratings yet
1-s2.0-S0957417424009850-main
11 pages
End-to-End Automatic Speech Recognition
No ratings yet
End-to-End Automatic Speech Recognition
19 pages
Comparative Analysis of Automatic Speech Recognition Techniques
No ratings yet
Comparative Analysis of Automatic Speech Recognition Techniques
8 pages
s10772-024-10082-z
No ratings yet
s10772-024-10082-z
13 pages
2022.lrec-1.542 A Survey of Multilingual Models For Automatic Speech Recognition
No ratings yet
2022.lrec-1.542 A Survey of Multilingual Models For Automatic Speech Recognition
9 pages
CHAPTER ONE
No ratings yet
CHAPTER ONE
13 pages
2208.12666v1 Feature Extraction
No ratings yet
2208.12666v1 Feature Extraction
13 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
asr01-intro
No ratings yet
asr01-intro
43 pages
2202.05209v1
No ratings yet
2202.05209v1
25 pages
2015_Multi-task Learning of Deep Neural Networks for Low-resource Speech Recognition_Chen_Mak_IEEEACM Transactions on Audio, Speech, and Language Processing
No ratings yet
2015_Multi-task Learning of Deep Neural Networks for Low-resource Speech Recognition_Chen_Mak_IEEEACM Transactions on Audio, Speech, and Language Processing
12 pages
Electrical Engineering (2017-2021) Punjab Engineering College, Chandigarh - 160012
No ratings yet
Electrical Engineering (2017-2021) Punjab Engineering College, Chandigarh - 160012
23 pages
2406.01446v1
No ratings yet
2406.01446v1
13 pages
End-to-End Speech Recognition: A Survey
No ratings yet
End-to-End Speech Recognition: A Survey
27 pages
Linguistic-Coupled Age-to-Age Voice Translation To Improve Speech Recognition Performance in Real Environments
No ratings yet
Linguistic-Coupled Age-to-Age Voice Translation To Improve Speech Recognition Performance in Real Environments
11 pages
Comparing the Fine-Tuning and Performance of Whisper Pre-Trained Models for Turkish Speech Recognition Task
No ratings yet
Comparing the Fine-Tuning and Performance of Whisper Pre-Trained Models for Turkish Speech Recognition Task
4 pages
Reducing Language Context Confusion for End-To-End Code-switching Automatic Speech Recognition
No ratings yet
Reducing Language Context Confusion for End-To-End Code-switching Automatic Speech Recognition
9 pages
ASR in NLP
No ratings yet
ASR in NLP
7 pages
Applsci 12 01091
No ratings yet
Applsci 12 01091
18 pages
Promptasr For Contextualized Asr With Controllable Style
No ratings yet
Promptasr For Contextualized Asr With Controllable Style
5 pages
Christoph Bensch Master Thesis
No ratings yet
Christoph Bensch Master Thesis
67 pages
Development of Multilingual Speech
No ratings yet
Development of Multilingual Speech
13 pages
Sign Language Recognition Using Deep Learning
No ratings yet
Sign Language Recognition Using Deep Learning
6 pages
2403.01255v2
No ratings yet
2403.01255v2
22 pages
BTP Thesis rs1 End-To-End-Asr
No ratings yet
BTP Thesis rs1 End-To-End-Asr
51 pages
Ethiopian Multi
No ratings yet
Ethiopian Multi
5 pages
Delve Deep Into End-To-End Automatic Speech Recognition Models
No ratings yet
Delve Deep Into End-To-End Automatic Speech Recognition Models
6 pages
Article 5
No ratings yet
Article 5
7 pages
2023.banglalp-1.16 Pseudo
No ratings yet
2023.banglalp-1.16 Pseudo
11 pages
j9
No ratings yet
j9
37 pages
Automatic Speech Recognition Thesis
100% (3)
Automatic Speech Recognition Thesis
7 pages
The Impact of Speech Recognition On Speech Synthesis
No ratings yet
The Impact of Speech Recognition On Speech Synthesis
8 pages
Hybrid CTC/Attention Architecture For End-to-End Speech Recognition
No ratings yet
Hybrid CTC/Attention Architecture For End-to-End Speech Recognition
16 pages
Improving Myanmar Automatic Speech Recognition With Optimization of Convolutional Neural Network Parameters
No ratings yet
Improving Myanmar Automatic Speech Recognition With Optimization of Convolutional Neural Network Parameters
10 pages
Talgat Zhenishbek uulu
No ratings yet
Talgat Zhenishbek uulu
27 pages
Representation Analysis Methods - For Translation
No ratings yet
Representation Analysis Methods - For Translation
218 pages
End-to-End Speech Recognition: A Survey
No ratings yet
End-to-End Speech Recognition: A Survey
27 pages
2104 14297
No ratings yet
2104 14297
8 pages
Presentation RaviShankar
No ratings yet
Presentation RaviShankar
28 pages
Xiao Guest Lecture ASR
No ratings yet
Xiao Guest Lecture ASR
39 pages
Data-Driven Neural Network Based Feature - Phd-Thesis
No ratings yet
Data-Driven Neural Network Based Feature - Phd-Thesis
155 pages
Alemayehu Yilma
No ratings yet
Alemayehu Yilma
67 pages
Towards Rehearsal-Free Multilingual ASR - A LoRA-based Case Study On Whisper
No ratings yet
Towards Rehearsal-Free Multilingual ASR - A LoRA-based Case Study On Whisper
5 pages
Automatic Speech Recognition Using Deep Neural Networks
No ratings yet
Automatic Speech Recognition Using Deep Neural Networks
6 pages
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Incorporating Knowledge Sources Into Statistical Speech Recognition
No ratings yet
Incorporating Knowledge Sources Into Statistical Speech Recognition
20 pages
Ieee Icaicta Edited
No ratings yet
Ieee Icaicta Edited
6 pages
2514-Article Text-11375-1-10-20220919 (1)
No ratings yet
2514-Article Text-11375-1-10-20220919 (1)
12 pages
Ed 613938
No ratings yet
Ed 613938
7 pages
IT Report-1
No ratings yet
IT Report-1
14 pages
Preprints202212 0426 v1
No ratings yet
Preprints202212 0426 v1
18 pages
Rs 1 Poster
No ratings yet
Rs 1 Poster
1 page
2104.02194v2
No ratings yet
2104.02194v2
5 pages
2503.23542v1
No ratings yet
2503.23542v1
26 pages
LR_speech_tts_ASR_combo_2020
No ratings yet
LR_speech_tts_ASR_combo_2020
11 pages
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
From Everand
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
SYLLABUS English For Business (Manajemen)
No ratings yet
SYLLABUS English For Business (Manajemen)
5 pages
Word Stress Worksheet
100% (4)
Word Stress Worksheet
3 pages
A2 Listening Tap 1
No ratings yet
A2 Listening Tap 1
57 pages
Islamic Studies and Arabic Language Datu Kasim
No ratings yet
Islamic Studies and Arabic Language Datu Kasim
1 page
[ML&PR 2025] Lec0 Intro
No ratings yet
[ML&PR 2025] Lec0 Intro
19 pages
Unit4 v1.0
No ratings yet
Unit4 v1.0
20 pages
2现在进行时专项练习
No ratings yet
2现在进行时专项练习
3 pages
Transitions and Transitional Phrases
No ratings yet
Transitions and Transitional Phrases
3 pages
Adverbs Fix PDF
No ratings yet
Adverbs Fix PDF
8 pages
Into the Darkness (11-16)
No ratings yet
Into the Darkness (11-16)
43 pages
Eliciting
No ratings yet
Eliciting
18 pages
Broad and Narrow Transcription Practice
No ratings yet
Broad and Narrow Transcription Practice
40 pages
Colin Andcombes Francisco Colin, S.J. (1592-1660), Whose Labor Evangelica Was First Published in Madrid in 1663
No ratings yet
Colin Andcombes Francisco Colin, S.J. (1592-1660), Whose Labor Evangelica Was First Published in Madrid in 1663
7 pages
10th English New
No ratings yet
10th English New
26 pages
What Are They Doing
No ratings yet
What Are They Doing
4 pages
Tips For TOEFL Listening
75% (4)
Tips For TOEFL Listening
22 pages
TOEFL Listening Tips
No ratings yet
TOEFL Listening Tips
2 pages
The French language and its history.docx
No ratings yet
The French language and its history.docx
3 pages
Final Activity
No ratings yet
Final Activity
4 pages
Meccano Diagram
No ratings yet
Meccano Diagram
2 pages
Lecture 6
No ratings yet
Lecture 6
21 pages
Sts Nouns & Pronouns
No ratings yet
Sts Nouns & Pronouns
13 pages
Dethi hsg thcs tỉnh Tiếng Anh 2022-2023
No ratings yet
Dethi hsg thcs tỉnh Tiếng Anh 2022-2023
10 pages
A lesson on Commas, Colons, and Semicolons
No ratings yet
A lesson on Commas, Colons, and Semicolons
23 pages
Modal Perfect Verbs
No ratings yet
Modal Perfect Verbs
20 pages
MEH EOI B1 Wordlist U6
No ratings yet
MEH EOI B1 Wordlist U6
2 pages
Unit 2 - Animals and Human Language
No ratings yet
Unit 2 - Animals and Human Language
32 pages
B1 All Tenses T006: Complete The Sentences With The Correct Form of The Verbs in Brackets
100% (1)
B1 All Tenses T006: Complete The Sentences With The Correct Form of The Verbs in Brackets
2 pages
Pt3 19-20
No ratings yet
Pt3 19-20
3 pages

asr

Uploaded by

asr

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Transfer Learning for ASR to Deal with Low-Resource Data Problem

Technical Report · March 2022

Maryam Asadolahzade Kermanshahi

The user has requested enhancement of the downloaded file.

Transfer Learning for ASR to Deal with

(LateX template borrowed from NeurIPS 2019)

We use two corpora which are explained in the following:

4.2 Evaluation metric

5.1 The effect of Number of frozen layers

5.2 The effect of transfer learning

5.3 Comparing with baselines

View publication stats

You might also like