0% found this document useful (0 votes)

48 views

Text Augmentation For Neural Networks

This document proposes a method for augmenting text data to increase the performance of neural networks trained on small datasets. The method replaces words in sentences with synonyms while maintaining the overall meaning. An algorithm is described that excludes certain words from replacement and randomly selects words to replace with synonyms from WordNet. The method is tested on a toxic comment classification dataset, where augmenting the data with synonyms improved the performance of a character-level neural network model but did not affect a word-level model.

Uploaded by

chloe trump

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

Text Augmentation For Neural Networks

Uploaded by

chloe trump

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Text Augmentation for Neural Networks

Anna V. Mosolova1 , Vadim V. Fomin2 , and Ivan Yu. Bondarenko1

Novosibirsk State University1 ,

[email protected]
[email protected]
National Research University Higher School of Economics2
[email protected]

Abstract. This study considers the problem of using small text datasets
for learning of neural networks. We explore the method used for image
and sound datasets that augments data in order to increase the perfor-
mance of models trained on it. We propose a method for augmenting
that is based on synonymy.

Keywords: NLP, neural networks, small datasets, synonymy

1 Introduction

Natural language processing is an actively developing area today. Machine learn-

ing develops in this direction, and developers need for their approaches a lot of
labeled data, but it costs a lot of hours of human’s work. So, there is a need
for increasing amount of data which was labeled earlier. These methods already
exist in other parts of machine learning such as image classification, speech, and
sound recognition, but all technologies that can be used for images and sounds
are not suitable for text because of the danger of losing the sense of a sentence.
These methods are named data augmentation and they are a common way to
increase the performance of the model, avoid overfitting and improve the model’s
robustness. In this paper, we suggest a method for text augmentation that can
improve the performance, does not very computationally cost and allows not
losing the sense of the sentence. The paper consists of Related Work where we
present some augmentation technologies, Methods where the model is described,
Dataset provides the information about data for experiments and Experiments
and Results represent the results of our work. In Conclusion part, some future
goals are outlined.

2 Related Work

Data augmentation is a common problem in many areas of machine learning

because it enables models to generalize better. This is crucial for fields where a
good generalization is a challenging task, e. g. those where one must rely on small
datasets. The problem of data augmentation has a certain history of research
in certain tasks, such as picture recognition or speech recognition. For instance,
[1] suggests methods of image augmentation based on cropping, rotating, and
flipping input images. They also suggest using GANs to generate images of
different styles, and a new method that allows a neural net to choose which
additions will improve the classifier. An approach to sound augmentation is
suggested in [2], which proposed a method that consists in changing the speed of
an audio signal, producing 3 versions of the original with speed factors of 0.9, 1.0
and 1.1. A Python library for sound data augmentation has been suggested in [3].
All of these methods allow increasing the performance of a model without using
any additional data. However, there are no such methods for text augmentation,
so our proposal will be one of the first open-source solutions.

3 Methods

3.1 Algorithm

We propose an approach to text augmentation which consists in using synony-

mous words instead of original ones without losing of sentence’s sense. The pro-
cess of augmentation runs in several steps which we will describe below. The first
step implies excluding all words that do not require replacing like pronouns, con-
junctions, prepositions, and articles so that they remain intact. The nesessity to
leave such words intact is explained by the fact that such words tend to have no
synonyms and by its function which is to define the relationships of other words
rather then to introduce a meaning. The second step is to find the synonyms for
some words in the sentence. The words to be substituted are picked randomly
depending on the initial settings (specifically, the percentage of replaced words).
For example, if a sentence consists of 10 words and augmentation was set for
25%, the algorithm will substitute 2 words in the sentence. The choice of the syn-
onyms is also random. During the next step, new words are put in the places of
changed words and the algorithm returns the resulting sentence. The algorithm
has a parameter that allows increasing the number of sentences in the corpora
n-tuply. For example, Figure 1 shows a 7-times augmentation. Also, there is one
additional option that saves writing in capital letters (This option is presented
in Figure 2).

Fig. 1. An example of a 7-times augmentation with 25% changes

Fig. 2. The alrogithm of augmentation

3.2 Synonymy
The procedure of replacing the words with synonyms described in the paragraph
2.1 was realized by means of WordNet [4]. WordNet contains sets of synonymous
words and represents a base of words which are related in some other ways. We
used WordNet as one of the modules in NLTK [5]. Also, we used POS-tagging
from NLTK for disambiguation of part-of-speech in the sentence. It caused some
problems because POS-tags in NLTK differ from POS-tags in WordNet, so we
added a module for changing it to the desired form.

3.3 Realization
We used for the realization of augmentation Python 3.6 and NLTK library,
because it provides access to WordNet’s base.

4 Dataset
It is the Toxic Comment Classification Challenge, a competition launched by
Kaggle, that inspired us to test text data augmentation as an approach. The aim
of the competition was to classify comments written by Wikipedia users against
6 binary classifications, each binary classification representing a certain type of
toxicity. Thus, we used the dataset from this competition for our experiments
and evaluation. The dataset is available at https://www.kaggle.com/c/jigsaw-
toxic-comment-classification- challenge/data. The train set consisted of 159 571
samples, each of which was assigned 6 class labels, according to the 6 classifica-
tion tasks. The test set consisted of 153 164 samples. The 6 binary classifications
are related to the classes of toxic, severe toxic, obscene comments, threats, insults
and identity hate. Each class is illustrated by examples 1–6 correspondingly:
(1) Bye! Don’t look, come or think of comming back! Tosser.
(2) SHUT UP, YOU FAT POOP, OR I WILL KICK YOUR ASS!!!
(3) A pair of jew-hating weiner nazi schmucks.
(4) Hi! I am back again! Last warning! Stop undoing my edits or die!
(5) Hey, you freaking hermaphrodite. Please unprotect your user page; I would
like to move it to a more suitable title or three.
(6) Bla bla bla....suck it Irishguy =)

The number of samples and their percentage for each class in the dataset is
presented in Table 1.

Type Samples Percentage

Toxic 15249 9,6%
Severe toxic 1959 1,2%
Obscene 8449 5,3%
Threat 478 0,3%
Insults 7877 4,9%
Identity hate 1405 0,9%
Overall 159571 100%
Table 1. Dataset structure

Every comment that is marked as severe toxic is also labeled toxic. This is not
the case with other classes. It is obvious in the table above that the classes are
not quite balanced, e. g. the number of samples in the class ’toxic’ is drastically
larger than the number of comments that contain threats. It is also evident in the
examples above that, while some classes are largely dependent on the presence
of specific words in comments, others depend on the meaning of the sentence on
the whole. For instance, classes ‘identity hate’ and ‘obscene’ rely on the presence
of words that signify words that either insult a nation, a political view etc. or
are obscene.

5 Experiments and results

5.1 Model

To solve the problem suggested in the competition, we used a convolutional neu-

ral network with 128 feature maps, 6 convolutional layers, 6 pooling layers and 2
dense layers and a dropout of 0.5. The feature representation of a sentence that
was used as an input for the neural network consisted of vector representations
of each word in a sentence. As a source of vector representations, we tried two
embeddings trained upon the training set described in section 4: a word2vec
model as a word embedding and a word2vec trained to predict character ngrams
as a character embedding. The result is presented in Table 2.
5.2 Metrics

As a metric it was used ROC-AUC in this competition. The score of an algorithm

is the average of the individual AUCs of each predicted type of toxicity.

Model Public score Private Score

CNN with character embeddings 0.9065 0.8933
CNN with character embeddings and
with a 6 times augmentation for 25% 0.9436 0.9446
of all words
CNN with word embeddings 0.9752 0.9742
CNN with word embeddings
and with a 6 times augmentation for 25% 0.9743 0.9721
of all words
Table 2. Results of the Kaggle competition

6 Analysis

It is obvious from the evaluation presented above that text data augmentation
appeared capable of making character embeddings more relevant for classifica-
tion but did not affect the usefulness word embeddings in any way. This is be-
cause vector representations of synonymous words in word embeddings are very
close. As a result, the artificial samples from the augmented training set are very
close to the existing ones, which is not the case for character embeddings.

7 Conclusion and Future Work

Data augmentation has been shown to produce promising ways to increase the
accuracy of classification tasks. In this paper, we proposed an algorithm that
worked well in the competition from Kaggle and it can be used by researchers as
it free distributed on gitlab. We are going to develop our augmentation model
and add the possibility of augmenting Russian texts using synonyms from Wik-
tionary.

References

1. Wang, J., Perez, L.: The effectiveness of data augmentation in image classi-
fication using deep learning (No. 300). Technical report (2017)
2. Ko, T., Peddinti, V., Povey, D., Khudanpur, S. Audio augmentation for
speech recognition. In Sixteenth Annual Conference of the International
Speech Communication Association (2015)
3. Salamon, J., MacConnell, D., Cartwright, M., Li, P., Bello, J. P.: Scaper: A
library for soundscape synthesis and augmentation. In Applications of Signal
Processing to Audio and Acoustics (WASPAA), 2017 IEEE Workshop on
(pp. 344- 348). IEEE. (2017)
4. Miller, G. A.: WordNet: a lexical database for English. Communications of
the ACM, 38(11), 39-41 (1995)
5. Bird, S., Loper, E.: NLTK: the natural language toolkit. In Proceedings
of the ACL 2004 on Interactive poster and demonstration sessions (p. 31).
Association for Computational Linguistics (2004)

Fortune Cookies
No ratings yet
Fortune Cookies
25 pages
Innovation Management Capabilities Assessment 2019
No ratings yet
Innovation Management Capabilities Assessment 2019
17 pages
Explaining The Intuition of Word2Vec & Implementing It in Python
No ratings yet
Explaining The Intuition of Word2Vec & Implementing It in Python
13 pages
Konica Minolta Regius Model 110 Installation Manual
100% (1)
Konica Minolta Regius Model 110 Installation Manual
35 pages
Ultra Practice Mathematics Bundle PDF For Railway NTPC Exam
100% (1)
Ultra Practice Mathematics Bundle PDF For Railway NTPC Exam
314 pages
Contextual Augmentation: Data Augmentation by Words With Paradigmatic Relations
No ratings yet
Contextual Augmentation: Data Augmentation by Words With Paradigmatic Relations
6 pages
2020.findings Emnlp.269
No ratings yet
2020.findings Emnlp.269
19 pages
NLP_Module 2
No ratings yet
NLP_Module 2
54 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Doga
No ratings yet
Doga
13 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
Ci 5
No ratings yet
Ci 5
17 pages
NLP_DeepNLP
No ratings yet
NLP_DeepNLP
61 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
(Slide) Sentiment Analysis v3
No ratings yet
(Slide) Sentiment Analysis v3
46 pages
Unit iv
No ratings yet
Unit iv
58 pages
AugCSE: Contrastive Sentence Embedding With Diverse Augmentations
No ratings yet
AugCSE: Contrastive Sentence Embedding With Diverse Augmentations
24 pages
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
No ratings yet
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
11 pages
Text Mining Project Report
No ratings yet
Text Mining Project Report
27 pages
2302.13007v3
No ratings yet
2302.13007v3
12 pages
Text Classification by Augmenting Bag of Words (BOW) Representation With Co-Occurrence Feature
No ratings yet
Text Classification by Augmenting Bag of Words (BOW) Representation With Co-Occurrence Feature
5 pages
Text Mining
No ratings yet
Text Mining
34 pages
Lect04
No ratings yet
Lect04
44 pages
Data Augmentation With Transformers For Text Classification
No ratings yet
Data Augmentation With Transformers For Text Classification
13 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Assignment 4
No ratings yet
Assignment 4
5 pages
DM Chapter 9 - word embedding
No ratings yet
DM Chapter 9 - word embedding
7 pages
Natural Language Processing With RNNs .Ipynb - Colaboratory
No ratings yet
Natural Language Processing With RNNs .Ipynb - Colaboratory
15 pages
DeepNorm Deep Learning Approach
No ratings yet
DeepNorm Deep Learning Approach
7 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
TextFeatureEnginerring-NLP lec2
No ratings yet
TextFeatureEnginerring-NLP lec2
60 pages
nlp file
No ratings yet
nlp file
21 pages
EMNLP 2021 Bench Augm vANON
No ratings yet
EMNLP 2021 Bench Augm vANON
9 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
Accepted Manuscript: Speech Communication
No ratings yet
Accepted Manuscript: Speech Communication
16 pages
Chataug: Leveraging Chatgpt For Text Data Augmentation
No ratings yet
Chataug: Leveraging Chatgpt For Text Data Augmentation
12 pages
Sentiment Analysis Presentationnotes
No ratings yet
Sentiment Analysis Presentationnotes
4 pages
Unit iv
No ratings yet
Unit iv
57 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
0YqnEK3vg4heOTv089KxSI1ijWzuAxT1AgGevOKKJE
No ratings yet
0YqnEK3vg4heOTv089KxSI1ijWzuAxT1AgGevOKKJE
4 pages
DL Practical 09text Pre Processing
No ratings yet
DL Practical 09text Pre Processing
6 pages
Word Embeddings Notes
No ratings yet
Word Embeddings Notes
9 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
A Soft Introduction To NLP - Semantic Similarity Calculations Using Python - Medium
No ratings yet
A Soft Introduction To NLP - Semantic Similarity Calculations Using Python - Medium
13 pages
NLP Asgn3
No ratings yet
NLP Asgn3
6 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
Chapter II
No ratings yet
Chapter II
26 pages
Performance of Data Augmentation Methods For Brazi
No ratings yet
Performance of Data Augmentation Methods For Brazi
9 pages
Character-Level Convolutional Networks For Text Classification
No ratings yet
Character-Level Convolutional Networks For Text Classification
9 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
MATLAB for Machine Learning: Unlock the power of deep learning for swift and enhanced results
From Everand
MATLAB for Machine Learning: Unlock the power of deep learning for swift and enhanced results
Giuseppe Ciaburro
No ratings yet
Machine Learning
No ratings yet
Machine Learning
39 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
An Introduction to Probability and Stochastic Processes
From Everand
An Introduction to Probability and Stochastic Processes
James L. Melsa
4.5/5 (2)
NLP m3
No ratings yet
NLP m3
111 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
AP Computer Science A Premium, 12th Edition: Prep Book with 6 Practice Tests + Comprehensive Review + Online Practice
From Everand
AP Computer Science A Premium, 12th Edition: Prep Book with 6 Practice Tests + Comprehensive Review + Online Practice
Barron's Educational Series
No ratings yet
Zharmagambetov 2015
No ratings yet
Zharmagambetov 2015
4 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
Ass7 Write Up .Final
No ratings yet
Ass7 Write Up .Final
11 pages
Lagerwey LW 18 80 Power To Torque
No ratings yet
Lagerwey LW 18 80 Power To Torque
7 pages
Alfa Laval Sru Rotary Lobe Pump Product Leaflet
No ratings yet
Alfa Laval Sru Rotary Lobe Pump Product Leaflet
7 pages
Dahua HAC-HDBW2401RP-Z-DP-27135-S2 en Datasheet
No ratings yet
Dahua HAC-HDBW2401RP-Z-DP-27135-S2 en Datasheet
3 pages
Community Activity Pla (Summer League)
No ratings yet
Community Activity Pla (Summer League)
3 pages
Chaman
No ratings yet
Chaman
2 pages
9-RCC Arches & Lentils
100% (1)
9-RCC Arches & Lentils
19 pages
Electron Paramagnetic Resonance Theory
No ratings yet
Electron Paramagnetic Resonance Theory
47 pages
Fisher Thermo Scientific Catalogue V Dear
100% (1)
Fisher Thermo Scientific Catalogue V Dear
72 pages
NLP - Mental Toolbox - Goal Setting PDF
50% (2)
NLP - Mental Toolbox - Goal Setting PDF
20 pages
Understanding the Purpose and Power of Woman PDF
No ratings yet
Understanding the Purpose and Power of Woman PDF
30 pages
Girl Child Education Africa Part2
No ratings yet
Girl Child Education Africa Part2
15 pages
DSA Internal Exam Questions With Quiz (1)
No ratings yet
DSA Internal Exam Questions With Quiz (1)
4 pages
Ansi N45.2.9
No ratings yet
Ansi N45.2.9
6 pages
Rework Assignment
No ratings yet
Rework Assignment
3 pages
I. Objectives:: Batangas State University College of Teacher Education
No ratings yet
I. Objectives:: Batangas State University College of Teacher Education
8 pages
Method Statement FOR Mackintosh Probe Test: Project
100% (2)
Method Statement FOR Mackintosh Probe Test: Project
6 pages
RIBA Plan of Work
100% (2)
RIBA Plan of Work
2 pages
Powerfactory 2020: Technical Reference
100% (1)
Powerfactory 2020: Technical Reference
13 pages
Low Temperature Corossion (Molten Salt Corossion) in Black Liquor Recovery Boilers
No ratings yet
Low Temperature Corossion (Molten Salt Corossion) in Black Liquor Recovery Boilers
108 pages
Snowflake
No ratings yet
Snowflake
11 pages
Akademik FH Unsoed Ac Id
No ratings yet
Akademik FH Unsoed Ac Id
65 pages
Authorship 2024 Form
No ratings yet
Authorship 2024 Form
5 pages
Topic Test For Science Year 8.pdf (1)
No ratings yet
Topic Test For Science Year 8.pdf (1)
2 pages
Bow Tie Pneumatic 2nd Trial
No ratings yet
Bow Tie Pneumatic 2nd Trial
1 page
Fire Alarm System Water Spray Projector System
No ratings yet
Fire Alarm System Water Spray Projector System
2 pages
Pom 2
No ratings yet
Pom 2
4 pages

Text Augmentation For Neural Networks

Uploaded by

Text Augmentation For Neural Networks

Uploaded by

Text Augmentation for Neural Networks

Anna V. Mosolova1 , Vadim V. Fomin2 , and Ivan Yu. Bondarenko1

Novosibirsk State University1 ,

Keywords: NLP, neural networks, small datasets, synonymy

Natural language processing is an actively developing area today. Machine learn-

Data augmentation is a common problem in many areas of machine learning

We propose an approach to text augmentation which consists in using synony-

Fig. 1. An example of a 7-times augmentation with 25% changes

Type Samples Percentage

5 Experiments and results

To solve the problem suggested in the competition, we used a convolutional neu-

As a metric it was used ROC-AUC in this competition. The score of an algorithm

Model Public score Private Score

7 Conclusion and Future Work

You might also like