Text Augmentation For Neural Networks
Text Augmentation For Neural Networks
Abstract. This study considers the problem of using small text datasets
for learning of neural networks. We explore the method used for image
and sound datasets that augments data in order to increase the perfor-
mance of models trained on it. We propose a method for augmenting
that is based on synonymy.
1 Introduction
2 Related Work
3 Methods
3.1 Algorithm
3.2 Synonymy
The procedure of replacing the words with synonyms described in the paragraph
2.1 was realized by means of WordNet [4]. WordNet contains sets of synonymous
words and represents a base of words which are related in some other ways. We
used WordNet as one of the modules in NLTK [5]. Also, we used POS-tagging
from NLTK for disambiguation of part-of-speech in the sentence. It caused some
problems because POS-tags in NLTK differ from POS-tags in WordNet, so we
added a module for changing it to the desired form.
3.3 Realization
We used for the realization of augmentation Python 3.6 and NLTK library,
because it provides access to WordNet’s base.
4 Dataset
It is the Toxic Comment Classification Challenge, a competition launched by
Kaggle, that inspired us to test text data augmentation as an approach. The aim
of the competition was to classify comments written by Wikipedia users against
6 binary classifications, each binary classification representing a certain type of
toxicity. Thus, we used the dataset from this competition for our experiments
and evaluation. The dataset is available at https://www.kaggle.com/c/jigsaw-
toxic-comment-classification- challenge/data. The train set consisted of 159 571
samples, each of which was assigned 6 class labels, according to the 6 classifica-
tion tasks. The test set consisted of 153 164 samples. The 6 binary classifications
are related to the classes of toxic, severe toxic, obscene comments, threats, insults
and identity hate. Each class is illustrated by examples 1–6 correspondingly:
(1) Bye! Don’t look, come or think of comming back! Tosser.
(2) SHUT UP, YOU FAT POOP, OR I WILL KICK YOUR ASS!!!
(3) A pair of jew-hating weiner nazi schmucks.
(4) Hi! I am back again! Last warning! Stop undoing my edits or die!
(5) Hey, you freaking hermaphrodite. Please unprotect your user page; I would
like to move it to a more suitable title or three.
(6) Bla bla bla....suck it Irishguy =)
The number of samples and their percentage for each class in the dataset is
presented in Table 1.
Every comment that is marked as severe toxic is also labeled toxic. This is not
the case with other classes. It is obvious in the table above that the classes are
not quite balanced, e. g. the number of samples in the class ’toxic’ is drastically
larger than the number of comments that contain threats. It is also evident in the
examples above that, while some classes are largely dependent on the presence
of specific words in comments, others depend on the meaning of the sentence on
the whole. For instance, classes ‘identity hate’ and ‘obscene’ rely on the presence
of words that signify words that either insult a nation, a political view etc. or
are obscene.
5.1 Model
6 Analysis
It is obvious from the evaluation presented above that text data augmentation
appeared capable of making character embeddings more relevant for classifica-
tion but did not affect the usefulness word embeddings in any way. This is be-
cause vector representations of synonymous words in word embeddings are very
close. As a result, the artificial samples from the augmented training set are very
close to the existing ones, which is not the case for character embeddings.
Data augmentation has been shown to produce promising ways to increase the
accuracy of classification tasks. In this paper, we proposed an algorithm that
worked well in the competition from Kaggle and it can be used by researchers as
it free distributed on gitlab. We are going to develop our augmentation model
and add the possibility of augmenting Russian texts using synonyms from Wik-
tionary.
References
1. Wang, J., Perez, L.: The effectiveness of data augmentation in image classi-
fication using deep learning (No. 300). Technical report (2017)
2. Ko, T., Peddinti, V., Povey, D., Khudanpur, S. Audio augmentation for
speech recognition. In Sixteenth Annual Conference of the International
Speech Communication Association (2015)
3. Salamon, J., MacConnell, D., Cartwright, M., Li, P., Bello, J. P.: Scaper: A
library for soundscape synthesis and augmentation. In Applications of Signal
Processing to Audio and Acoustics (WASPAA), 2017 IEEE Workshop on
(pp. 344- 348). IEEE. (2017)
4. Miller, G. A.: WordNet: a lexical database for English. Communications of
the ACM, 38(11), 39-41 (1995)
5. Bird, S., Loper, E.: NLTK: the natural language toolkit. In Proceedings
of the ACL 2004 on Interactive poster and demonstration sessions (p. 31).
Association for Computational Linguistics (2004)