IEEE Conference Template Example
IEEE Conference Template Example
MALAYALAM CORPUS
FATHIMA MURSHIDA K
M.tech Cse Department
MES Collage of Engineering ,Kuttipuram
Kerala, India
[email protected]
Abstract—NLP is natural language processing or neuro linguis- script, and later the Kolezhuttu, which derived from it. The
tic programming.Natural languages like malayalam are highly oldest literary works in Malayalam, distinct from the Tamil
inflectional and agglutinative in nature.This is problematic when tradition, are the Paattus, folk songs, dated from between the
dealing with nlp based malayalam applications.So that inorder to
improve performance of malayalam nlp based applications, word 9th and 11th centuries. Grantha script letters were adopted to
embedding improvement on malayalam corpus is needed.The write Sanskrit loanwords, which resulted in the modern Malay-
improvement is based on converting the words contained in alam script. Malayalam is a highly inflectional and agglu-
the malayalam corpus into a standardised means removing tinative language.Algorithmic interpretation of Malayalam’s
all inflectional parts in the words in the existing malayalam words and their formation rules continues to be an untackled
corpus ie taking root words only.All that needed is a stemmer.In
this project i have used a malayalam morphological analyser problem. The word order is generally subject–object–verb,
for taking root words of all words in the existing malayalam although other orders are often employed for reasons such
corpus.The advanatge of removing inflectional parts from all as emphasis. Nouns are inflected for case and number, whilst
words is that we can reduce the sparsity in the existing malayalam verbs are conjugated for tense, mood and causativity (and also
corpus.Also there will be a high hike in frequency of words in archaic language for person, gender, number and polarity).
in the resulting corpus,then the space and time complexity
of wordembedding representation of the existing corpus will Being the linguistic successor of the macaronic Manipravalam,
decreases.According to zipfs law by increasing frequency of Malayalam grammar is based on Sanskrit too.Because of its
words performance of neural word embedding will increases. high complexity nature it is challenging to work on malayalam
Zipfs Law is a discrete probability distribution that tells you language.
the probability of encountering a word in a given corpus.By Morphological analyzer and morphological generator are
applying zipfs law am proposing there will be improvement on
malyalam wordembedding.Here using fasttext, word embeddings two essential and basic tools for building any language pro-
are performed and capture dense word vector representation of cessing application. Morphological Analysis is the process of
the malayalam corpus with dimensionality reduction from the providing grammatical information of a word given its suffix.
sparse word co-occurence matrix.The improvement is mainly Morphological analyzer is a computer program which takes
used for wordnet,analogy,ontology based malayalam applications. a word as input and produces its grammatical structure as
Index Terms—Morphological Analyzer,Zipfs law, Preprocess-
ing, Testing, Training output. A morphological analyzer will return its root/stem
word along with its grammatical information depending upon
its word category. For nouns it will provide gender, number,
I. I NTRODUCTION
and case information and for verbs, it will be tense, aspects,and
NLP stands for Neuro-Linguistic Programming. Neuro modularity. In my project need to remove inflections from
refers to your neurology, Linguistic refers to language, pro- each words in the corpus so that the frequency of words in
gramming refers to how that neural language functions. In the resulting corpus will increase. So the resulting corpus will
other words, learning NLP is like learning the language of contain only the root words. By increasing the frequency of
your own mind.It is the sub-field of AI that is focused words,complexity of word embedding will decrease.Because
on enabling computers to understand and process human sparsity is more in inflected words.Also with limitted re-
languages.Computers can’t yet truly understand Natural lan- souces will get better output provided that not considering
guages in the way that humans do — but they can already do or taking any morphosyntactic information.In my project
a lot.We might be able to save a lot of time by applying NLP am using only the conceptual similarity,that is conceptually
techniques to projects. similar words co occurene so that malayalam NLP based
In this project am using Malayalam language.Malayalam wordnet,ontology,analogy applications can improve their per-
is an Dravidian Indian language spoken in the Indian state formance.
of Kerala. It is one of the 22 scheduled languages of India In Natural Language Processing we want to make computer
and was designated a Classical Language in India in 2013. programs that understand, generate and, more generally speak-
The earliest script used to write Malayalam was the Vatteluttu ing, work with human languages. But there’s a challenge that
jumps out: we, humans, communicate with words and sen- al. introduced the fastText embeddings by extending the skip-
tences; meanwhile, computers only understand numbers.For gram algorithm to not consider words as atomic but as bags of
this reason, we have to map those words (sometimes even character n-grams. Their idea was inspired by a the work from
the sentences) to vectors: just a bunch of numbers ,That’s Sch¨utze in 1993, who learned representa- tions of character
called text vectorization.It is also termed as feature extrac- four-grams through singular value decomposition (SVD). One
tion.Different ways to convert text into numbers are Sparse of the main advantages of this approach is that word meaning
Vector Representations and Dense Vector Representations. can now be transferred between words, and thus embeddings
of new words can be extrapolated from embeddings of the n-
II. L ITERATURE S URVEY grams already learned. The length of n-grams you use can be
controlled by the -minn and -maxn flags for minimum and
A. State of the art
maximum number of characters to use respectively. These
1) Word2Vec : Word2Vec is a statistical method for effi- control the range of values to get n-grams for. The model
ciently learning a standalone word embedding from a text is considered to be a bag of words model because aside of
corpus.It was developed by Tomas Mikolov, et al. at Google in the sliding window of n-gram selection, there is no internal
2013 as a response to make the neural-network-based training structure of a word that is taken into account for featurization,
of the embedding more efficient and since then has become i.e as long as the characters fall under the window, the order
the de facto standard for developing pretrained word embed- of the character n-grams does not matter.
ding.The model is, in contrast to other deep learningmodels, 4) Paragram: Wieting et al.introduced a method to tune
a shallow model of only one layer without non-linearities. existing word embeddings using paraphrasing data. The focus
The paper by Mikolov et al. introduced two architectures of their paper is not on creating entirely new word embeddings
for unsupervisedly learning word embeddings from a large from a large corpus. Instead, the authors are taking existing
corpus of text. The first architecture is called CBOW, it tries pretrained GloVe embeddings and tune them so words in sim-
to predict the center word from the summation of the context ilar sentences are able to compose in the same manner. Their
vectors within a specific window. The second, and more training data consists of a set of P phrase pairs ( p 1 , p 2 ) ,
successful architecture, is called skip-gram . This architecture where p 1 and p 2 are assumed to be the paraphrases. The ob-
does the exact opposite, it tries to each of the context words jective function they use focuses to increase cosine similarity,
directly from the center word. The used pretrained word2vec i.e. the similarity of the angles between the composed semantic
embeddings are trained using the Skip-gram algorithm. This representations of two paraphrases.Important to mention is
algorithm is also the inspiration for the algorithms behind the that Wieting et al. expresses similarity in terms of angle and
GloVe and fastText embeddings. not in terms of actual distance. Additionally, Wieting et al.
2) GloVe : Global Vector for Word Representation by only explored one algebraic composition function, namely:
Pennington, Socher, and Manning (GloVe) was inspired by averaging of the word vectors.
the skip-gram algorithm and tries to approach the problem The data for tuning the embeddings used by the authors
from a different direction. Pennington, Socher, and Manning is the PPDB 5 or the Paraphrase Database by Ganitkevitch,
show that the ratio of co-occurrence probabilities of two Van Durme, and Callison-Burch . Specifically,they used ver-
specific words contains semantic information. The idea is sion XXL which contains 86 million paraphrase pairs. An
similar to TF-IDF but for weighing the importance of a example of a short paraphrase is: “thrown into jail” which
context word during the training of word embeddings.Classical is semantically similar to “taken into custody”. Wieting et al.
vector space model representations of words were developed published their pretrained embeddings called Paragram-Phrase
using matrix factorization techniques such as Latent Semantic XXL 6 , which are in fact tuned GloVe embeddings, alongside
Analysis (LSA) that do a good job of using global text their paper. These em- beddings also have a dimensionality
statistics but are not as good as the learned methods like of 300 and have a limited vocabulary of 50,000.In order to
word2vec at capturing meaning and demonstrating it on tasks apply the embeddings, according to Wieting et al., they should
like calculating analogies.Their algorithm works by gathering be combined with Paragram-SL999 which are tuned on the
all co-occurrence statistics in a large sparse matrix X, wherein SimLex dataset.
each element represents the times word i co-occurs with j
within a window similar to skip-gram. After which the word III. P ROPOSED M ETHOD
embeddings are defined in terms of this co-occurrence matrix. The increasing accuracy of word embedding representation
3) fastText : fastText as a library for efficient learning of malayalam languae has a great impact on ontology,analogy
of word representations and sentence classification. It is writ- representation based NLP based applications. Improved Word
ten in C++ and supports multiprocessing during training. Vectors on Malayalam language, which increases the accuracy
FastText allows you to train supervised and unsupervised of pre-trained malayalam wordembeddings in NLP Applica-
representations of words and sentences. These representations tions . The main aim is to improve word embeddings that
(embeddings) can be used for numerous applications from data they do not need any labeled data.The improvement is by
compression, as features into additional models, for candidate converting malayalam language into a standardised format by
selection, or as initializers for transfer learning. Bojanowski et removing inflections from each word in malayalam corpus.
Any large volume of text can be used to get the word T- Distributed Stochastic Neighbor Embedding (t-SNE) is a
embeddings by feeding it to the model, without any kind non-linear technique for dimensionality reduction. Well suited
of labeling.In this project am using fastText wordembedding. for the visualization of high-dimensional datasets.So that tsne
Word embeddings are derived by training a model on large visualisation technique is used for this project.By using tsne
text corpus. visualisation it can be seen that similar words tend to be close
to each othe and dissimilar words tend to be far from each
A. System Architecture other.
The system architecture mainly consist of five modules.
They are Corpus Creation ,Data Preprocessing, Training Using IV. R ESULTS
Neural Networks, Model Evaluation,Dataset Visualisation.
1) Corpus Creation : Datas are collected from Malayalam
news articles and created a malayalam Corpus containing
thousands of sentences.
2) Data Preprocessing: Data Preprocessing is the impor-
tant part of this project,Because datas in the real world
is incomplete,noisy,inconsistent. Also malayalam language is
highly inflectional and agglutinative. In this , a malayalam
morphological analyser is used to generate root words or
removing inflectional parts from the existing malayalam cor-
pus.By using sandhi rules malayalam morphological analyser
will generate root words of each word in the existing corpus.So
in the resulting corpus frequency of malayalam words will
increases.According to zipfs law in a large corpus of natural Fig. 1. Graphical result of existing dataset after applying zipfs law
language like malayalam, the frequency of any word is in-
versely proportional to its rank in frequency table.Frequency is
number of times a word appears in a given corpus.By remov-
ing inflections frequency of words will increases, according
to zipfs law by increasing frequency of words performance of
neural word embedding will increases.
3) Training Using Neural Networks : In this project
am using fastText wordembedding.fastText is another word
embedding method that is an extension of the word2vec
model. Instead of learning vectors for words directly, fastText
represents each word as an n-gram of charac- ters. This
helps capture the meaning of shorter words and allows the
embeddings to understand suffixes and prefixes. Once the
word has been represented using character n-grams, a skip-
Fig. 2. Graphical result of new malayalam dataset after appying zipfs law
gram model is trained to learn the embeddings.Fasttext can be
used both for classification and word-embedding creation.Here
it will generate the improved word embedding of the new
V. C ONCLUSION AND F UTURE W ORK
malayalam corpus which is containing root words only.
4) Model Evaluation: Word embeddings should capture It can be visualise that similar malayalam words or syn-
the relationship between words in natural language.In the onyms occupy close to each other by using tensorflow em-
Word Similarity and Relatedness Task, word embeddings are bedding projector. The Eucledian distance and cosine simi-
evaluated by comparing word similarity scores computed from larity of similar words will be lesser than the fastText pre
a pair of words with human labels for the similarity or relat- trained word vectors of malayalam language. It can be predict
edness of the pair.Cosine similarity measures the similarity that by using zipfs law malayalam word embedding can be
between two vectors of an inner product space. It is measured improved. Also sparsity problem in the existing malayalam
by the cosine of the angle between two vectors and determines corpus is decreased by using less resources that is conceptual
whether two vectors are pointing in roughly the same direction. similarity of words.Space and time complexity is reduced in
It is often used to measure document similarity in text analysis. the new malayalam corpus. The improved word vectors can
In this project cosine similarity measure is used to calculate be used to all NLP malayalam applications to improve their
the similar words. efficiency,accuracy,speed. In future for machine translation
5) Model Visualisation: PCA and T-sne are two techniques improvement on malayalam this model can be used as a
for visualising dataset in 2D or 3D space. PCA performs a base model and also learn morphosyntatic information on that
linear mapping of the data to a lower-dimensional space . model.
R EFERENCES
[1] Abdulaziz M. Alayba, Vasile Palade, Matthew Englandand Rahat Iqbal
” Improving Sentiment Analysis inArabic Using Word Representation”,
IEEE 2nd International Workshop on Arabic and Derived Script Analysis
and Recognition (ASAR),2018 .
[2] Shaosheng Cao Wei Lu, ”Improving Word Embeddings with Convolu-
tional Feature Learning and Subword Information”, Proceedings of the
2017 Conference on Empirical Methods in Natural Language Processing.
.
[3] YuvalPinter,RobertGuthrie,Jacob Eisenstein, ” Mimicking Word Embed-
dings using Subword RNNs”, Proceedings of the 2017 Conference on
Empirical Methods in Natural Language Processing. .
[4] Vishwani Gupta,Sven Giesselbach,Stefan R¨uping,Christian Bauckhage,
”Improving Word Embeddings Using Kernel PCA”, Proceedings of the
4th Workshop on Representation Learning for NLP.
[5] Procheta Sen,Debasis Ganguly , ”Word-Node2Vec: Improving Word Em-
bedding with Document-Level Non-LocalWord Co-occurrences”, Pro-
ceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language
Technologies..
[6] Miguel Ballesteros, Chris Dyer, and Noah A. Smith, ”Improved
transition-based parsing by modeling characters instead of words with
LSTMs”, In Proc. EMNLP.2015.
[7] Marco Baroni and Alessandro Lenci, ”Distributional memory: A general
framework for corpus-based semantics”, Computational Linguistics,
36(4):673–721, 2010.
[8] Jan A. Botha and Phil Blunsom, ” Compositional morphology for word
representations and language modelling”, In Proc. ICML, 2014.
[9] Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huanbo
Luan, ”Joint learning of character and word embeddings”, In Proc.
IJCAI,2015.
[10] Cicero Nogueira dos Santos and Bianca Zadrozny, ”Learning character-
level representations for part-ofspeech tagging”, In Proc. ICML,2014.
[11] F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida, ”Detecting
spammer on twitter”, in Proc. 7th Annu. Collaboration, Electron. Mes-
saging, Anti-Abuse Spam Conf., Jul. 2012, p. 12
[12] J. Song, S. Lee, and J. Kim, ”Spam filtering in Twitter using sender
receiver relationship”, in Proc. 14th Int. Conf. Recent Adv. Intrusion
Detection, 2011, pp. 301317.
[13] Chao Chen, Yu Wang, Jun Zhang, Yang Xiang, WanleiZhou, ”Statistical
Features-Based Real-Time Detection of Drifted TwitterSpam, Twit-
terSpam,”IEEE TRANSACTIONS ON INFORMATION FORENSICS
AND SECURITY, VOL. 12, NO. 4, APRIL 2017.
[14] Sayali Kamble,S.M.Sangve, ””Real Time Detection of Drifted Twit-
terSpam Based on Statistical Features”, Features”2018 International
Conference on Information, Communication, Engineering and Technol-
ogy (ICICET) .
[15] Surendra Sedhai , Aixin Sun, ”Semi Supervised Spam Detection Twitter
Stream”, ”IEEE Transactions On Computational Social Systems 2017.
[16] Hajime Watabe, Mondher Boduazizi,And Tomoaki Ohtsuki, ””Hate
Speech on Twitter: A Pragmatic Approach to Collect Hateful and
Offensive Expressions and Perform Hate Speech Detection”, ”IEEE
Transactions On Computational Social Systems 2018.