0% found this document useful (0 votes)
20 views

Manuscript Updated-1

Uploaded by

Abhilash Mohanty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Manuscript Updated-1

Uploaded by

Abhilash Mohanty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Text Analysis for Human Emotion Recognition in Low-resourced Indian

Languages

Ayush Kumar1, Satwik Das2, Abhilash Mohanty3

Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan (Deemed to be) University,
Bhubaneswar, Odisha, India
[email protected]
[email protected]
[email protected]

Abstract: The project is an advanced sentiment analysis system leveraging deep learning and
natural language processing to accurately determine the sentiment of textual data. This system is
built upon the integration of GloVe word embeddings and a Bidirectional Long Short-Term Mem-
ory (BiLSTM) model, providing a robust solution for processing and understanding sentiments in
large text datasets. By using TensorFlow and Keras, the platform ensures efficient model training
and evaluation, leading to high accuracy in sentiment classification tasks.

The user experience begins with the preprocessing of the input data, which includes cleaning and
tokenizing the text data from the provided dataset. The GloVe embeddings are utilized to convert
words into dense vector representations, capturing semantic meanings that enhance the model's
ability to understand context. The BiLSTM model, with its ability to consider both past and future
contexts in the text, is trained using these embeddings, ensuring a comprehensive understanding
of the sentiment conveyed in the sentences. The training process is optimized using the Adam
optimizer, and the model is evaluated to ensure its accuracy and reliability.

Once the model is trained, it can predict the sentiment of new text inputs, including those in different
languages through translation. The implementation of the system involves various preprocessing tech-
niques such as text cleaning, tokenization, padding, and one-hot encoding of sentiments. By combining
these techniques with the powerful BiLSTM model and GloVe embeddings, the project offers a highly
accurate and efficient solution for sentiment analysis, addressing the challenges of understanding and
classifying sentiments in diverse and complex text data.Keywords: Decentralized Database ,
Blockchain Technology, MetaMask wallet, Smart Contract.
2
1 Introduction

1.1 Motivations

The primary motivation behind developing this sentiment analysis system is to improve the under-
standing and processing of textual data in our digital age. With vast amounts of text generated
daily on social media, reviews, and customer feedback, there is a crucial need for reliable tools to
interpret and classify sentiments accurately. Using advanced techniques like GloVe embeddings
and Bidirectional LSTM models, our system aims to provide businesses, researchers, and devel-
opers with a robust solution for gaining insights from text data.

Understanding the emotional tone behind text is essential for applications such as enhancing cus-
tomer service, monitoring social media sentiment, and improving user experiences. Traditional
methods often fail to capture the nuances of human language. Our motivation is to overcome these
limitations by employing deep learning technologies that grasp contextual information and offer a
deeper understanding of sentiments, thereby improving accuracy and handling diverse text data
effectively.

Additionally, the system's ability to predict sentiments in different languages through translation
underscores its global applicability. By creating a tool that accurately analyzes sentiments across
languages and cultures, we aim to foster better communication and inclusivity in our digital
world. This project not only addresses technical challenges but also enhances human-computer
interaction and communication in an increasingly digital society.
3
1.2 Objectives

1.Ensure Accurate Sentiment Analysis Utilize RoBERTa tokenization for better accuracy in text
processing and sentiment classification, ensuring that the system captures the nuances and context of
human language effectively.

2. Integrate Advanced Embeddings: Implement GloVe word embeddings to convert words into dense
vector representations, enhancing the model's ability to understand semantic meanings and improve
sentiment prediction.

3. Leverage Deep Learning Models: Use Bidirectional LSTM models to consider both past and future
contexts in text data, providing a comprehensive understanding of sentiments.

4. Optimize Model Performance: Employ the Adam optimizer to ensure efficient training and high
accuracy of the sentiment analysis model.

5. Facilitate Multilingual Analysis: Enable the system to predict sentiments in different languages
through translation, making it versatile and globally applicable.

6. Enhance Preprocessing Techniques: Implement robust preprocessing methods, including text cleaning,
tokenization, padding, and one-hot encoding, to prepare data effectively for sentiment analysis.

7. Support Diverse Applications: Provide a reliable solution for various applications such as improving
customer service, monitoring social media sentiment, and enhancing user experiences by accurately
interpreting and classifying sentiments.

Original Contributions

Name Contributions to this Project

Ayush Kumar Literature survey


Problem Identification
Project Implementation
Documentation

Satwik Das Literature Survey


Work-flow Diagram
Project Implementation
Result Analysis
4
Abhilash Mohanty Literature Survey
Documentation
Project Implementation

1.3 Paper Layout

In our paper, Section 1 presents the introduction. Section 2 provides a literature survey on sentiment analysis
techniques and models. Section 3 details our proposed solution using RoBERTa tokenization, GloVe embed-
dings, and Bidirectional LSTM models to achieve high accuracy in sentiment analysis. Section 4 discusses the
results and outcomes of our project. Section 5 explores future possibilities and potential enhancements for our
sentiment analysis system.

2 Literature Survey

Ref. Author’s / Title Results Technique Findings


No. Year
1. Aditya Joshi, A Fall-back accuracy  Training a  A fall-back strat-
Balamurali Strategy for :In-language classifier on egy is proposed
AR, Sentiment sentiment annotated where first an in-
Pushpak Analysis in analysis- Hindi corpus language classifier
Bhat- Hindi: a 78.14 and using it is trained on the
tacharyya Case Study :MT-based to classify a same language
sentiment new docu- dataset, as very
analysis- ment. rough and less data
65.96  Machine is available.
Translation
(MT) -
based Senti-
ment Ana-
lysis
 Hindi-Senti-
WordNet
(H-SWN):
Lexical
Resource for
Hindi

2. Naman Sentiment It also shows Deep belief net- Trained a DBN on


Bansal and Analysis in that decision work a small percentage
Umair Z. Hindi tree gives of labeled data and
Ahmed. best accuracy assign polarity to
Advisor: of 90.85% in unlabeled data.
Amitabha case of TF- They used semi-
Mukherjee, IDF represen- supervised learning
IIT Kanpur tation, and in because supervised
5
case of uni- polarity classifica-
gram repre- tion systems are
sentation we domain-specific
have and hence systems
achieved trained on one
79% accu- dataset typically
racy through perform much
voting classi- worse on a differ-
fier. ent dataset. They
also stated that
annotating a large
amount of data
could be an expen-
sive process
3. Loitongbam Low re-  The  Translitera- Dataset used in
Sanayai source highest tion and experiment is col-
Meetei language average TF-IDF lected from the
• Thoudam specific overall (term fre- local daily newspa-
Doren Singh pre-pro- result is quency) - pers.Then translit-
• Samir Ku- cessing and obtained (inverse erated dataset
mar Borgo- features for with TF- document which is in the
hain sentiment IDF frequency) Meetei Mayek and
• Sivaji analysis values as Bengali script to
Bandyopad- task the fea- the Roman script
hya tures in because of very
10-fold close similarity
CV. and roman script
having pre tuned
tokenizer and other
lexical tools.

4. Joanito Agili Construct-  The use  Mixed tech- Used open


Lopo ing and of this nique for source translated
Expanding parallel dataset cre- datasets such as
Low-Re- corpus ation. Tatoeba Dataset,
source and and NusaX Lexicon
Underrep- bilingual Dataset .
resented lexicon Used statistical
Parallel is ex- machine transla-
Datasets for pected to tion keeping its
Indonesian acceler- context.
Local Lan- ate the
guages process
of NLP
research
in low
resource
settings.
6
3 Proposed Systems

3.1 Methodologies Used

RoBERTa Tokenization: RoBERTa (Robustly Optimized BERT Approach) is a state-of-the-art natural lan-
guage processing model designed to improve the accuracy of text processing tasks. It refines the BERT model
by optimizing training strategies, increasing the amount of training data, and modifying key hyperparameters.
RoBERTa tokenization ensures that text data is processed with high precision, capturing the nuances and con -
text of language more effectively than traditional tokenization methods.

GloVe Embeddings: GloVe (Global Vectors for Word Representation) is an unsupervised learning algorithm
for obtaining vector representations for words. By mapping words into dense vector spaces where semantically
similar words are positioned closely, GloVe embeddings enhance the model's ability to understand semantic
relationships within the text. This methodology provides a rich representation of words, crucial for improving
the performance of sentiment analysis models.

Bidirectional LSTM (BiLSTM) Models: Bidirectional Long Short-Term Memory (BiLSTM) networks are a
type of recurrent neural network that can process data in both forward and backward directions. This capability
allows the model to capture dependencies and contextual information from both past and future states within a
text sequence, leading to a more comprehensive understanding of sentiment. BiLSTM models are particularly
effective for sentiment analysis due to their ability to grasp long-term dependencies in text data.

Adam Optimizer: The Adam optimizer is an advanced gradient descent algorithm used for training deep learn-
ing models. It combines the advantages of two other extensions of stochastic gradient descent, namely AdaGrad
and RMSProp, to achieve efficient and effective model training. Adam adjusts the learning rate dynamically,
ensuring faster convergence and improved performance of the sentiment analysis model.

Preprocessing Techniques: Robust preprocessing techniques are employed to prepare the text data for analysis.
This includes text cleaning, tokenization, padding, and one-hot encoding of sentiments. Text cleaning involves
removing unwanted characters and noise, while tokenization splits the text into individual words or tokens. Pad-
ding ensures that all text sequences are of uniform length, and one-hot encoding converts sentiment labels into a
binary matrix representation, facilitating accurate model training and evaluation.

3.2 Schematic Layout model

4.1 System Requirements

The minimum system requirements for running our sentiment analysis project, which involves preprocessing
text data, using RoBERTa tokenization, GloVe embeddings, and training Bidirectional LSTM models with
TensorFlow and Keras, typically include:
7
Server/Hosting:

 CPU: Dual-core processor or higher


 RAM: 8 GB or more
 Storage: 20 GB SSD or more (for hosting the operating system, applications, and storing model data
and embeddings)
 Operating System: Linux (recommended), Windows, or macOS

Client Devices:

 Web Browser: Google Chrome, Firefox, or any browser compatible with the Jupyter notebook or Colab
environment
 Operating System: Windows 7 or later, macOS 10.12 or later, Linux distributions

Networking:

 Internet Connection: Broadband internet connection with sufficient bandwidth for downloading embed-
dings, model training, and interaction with cloud-based services if used (e.g., Google Colab for model
training)

These specifications ensure that the system can efficiently handle the computational demands of text prepro-
cessing, embedding generation, and deep learning model training for sentiment analysis.

5 Experimentation and Model Evaluation

5.1 Depiction Results

The decentralized storage and publication website project achieved several not-
able results, showcasing the effectiveness and reliability of using blockchain tech-
nology and React.js. Key outcomes include:

1. Enhanced Sentiment Analysis Accuracy:


o The integration of RoBERTa tokenization significantly improved the model's ability
to understand and process complex language nuances, leading to more accurate sen-
timent classification..
o The use of GloVe embeddings provided dense vector representations of words,
which enhanced the model's semantic understanding and overall performance in

sentiment prediction..
2. Efficient Model Training and Performance:
o The Bidirectional LSTM models, trained with the Adam optimizer, effectively cap-
tured context from both past and future states in the text, resulting in a comprehen-
sive understanding of sentiments.
8
o The robust preprocessing techniques, including text cleaning, tokenization, padding,
and one-hot encoding, ensured that the data was prepared efficiently, leading to
streamlined training processes and high model accuracy..
3. Reliable and Scalable System:
o The system demonstrated reliable performance in processing and analyzing large
volumes of text data, meeting the expected benchmarks for sentiment analysis appli-
cations.
o The modular architecture of the project, incorporating TensorFlow and Keras for
model training, provides a strong foundation for future enhancements and scalabil-
ity.
4. Support for Quality Journalism:
o The sentiment analysis system ensures that only reliable and authenticated sources
contribute to news sentiment assessments, thereby upholding journalistic standards
and integrity.
o By leveraging advanced natural language processing techniques, the system helps
combat misinformation by providing accurate sentiment analysis of news content,
promoting transparency and trustworthiness in journalism..
5. Scalability and Future Enhancements:
o Potential improvements, such as the integration of additional pre-trained models,
enhanced preprocessing techniques, and broader support for different languages,
were identified to further increase the system's accuracy and usability.
o The project lays the groundwork for future developments in sentiment analysis, pro-
viding a scalable and adaptable framework for various text processing and sentiment
classification applications.
6. User Adoption and Feedback:
o Initial user feedback highlighted the system's ease of use and the perceived increase
in accuracy and reliability of sentiment analysis results.
o Users appreciated the comprehensive preprocessing and advanced model integration,
which simplified the analysis process while ensuring high accuracy and perfor-
mance.
9
5.2 Validation/System Performance Evaluation

Fig: Real time Sentiment analysis

Fig: Model Accuracy and Loss

5 Conclusion and Future Scope

With its robust integration of RoBERTa tokenization, GloVe embeddings, and Bidirectional LSTM models, this
sentiment analysis project is poised for significant advancements. Future possibilities include enhancing multi-
lingual support to encompass a wider range of languages, refining preprocessing techniques for even more ac-
curate sentiment predictions, and exploring the integration of newer, more efficient deep learning architectures.
These advancements promise to elevate the system's capability to interpret and classify sentiments across di-
verse textual data, furthering its utility in various applications and scenarios.
10
6 References:

 Zibin Zheng, Shaoan Xie, Hongning Dai, Xiangping Chen, and Huaimin Wang, “An Overview of
Blockchain Technology: Architecture, Consensus, and Future Trends ”[2017], DOI:10.1109/BigData-
Congress.2017.85 Conference: 6th IEEE International Congress on Big Data
 Hamed Taherdoost , “Smart Contracts in Blockchain Technology: A Critical Review”[2023],
DOI:10.3390/info14020117
 Van Giang Phan Mai , La Minh Vu, Do Hoang Son, Nguyen Tuan Khai, “A Blockchain-based User
Authentication Model Using MetaMask” DOI:10.1109/ICARC61713.2024.10499782, Conference:
2024 4th International Conference on Advanced Research in Computing (ICARC)
 M.D.M. Shamalka, Banujan Kuhaneswaran, B.T.G.S. Kumara, “Blockchain and Smart Contract Based
Approach to Mitigate Software Piracy” , DOI:10.1109/ICARC61713.2024.10499782, Conference:
2024 4th International Conference on Advanced Research in Computing (ICARC) [2024]
 Martiny, Amaury (2021). “MetaMask Tutorial: One-click Login With Blockchain Made Easy.” Toptal,
https://www.toptal.com/ethereum/one-click-login-flows-a-metamask-tutorial ● G. Singh, V. Garg, and
P. Tiwari, "A Study on Blockchain Technology: Application and Future Trends," in Blockchain Tech-
nology and the Internet of Things: Apple Academic Press, 2020, pp. 317-337.

You might also like