0% found this document useful (0 votes)
29 views

Live Event Detection For Peoples Safety Using NLP and Deep Learning

Live event detection project

Uploaded by

susmithacrazy55
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Live Event Detection For Peoples Safety Using NLP and Deep Learning

Live event detection project

Uploaded by

susmithacrazy55
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Received 28 November 2023, accepted 25 December 2023, date of publication 1 January 2024,

date of current version 16 January 2024.


Digital Object Identifier 10.1109/ACCESS.2023.3349097

Live Event Detection for People’s Safety Using


NLP and Deep Learning
AMRIT SEN1 , GAYATHRI RAJAKUMARAN 1 , MIROSLAV MAHDAL 2 , SHOLA USHARANI1 ,
VEZHAVENDHAN RAJASEKHARAN 3 , RAJIV VINCENT 1 , AND KARTHIKEYAN SUGAVANAN1
1 School of Computer Science and Engineering, Vellore Institute of Technology, Chennai 600127, India
2 Department of Control Systems and Instrumentation, Faculty of Mechanical Engineering, VSB-Technical University of Ostrava, 708 00 Ostrava, Czech Republic
3 School of Mechanical and Building Sciences, Vellore Institute of Technology, Vellore 632014, India
Corresponding author: Miroslav Mahdal ([email protected])
This work was supported by the European Union under the REFRESH—Research Excellence For REgion Sustainability and High-Tech
Industries via the Operational Program Just Transition under Project CZ.10.03.01/00/22_003/0000048.

ABSTRACT Today, humans pose the greatest threat to society by getting involved in robbery, assault,
or homicide activities. Such circumstances threaten the people working alone at night in remote areas
especially women. Any such kind of threat in real time is always associated with a sound/noise which may be
used for an early detection. Numerous existing measures are available but none of them sounds efficient due
to lack of accuracy, delays in exact prediction of threat. Hence a novel software-based prototype is developed
to detect threats from a person’s surrounding sound/noise and automatically alert the registered contacts of
victims by sending email, SMS, WhatsApp messages through their smartphones without any other hardware
components. Audio signals from Kaggle dataset are visualized, analyzed using Exploratory Data Analytics
(EDA) techniques. By feeding EDA outcomes into various Deep Learning models: Long short-term memory
(LSTM), Convolutional Neural Networks (CNN) yields accuracy of 96.6% in classifying the audio-events.

INDEX TERMS Natural language processing (NLP), deep learning, audio, recording, CNN, LSTM,
classification, prediction.

I. INTRODUCTION form or the other. Before directly discussing the prospects of


In the physical world, the occurrence of any physical event natural language processing in the domain of security, one
would bear with it a sound, particular to that event only. Be it needs to understand the current scenario, and where and how
the sound of a stone falling to the ground, a river flowing, sound/text detection/analysis is done by NLP and used by
a bird chirping, a person walking, a road being constructed mankind in modern times.
etc. It is thus fair to take into account, just like an event with Modern smartphones equipped with Artificial Intelligence
which positivity can be associated (crowds cheering, friends based voice recognition systems are classic examples of
greeting each other, celebratory fireworks etc.) has a sound the application of speech-based natural language processing
paired with it, negative events (a road accident, a landslide, in our day-to-day lives. With the likes of Google’s Voice
a gunshot etc.) also have specific sounds associated with Assistant [23], [24] available on every android powered
them as well. It is worth pondering how great would be a smartphone which searches the web and brings the necessary
system which would be able to detect ambient noise and information in front of us without us typing a single search
judge whether it is related to something positive or some- keyword, or the evolution of Apple’s Siri [24] through the
thing negative. Even if direct applications as described here various versions of the iconic iPhone series which primarily
are not available to mankind’s everyday use, the technology works as a personal assistant for its users, or the availability of
which is required to make it a reality already exists in one Amazon’s Alexa [26] as a smart home voice assistant which
can be used to control virtually any smart electrical device
The associate editor coordinating the review of this manuscript and in our home, or the development of Samsung’s own artifi-
approving it for publication was Felix Albu . cial intelligence-based personal assistant named Bixby [25],
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 6455
A. Sen et al.: Live Event Detection for People’s Safety Using NLP and Deep Learning

natural language processing has already come a long way in situation of the victim. For instance, if a fire breaks out in
the domain of voice/text-based language processing for the a building, and residents get trapped inside, then with the
betterment of human lives, all while keeping at par with the help of sound detectors in appropriate locations inside the
compatibility with the latest technologies and hardware. building, an automated application based on NLP could be
Taking a few instances of how natural language processing used to detect if a person is in immediate danger of getting
with artificial intelligence has impacted our lives in recent burned, from the ambient noise of the fire and the screams of
times, one can take the example of how Google’s Voice the victim, and emergency services could be sent in his/her
Assistant works. If one needs to just set an alarm, he/she direction with definitive motive, without wasting time look-
can literally speak so, and the alarm would be set by the ing everywhere for the victim.
AI-based assistant. If one is driving a car and wants to Natural disasters aside, individual human beings also face
look up the direction to the destination, one can just ask many dangers and perish too, which can be a direct cause
for it by just doing so after saying ‘‘Hey, Google’’! The of either an unfortunate accident or a peril caused by one
examples of such applications are numerous. It is virtually human being to another (like a homicide). In most situations
not necessary to touch the phone even for dialing another alike, the victim dies as the situation is not conveyed to the
person’s phone number so as to call that person. One can emergency services on time. Also, people working/walking
just ask Google to do so, and it would be done in no late at night, especially women, in remote areas also face the
time. danger of being robbed, assaulted, or being murder. Situations
Another example of the application of natural language like these, call for a system which would be able to access
processing is in the field of textual data analysis and classi- the ambient noise of a person and detect whether he/she
fication. NLP can be used to determine the voice, tense, and is in danger or not. A system similar to the one described
type of a sentence, which can in turn be used to determine here has been worked upon by the researchers in [1], but
either the next part of the entire text or the emotion [28], [29] there, the system was primarily based on a hard-ware model
that the text is trying to convey. Google, for in-stance, uses a and had to be worn at all times in order for it to work
very similar application in their search engine where they start properly.
suggesting search queries to users as soon as they start typing In this research, a software-based system is built as
something into the browser’s search bar. The relevancy of the depicted in Figure 1 would be able to detect whether a person
search result is also determined by NLP itself. Taking another is in any dangerous situation or not by analyzing his/her
instance, social media platforms like Twitter and Instagram surrounding noise. If dangerous situation is detected, then
use NLP for sentiment analysis and emotion detection and an automatic and immediate alert would be issued to the
use the acquired insight to show their users similar posts on registered contact (or the emergency services
their timelines. The research starts with understanding related studies in
Now, observing the scenario presented above, NLP can be the process of event classification and adoption of various
used to enhance the safety and security of individuals or a machine learning, deep learning models in the Technical
population as a whole, in many different ways. Both sound Background Section II concludes by mentioning research
and text-based NLP can be used in multiple scenarios to gaps in the current literature. Proposed live event detection
provide different kinds of security solutions. methodology of Section III depicts the overall procedure in
Taking the scenario of NLP on social media platforms [27], identifying and classifying the real time events based on
they can be of great help to security and emergency work- surrounding noise/sound and the understanding, explanation
ers in times of crisis like a natural calamity (like a flood, of the dataset is depicted in Section IV, which consists of
landslide, tsunami, cyclone etc.) or man-made disasters (like over 9000 different audio clips, spread across 13 differ-
terrorist attacks, hostage situations, aviation emergencies, ent classes of audio. The dataset is explored Section IV,
civil accidents etc.). Since events of such kinds would stir by looking into its time-domain form, before transforming it
heavy discussions on social media, NLP techniques can be to the frequency domain using Fast Fourier Transform and
put to work, which can keep track of posts of similar kinds, sampling it to 44.1 kHz. The Decibel Spectrogram and the
and if any post related to an emergency is detected on the Mel-Spectrograms are used to visualize the data. The Mel-
social media platforms, then it would be re-directed to the Spectrogram form of the data is used by the deep learning
appropriate emergency services with the help of customized models.
algorithms. This way, emergency responders would have The data is then cleansed in Section IV-A, and for each
entire situational awareness, and would be able to act rela- audio signal, an audio envelope is created for better analysis
tively more quickly than possible before, and would be able by the deep learning models. The next step is to train the
to help the victims of the situation in a more informed and three deep learning models (1D-CNN, 2D-CNN, and LSTM)
better way, possibly minimizing further damage as much as in section IV-B, IV-C, IV-D on the cleansed dataset, and to
possible. analyze the output.
A similar solution can be attained with the help of Lastly, the live audio recording module is integrated with
sound-based NLP techniques, where the sound detected from the prediction module Section IV-E, which listens for any
the surrounding would be able to draw insights into the actual sound from the person’s surroundings, and sends the recorded

6456 VOLUME 12, 2024


A. Sen et al.: Live Event Detection for People’s Safety Using NLP and Deep Learning

which is then transmitted to the emergency services together


with the camera footage using the GSM module.
While the above method focus more on the application
of IoT (Internet of Things) along with sound detection,
the researchers in [2] have based their experiments on the
similarity of the input to a group of learned prototypes in
a latent space and utilize a frequency-dependent similar-
ity measure that is built by taking into account different
time-frequency resolutions in the feature space. Voice, music,
and background noise are three different sound categoriza-
tion tasks that the proposed model is capable of handling.
Here, a Deep Learning Model containing a Prototype Layer,
a Similarity Measure and Weighted Sum Layer, and a
Fully-Connected Layer is utilized to extract insightful infor-
mation from the input sound. Although this research is not
furthered towards the application of the designed system in
the field of individual security, it holds a great potential for the
same.
Utilizing an ARM (Advanced RISC Machine) controller
and an android application, the system proposed in [3] syn-
chronizes the device and the smartphone using Bluetooth
so that each may be turned on independently. Every two
minutes, the device may send alert calls and messages to the
pre-set contacts along with its current location, record audio
for further analysis, and be followed in real time through a
mobile application. An additional distinctive element of the
system that one might employ to protect their privacy is a
hid-den camera detector. Its major advantage is that it can
be used to protect women against crimes including stalk-
FIGURE 1. Flow of live event detection.
ing, domestic violence, physical assault, and intrusive hidden
cameras.
sound to the prediction module. The prediction Module then Focusing more on the domain of security in the urban
classifies the sound to be related to danger or not. If the sound scenario, one can study the research done in [4], where the
is classified as related to a dangerous environment, then an researchers have gone by the domain of security in a slightly
automatic alert message is sent to the registered contact or unorthodox manner. By identifying Unmanned Aerial Vehi-
emergency services via email (with an attachment containing cles (UAVs) in loud outdoor and interior contexts, which
the recorded audio), SMS, and WhatsApp. Lastly, the results have recently been utilized to carry out or support terrorist
and conclusion are discussed in Section-V, and Section-VI operations, the proposed effort aims to increase urban safety.
respectively. Deep neural network-based techniques that can identify a
UAV’s spectral signature are used to detect UAVs in addition
II. TECHNICAL BACKGROUND to sensors that might measure the sound they made.
The topic of audio analysis and sound classification is at the With the approach provided by [5], a method for catego-
heart of speech recognition technology in the modern times. rizing sounds, a thorough classification of the various noises
However, this technology could also be used for increasing present in an urban area is achievable. The result can be used
the safety of individuals or a population from dangers around to generate insights into the different kinds of activities going
them. All the current literature analysis related to the tech- around in an area, and further be used to detect whether any
niques adopted for the prediction, classification and detection individual/group of people is in any dangerous situation or
of live input audio is summarized. not. The log-Mel spectrogram’s FBank feature is first devel-
For instance, in [1], Raspberry-Pi based IoT device that oped for auditory representation. A series of FBank feature
includes a camera, sound sensor, GPS (Global Positioning vectors created from distinct acoustic signal frames are then
System), and GSM (Global System for Mobile Communi- used as input to a Convolutional Neural Network (CNN) for
cations) module. When a scream is recorded by the sound urban noise identification. Here, the traditional LPCC (Linear
sensor, the SVM algorithm is able to recognize it. The camera Prediction Cepstral Coefficients) and MFCC (Mel-Frequency
is then turned on, which records a 30-second video clip and Cepstral Coefficients) acoustic feature, the FBank image fea-
sends it to the closest police station or emergency services. ture, the hierarchical extreme learning machine (H-ELM),
The GPS module continuously tracks the victim’s location, and the multilayer extreme learning machine are integrated

VOLUME 12, 2024 6457


A. Sen et al.: Live Event Detection for People’s Safety Using NLP and Deep Learning

with the support vector machine (SVM) and the extreme Event detection by analyzing posts on social media is
learning machine (ML-ELM). also one of the major applications of NLP, and can be
For emotion recognition from speech, [6] looked at how used to enhance security measures/operations by providing
noise affected two popular SER (Speech Emotion Recogni- early information to emergency workers, who in turn can
tion) architectures, Acoustic Features and End-to-end, as well reach out to the victims be-fore the severity of the situation
as the potential benefits of implementing speech enhance- becomes graver. A similar attempt has been made by the
ment in SER applications, particularly in low SNRs. This research done in [10]. In order to select the language process-
system’s ability to recognize speech (a sort of noise) even at ing models striking the best balance between accuracy and
very low Signal-to-Noise Ratios (SNRs), or for poor input processing speed for text-based natural language processing
sound quality, is a significant benefit. In this research, a num- in the urban context, the researchers in [10] conducted a
ber of SER techniques based on SVMs and openSMILE preemptive evaluation by contrasting several baseline lan-
features are employed. The approach is based on stacked guage models previously used by researchers for event
residual blocks of 2D convolution layers, which have been classification. To achieve the desired results, a number of
shown to efficiently learn rich representations of input signals algorithms are applied along with the pre-defined NLP mod-
in the past. els, including MNB (Multinomial Naive Bayes Classifier),
A major usage of natural language processing is in CNB (Complement Naive Bayes Classifier), RF (Random
the field of sentiment analysis and emotion detection too. Forest Classifier), Multiple Regression Analysis, General
In [7], a method for sentiment (feeling) analysis that is Regression Statistics, and ANOVA.
non-predictive a priori and can handle audio recordings of Similar to aforementioned method of event detection, [11]
arbitrary length is proposed. Mel spectrogram and Mel Fre- presents a simple yet effective method for social event recog-
quency Cepstral Coefficients are used as audio description nition that mostly utilizes natural language processing. The
tools, and a Fully Convolutional Neural Network architecture researchers look at the distinctive characteristics of social
is recommended as a classifier. An FCN architecture (Fully media’s natural language in order to select the most suitable
Convolutional Neural Network) is suggested by this study in characteristics. Second, they mix fundamental machine learn-
order to classify audio files of any length and recognizing ing techniques with NLP methods to do classification and
emotions in close to real-time. An FCN is primarily a CNN extract features. The bag-of-words (BoW) model, one of the
without fully connected layers that employs just convolu- methods employed in this paper, may be used to describe
tional layers and up- or down-scales input data to enable the a text using the frequency of terms found in a dictionary.
system to accept variable input data. BoW completely disregards word order or structure, which is
A study similar to the aforementioned one is conducted a highly powerful approach to communicate messages. The
by [8]. Built on significant elements gathered from several Support Vector Machine (SVM) algorithm is another one
case studies, a Generative Model for NLP Applications is that is applied in this work. This system has the capacity
pro-vided in this paper. The generative model serves as a uni- to identify social events from short, hazy, and nonstandard
fied framework for several NLP disciplines and may address English-written social media messages.
specific difficulties reading text, hearing speech, compre- In [12] too, the researchers present an analytical framework
hending it, gauging mood, and determining the essential for the analysis of tweets in order to identify and catego-
elements. The study proposes a model for a smart virtual rize specific information about a disaster, such as affected
assistant that might include the best traits from each case people, damaged infrastructure, and disrupted services, and
study looked at for an improvement over the present NLP to distinguish impact areas and time periods, as well as
models in addition to having the ability to understand more the relative prominence of each category of disaster-related
challenging languages like the Chinese language. The sys- information across space and time. Here, Latent Dirichlet
tem’s capacity to recognize speech and emotion in multiple Allocation (LDA) is employed in an unsupervised multi-label
languages can be a huge benefit in the security sector since categorization of tweets utilizing LSTM (Long Short-Term
it allows for the detection of threatening or abusive speech Memory) networks.
delivered to a person in different languages. The scope of NLP in the domain of security can be further
A study on emotion detection from text/speech is also extended to its usage inside modern AI-powered self-driving
conducted by [9], where it uses neural networks and automat- vehicles also. In [13], an image and audio-based solution
ically determine the speaker’s emotions by analyzing vocal is provided as a service to increase the security and trust
cues. In order to analyze texts/speech with multilingual forms within an autonomous shuttle. It is backed by special Arti-
utilizing cross-language functions and the lexical level func- ficial Intelligence (AI) algorithms. The two modalities allow
tion, a hybrid neural network made up of CNN and Bi-LSTM for the real-time identification of small criminal scenarios,
subnets is utilized. This network also identifies emotions in such as screaming, bag stealing, altercations, and vandalism.
cross-language vocals/writings. The system may be utilized They also provide notifications to authorized personnel for
as a software-only pro-gram to automatically identify threat necessary action. For audio classification, a two-dimensional
calls and so improve a person’s security. Convolutional Neural Network (CNN) is employed, and for

6458 VOLUME 12, 2024


A. Sen et al.: Live Event Detection for People’s Safety Using NLP and Deep Learning

visual analysis, an LSTM classifier that can perform binary patterns. An integrated and thorough body of information
or multi-class SoftMax classification is used. A rider’s safety about NLP-based HIDS is required given the current research
is maintained as an advantage of this system. trend of using NLP in HIDS. Notwithstanding the rapidly
Short sounds/noises, which are needed to be detected very increasing usage of NLP in HIDS development, not much
quickly, are often linked with dangerous situations. Thus, effort has been made to systematically examine and compile
it calls for a system which would have the capability to do the peer-reviewed literature that is currently accessible in
so. Reference [14] proposes pre-trained audio neural net- order to comprehend the role that NLP plays in HIDS devel-
works (PANNs) that were trained on the substantial Audio opment. Reference cite30 conducted a Systematic Literature
Set dataset. These PANNs take on additional audio-related Review (SLR) of the works on the end-to-end pipeline of the
duties - modeling the computational complexity of PANNs application of NLP in HIDS development since there was a
and investigating their performance using a variety of con- dearth of a synthesis and a complete body of information on
volutional neural networks. Convolutional Neural Networks this crucial issue. Reference [30] identifies, taxonomically
(CNN) and other methods for data balance and augmentation classifies, and systematically compares the state-of-the-art
are the major techniques utilized for building PANNs. This NLP techniques used in HIDS, attacks identified by these
study specifically uses the augmentation methods Mix-up and NLP methods, datasets, and evaluation metrics that are used
Spec Augment. PANNs can identify sounds with enormous to assess the NLP-based HIDS for the end-to-end NLP-based
accuracy, which significantly decreases the work required HIDS development pipeline. To assist the HIDS developers,
from humans to accomplish the same. The speed at which [30] emphasizes the pertinent best practices, issues, benefits,
PANNs operate also makes them more suited for swiftly and and drawbacks, and also provides the planned future research
precisely classifying common noises. paths for the development of NLP-based HIDS.
Returning to the classification of sound/noise, which can In [31], the researchers provide another instance where
be classically used as the base for the detection danger deep learning is used for sound classification in the urban
from the noise around an individual/population, [15] demon- landscape. They claim that building habitable and sustainable
strates that sound categorization performance can still be cities is severely challenged by the world’s rapid urbanization
improved by swapping out the recurrent architecture for a and population expansion. Urban noises are increasing and
parallel processing structure during feature extraction. The becoming more diverse as a result of this increase. Since noise
research processes the huge data and uses it to develop the is central to the idea of smart cities, [31] turned these noises
model using Deep Learning Algorithms, namely CNN (Con- into information rather than merely being heard. Two funda-
volutional Neural Networks) and LSTM (Long Short-Term mental techniques are utilized to categorize urban noises for
Memory). A stack of L identical blocks with their own set this purpose. In the first of these, the sounds are subjected
of training parameters makes up the feature-extracted model to signal processing techniques in order to extract hand-
is used. This study compares SVM to LR (Logistic Regres- crafted qualities. The alternative approach uses deep learning
sion) and KNN (K-Nearest Neighbor), two other classifiers, models to classify sounds based on their visual representa-
and discusses the advantages of SVM as a classifier. The tion. This study looked at how different variables utilized in
studies’ findings demonstrated that the suggested technique both approaches—individual and hybrid—affect how urban
may greatly improve sound classification accuracy, further sounds are classified. Furthermore, a CNN model for hybrid
enhancing the cause of improving individual security. feature classification was developed. The outcomes demon-
Exploring some unconventional usage of NLP, [30] pro- strated that both strategies were successful in classifying
vides the perfect example. After perimeter defenses (such as a data. Mel-spectrogram, scalogram, and spectrogram pictures
firewall and network-based intrusion detection system) have yielded the best categorization success rate among the visual
failed or been circumvented, a host-based intrusion detection representation techniques. Accuracy was positively impacted
system (HIDS) is a useful final line of defense against cyber by using the SVM classifier, mel-spectrogram, and audi-
security threats. Since Security Operation Centers (SOC) of tory features. Datasets from UrbanSound8k and ESC-10
enterprises rank HIDS as one of the top two security tools, were used for the experiments. When utilizing the AVCNN
HIDS is widely employed in the business. For industrial model with the scalogram and acoustic characteristics, the
companies, having a highly effective and efficient HIDS is ESC-10 achieved the greatest accuracy of 98.33%. By uti-
ideal, however, when sophisticated attack patterns evolve, lizing the SVM classifier to categorize the mel-spectrogram
HIDS performance deteriorates due to various issues (e.g., and acoustic characteristics derived from the AVCNN
a high false alarm rate that wears out SOC employees). model, the maximum accuracy of 97.70% was achieved for
An increasing number of HIDS are utilizing the advance- UrbanSound8k.
ments in Natural Language Processing (NLP) techniques, The researchers in [32] focus their research on the Syn-
which have demonstrated effective and efficient performance thetic Polyphonic Ambient Sound Source (SPASS) dataset,
in accurately detecting low-footprint, zero-day attacks and a freely accessible source of synthetic polyphonic audio.
predicting an attacker’s next steps. This is because NLP SPASS was created to efficiently train deep neural networks
methods are better suited for identifying complex attack for the purpose of detecting polyphonic sound events (PSED)

VOLUME 12, 2024 6459


A. Sen et al.: Live Event Detection for People’s Safety Using NLP and Deep Learning

in urban sound environments. The five virtual areas that not, or if a person is being verbally threatened by another
makeup SPASS are park, square, street, market, and water- fellow human being or not. Systems like these, although
front. Following a hierarchical class taxonomy, a variety beneficial, are unable to address the problem of physical
of monophonic sound sources were curated, virtual envi- safety of an individual.
ronments were set up using the RAVEN software library, • A very similar system is proposed by [8], where the
all stimuli were created, and the data was processed to speech detection is done for multiple languages to detect
produce synthetic recordings of polyphonic sound events verbal threats, but not physical ones.
along with their corresponding metadata. The collection has • Some other systems, like those proposed in [9], [10],
25,000 stimuli of 10 seconds each, or 5,000 audio clips [11], and [12], detect emotions and events from
per environment, virtually recorded at a 44.1 kHz sampling texts/speech, social media posts and tweets respectively.
rate. Although these are unique approaches to determine an
In this research, the audio analysis technique adopted is individual’s/group’s live situation, they again fail to
the Fourier Transform and Mel-Spectrogram (similar to [31]), address the challenge of physical individual safety.
and the audio was sampled at 44.1kHz (just like in [32]) for • Moving further, [13] provides a unique sound-detection-
further processing. Post-cleaning, the sound data is subjected based approach towards safety of travelers inside a
to three different deep learning models (1D-CNN, 2D-CNN, vehicle, but do not address the safety concerns for those
and LSTM) for the classification of sound from a person’s individuals who are alone and not inside any vehicle.
surroundings (the likes of which have been used in various • Reference [14] provides yet another approach towards
pieces of research cited above, for example: [15]), and to the detection and classification of short sounds/noises,
detect a threat from it. If the threat is detected, then an which can actually become very useful for approaches
automatic alert message is sent to the registered help or the like the ones that this research has proposed, but the
emergency services. Moreover, in general, research works researchers in [14] do not further their research towards
like [34], [35], [36], [37], and [38] etc. have been referred any practical solution to the problem of physical individ-
to during the course of this work in order to generate more ual safety.
insights into how audio analysis and classification is done • References [30] and [31] use noise detection for perime-
using different analysis techniques and deep learning models ter defense techniques (like intrusion detection), and
respectively. conversion of urban sounds into information respec-
The research gaps identified from the existing literature are tively, but do not address how an individual can be
depicted as below: helped with respect to physical threats.
• Starting with [1], it provides the necessary solution for
the problem of detection of threat around an individual, III. PROPOSED LIVE EVENT DETECTION METHODOLOGY
but it comes with a bulky hardware, which poses a FOR INPUT AUDIO CLASSIFICATION
difficulty in carrying it around for regular use. The main objective of the proposed system is to detect
• On the other hand, in [2], [15], and [32], the researchers and classify the victim’s live audio signals for immediate
use several techniques to analyze audio signals, but they rescue. The system is intended to deliver its excellence as
don’t further their work to provide a practical solution to an application in any smartphone and it uses the default
the problem of danger detection around an individual. microphone configuration of it. On detection of suspicious
• The research done in [3] is somewhat close to what has audio patterns from the live input audio from microphone,
been achieved in this research, where the researchers the geographical location of the victim is shared to the emer-
have built a system to provide real time feedback from a gency contacts in the phone as well as to the police patrol.
person’s surroundings, however, this also comes with an The drawbacks inferred from the current violence detection
additional hardware component in addition to a smart- scenarios related to audio event detection and classification
phone. accuracy are addressed for effective functioning which plays
• In [4], the researchers use noise detection to detect a vital role in avoiding false event classifications meanwhile
Unmanned Aerial Vehicles (UAVs) which might be used ensuring the victim’s safety through high classification accu-
for criminal activities. However, this approach is not racy. To carry out accurate prediction, training and testing of
favorable to be applied at an individual level, and would Kaggle dataset is carried out in 3 machine learning models
not be suitable for detection of threat around an individ- namely LSTM, 1D CNN and 2D CNN which is illustrated in
ual human being. Figure 2.
• Reference [5] also used similar techniques to detect The audio dataset used in this work consists of 13 classes
threat for an individual/group in an urban context, but (types) of audio signals, namely air conditioner, car horn,
is unable to provide a solution to make the user friendly children playing, dog bark, drilling, engine idling, fire crack-
at an individual level with the use of no hardware. ling, glass breaking, gunshot, jackhammer, scream, siren, and
• The research done in [6] and [7] focus on detection on street music. Of these 13 classes, fire crackling, glass break-
the detection of emotion from speech, which can be ing, gunshot, and screaming are identified as audio types
helpful in determining whether a person is in agony or related to a potentially dangerous environment.

6460 VOLUME 12, 2024


A. Sen et al.: Live Event Detection for People’s Safety Using NLP and Deep Learning

as a signal. A few of the considered categories of audio signals


are depicted in Figure 3. Air pressure is the variable quantity
for audio. The air pressure may be measured over time with
samples. In general, audio data is sampled at various rates,
but most frequently at a rate of 44100 Hz, or 44100 samples
per second, and becomes very difficult to visualize and work
with in the time domain.

FIGURE 3. The audio signals in the time domain for one sample audio
each from 6 of the 13 classes considered in the research.

Thus, audio data is preferred to be transformed to the


frequency domain using a mathematical method called the
Fast Fourier Transform (FFT), and the plot obtained from it
is called the periodogram. This is so because numerous sound
waves with a single frequency make up an audio signal. The
FIGURE 2. Training the dataset using ML models for live event detection.
resulting amplitudes are simply measured while periodically
sampling the signal. The Fourier Transform makes it possible
to separate a signal into its individual frequencies and ampli-
tudes. To put it another way, it changes the signal from the
The dataset described here is a customized dataset and is
time domain to the frequency domain. The result is referred
built by taking into account and combining parts of data from
to as a spectrum. This is achievable because any signal can
three different audio datasets in order to meet the require-
be decomposed into a set of sine and cosine waves that sum
ments of the problem statement under research. The three
to the original signal. The Fourier’s Theorem makes this
datasets mentioned here are the [16] UrbanSound8K dataset,
precise claim. The Fast Fourier Transform (FFT) method can
[17] – which is a dataset with audio data for 50 different
be used to quickly compute the Fourier Transform. In signal
environmental sounds, and [18] – which is a dataset for the
processing, it is frequently employed. The general formula
‘screaming’ noise as this sound type is a crucial one for the
for a Fourier Transform is as given below, and the same can
requirement of the project, which focuses on the detection of
be used to compute an FFT on machines as well:
threat from the ambient noise. Z ∞
Although the three original datasets from which the dataset F (x) = f (x)e−x dt (1)
for this project has been built, have over 50 different classes −∞
of audio signals, only 13 are kept for the final analysis as it Another concept which comes into play while sampling audio
would take less computational time for only 13 classes to be signals is the Nyquist Frequency, which considers exactly the
trained on the three deep learning models built during the half of the maximum sampling rate. In this research, the initial
course of this project, as compared to the larger size of a audio data is sampled down to 16000 Hz from 44100 Hz
dataset with over 50 classes, taking into account the fact that for easier analysis, and thus the Nyquist Frequency achieved
this research focuses more on the successful development of in this case would be 8000 Hz (8000 samples per second).
a working prototype than a production-level software. In other words, when the live audio recording module is used,
the highest frequency captured by the microphone would
IV. RESEARCH METHODOLOGY FOR VISUALZING AND be 8000 Hz, and all other frequencies above this threshold
CLASSIFYING AUDIO SIGNALS would be discarded. The periodogram of this signal would
After training the model with Kaggle dataset, exploratory reach a maximum of 8000 Hz on the Frequency axis (X-axis).
data analytics need to be carried out on the audio signals for Figure 4 and 5 shows the audio signal for a gunshot in the time
easy visualization and classification of general audio cate- domain, and in the frequency domain after the application of
gories. A change in a certain quantity over time is referred to the Fast Fourier Transform.

VOLUME 12, 2024 6461


A. Sen et al.: Live Event Detection for People’s Safety Using NLP and Deep Learning

content of continuous signals acquired by a single seismome-


ter or a group of them.
The decibel spectrogram helps in visualizing how the sig-
nal changes over time (on the x-axis) along with its intensity,
and the frequency (on the y axis). It is worth noting that
the number of samples considered while creating the decibel
spectrogram using the STFT is exactly half the number of
samples considered for the FFTs (in accordance with the
Nyquist Theorem) which is depicted in Figure 6.

FIGURE 4. The audio signal for a gunshot in the time domain with a
sampling period of 1 second on applying FFT.

It can be verified that the Nyquist Frequency achieved


from the above gunshot representations are 8000 Hz, which
is exactly the half of the sampling rate considered here i.e.,
16000 Hz. The Fast Fourier Transform is a useful tool for
examining a signal’s frequency content, but what makes this
approach stand out is when the signal’s frequency content
shifts over time. This is how the bulk of audio transmissions FIGURE 5. The audio signal for a gunshot in the frequency domain with a
work, including speech and music. The science community sampling period of 1 second on applying FFT.

call these signals non-periodic signals. To show how the spec- From the above Figure 6, it can be seen from here that as
trum of these signals evolves over time, a technique is needed, the time changes along the x-axis, the intensity (loudness)
called the short-time Fourier trans-form, which is a method changes along the y-axis (Example: the four vertical lines in
for computing many spectra by applying FFT to a number of the plot for the class ‘‘scream’’ signifies that at that part of the
windowed signal segments. To put it another way, Short-Time audio signal the person screams for four distinct instances).
Fourier Transform (STFT) is a method for taking into account It’s crucial to take the Mel FilterBank into account to
a lot of FFTs and stacking all the periodograms to produce a further the conversation on audio data analysis. Studies have
new visual metric known as the decibel spectrogram, which is shown that humans do not perceive frequencies on a linear
created when the FFT is calculated on overlapping windowed scale. Humans can discriminate between lower frequencies
portions of the signal (audio signal in this research). The more easily than higher frequencies. Humans can readily
general formula for an STFT is as given below: differentiate between 600 and 1200 Hz but will find it chal-
Z ∞ lenging to discern between 15,000 and 15,600 Hz, even
X (τ, ω) = x(t)w(t−τ )e−iωt dt (2) though the difference between the two pairs is the same.

However, for deep learning models to operate accurately, it is
A spectrogram, sometimes known as a decibel spectro- essential for them to be able to distinguish even between
gram, is a visual depiction of the ‘‘loudness’’ or signal inten- signals with the smallest of differences in frequencies at
sity (or amplitude) over time at various frequencies included the higher end of the audible spectrum too. For this pur-
in a particular waveform. A spectrogram is effectively a pose, the Mel FilterBank on the Mel scale can be used to
collection of FFTs stacked on top of one another. Some more rephrase and rescale the audio signal for aiding in the accurate
information is being processed in the back-ground while the training of deep learning models. Reference [19] Stevens,
spectrogram is being calculated. This may be regarded as the Volkmann, and Newmann developed a unit of pitch in 1937 so
amplitude’s log scale. The color dimension is converted to that the listener would perceive equivalent distances in pitch
decibels, and the y-axis is changed to a log scale. This is as equal lengths. It is known as the Mel scale. In order
because humans can only sense a very limited and restricted to translate frequencies to the Mel scale, mathematics is
range of frequencies and amplitudes. The quantity of energy needed.
at various frequencies, such as 2 Hz vs. 10 Hz, as well as The working of Mel spectrogram for a sample audio is
how it varies over time, may be seen. In several scientific depicted in Figure 7. To understand better how the Mel
disciplines, spectrograms are widely used to display the fre- scale is related to the frequency, for lower frequencies as the
quencies of sound waves produced by humans, machinery, frequency changes, the difference reflected on the Mel scale
animals, whales, airplanes, etc. and recorded by microphones. is quite considerable. However, as the frequency increases,
In order to distinguish and categorize distinct earthquake the large differences in frequencies tend to yield smaller
types or other ground vibrations, the seismic community is changes on the Mel scale i.e., the lower frequencies are
increasingly using spectrograms to analyze the frequency given more importance than the higher frequencies. Thus,

6462 VOLUME 12, 2024


A. Sen et al.: Live Event Detection for People’s Safety Using NLP and Deep Learning

FIGURE 7. The Mel spectrogram for one sample audio each from 12 of
the 13 classes considered in the research.

Furthermore, a Mel-Spectrogram has two main differences


as compared to the regular Decibel Spectrogram – (i) The
Mel Scale is used instead of frequency on the Y-axis, (ii) The
Decibel Scale is used instead of amplitude to define the col-
ors in the plot. Research has shown that Mel-Spectrogram
performs better as inputs for deep learning models instead
of Decibel Spectrogram when considering audio data,
FIGURE 6. The decibel spectrogram for one sample audio each from 12 of and the same has been implemented in this research as
the 13 classes considered in the research (the same as considered in the well.
time domain plots above).

A. CLEANING THE SOUND DATA


the Mel FilterBank is a sequence of several triangular fil- For the ease of specifying the classes of each of the
ters, whose number can be varied as per requirement (128 audio data files, the audio files from the dataset are
bands used in this research – change in the number of bands kept in respective folders, where the name of the folder
has an effect on the memory requirement for the analysis), specifies the class of the files inside. For this research,
which are passed over the audio signal in the frequency 13 folders are created (as 13 audio classes are being con-
domain (periodogram). Each of the triangular filters in the sidered for training), and these class-wise folders are kept
sequence corresponds to a specific frequency band, and the inside a parent folder, from which the audio files would
FilterBank as a whole decomposes the entire audio sig- later be fetched for cleaning, pre-processing and training
nal into separate frequency bands in the Mel Frequency purposes.
Scale. Also, since audio files in the ‘‘.wav’’ format are very quick
To summarize, when the frequencies are translated to the to load into memory, all the audio files are converted to this
Mel scale, the spectrogram is called a Mel spectrogram. format before-hand. However, on the downside, since audio
When audio signals are windowed in time, Mel-Spectrogram files in the ‘‘.wav’’ format are uncompressed files, they tend
adds a bank of frequency-domain filters to the signals. to take up a lot of space in memory. Moreover, as the sound

VOLUME 12, 2024 6463


A. Sen et al.: Live Event Detection for People’s Safety Using NLP and Deep Learning

accurate a learning as possible. However, when the prediction


module is run, it is fed with the raw data, on which it then
performs a similar data-cleaning step, as would be the case for
the practical application of the model on a real-world audio
data.
Once the down-sampling and the removal of dead space
from the audio signals are done (with the help of the cus-
tom signal envelope), each audio signal is split into separate
instances with a fixed specified time interval. This time inter-
val has been referred to as the ‘‘delta time’’ (set as 1 second
in this work) throughout this paper, and the models train over
each of these intervals for learning purposes.
Once the data is cleaned, it is saved in the new form inside
FIGURE 8. The signal envelope for one of the audio signals corresponding
a separate directory with the same folder hierarchy (with
to ‘‘glass breaking’’. Here, since the threshold magnitude is considered as 13 different directories, each denoting a separate class) as
20, any part of the signal below that threshold is not considered, and this the parent directory used for the original audio signals. These
can be verified with the help of the non-silent signal envelope created
around the signal (visible in purple). new cleaned audio files from this new ‘‘clean data’’ directory
are used later for training the deep learning models.

files contain 16-bit audio, the data type considered is NumPy B. APPLIED DEEP LEARNING MODELS
16-bit integer. In this project, three different deep learning techniques have
One of the problems that is faced with audio data is that been explored and implemented, namely LSTM (Long Short-
most of the audio is concentrated in one general area in the Term Memory), 1D-CNN (1-Dimensional Convolutional
audio stream, and as the audio progresses, the magnitude Neural Network), and 2D-CNN (2-Dimensional Convolu-
becomes so low that a significant part of the signal looks tional Neural Network). The primary reason behind the
the same as there is a lot of silent area in the audio. The implementation of three different deep learning models is
silent zones or dead spaces in the audio can be removed by that each of these models has its own set of advantages and
creating a customized signal envelope, which is essentially disadvantages, which would cater to different requirements
what it sounds like – this signal envelope tracks the signal with respect to the training, analysis and classification of the
to study how it changes, and considers only the magnitudes data based on the different properties and attributes associated
above a particular threshold (20 is considered as threshold for with the dataset being used. The prediction module would
this re-search). This is implemented by first converting the have an option to choose between these three deep learning
signal into a sequence, and obtaining the absolute (positive) models for running the prediction on the real-world audio
value of the signal at any given point of time, and then signal, based on the requirement.
applying a rolling window over the signal with a specific Once the data is cleaned, the training is the next phase,
window length (set as 20 in this case), set for considering for which the data has to be loaded into the models with
the maximum magnitude at any time instance, for creating appropriate parameters. For this, a custom data generator is
the signal envelope with the specified threshold is depicted in built, both for training and validation, which would prepare
Figure 8. (generate) the data to be loaded into the deep learning models
The dataset used in this project, as already described, for training, after it has been spilt into training and testing
is a collection of audio files across 13 classes. Since all parts (a 90-10 split has been done in this work). The input to
these audio files are a recording of real-world environmental this data generator is the paths to the clean audio files (created
sounds, the format of the audio across each audio file is earlier), split into intervals of 1 second, and the corresponding
inconsistent. For most of the instances of audio data, the data classes for the signals.
available has more than one channel of audio signal with a Beginning with the best practice of loading data into the
sampling rate between 44100 Hz and 48000 Hz. system before analyzing with a deep learning technique, it is
Since this variance across the data is a challenge towards computationally more efficient to load the data in batches, till
the uniform analysis of the audio data, which in turn might all the epochs are completed, instead of loading the entire data
lead to erroneous classification results after train-ng, the data to the memory at once. So, the way this is recommended to
is thus passed through a cleaning process. In this step, the be done in TensorFlow from Keras’ perspective is to inherit
audio data in each of the audio files are converted to a signal the requirements from a class call ‘‘Sequence’’. This class
with a mono channel, and the sampling rate is down-sampled provides the functionality to load the data in batches, and use
to 16000 Hz for the ease of analysis during the training phase multiple GPUs (Graphic Processing Units) to process the data
of the data, including a reduced computation time. through multiprocessing.
The focus of down-sampling the data is to feed the deep Three essential functions are implemented here for gener-
learning models with a clean data to train on so as to have as ating the final processed data on which the analysis would

6464 VOLUME 12, 2024


A. Sen et al.: Live Event Detection for People’s Safety Using NLP and Deep Learning

be run later – namely ‘‘__len__’’, ‘‘__getitem__’’, and is not particularly important. This is pertinent to the analy-
‘‘on_epoch_end’’ [20]. The ‘‘__len__’’ function is used to sis of time sequences of sensor data, such as that from an
specify the number of batches per epoch (total number of accelerometer or gyroscope. It also applies to the analysis
samples divided by the batch size, which can vary from 16, of audio signals as well as any other signal data that has
32, 64, and so on). The ‘‘__getitem__’’ method outputs an been gathered over a predefined period of time. Another
‘‘X’’ matrix (a time-series format of the audio data) whose 1st use is in natural language processing (NLP), albeit LSTM
dimension is the batch size, 2nd dimension is the number of networks have more promise in this field because word
channels, and the 3rd dimension would be the total number of proximity isn’t always a trustworthy indicator of a trainable
data points considered over the specified time interval, and a pattern.
‘‘Y’’ matrix which in turn would be the output in the form of a 1D-CNN uses time distributed layers to wrap the 1D Con-
SoftMax layer (the probabilities of occurrence of the different volutions over time. In-put to the model is time series format
classes) – the 1st dimension of this matrix is the batch size of the data, where the channels are going to be the first dimen-
and the 2nd dimension is the number of classes considered in sion. For consistency, a permute layer has been implemented
the research (13 in this work). The ‘‘Y’’ matrix would be later on the dimensions of the input data, which is important to be
used to build a hot-encoded matrix using the ‘‘to_categorical’’ done as a time distributed layer is being used here, as any
method of TensorFlow. The output from the ‘‘__getitem__’’ time-based layer expects the dimensions of the data to be
method is used as input to the Mel-Spectrogram layer in the batched by time, features and channels - so the permute layer
deep learning models, where the data is fed as audio signals just swaps the features and time, and time becomes the 1st
with one channel and 16000 data points for one second of dimension.
time-series data (sampling rate). Lastly, the ‘‘on_epoch_end’’ Next, a time distributed 1D convolution is implemented
method is typically used for data augmentation across with a small kernel size and a hyperbolic tangent as the
deep learning projects, however, in this work, it has been first activation function. The activation used in the follow-
mainly used for shuffling the data in between epochs, ing time distributed layer is the ReLU (Rectified Linear
so that there is different distribution of data for different Unit), which allows a deep learning model to be non-linear,
batches, and the models have a more holistic learning in the addresses the vanishing gradients problem, helps prevent the
end. exponential increase in computing required to run the neu-
Just as previously mentioned that a deep learning model ral network, and only takes into account half of the input
tends to perform better with Mel-Spectrogram audio inputs, data.
the concept of Mel-Spectrogram is applied to the input audio Before going any further, it is important to note that the
data files before the training of the models begin. For all three goal here is to develop a functionality of classification, which
models, after feeding the input audio signals, with one chan- can be done using a few deep learning layers to build out fea-
nel and a sampling rate of 16000 Hz, a Mel-Spectrogram layer tures from the data that is available (sound data in this case),
is included, just before the output from this layer is normal- and reduce them down to the point where a classifier of some
ized with a 2D normalization layer which rescales the data to sort can be built in the last few layers of the Convolutional
0-mean, for further processing by the models. This layer is a Neutral Network, which are called as the head of the neural
custom layer which is added to the Keras mod-el(s) (the deep network.
learning model(s)), with the help of another Python library The same is the case for the 1D time distributed CNN
called Kapre [20], which is an audio pre-processing library in model used here - the number of dimensions is gradually
Python, which allows the implementation of various custom reduced down as the model progresses through the layers,
signal processing techniques like STFT (Short-Time Fourier and the number of tuned parameters increase as the number
transform), Inverse STFT, Mel-Spectrogram etc. If instead of of features are increased in each layer. In better words, the
using Kapre, the Mel-Spectrogram is computed separately, training initially starts with a limited number of features
then it has to be performed offline, and stored separately in (a general start), and as the training progresses through the
memory and then the analysis would have to be run – in case layers, more features are added to the network to specify
some parameters are to be changed, then the entire process with more granularity of what exactly the network should
has to be re-iterated, and would take up a lot of computational learn – this is what the last layer with 128 features (the max-
time. imum number) does. Wrapping this procedure with a time
The details of the three models used in this work are distributed layer with respect to sound data means that the
discussed in the upcoming subsections. network is instructed to go along the time dimension, and as it
progresses, all the different frequencies from the input audio
data is fed to the network, and the neural network can identify
1) THE 1D-CNN (1-DIMENSIONAL CONVOLUTIONAL useful features from specific frequencies, and use all these
NEURAL NETWORK) MODEL features together to learn from it, which is practically not
A 1D-CNN performs fairly well when shorter (fixed-length) possible for human beings to do. So, the purpose of wrapping
segments of the full dataset are anticipated to provide inter- a 1D-CNN in time distributed layers is that the model looks
esting features and the feature’s location within the segment at only the frequency spectrum of the audio signal over time

VOLUME 12, 2024 6465


A. Sen et al.: Live Event Detection for People’s Safety Using NLP and Deep Learning

frequency spectrogram as a whole, and building features from


it for learning purposes.
With steadily increasing numbers of activation units (8, 16,
32. . . ), a sequence of Conv2D (2D Convolution) layers are
utilized here, similar to 1D-CNN, with the goal of incorpo-
rating more particular characteristics as the model advances
through the levels. All of these layers use ReLU as the activa-
tion function, with the exception of the first layer, which uses
a hyperbolic tangent with the same S-shape as the sigmoid
activation function. This function accepts any real value as
FIGURE 9. On running for 35 epochs, the maximum accuracy reached for
the 1D-CNN model is found to be 95.2%, with a maximum validation
an input and returns values between −1 and 1, or values cen-
accuracy of 90.2%. The metric used here for considering the best results tered around 0. Recurrent neural networks perform best when
is the validation loss. The Training (train) vs Validation (test) can be seen combined with the hyperbolic tangent activation function
in this figure.
for tasks requiring speech recognition and natural language
processing. All the convolution layers have a MaxPooling
layer in between them, which creates a downscaled (pooled)
for training. The last 4 epochs of 1D-CNN is depicted in
feature map, by determining the maximum value for patches
Figure 9.
of a feature map.
Furthermore, a 2D GlobalMaxPooling layer is used, which
At the end of the last convolution layer, a flattening layer is
takes only the latest feature size (the last dimension) for
used, which considers the product of the number of remaining
training and learning purposes, instead of flattening it out
useful and tuned features from the previous layer for calcu-
by considering a product of all the features (dimensions)
lating the final number of actual features to be considered for
available for the data – it is not concerned with the part of
learning purposes – this approach is somewhat different than
the neural network these features are obtained from, but just
the GlobalMaxPooling layer used in the 1D-CNN approach,
uses these features to implement the classification.
which considers only the last specified number of features,
Towards the end of the network, a dropout and a regular-
and discards the rest.
ization layer are used to prevent overfitting, and the flattened
It is interesting to note that if the delta time (time intervals)
data from the GlobalMaxPooling layer is then be passed
at which the audio files were segregated at the time of data
through a dense layer (with 64 activation units considered
cleaning and pre-processing (considered as 1 second in this
in the first dense layer, and the sound classes used in this
research) is changed, then it would cause the number of
research as the features in the 2nd dense layer) to build
Mel-bands (1st dimension of the input data) to reduce or
a classifier, and the output is presented using a SoftMax
the time dimension (2nd dimension of the input data) to
activation function (which converts the initial outputs of
increase in value – this is important to consider because
the neural network into a probability vector), which helps
this would determine the final dimension achieved at the
in the prediction of the class by calculating the probabil-
end of the flattening layer, which is essential to be of a
ity of each possible outcome. This layer also creates a
small size, otherwise would create a huge number of param-
hot encoded matrix for the implementation of categorical
eters to train on, and take up a lot of dynamic memory
classification.
for processing, which in turn would require more compu-
tational power and time, and would eventually make the
2) THE 2D-CNN (2-DIMENSIONAL CONVOLUTIONAL network slow. The last 4 epochs of 2D-CNN are depicted in
NEURAL NETWORK) MODEL Figure 10.
The initial implementation of the standard Convolution Neu-
ral Network was made possible by the Lenet-5 design [22].
Conv2D is frequently applied to picture data. It is referred
to as a two-dimensional CNN since the kernel moves along
two dimensions on the data. The key advantage of using a
2D-CNN is that it can recover spatial information from the
input using its kernel, unlike other networks.
The 2D-CNN is quite similar to the 1D-CNN, however the
approach can be associated with a Computer Vision based
approach, as a similar architecture is used. So, in this case,
no time distributed implementation is done, but the entire
frequency spectrogram is looked at as a whole, and the neural FIGURE 10. On running for 35 epochs, the maximum accuracy reached for
network learns from features that are next to each other, the 2D-CNN model is found to be 96.3%, with a maximum validation
accuracy of 92.7%. The metric used here for considering the best results
their interaction with one another, how they generate meaning is the validation loss. The Training (train) vs Validation (test) can be seen
through a pattern in their occurrence etc. – looking at the in this figure.

6466 VOLUME 12, 2024


A. Sen et al.: Live Event Detection for People’s Safety Using NLP and Deep Learning

The last three layers are essentially the same as that used also in a reverse manner – this helps in obtaining better gradi-
in the 1D-CNN model – with a dropout and a regularization ent descent updates. This layer has lesser number of features
layer to minimize overfitting, with two dense layers (with (32 in this case) as compared to that in the previous layer. This
64 activation units considered in the first dense layer, and the is so because LSTM networks do not need a large number of
sound classes used in this research as the features in the 2nd features (nodes) to learn relevant in-formation as opposed to
dense layer) for building the classifier (using a hot encoded the 1D/2D-CNN networks previously discussed. The output
matrix) and obtaining the output (prediction probabilities) of this layer is returned as a sequence, and the feature size
using a SoftMax activation layer. returned in this case is double of what was provided in the
input (64 in this scenario).
3) THE LSTM (LONG SHORT-TERM MEMORY) MODEL The next procedure done in the LSTM implementa-
Unlike conventional feedforward neural networks, LSTM tion, which is quite common across networks like LSTMs,
has feedback connections. Such a Recurrent Neural Network is called Skip-Connection, where the output features from
(RNN) is capable of analyzing both single da-ta points, such the time-distributed dense layer are concatenated with the
as photos, as well as whole data sequences, like audio or output of the Bidirectional LSTM layer – so in this case,
video. This characteristic makes LSTM networks ideal for 64 features from the time-distributed dense layer are concate-
managing and anticipating data. For example, voice recog- nated with 64 features from the output of the Bidirectional
nition, machine translation, speech activity detection, robot LSTM layer to produce a total feature set of 128. This
control, video gaming, and healthcare are some applications step lets the neural network take decisions based on both
of LSTM. Applications like connected, unsegmented hand- sets of features learnt before and after entering the LSTM.
writing identification and others are also possible to utilize The last 4 epochs for the LSTM model are depicted in
it for. Numerous RNNs may pick up long-term dependen- Figure 11.
cies, which is very useful for challenges involving sequence As the network progresses, feature engineering is imple-
prediction. In addition to processing single data points like mented with the help of two more dense layers with 64 and
pictures, LSTM also features feedback links that enable it to 32 activation units respectively, with a MaxPooling, and
process the full data stream. a Flattening layer in between. It is worth noting that the
A memory cell in an LSTM model is referred to as a ‘‘cell MaxPooling Layer used in this case is a 1D-MaxPooling
state’’ and performs a crucial role in the model by maintaining (instead of 2D-MaxPooling) as the channel in-formation was
its state over time. In LSTMs, gates regulate the insertion and deliberately lost beforehand when the input was reshaped.
deletion of data from the cell state. Information may be able After the flattening layer, a similar approach to building a
to enter and leave the cell through these gates. The method is classifier is implemented as was done in case of 1D-CNN and
aided by a layer of sigmoid neural networks and a pointwise 2D-CNN, with a dropout and a regularization layer to mini-
multiplication function. An LSTM’s sigmoid layer outputs mize overfitting, with two dense layers (with 32 activation
integers in the range of 0 and 1, where 0 means that nothing units considered in the first dense layer, and the sound classes
should pass through and 1 means that everything should. used in this research as the features in the 2nd dense layer)
The LSTM neural network is specifically designed to study for building the classifier (using a hot encoded matrix) and
all the features and how they change over time. As done in the obtaining the output (prediction) using a SoftMax activation
1D-CNN model, the input is again batched by time, features layer.
and channels, and a permute layer is used to switch between
the time and the feature dimensions. However, since a channel
cannot be fed as an input to the LSTM network (mono channel
sound being used in this research), it is combined with the
feature dimension using a reshape layer.
Before entering the LSTM layers, a time distributed dense
layer is used for some initial feature learning (which is not
done in standard LSTM networks), using an appropriate num-
ber of activation units (64 in this case), and the activation
function as a hyperbolic tangent. For this, it uses the feature
dimension (128 features considered in this case), and reduces
FIGURE 11. On running for 35 epochs, the maximum accuracy reached for
it by half so that the some more relevant features could be the LSTM model is found to be 96.6%, with a maximum validation
learnt about the data even before progressing into the LSTM accuracy of 90.3%. The metric used here for considering the best results
is the validation loss. The Training (train) vs Validation (test) can be seen
layers. in this figure.
The next layer used is a Bidirectional LSTM layer, which
computes the gradient descent of learning for the data by
going through the time dimension in both the for-ward and C. TRAINING THE MODELS ON THE DATASET
the backward direction, which basically means that the model The previously described dataset that is being used in this
not only studies the sound data only in a forward direction, but project is subjected to a training phase with respect to all

VOLUME 12, 2024 6467


A. Sen et al.: Live Event Detection for People’s Safety Using NLP and Deep Learning

the three deep learning models discussed above. The training similar to the down-sampling step performed during the
on each of the three models takes approximately 3 hours data cleaning procedure. Moreover, just like the non-silent
(approximately 9 hours for all the three models) for the customized signal envelope was created during data pre-
dataset being considered in this project. processing, the same step is performed on the live-recorded
While training the three models, a call-back is imple- audio data, so that while performing prediction on the sig-
mented to log the results (validation accuracy/loss and nal, the maximum of the non-silent part of the signal is
training accuracy/loss) from each epoch (35 epochs con- considered.
sidered for each model) into a specific CSV file for each For running the prediction on the live recorded audio,
of the models (used later for visualization of the validation after all the above pre-processing, the audio data is batched
and training accuracies). Along with this, a check-pointer up so that predictions can be made using the argmax
is implemented which would monitor the specified param- function of NumPy on the output of the previously men-
eter (set as validation loss during this research – can be tioned hot-encoded probabilities. The prediction is then
set as validation accuracy as well) from the values logged, achieved by taking 1 second intervals (specified as delta
and the model would be saved accordingly – since the time previously) within the audio on which the prediction
monitoring parameter was set as validation loss, the low- is run i.e., it considers every single second of the audio,
est validation loss would be saved on the models being and sums all the probabilities and takes the average of
trained (1D-CNN, 2D-CNN, LSTM), and the models would them.
work in the best possible way while performing the actual Furthermore, it is configured in the prediction module
predictions. itself that if the class of the recorded audio is identified as
On training, the 1D-CNN model turned out to have the one related to a potentially hostile environment (like ‘‘fire
least overfitting. This can be attributed to the fact that the crackling’’, ‘‘glass breaking’’, ‘‘gunshot’’, ‘‘scream’’ etc.),
features considered in the 1D-CNN are a lot weaker as then an automatic alert message is immediately sent to the
compared to the 2D¬CNN or the LSTM model. However, registered contact via e-mail, SMS, and WhatsApp. In the
looking from another perspective, the LSTM network can case of the e-mail message, it also contains the audio file
also be attributed to being the best network of all the three saved from the live re-cording, which the audio recording
networks built here, even though the deviation between the module records, saves and prompts the prediction module to
training and testing accuracies is higher as compared to that identify the class of.
in the 1D-CNN. This gap discrepancy would get reduced
further in the LSTM as the size of the data increases, E. THE LIVE AUDIO RECORDING AND PREDICTION
whereas the 1D¬CNN would perform poorly in that case. MODULE
The following plots show how the three models perform on The audio recording module records live audio from the envi-
training, with respect to their training and testing (validation) ronment, and stops the recording when no more noise comes.
accuracies. This module also performs a set of other pre-processing steps
From the above Figures 9, 10, and 11, although, the on the recorded audio, apart from the prediction module,
1D-CNN model seems to over fit the least for the given which is invoked later.
dataset and the number of classes considered as compared As part of pre-processing, the recorded audio is trimmed
to the other two models (the gap between training and off of any silence at both ends of the recording, then
testing is the least), the LSTM model, which achieved the it is normalized i.e., the volume of the audio signal is
highest training accuracy among the three (96.6%), would aver-aged over the entire length of the recording, and then
perform the best when the size of the dataset and the num- an audio padding of 0.5 seconds is added to both ends of
ber of classes considered would increase. Once the three the recording so that different media players can play the
models are trained on the given dataset, the models are audio without losing any crucial audio data which might
saved with their training information in the ‘‘.h5d’’ format. be present at the very beginning or the ending of the audio
The user would be able to choose the model to be used signal.
for audio class prediction for the real-world audio signal, After the pre-processing is done, the audio data is saved
by specifying the path to the necessary ‘‘.h5d’’ file (for 1D- as a ‘‘.wav’’ file at a specified location, and the prediction
CNN, 2D-CNN, and LSTM) as saved during the training module is automatically prompted to run the prediction to
phase. identify the class of the recorded audio, using the deep learn-
ing model as specified by the user (among the three models
D. THE PREDICTION MODEL built during this research). After the prediction module is
Since the prediction module takes in real-world audio data called by the audio recording module, it performs all the steps
for identifying the class of the sound, it needs to pre-process as mentioned automatically.
the data before running the analysis on it. For this pur-
pose, the sound data received, after the recording module V. RESULTS
records and saves the audio data, is down-Sampled to a The problem statement of this research was to identify
mono channel with a sampling rate of 16000 Hz. This is the type of real-world audio data using deep learning

6468 VOLUME 12, 2024


A. Sen et al.: Live Event Detection for People’s Safety Using NLP and Deep Learning

based multi-class classification technique(s), and determine


whether the identified sound type corresponds to a poten-
tially dangerous/threatful environment. Furthermore, the
focus had been to use the resultant prediction to develop
a threat alert system, which would immediately and auto-
matically issue an alert message to the registered help
via e-mail (with an attachment containing the live audio
recording of the victim’s surrounding), SMS, and What-
sApp. This has been successfully achieved, with the help
of the recording module for recording a clear audio sig-
nal, which in turn is pre-processed as per the requirements
for running the prediction model, and on a detection of a
potentially dangerous/threatful environment, the registered
help is issued an immediate and automatic threat alert as
FIGURE 12. The confusion matrix attained in live event prediction.
well.
With respect to the deep learning models developed and
used for classifying the audio data in this research, all the up by a significant margin, the difference would be visible
three models work approximately equally well. However, the in the amount of time taken by each model specifically, and
resultant accuracy achieved by each model might signifi- the effect of the dataset’s size and complexity on its accuracy.
cantly vary as the size of the dataset increases as well as the The differences in accuracies achieved by the three models
number of audio classes considered. This is one of the prime on changing the size of the dataset would become a domain
reasons as to why three different deep learning models have of further research by itself. Furthermore, an increase in
been used in this research. the number of classes considered for training would further
While the accuracy achieved by the LSTM model is as improve the products ability to identify sound correspond-
high as 96.6\%, the 1D-CNN and the 2D-CNN are also not ing to a threatful environment form a longer list of such
very far behind, with their accuracies at 95.2\% and 96.3\% classes.
respectively. With respect to the aspect of analysis, more feature extrac-
Another metric used for the evaluation of the three models tion can be done from the sound data, like amplitude analysis,
is the Confusion Matrix. The Confusion Matrix is determined phase analysis, harmonic distortion analysis etc., which in
with the help of over 9000 records, across the 13 classes. turn can provide more insights into the audio data, and
The Confusion Matrix achieved in this research showcases facilitate the development of more complex and better mod-
how accurately the models are able to predict the classes of els, suited to the requirements which this research tries to
the audio signal files against their actual classes. It verifies address.
the accuracy rate achieved by each of the three models, On one side this research limits its study to the develop-
and the number of correct and incorrect classifications are ment of a 1D-CNN, 2D-CNN, and an LSTM model for the
also visually identifiable. For instance, it can be seen from construction of a system which would detect threats from
the Confusion Matrix for the LSTM model depicted in a person’s surroundings, and alert the registered help and
Figure 12 is the model which is able to predict the classes emergency services, more complex RNN (Recurrent Neural
for most of the audio files, while providing a maximum Networks) models like GRU (Gated Recurrent Unit), Bi-
misprediction (31) for the class ‘‘gunshot’’ as a ‘‘dog bark’’ GRU, Bi-LSTM, CTRNN (Continuous Time RNN), HRNN
(which are actually quite similar in type, with an audio signal (Hierarchical RNN) etc. can also be researched upon to
having very short duration, and a high frequency in both meet the needs tried to be satisfied by this research. These
cases). models would provide even better results as compared to
Although the working model built during the course of this the current output, as these models use even more complex
research works accurately and as planned during the initial architectures.
phases of the research, there are still a lot of scope for future Considering the functionalities provided by the product
research on the same given problem. developed during the course of this research, the current
With respect to the dataset being used in the project cur- model provides an immediate and automatic threat alert
rently, it consists of audio data corresponding to 13 classes message to the registered help if the sound is predicted to
only. However, this number can be increased or de-creased as be that of a potentially harmful environment. This message
per the requirement which would be needed to be satisfied by is sent via e-mail, SMS, and WhatsApp, and in-forms the
the product of this research as a whole. The user’s choice of registered contact of the type of sound identified from the vic-
three different deep learning models for training and predict- tim’s environment, with the e-mail consisting of the recorded
ing the class of an audio signal would come to actual use once audio file as an attachment as well. This aspect has a lot
the size of the dataset increases significantly. Even though the of scope for future work and further research, and a lot
overall computational time required by the models would go more integrations can be tried to be done to improve and

VOLUME 12, 2024 6469


A. Sen et al.: Live Event Detection for People’s Safety Using NLP and Deep Learning

diversify the end-deliverables of the project. For instance, The recorded audio along with timestamp and geographical
the location of the victim can also be shared along with the location could also serve as evidence against the person
alert message, which can be done with further research on the committing crime.
intricacies of the technical implementation. Likewise, further
functionalities can be added as well in order to improve VI. CONCLUSION
the communication about the victim’s distress, the assistance A new software-based threat identification system has been
provided to him/her during the time of crisis, and the time developed to the dangerous situation of an individual from
taken from the first distress signal for the help to reach the his/her ambient noise, and provide immediate assistance
victim. to the victim by automatically informing their emergency
To summarize, on one hand, the research leverages the contacts about the situation. The functionality of record-
power of the Fourier Transform and Mel-Spectrogram anal- ing the live audio on the victim’s side, identifying the
ysis to facilitate the development of three different models type of the sound and predicting threatful situation is car-
(1D- CNN, 2D-CNN, and LSTM) to detect threats from ried out using the 1D-CNN, 2D-CNN, and LSTM models
a person’s surroundings and alert register/emergency ser- to achieve an accuracy of 95.2\%, 96.3\%, and 96.6\%
vices through email, SMS, and WhatsApp, it also has a respectively, with an average ac-curacy of 96.03\%. In addi-
few limitations. Other complex models, the likes of which tion to providing the user with three options to choose
(GRU, Bi-GRU, CTRNN, HRNN etc.) have already been from for the prediction of the class of the live audio, the
mentioned before, can also be used to yield even bet- choice of more than one model becomes useful as and
ter results and leaves a scope for further research on the when the size and complexity of the dataset and the live
topic. recorded audio increases. In conclusion, the requirement of
Although email, SMS, and WhatsApp are great ways to the research has been successfully achieved, fulfilling the
communicate about a person’s whereabouts, the research can target of providing an essential solution to one of the great-
be furthered to include more information about the victim, est practical problems humankind faces in today’s world
like their location, type of distress, intensity of the emer- i.e., threat detection and alert system, all while explor-
gency, physical parameters (like heart rate, oxygen level etc.), ing and delivering on the domain of deep learning and
environmental parameters (like altitude, temperature etc.), audio analysis for the detection of live events from ambient
which in turn would allow the emergency services to pre- sounds.
pare better for the rescue of the victim from their perilous
situation. REFERENCES
A very distinct advantage of the proposed system is that [1] T. P. Suma and G. Rekha, ‘‘Study on IoT based women safety devices with
screaming detection and video capturing,’’ Int. J. Eng. Appl. Sci. Technol.,
it does not require any special external hardware/wearable vol. 6, no. 7, pp. 257–262, 2021.
devices (like apart watches), but can be implemented with [2] P. Zinemanas, M. Rocamora, M. Miron, F. Font, and X. Serra, ‘‘An inter-
the help of a mere smartphone configured accordingly. pretable deep learning model for automatic sound classification,’’ Elec-
tronics, vol. 10, no. 7, p. 850, Apr. 2021.
In most modern IoT (Internet of Things) systems, the sys- [3] D. G. Monisha, M. Monisha, G. Pavithra, and R. Subhashini, ‘‘Women
tem comes with associated hardware, like smartwatches [33], safety device and application-FEMME,’’ Indian J. Sci. Technol., vol. 9,
which constantly monitor a person’s health parameters. no. 10, pp. 1–6, Mar. 2016.
[4] G. Ciaburro and G. Iannace, ‘‘Improving smart cities safety using sound
A person would not have to carry bulky hardware around events detection based on deep neural network algorithms,’’ Informatics,
with them, but only remember to carry their phones with vol. 7, no. 3, p. 23, Jul. 2020.
them. However, whenever, more and more modules would [5] J. Cao, M. Cao, J. Wang, C. Yin, D. Wang, and P.-P. Vidal, ‘‘Urban noise
recognition with convolutional neural network,’’ Multimedia Tools Appl.,
be tried to be integrated along with the existing sys- vol. 78, no. 20, pp. 29021–29041, Oct. 2019.
tem, the inclusion of a special hardware might become an [6] A. Triantafyllopoulos, G. Keren, J. Wagner, I. Steiner, and B. W. Schuller,
issue. ‘‘Towards robust speech emotion recognition using deep residual net-
works for speech enhancement,’’ in Proc. INTERSPEECH, Graz, Austria,
The benefits of proposed solution from the fetched results Sep. 2019.
are depicted as below: [7] M. T. García-Ordás, H. Alaiz-Moretón, J. A. Benítez-Andrades,
As crime rates, and the frequency of natural and man-made I. García-Rodríguez, O. García-Olalla, and C. Benavides, ‘‘Sentiment
analysis in non-fixed length audios using a fully convolutional neural
disasters have increased significantly in modern days, it is of network,’’ Biomed. Signal Process. Control, vol. 69, Aug. 2021,
prime essence to have emergency help and services at the Art. no. 102946.
victim’s assistance as quickly as possible, with immediate [8] A. Bhardwaj, P. Khanna, S. Kumar, and Pragya, ‘‘Generative model for
NLP applications based on component extraction,’’ Proc. Comput. Sci.,
and accurate information about the victim’s situation shared vol. 167, pp. 918–931, Jan. 2020.
with them automatically. The yielded solution helps greatly [9] I. S. Malova and D. V. Tikhomirova, ‘‘Recognition of emotions in ver-
to avoid any such type of hazard like a robbery, homicide bal messages based on neural networks,’’ Proc. Comput. Sci., vol. 190,
pp. 560–563, Jan. 2021.
or other threats to any individual as it helps greatly in the [10] A. Hodorog, I. Petri, and Y. Rezgui, ‘‘Machine learning and natural lan-
accurate prediction of surrounding sound/noise of victims guage processing of social media data for event detection in smart cities,’’
in minimum time and alerts the emergency contact list of Sustain. Cities Soc., vol. 85, Oct. 2022, Art. no. 104026.
[11] D. D. Nguyen, M. S. Dao, and T. V. T. Nguyen, ‘‘Natural language
victims. Through this accurate prediction, the rate of crimes processing for social event classification,’’ in Knowledge and Systems
would go down for the fear of getting caught by the police. Engineering. Cham, Switzerland: Springer, 2015, pp. 79–91.

6470 VOLUME 12, 2024


A. Sen et al.: Live Event Detection for People’s Safety Using NLP and Deep Learning

[12] M. A. Sit, C. Koylu, and I. Demir, ‘‘Identifying disaster-related tweets [34] P. Upretee and M. E. Yüksel, ‘‘Accurate classification of heart sounds for
and their semantic, spatial and temporal context using deep learning, disease diagnosis by using spectral analysis and deep learning methods,’’
natural language processing and spatial analysis: A case study of hur- in Data Analytics in Biomedical Engineering and Healthcare, New York,
ricane irma,’’ Int. J. Digit. Earth, vol. 12, no. 11, pp. 1205–1229, NY, USA: Academic, 2021, pp. 215–232.
Nov. 2019. [35] K. Presannakumar and A. Mohamed, ‘‘Deep learning based source identifi-
[13] D. Tsiktsiris, A. Vafeiadis, A. Lalas, M. Dasygenis, K. Votis, and cation of environmental audio signals using optimized convolutional neural
D. Tzovaras, ‘‘A novel image and audio-based artificial intelligence service networks,’’ Appl. Soft Comput., vol. 143, Aug. 2023, Art. no. 110423.
for security applications in autonomous vehicles,’’ Transp. Res. Proc., [36] A. M. Tripathi and A. Mishra, ‘‘Self-supervised learning for environmental
vol. 62, pp. 294–301, Jan. 2022. sound classification,’’ Appl. Acoust., vol. 182, Nov. 2021, Art. no. 108183.
[14] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, [37] M. Mohaimenuzzaman, C. Bergmeir, I. West, and B. Meyer, ‘‘Environmen-
‘‘PANNs: Large-scale pretrained audio neural networks for audio pat- tal sound classification on the edge: A pipeline for deep acoustic networks
tern recognition,’’ IEEE/ACM Trans. Audio, Speech, Language Process., on extremely resource-constrained devices,’’ Pattern Recognit., vol. 133,
vol. 28, pp. 2880–2894, 2020. Jan. 2023, Art. no. 109025.
[15] L. Yang and H. Zhao, ‘‘Sound classification based on multihead attention [38] S. Dong, Z. Xia, X. Pan, and T. Yu, ‘‘Environmental sound classification
and support vector machine,’’ Math. Problems Eng., vol. 2021, pp. 1–11, based on improved compact bilinear attention network,’’ Digit. Signal
May 2021. Process., vol. 141, Sep. 2023, Art. no. 104170.
[16] Urban Sound Datasets. Accessed: Dec. 1, 2023. [Online]. Available:
https://urbansounddataset.weebly.com/download-urbansound8k.html
[17] Environmental Sound Classification 50. Accessed: Dec. 1, 2023.
[Online]. Available: https://www.kaggle.com/datasets/mmoreaux/
environmental-sound-classification-50?select=audio AMRIT SEN received the B.Tech. degree in com-
[18] Audio Dataset of Scream and Non Scream. Accessed: Dec. 1, 2023. puter science and engineering with a specialization
[Online]. Available: https://www.kaggle.com/datasets/aananehsansiam/ in cyber-physical systems from the Vellore Insti-
audio-dataset-of-scream-and-non-scream tute of Technology, Chennai. His research interests
[19] Writing Your Own Callbacks. Accessed: Dec. 1, 2023. [Online]. Available: include data science, machine learning, and deep
https://www.tensorflow.org/guide/keras/custom callback learning. During the course, he worked on sev-
[20] The Architecture of LeNet-5. Accessed: Dec. 1, 2023. [Online]. Avail- eral in-house projects. He also worked on a deep
able: https://www.analyticsvidhya.com/blog/2021/03/the-architecture-of- learning project, named ‘‘A Brain Tumour Seg-
lenet-5/ mentation and Classification System using Mini
[21] Understanding the Mel Spectrogram. Accessed: Dec. 1, 2023. [Online].
Batch K-Means Clustering and CNN’’ and carried
Available: https://medium.com/analytics-vidhya/understanding-the-mel-
out Industrial Research Internship with Samsung and worked on a team
spectrogram-fca2afa2ce53
project titled ‘‘Graph Embedding Generation for Link Prediction and User
[22] Kapre. Accessed: Dec. 1, 2023. [Online]. Available: https://kapre.
Classification.’’
readthedocs.io/
[23] Google Search by Voice: A Case Study. Accessed: Dec. 1, 2023. [Online].
Available: https://research.google.com/pubs/archive/36340.pdf
[24] M. Assefi, G. Liu, M. P. Wittie, and C. Izurieta, ‘‘An experimental evalu-
ation of apple Siri and Google speech recognition,’’ in Proc. ISCA SEDE, GAYATHRI RAJAKUMARAN received the bach-
2015, p. 118. elor’s degree from the Rajiv Gandhi College of
[25] A. L. Nobles, E. C. Leas, T. L. Caputi, S.-H. Zhu, S. A. Strathdee, and Engineering and Technology, in 2009, the mas-
J. W. Ayers, ‘‘Responses to addiction help-seeking from alexa, siri, Google
ter’s degree from the Pondicherry Engineering
assistant, cortana, and bixby intelligent virtual assistants,’’ npj Digit. Med.,
College, in 2011, and the Ph.D. degree from the
vol. 3, no. 1, p. 11, Jan. 2020.
Vellore Institute of Technology (VIT), Chennai,
[26] I. Lopatovska, K. Rink, I. Knight, K. Raines, K. Cosenza, H. Williams,
P. Sorsche, D. Hirsch, Q. Li, and A. Martinez, ‘‘Talk to me: Exploring user in 2020, under cloud security specialization. She is
interactions with the Amazon Alexa,’’ J. Librarianship Inf. Sci., vol. 51, currently affiliated with VIT Chennai as an Assis-
no. 4, pp. 984–997, Dec. 2019. tant Professor (Senior) with the Department of
[27] A. Farzindar, D. Inkpen, and G. Hirst, Natural Language Process- Computer Science and Engineering. Her research
ing for Social Media. San Rafael, CA, USA: Morgan Claypool, interests include cloud security, information and cyber security, the IoT, and
2015. machine learning. She has published numerous journals in high indexed
[28] S. Zad, M. Heidari, J. H. J. Jones, and O. Uzuner, ‘‘Emotion detection journals and holding two patents related to domains agriculture and the IoT.
of textual data: An interdisciplinary survey,’’ in Proc. IEEE World AI IoT She is a Reviewer of The Journal of Super Computing. She played the role of
Congr. (AIIoT), May 2021, pp. 0255–0261. an Editor and the author for publishing the books Grid and Cloud Computing,
[29] W. Graterol, J. Diaz-Amado, Y. Cardinale, I. Dongo, E. Lopes-Silva, and Cloud Computing, and Cloud Security. She was the Sponsorship Chair
C. Santos-Libarino, ‘‘Emotion detection for social robots based on NLP of International Conference on Big Data and Cloud Computing (ICBCC)
transformers and an emotion ontology,’’ Sensors, vol. 21, no. 4, p. 1322, 2018 and attracted fund from a government funding agency. She is the current
Feb. 2021. Linux Club Co-Coordinator with VIT Chennai.
[30] Z. T. Sworna, Z. Mousavi, and M. A. Babar, ‘‘NLP methods in host-based
intrusion detection systems: A systematic review and future directions,’’
J. Netw. Comput. Appl., vol. 220, Nov. 2023, Art. no. 103761.
[31] T. Özseven, ‘‘Investigation of the effectiveness of time-frequency domain MIROSLAV MAHDAL is currently the Vice-Dean
images and acoustic features in urban sound classification,’’ Appl. Acoust.,
for Science, Research and Doctoral Studies
vol. 211, Aug. 2023, Art. no. 109564.
with the Faculty of Mechanical Engineering,
[32] R. Viveros-Muñoz, P. Huijse, V. Vargas, D. Espejo, V. Poblete, J. P. Arenas,
VSB-Technical University of Ostrava, and an
M. Vernier, D. Vergara, and E. Suárez, ‘‘Dataset for polyphonic sound
event detection tasks in urban soundscapes: The synthetic polyphonic Associate Professor with the Department of Con-
ambient sound source (SPASS) dataset,’’ Data Brief, vol. 50, Oct. 2023, trol Systems and Instrumentation. His research
Art. no. 109552. interests include the control of mechatronic sys-
[33] A. B. Shrestha, B. Khanal, N. Mainali, S. Shrestha, S. Chapagain, tems, control systems, automatic control theory,
T. P. Umar, and V. Jaiswal, ‘‘Navigating the role of smartwatches in wireless technologies, artificial intelligence, cloud
cardiac fitness monitoring: Insights from physicians and the evolving computing, optimization methods, and the pro-
landscape,’’ Current Problems Cardiology, vol. 49, no. 1, Jan. 2024, gramming of control systems. He has nearly more than 80 articles to his
Art. no. 102073. credit.

VOLUME 12, 2024 6471


A. Sen et al.: Live Event Detection for People’s Safety Using NLP and Deep Learning

SHOLA USHARANI received the Ph.D. degree RAJIV VINCENT received the master’s degree
from the Vellore Institute of Technology (VIT), in computer science and engineering from the
Chennai, in 2020. She is currently an Associate College of Engineering Guindy, Anna Univer-
Professor with the School of Computing Science sity, Chennai, India. He is currently an Assistant
and Engineering, VIT Chennai. Her research inter- Professor Senior with the School of Comput-
ests include embedded systems, the IoT, machine ing Science and Engineering, Vellore Institute of
learning, computer networks, cloud computing, Technology (VIT), Chennai. He has been an Aca-
and security. She has published articles in Scopus demician for the past 11 years and a System
indexed journals, guided more number of UG and Administrator for two years. He has published a
PG projects in the area of embedded systems, the book in machine learning titled as Image Process-
IoT, machine learning, and security. She is an active ACM Member. She ing for Machine Learning (ISBN: 978-93-5445-509-4). He has published
received Research Award, from 2014 to 2015. She is a Reviewer in journals, many research articles in reputed Scopus indexed and Web of Science
such as IGI Global journal systems and Book Systems. She is acting as an journals, also two Indian patents and got one grant in international patent.
Android Club Coordinator with VIT University. His academic and research expertise covers a wide range of subject area,
including deep learning, image processing, and web technologies.

VEZHAVENDHAN RAJASEKHARAN received


the Bachelor of Engineering degree in mechan-
ical engineering from the University of Madras,
the Master of Technology degree in manu- KARTHIKEYAN SUGAVANAN is currently pur-
facturing engineering and management, and suing the degree with the Vellore Institute of
the Ph.D. degree in enterprise transformation. Technology (VIT), Chennai. With a keen interest
He began his career as a hardcore manufacturing in data science and its intersection with AI and
Engineer and ventured into entrepreneurship to IoT, his academic journey has been marked by out-
manufacture mill boards from waste paper. Later, standing achievements and a thirst for knowledge
he joined the Vellore Institute of Technology, in cutting-edge technologies. His dedication to this
Vellore Campus, where he is currently a Senior Associate Professor. He is field and desire to make a meaningful impact on
also a Distinguished Mechanical Engineer and a Researcher with a good people’s lives led him to collaborate with esteemed
background in the application of artificial intelligence (AI) and machine mentors, Dr. Shola Usharani and Dr. Gayathri
learning (ML) in various engineering domains. His current research interests Rajakumaran. Inspired by the potential of NLP and deep learning, he has
include quality management, entrepreneurship, AI, and ML. He has collab- been actively involved in their applications for live event detection for
orated actively with researchers in several disciplines of AI, ML, and the people’s safety. This innovative project aims to harness the power of natural
IoT. His expertise, coupled with his passion for advancing the boundaries language processing and advanced machine learning algorithms to ensure the
of AI and ML, continues to inspire the next generation of researchers and security and well-being of individuals during live events.
engineers in their pursuit of innovation and excellence.

6472 VOLUME 12, 2024

You might also like