Live Event Detection For Peoples Safety Using NLP and Deep Learning
Live Event Detection For Peoples Safety Using NLP and Deep Learning
ABSTRACT Today, humans pose the greatest threat to society by getting involved in robbery, assault,
or homicide activities. Such circumstances threaten the people working alone at night in remote areas
especially women. Any such kind of threat in real time is always associated with a sound/noise which may be
used for an early detection. Numerous existing measures are available but none of them sounds efficient due
to lack of accuracy, delays in exact prediction of threat. Hence a novel software-based prototype is developed
to detect threats from a person’s surrounding sound/noise and automatically alert the registered contacts of
victims by sending email, SMS, WhatsApp messages through their smartphones without any other hardware
components. Audio signals from Kaggle dataset are visualized, analyzed using Exploratory Data Analytics
(EDA) techniques. By feeding EDA outcomes into various Deep Learning models: Long short-term memory
(LSTM), Convolutional Neural Networks (CNN) yields accuracy of 96.6% in classifying the audio-events.
INDEX TERMS Natural language processing (NLP), deep learning, audio, recording, CNN, LSTM,
classification, prediction.
natural language processing has already come a long way in situation of the victim. For instance, if a fire breaks out in
the domain of voice/text-based language processing for the a building, and residents get trapped inside, then with the
betterment of human lives, all while keeping at par with the help of sound detectors in appropriate locations inside the
compatibility with the latest technologies and hardware. building, an automated application based on NLP could be
Taking a few instances of how natural language processing used to detect if a person is in immediate danger of getting
with artificial intelligence has impacted our lives in recent burned, from the ambient noise of the fire and the screams of
times, one can take the example of how Google’s Voice the victim, and emergency services could be sent in his/her
Assistant works. If one needs to just set an alarm, he/she direction with definitive motive, without wasting time look-
can literally speak so, and the alarm would be set by the ing everywhere for the victim.
AI-based assistant. If one is driving a car and wants to Natural disasters aside, individual human beings also face
look up the direction to the destination, one can just ask many dangers and perish too, which can be a direct cause
for it by just doing so after saying ‘‘Hey, Google’’! The of either an unfortunate accident or a peril caused by one
examples of such applications are numerous. It is virtually human being to another (like a homicide). In most situations
not necessary to touch the phone even for dialing another alike, the victim dies as the situation is not conveyed to the
person’s phone number so as to call that person. One can emergency services on time. Also, people working/walking
just ask Google to do so, and it would be done in no late at night, especially women, in remote areas also face the
time. danger of being robbed, assaulted, or being murder. Situations
Another example of the application of natural language like these, call for a system which would be able to access
processing is in the field of textual data analysis and classi- the ambient noise of a person and detect whether he/she
fication. NLP can be used to determine the voice, tense, and is in danger or not. A system similar to the one described
type of a sentence, which can in turn be used to determine here has been worked upon by the researchers in [1], but
either the next part of the entire text or the emotion [28], [29] there, the system was primarily based on a hard-ware model
that the text is trying to convey. Google, for in-stance, uses a and had to be worn at all times in order for it to work
very similar application in their search engine where they start properly.
suggesting search queries to users as soon as they start typing In this research, a software-based system is built as
something into the browser’s search bar. The relevancy of the depicted in Figure 1 would be able to detect whether a person
search result is also determined by NLP itself. Taking another is in any dangerous situation or not by analyzing his/her
instance, social media platforms like Twitter and Instagram surrounding noise. If dangerous situation is detected, then
use NLP for sentiment analysis and emotion detection and an automatic and immediate alert would be issued to the
use the acquired insight to show their users similar posts on registered contact (or the emergency services
their timelines. The research starts with understanding related studies in
Now, observing the scenario presented above, NLP can be the process of event classification and adoption of various
used to enhance the safety and security of individuals or a machine learning, deep learning models in the Technical
population as a whole, in many different ways. Both sound Background Section II concludes by mentioning research
and text-based NLP can be used in multiple scenarios to gaps in the current literature. Proposed live event detection
provide different kinds of security solutions. methodology of Section III depicts the overall procedure in
Taking the scenario of NLP on social media platforms [27], identifying and classifying the real time events based on
they can be of great help to security and emergency work- surrounding noise/sound and the understanding, explanation
ers in times of crisis like a natural calamity (like a flood, of the dataset is depicted in Section IV, which consists of
landslide, tsunami, cyclone etc.) or man-made disasters (like over 9000 different audio clips, spread across 13 differ-
terrorist attacks, hostage situations, aviation emergencies, ent classes of audio. The dataset is explored Section IV,
civil accidents etc.). Since events of such kinds would stir by looking into its time-domain form, before transforming it
heavy discussions on social media, NLP techniques can be to the frequency domain using Fast Fourier Transform and
put to work, which can keep track of posts of similar kinds, sampling it to 44.1 kHz. The Decibel Spectrogram and the
and if any post related to an emergency is detected on the Mel-Spectrograms are used to visualize the data. The Mel-
social media platforms, then it would be re-directed to the Spectrogram form of the data is used by the deep learning
appropriate emergency services with the help of customized models.
algorithms. This way, emergency responders would have The data is then cleansed in Section IV-A, and for each
entire situational awareness, and would be able to act rela- audio signal, an audio envelope is created for better analysis
tively more quickly than possible before, and would be able by the deep learning models. The next step is to train the
to help the victims of the situation in a more informed and three deep learning models (1D-CNN, 2D-CNN, and LSTM)
better way, possibly minimizing further damage as much as in section IV-B, IV-C, IV-D on the cleansed dataset, and to
possible. analyze the output.
A similar solution can be attained with the help of Lastly, the live audio recording module is integrated with
sound-based NLP techniques, where the sound detected from the prediction module Section IV-E, which listens for any
the surrounding would be able to draw insights into the actual sound from the person’s surroundings, and sends the recorded
with the support vector machine (SVM) and the extreme Event detection by analyzing posts on social media is
learning machine (ML-ELM). also one of the major applications of NLP, and can be
For emotion recognition from speech, [6] looked at how used to enhance security measures/operations by providing
noise affected two popular SER (Speech Emotion Recogni- early information to emergency workers, who in turn can
tion) architectures, Acoustic Features and End-to-end, as well reach out to the victims be-fore the severity of the situation
as the potential benefits of implementing speech enhance- becomes graver. A similar attempt has been made by the
ment in SER applications, particularly in low SNRs. This research done in [10]. In order to select the language process-
system’s ability to recognize speech (a sort of noise) even at ing models striking the best balance between accuracy and
very low Signal-to-Noise Ratios (SNRs), or for poor input processing speed for text-based natural language processing
sound quality, is a significant benefit. In this research, a num- in the urban context, the researchers in [10] conducted a
ber of SER techniques based on SVMs and openSMILE preemptive evaluation by contrasting several baseline lan-
features are employed. The approach is based on stacked guage models previously used by researchers for event
residual blocks of 2D convolution layers, which have been classification. To achieve the desired results, a number of
shown to efficiently learn rich representations of input signals algorithms are applied along with the pre-defined NLP mod-
in the past. els, including MNB (Multinomial Naive Bayes Classifier),
A major usage of natural language processing is in CNB (Complement Naive Bayes Classifier), RF (Random
the field of sentiment analysis and emotion detection too. Forest Classifier), Multiple Regression Analysis, General
In [7], a method for sentiment (feeling) analysis that is Regression Statistics, and ANOVA.
non-predictive a priori and can handle audio recordings of Similar to aforementioned method of event detection, [11]
arbitrary length is proposed. Mel spectrogram and Mel Fre- presents a simple yet effective method for social event recog-
quency Cepstral Coefficients are used as audio description nition that mostly utilizes natural language processing. The
tools, and a Fully Convolutional Neural Network architecture researchers look at the distinctive characteristics of social
is recommended as a classifier. An FCN architecture (Fully media’s natural language in order to select the most suitable
Convolutional Neural Network) is suggested by this study in characteristics. Second, they mix fundamental machine learn-
order to classify audio files of any length and recognizing ing techniques with NLP methods to do classification and
emotions in close to real-time. An FCN is primarily a CNN extract features. The bag-of-words (BoW) model, one of the
without fully connected layers that employs just convolu- methods employed in this paper, may be used to describe
tional layers and up- or down-scales input data to enable the a text using the frequency of terms found in a dictionary.
system to accept variable input data. BoW completely disregards word order or structure, which is
A study similar to the aforementioned one is conducted a highly powerful approach to communicate messages. The
by [8]. Built on significant elements gathered from several Support Vector Machine (SVM) algorithm is another one
case studies, a Generative Model for NLP Applications is that is applied in this work. This system has the capacity
pro-vided in this paper. The generative model serves as a uni- to identify social events from short, hazy, and nonstandard
fied framework for several NLP disciplines and may address English-written social media messages.
specific difficulties reading text, hearing speech, compre- In [12] too, the researchers present an analytical framework
hending it, gauging mood, and determining the essential for the analysis of tweets in order to identify and catego-
elements. The study proposes a model for a smart virtual rize specific information about a disaster, such as affected
assistant that might include the best traits from each case people, damaged infrastructure, and disrupted services, and
study looked at for an improvement over the present NLP to distinguish impact areas and time periods, as well as
models in addition to having the ability to understand more the relative prominence of each category of disaster-related
challenging languages like the Chinese language. The sys- information across space and time. Here, Latent Dirichlet
tem’s capacity to recognize speech and emotion in multiple Allocation (LDA) is employed in an unsupervised multi-label
languages can be a huge benefit in the security sector since categorization of tweets utilizing LSTM (Long Short-Term
it allows for the detection of threatening or abusive speech Memory) networks.
delivered to a person in different languages. The scope of NLP in the domain of security can be further
A study on emotion detection from text/speech is also extended to its usage inside modern AI-powered self-driving
conducted by [9], where it uses neural networks and automat- vehicles also. In [13], an image and audio-based solution
ically determine the speaker’s emotions by analyzing vocal is provided as a service to increase the security and trust
cues. In order to analyze texts/speech with multilingual forms within an autonomous shuttle. It is backed by special Arti-
utilizing cross-language functions and the lexical level func- ficial Intelligence (AI) algorithms. The two modalities allow
tion, a hybrid neural network made up of CNN and Bi-LSTM for the real-time identification of small criminal scenarios,
subnets is utilized. This network also identifies emotions in such as screaming, bag stealing, altercations, and vandalism.
cross-language vocals/writings. The system may be utilized They also provide notifications to authorized personnel for
as a software-only pro-gram to automatically identify threat necessary action. For audio classification, a two-dimensional
calls and so improve a person’s security. Convolutional Neural Network (CNN) is employed, and for
visual analysis, an LSTM classifier that can perform binary patterns. An integrated and thorough body of information
or multi-class SoftMax classification is used. A rider’s safety about NLP-based HIDS is required given the current research
is maintained as an advantage of this system. trend of using NLP in HIDS. Notwithstanding the rapidly
Short sounds/noises, which are needed to be detected very increasing usage of NLP in HIDS development, not much
quickly, are often linked with dangerous situations. Thus, effort has been made to systematically examine and compile
it calls for a system which would have the capability to do the peer-reviewed literature that is currently accessible in
so. Reference [14] proposes pre-trained audio neural net- order to comprehend the role that NLP plays in HIDS devel-
works (PANNs) that were trained on the substantial Audio opment. Reference cite30 conducted a Systematic Literature
Set dataset. These PANNs take on additional audio-related Review (SLR) of the works on the end-to-end pipeline of the
duties - modeling the computational complexity of PANNs application of NLP in HIDS development since there was a
and investigating their performance using a variety of con- dearth of a synthesis and a complete body of information on
volutional neural networks. Convolutional Neural Networks this crucial issue. Reference [30] identifies, taxonomically
(CNN) and other methods for data balance and augmentation classifies, and systematically compares the state-of-the-art
are the major techniques utilized for building PANNs. This NLP techniques used in HIDS, attacks identified by these
study specifically uses the augmentation methods Mix-up and NLP methods, datasets, and evaluation metrics that are used
Spec Augment. PANNs can identify sounds with enormous to assess the NLP-based HIDS for the end-to-end NLP-based
accuracy, which significantly decreases the work required HIDS development pipeline. To assist the HIDS developers,
from humans to accomplish the same. The speed at which [30] emphasizes the pertinent best practices, issues, benefits,
PANNs operate also makes them more suited for swiftly and and drawbacks, and also provides the planned future research
precisely classifying common noises. paths for the development of NLP-based HIDS.
Returning to the classification of sound/noise, which can In [31], the researchers provide another instance where
be classically used as the base for the detection danger deep learning is used for sound classification in the urban
from the noise around an individual/population, [15] demon- landscape. They claim that building habitable and sustainable
strates that sound categorization performance can still be cities is severely challenged by the world’s rapid urbanization
improved by swapping out the recurrent architecture for a and population expansion. Urban noises are increasing and
parallel processing structure during feature extraction. The becoming more diverse as a result of this increase. Since noise
research processes the huge data and uses it to develop the is central to the idea of smart cities, [31] turned these noises
model using Deep Learning Algorithms, namely CNN (Con- into information rather than merely being heard. Two funda-
volutional Neural Networks) and LSTM (Long Short-Term mental techniques are utilized to categorize urban noises for
Memory). A stack of L identical blocks with their own set this purpose. In the first of these, the sounds are subjected
of training parameters makes up the feature-extracted model to signal processing techniques in order to extract hand-
is used. This study compares SVM to LR (Logistic Regres- crafted qualities. The alternative approach uses deep learning
sion) and KNN (K-Nearest Neighbor), two other classifiers, models to classify sounds based on their visual representa-
and discusses the advantages of SVM as a classifier. The tion. This study looked at how different variables utilized in
studies’ findings demonstrated that the suggested technique both approaches—individual and hybrid—affect how urban
may greatly improve sound classification accuracy, further sounds are classified. Furthermore, a CNN model for hybrid
enhancing the cause of improving individual security. feature classification was developed. The outcomes demon-
Exploring some unconventional usage of NLP, [30] pro- strated that both strategies were successful in classifying
vides the perfect example. After perimeter defenses (such as a data. Mel-spectrogram, scalogram, and spectrogram pictures
firewall and network-based intrusion detection system) have yielded the best categorization success rate among the visual
failed or been circumvented, a host-based intrusion detection representation techniques. Accuracy was positively impacted
system (HIDS) is a useful final line of defense against cyber by using the SVM classifier, mel-spectrogram, and audi-
security threats. Since Security Operation Centers (SOC) of tory features. Datasets from UrbanSound8k and ESC-10
enterprises rank HIDS as one of the top two security tools, were used for the experiments. When utilizing the AVCNN
HIDS is widely employed in the business. For industrial model with the scalogram and acoustic characteristics, the
companies, having a highly effective and efficient HIDS is ESC-10 achieved the greatest accuracy of 98.33%. By uti-
ideal, however, when sophisticated attack patterns evolve, lizing the SVM classifier to categorize the mel-spectrogram
HIDS performance deteriorates due to various issues (e.g., and acoustic characteristics derived from the AVCNN
a high false alarm rate that wears out SOC employees). model, the maximum accuracy of 97.70% was achieved for
An increasing number of HIDS are utilizing the advance- UrbanSound8k.
ments in Natural Language Processing (NLP) techniques, The researchers in [32] focus their research on the Syn-
which have demonstrated effective and efficient performance thetic Polyphonic Ambient Sound Source (SPASS) dataset,
in accurately detecting low-footprint, zero-day attacks and a freely accessible source of synthetic polyphonic audio.
predicting an attacker’s next steps. This is because NLP SPASS was created to efficiently train deep neural networks
methods are better suited for identifying complex attack for the purpose of detecting polyphonic sound events (PSED)
in urban sound environments. The five virtual areas that not, or if a person is being verbally threatened by another
makeup SPASS are park, square, street, market, and water- fellow human being or not. Systems like these, although
front. Following a hierarchical class taxonomy, a variety beneficial, are unable to address the problem of physical
of monophonic sound sources were curated, virtual envi- safety of an individual.
ronments were set up using the RAVEN software library, • A very similar system is proposed by [8], where the
all stimuli were created, and the data was processed to speech detection is done for multiple languages to detect
produce synthetic recordings of polyphonic sound events verbal threats, but not physical ones.
along with their corresponding metadata. The collection has • Some other systems, like those proposed in [9], [10],
25,000 stimuli of 10 seconds each, or 5,000 audio clips [11], and [12], detect emotions and events from
per environment, virtually recorded at a 44.1 kHz sampling texts/speech, social media posts and tweets respectively.
rate. Although these are unique approaches to determine an
In this research, the audio analysis technique adopted is individual’s/group’s live situation, they again fail to
the Fourier Transform and Mel-Spectrogram (similar to [31]), address the challenge of physical individual safety.
and the audio was sampled at 44.1kHz (just like in [32]) for • Moving further, [13] provides a unique sound-detection-
further processing. Post-cleaning, the sound data is subjected based approach towards safety of travelers inside a
to three different deep learning models (1D-CNN, 2D-CNN, vehicle, but do not address the safety concerns for those
and LSTM) for the classification of sound from a person’s individuals who are alone and not inside any vehicle.
surroundings (the likes of which have been used in various • Reference [14] provides yet another approach towards
pieces of research cited above, for example: [15]), and to the detection and classification of short sounds/noises,
detect a threat from it. If the threat is detected, then an which can actually become very useful for approaches
automatic alert message is sent to the registered help or the like the ones that this research has proposed, but the
emergency services. Moreover, in general, research works researchers in [14] do not further their research towards
like [34], [35], [36], [37], and [38] etc. have been referred any practical solution to the problem of physical individ-
to during the course of this work in order to generate more ual safety.
insights into how audio analysis and classification is done • References [30] and [31] use noise detection for perime-
using different analysis techniques and deep learning models ter defense techniques (like intrusion detection), and
respectively. conversion of urban sounds into information respec-
The research gaps identified from the existing literature are tively, but do not address how an individual can be
depicted as below: helped with respect to physical threats.
• Starting with [1], it provides the necessary solution for
the problem of detection of threat around an individual, III. PROPOSED LIVE EVENT DETECTION METHODOLOGY
but it comes with a bulky hardware, which poses a FOR INPUT AUDIO CLASSIFICATION
difficulty in carrying it around for regular use. The main objective of the proposed system is to detect
• On the other hand, in [2], [15], and [32], the researchers and classify the victim’s live audio signals for immediate
use several techniques to analyze audio signals, but they rescue. The system is intended to deliver its excellence as
don’t further their work to provide a practical solution to an application in any smartphone and it uses the default
the problem of danger detection around an individual. microphone configuration of it. On detection of suspicious
• The research done in [3] is somewhat close to what has audio patterns from the live input audio from microphone,
been achieved in this research, where the researchers the geographical location of the victim is shared to the emer-
have built a system to provide real time feedback from a gency contacts in the phone as well as to the police patrol.
person’s surroundings, however, this also comes with an The drawbacks inferred from the current violence detection
additional hardware component in addition to a smart- scenarios related to audio event detection and classification
phone. accuracy are addressed for effective functioning which plays
• In [4], the researchers use noise detection to detect a vital role in avoiding false event classifications meanwhile
Unmanned Aerial Vehicles (UAVs) which might be used ensuring the victim’s safety through high classification accu-
for criminal activities. However, this approach is not racy. To carry out accurate prediction, training and testing of
favorable to be applied at an individual level, and would Kaggle dataset is carried out in 3 machine learning models
not be suitable for detection of threat around an individ- namely LSTM, 1D CNN and 2D CNN which is illustrated in
ual human being. Figure 2.
• Reference [5] also used similar techniques to detect The audio dataset used in this work consists of 13 classes
threat for an individual/group in an urban context, but (types) of audio signals, namely air conditioner, car horn,
is unable to provide a solution to make the user friendly children playing, dog bark, drilling, engine idling, fire crack-
at an individual level with the use of no hardware. ling, glass breaking, gunshot, jackhammer, scream, siren, and
• The research done in [6] and [7] focus on detection on street music. Of these 13 classes, fire crackling, glass break-
the detection of emotion from speech, which can be ing, gunshot, and screaming are identified as audio types
helpful in determining whether a person is in agony or related to a potentially dangerous environment.
FIGURE 3. The audio signals in the time domain for one sample audio
each from 6 of the 13 classes considered in the research.
FIGURE 4. The audio signal for a gunshot in the time domain with a
sampling period of 1 second on applying FFT.
call these signals non-periodic signals. To show how the spec- From the above Figure 6, it can be seen from here that as
trum of these signals evolves over time, a technique is needed, the time changes along the x-axis, the intensity (loudness)
called the short-time Fourier trans-form, which is a method changes along the y-axis (Example: the four vertical lines in
for computing many spectra by applying FFT to a number of the plot for the class ‘‘scream’’ signifies that at that part of the
windowed signal segments. To put it another way, Short-Time audio signal the person screams for four distinct instances).
Fourier Transform (STFT) is a method for taking into account It’s crucial to take the Mel FilterBank into account to
a lot of FFTs and stacking all the periodograms to produce a further the conversation on audio data analysis. Studies have
new visual metric known as the decibel spectrogram, which is shown that humans do not perceive frequencies on a linear
created when the FFT is calculated on overlapping windowed scale. Humans can discriminate between lower frequencies
portions of the signal (audio signal in this research). The more easily than higher frequencies. Humans can readily
general formula for an STFT is as given below: differentiate between 600 and 1200 Hz but will find it chal-
Z ∞ lenging to discern between 15,000 and 15,600 Hz, even
X (τ, ω) = x(t)w(t−τ )e−iωt dt (2) though the difference between the two pairs is the same.
−
However, for deep learning models to operate accurately, it is
A spectrogram, sometimes known as a decibel spectro- essential for them to be able to distinguish even between
gram, is a visual depiction of the ‘‘loudness’’ or signal inten- signals with the smallest of differences in frequencies at
sity (or amplitude) over time at various frequencies included the higher end of the audible spectrum too. For this pur-
in a particular waveform. A spectrogram is effectively a pose, the Mel FilterBank on the Mel scale can be used to
collection of FFTs stacked on top of one another. Some more rephrase and rescale the audio signal for aiding in the accurate
information is being processed in the back-ground while the training of deep learning models. Reference [19] Stevens,
spectrogram is being calculated. This may be regarded as the Volkmann, and Newmann developed a unit of pitch in 1937 so
amplitude’s log scale. The color dimension is converted to that the listener would perceive equivalent distances in pitch
decibels, and the y-axis is changed to a log scale. This is as equal lengths. It is known as the Mel scale. In order
because humans can only sense a very limited and restricted to translate frequencies to the Mel scale, mathematics is
range of frequencies and amplitudes. The quantity of energy needed.
at various frequencies, such as 2 Hz vs. 10 Hz, as well as The working of Mel spectrogram for a sample audio is
how it varies over time, may be seen. In several scientific depicted in Figure 7. To understand better how the Mel
disciplines, spectrograms are widely used to display the fre- scale is related to the frequency, for lower frequencies as the
quencies of sound waves produced by humans, machinery, frequency changes, the difference reflected on the Mel scale
animals, whales, airplanes, etc. and recorded by microphones. is quite considerable. However, as the frequency increases,
In order to distinguish and categorize distinct earthquake the large differences in frequencies tend to yield smaller
types or other ground vibrations, the seismic community is changes on the Mel scale i.e., the lower frequencies are
increasingly using spectrograms to analyze the frequency given more importance than the higher frequencies. Thus,
FIGURE 7. The Mel spectrogram for one sample audio each from 12 of
the 13 classes considered in the research.
files contain 16-bit audio, the data type considered is NumPy B. APPLIED DEEP LEARNING MODELS
16-bit integer. In this project, three different deep learning techniques have
One of the problems that is faced with audio data is that been explored and implemented, namely LSTM (Long Short-
most of the audio is concentrated in one general area in the Term Memory), 1D-CNN (1-Dimensional Convolutional
audio stream, and as the audio progresses, the magnitude Neural Network), and 2D-CNN (2-Dimensional Convolu-
becomes so low that a significant part of the signal looks tional Neural Network). The primary reason behind the
the same as there is a lot of silent area in the audio. The implementation of three different deep learning models is
silent zones or dead spaces in the audio can be removed by that each of these models has its own set of advantages and
creating a customized signal envelope, which is essentially disadvantages, which would cater to different requirements
what it sounds like – this signal envelope tracks the signal with respect to the training, analysis and classification of the
to study how it changes, and considers only the magnitudes data based on the different properties and attributes associated
above a particular threshold (20 is considered as threshold for with the dataset being used. The prediction module would
this re-search). This is implemented by first converting the have an option to choose between these three deep learning
signal into a sequence, and obtaining the absolute (positive) models for running the prediction on the real-world audio
value of the signal at any given point of time, and then signal, based on the requirement.
applying a rolling window over the signal with a specific Once the data is cleaned, the training is the next phase,
window length (set as 20 in this case), set for considering for which the data has to be loaded into the models with
the maximum magnitude at any time instance, for creating appropriate parameters. For this, a custom data generator is
the signal envelope with the specified threshold is depicted in built, both for training and validation, which would prepare
Figure 8. (generate) the data to be loaded into the deep learning models
The dataset used in this project, as already described, for training, after it has been spilt into training and testing
is a collection of audio files across 13 classes. Since all parts (a 90-10 split has been done in this work). The input to
these audio files are a recording of real-world environmental this data generator is the paths to the clean audio files (created
sounds, the format of the audio across each audio file is earlier), split into intervals of 1 second, and the corresponding
inconsistent. For most of the instances of audio data, the data classes for the signals.
available has more than one channel of audio signal with a Beginning with the best practice of loading data into the
sampling rate between 44100 Hz and 48000 Hz. system before analyzing with a deep learning technique, it is
Since this variance across the data is a challenge towards computationally more efficient to load the data in batches, till
the uniform analysis of the audio data, which in turn might all the epochs are completed, instead of loading the entire data
lead to erroneous classification results after train-ng, the data to the memory at once. So, the way this is recommended to
is thus passed through a cleaning process. In this step, the be done in TensorFlow from Keras’ perspective is to inherit
audio data in each of the audio files are converted to a signal the requirements from a class call ‘‘Sequence’’. This class
with a mono channel, and the sampling rate is down-sampled provides the functionality to load the data in batches, and use
to 16000 Hz for the ease of analysis during the training phase multiple GPUs (Graphic Processing Units) to process the data
of the data, including a reduced computation time. through multiprocessing.
The focus of down-sampling the data is to feed the deep Three essential functions are implemented here for gener-
learning models with a clean data to train on so as to have as ating the final processed data on which the analysis would
be run later – namely ‘‘__len__’’, ‘‘__getitem__’’, and is not particularly important. This is pertinent to the analy-
‘‘on_epoch_end’’ [20]. The ‘‘__len__’’ function is used to sis of time sequences of sensor data, such as that from an
specify the number of batches per epoch (total number of accelerometer or gyroscope. It also applies to the analysis
samples divided by the batch size, which can vary from 16, of audio signals as well as any other signal data that has
32, 64, and so on). The ‘‘__getitem__’’ method outputs an been gathered over a predefined period of time. Another
‘‘X’’ matrix (a time-series format of the audio data) whose 1st use is in natural language processing (NLP), albeit LSTM
dimension is the batch size, 2nd dimension is the number of networks have more promise in this field because word
channels, and the 3rd dimension would be the total number of proximity isn’t always a trustworthy indicator of a trainable
data points considered over the specified time interval, and a pattern.
‘‘Y’’ matrix which in turn would be the output in the form of a 1D-CNN uses time distributed layers to wrap the 1D Con-
SoftMax layer (the probabilities of occurrence of the different volutions over time. In-put to the model is time series format
classes) – the 1st dimension of this matrix is the batch size of the data, where the channels are going to be the first dimen-
and the 2nd dimension is the number of classes considered in sion. For consistency, a permute layer has been implemented
the research (13 in this work). The ‘‘Y’’ matrix would be later on the dimensions of the input data, which is important to be
used to build a hot-encoded matrix using the ‘‘to_categorical’’ done as a time distributed layer is being used here, as any
method of TensorFlow. The output from the ‘‘__getitem__’’ time-based layer expects the dimensions of the data to be
method is used as input to the Mel-Spectrogram layer in the batched by time, features and channels - so the permute layer
deep learning models, where the data is fed as audio signals just swaps the features and time, and time becomes the 1st
with one channel and 16000 data points for one second of dimension.
time-series data (sampling rate). Lastly, the ‘‘on_epoch_end’’ Next, a time distributed 1D convolution is implemented
method is typically used for data augmentation across with a small kernel size and a hyperbolic tangent as the
deep learning projects, however, in this work, it has been first activation function. The activation used in the follow-
mainly used for shuffling the data in between epochs, ing time distributed layer is the ReLU (Rectified Linear
so that there is different distribution of data for different Unit), which allows a deep learning model to be non-linear,
batches, and the models have a more holistic learning in the addresses the vanishing gradients problem, helps prevent the
end. exponential increase in computing required to run the neu-
Just as previously mentioned that a deep learning model ral network, and only takes into account half of the input
tends to perform better with Mel-Spectrogram audio inputs, data.
the concept of Mel-Spectrogram is applied to the input audio Before going any further, it is important to note that the
data files before the training of the models begin. For all three goal here is to develop a functionality of classification, which
models, after feeding the input audio signals, with one chan- can be done using a few deep learning layers to build out fea-
nel and a sampling rate of 16000 Hz, a Mel-Spectrogram layer tures from the data that is available (sound data in this case),
is included, just before the output from this layer is normal- and reduce them down to the point where a classifier of some
ized with a 2D normalization layer which rescales the data to sort can be built in the last few layers of the Convolutional
0-mean, for further processing by the models. This layer is a Neutral Network, which are called as the head of the neural
custom layer which is added to the Keras mod-el(s) (the deep network.
learning model(s)), with the help of another Python library The same is the case for the 1D time distributed CNN
called Kapre [20], which is an audio pre-processing library in model used here - the number of dimensions is gradually
Python, which allows the implementation of various custom reduced down as the model progresses through the layers,
signal processing techniques like STFT (Short-Time Fourier and the number of tuned parameters increase as the number
transform), Inverse STFT, Mel-Spectrogram etc. If instead of of features are increased in each layer. In better words, the
using Kapre, the Mel-Spectrogram is computed separately, training initially starts with a limited number of features
then it has to be performed offline, and stored separately in (a general start), and as the training progresses through the
memory and then the analysis would have to be run – in case layers, more features are added to the network to specify
some parameters are to be changed, then the entire process with more granularity of what exactly the network should
has to be re-iterated, and would take up a lot of computational learn – this is what the last layer with 128 features (the max-
time. imum number) does. Wrapping this procedure with a time
The details of the three models used in this work are distributed layer with respect to sound data means that the
discussed in the upcoming subsections. network is instructed to go along the time dimension, and as it
progresses, all the different frequencies from the input audio
data is fed to the network, and the neural network can identify
1) THE 1D-CNN (1-DIMENSIONAL CONVOLUTIONAL useful features from specific frequencies, and use all these
NEURAL NETWORK) MODEL features together to learn from it, which is practically not
A 1D-CNN performs fairly well when shorter (fixed-length) possible for human beings to do. So, the purpose of wrapping
segments of the full dataset are anticipated to provide inter- a 1D-CNN in time distributed layers is that the model looks
esting features and the feature’s location within the segment at only the frequency spectrum of the audio signal over time
The last three layers are essentially the same as that used also in a reverse manner – this helps in obtaining better gradi-
in the 1D-CNN model – with a dropout and a regularization ent descent updates. This layer has lesser number of features
layer to minimize overfitting, with two dense layers (with (32 in this case) as compared to that in the previous layer. This
64 activation units considered in the first dense layer, and the is so because LSTM networks do not need a large number of
sound classes used in this research as the features in the 2nd features (nodes) to learn relevant in-formation as opposed to
dense layer) for building the classifier (using a hot encoded the 1D/2D-CNN networks previously discussed. The output
matrix) and obtaining the output (prediction probabilities) of this layer is returned as a sequence, and the feature size
using a SoftMax activation layer. returned in this case is double of what was provided in the
input (64 in this scenario).
3) THE LSTM (LONG SHORT-TERM MEMORY) MODEL The next procedure done in the LSTM implementa-
Unlike conventional feedforward neural networks, LSTM tion, which is quite common across networks like LSTMs,
has feedback connections. Such a Recurrent Neural Network is called Skip-Connection, where the output features from
(RNN) is capable of analyzing both single da-ta points, such the time-distributed dense layer are concatenated with the
as photos, as well as whole data sequences, like audio or output of the Bidirectional LSTM layer – so in this case,
video. This characteristic makes LSTM networks ideal for 64 features from the time-distributed dense layer are concate-
managing and anticipating data. For example, voice recog- nated with 64 features from the output of the Bidirectional
nition, machine translation, speech activity detection, robot LSTM layer to produce a total feature set of 128. This
control, video gaming, and healthcare are some applications step lets the neural network take decisions based on both
of LSTM. Applications like connected, unsegmented hand- sets of features learnt before and after entering the LSTM.
writing identification and others are also possible to utilize The last 4 epochs for the LSTM model are depicted in
it for. Numerous RNNs may pick up long-term dependen- Figure 11.
cies, which is very useful for challenges involving sequence As the network progresses, feature engineering is imple-
prediction. In addition to processing single data points like mented with the help of two more dense layers with 64 and
pictures, LSTM also features feedback links that enable it to 32 activation units respectively, with a MaxPooling, and
process the full data stream. a Flattening layer in between. It is worth noting that the
A memory cell in an LSTM model is referred to as a ‘‘cell MaxPooling Layer used in this case is a 1D-MaxPooling
state’’ and performs a crucial role in the model by maintaining (instead of 2D-MaxPooling) as the channel in-formation was
its state over time. In LSTMs, gates regulate the insertion and deliberately lost beforehand when the input was reshaped.
deletion of data from the cell state. Information may be able After the flattening layer, a similar approach to building a
to enter and leave the cell through these gates. The method is classifier is implemented as was done in case of 1D-CNN and
aided by a layer of sigmoid neural networks and a pointwise 2D-CNN, with a dropout and a regularization layer to mini-
multiplication function. An LSTM’s sigmoid layer outputs mize overfitting, with two dense layers (with 32 activation
integers in the range of 0 and 1, where 0 means that nothing units considered in the first dense layer, and the sound classes
should pass through and 1 means that everything should. used in this research as the features in the 2nd dense layer)
The LSTM neural network is specifically designed to study for building the classifier (using a hot encoded matrix) and
all the features and how they change over time. As done in the obtaining the output (prediction) using a SoftMax activation
1D-CNN model, the input is again batched by time, features layer.
and channels, and a permute layer is used to switch between
the time and the feature dimensions. However, since a channel
cannot be fed as an input to the LSTM network (mono channel
sound being used in this research), it is combined with the
feature dimension using a reshape layer.
Before entering the LSTM layers, a time distributed dense
layer is used for some initial feature learning (which is not
done in standard LSTM networks), using an appropriate num-
ber of activation units (64 in this case), and the activation
function as a hyperbolic tangent. For this, it uses the feature
dimension (128 features considered in this case), and reduces
FIGURE 11. On running for 35 epochs, the maximum accuracy reached for
it by half so that the some more relevant features could be the LSTM model is found to be 96.6%, with a maximum validation
learnt about the data even before progressing into the LSTM accuracy of 90.3%. The metric used here for considering the best results
is the validation loss. The Training (train) vs Validation (test) can be seen
layers. in this figure.
The next layer used is a Bidirectional LSTM layer, which
computes the gradient descent of learning for the data by
going through the time dimension in both the for-ward and C. TRAINING THE MODELS ON THE DATASET
the backward direction, which basically means that the model The previously described dataset that is being used in this
not only studies the sound data only in a forward direction, but project is subjected to a training phase with respect to all
the three deep learning models discussed above. The training similar to the down-sampling step performed during the
on each of the three models takes approximately 3 hours data cleaning procedure. Moreover, just like the non-silent
(approximately 9 hours for all the three models) for the customized signal envelope was created during data pre-
dataset being considered in this project. processing, the same step is performed on the live-recorded
While training the three models, a call-back is imple- audio data, so that while performing prediction on the sig-
mented to log the results (validation accuracy/loss and nal, the maximum of the non-silent part of the signal is
training accuracy/loss) from each epoch (35 epochs con- considered.
sidered for each model) into a specific CSV file for each For running the prediction on the live recorded audio,
of the models (used later for visualization of the validation after all the above pre-processing, the audio data is batched
and training accuracies). Along with this, a check-pointer up so that predictions can be made using the argmax
is implemented which would monitor the specified param- function of NumPy on the output of the previously men-
eter (set as validation loss during this research – can be tioned hot-encoded probabilities. The prediction is then
set as validation accuracy as well) from the values logged, achieved by taking 1 second intervals (specified as delta
and the model would be saved accordingly – since the time previously) within the audio on which the prediction
monitoring parameter was set as validation loss, the low- is run i.e., it considers every single second of the audio,
est validation loss would be saved on the models being and sums all the probabilities and takes the average of
trained (1D-CNN, 2D-CNN, LSTM), and the models would them.
work in the best possible way while performing the actual Furthermore, it is configured in the prediction module
predictions. itself that if the class of the recorded audio is identified as
On training, the 1D-CNN model turned out to have the one related to a potentially hostile environment (like ‘‘fire
least overfitting. This can be attributed to the fact that the crackling’’, ‘‘glass breaking’’, ‘‘gunshot’’, ‘‘scream’’ etc.),
features considered in the 1D-CNN are a lot weaker as then an automatic alert message is immediately sent to the
compared to the 2D¬CNN or the LSTM model. However, registered contact via e-mail, SMS, and WhatsApp. In the
looking from another perspective, the LSTM network can case of the e-mail message, it also contains the audio file
also be attributed to being the best network of all the three saved from the live re-cording, which the audio recording
networks built here, even though the deviation between the module records, saves and prompts the prediction module to
training and testing accuracies is higher as compared to that identify the class of.
in the 1D-CNN. This gap discrepancy would get reduced
further in the LSTM as the size of the data increases, E. THE LIVE AUDIO RECORDING AND PREDICTION
whereas the 1D¬CNN would perform poorly in that case. MODULE
The following plots show how the three models perform on The audio recording module records live audio from the envi-
training, with respect to their training and testing (validation) ronment, and stops the recording when no more noise comes.
accuracies. This module also performs a set of other pre-processing steps
From the above Figures 9, 10, and 11, although, the on the recorded audio, apart from the prediction module,
1D-CNN model seems to over fit the least for the given which is invoked later.
dataset and the number of classes considered as compared As part of pre-processing, the recorded audio is trimmed
to the other two models (the gap between training and off of any silence at both ends of the recording, then
testing is the least), the LSTM model, which achieved the it is normalized i.e., the volume of the audio signal is
highest training accuracy among the three (96.6%), would aver-aged over the entire length of the recording, and then
perform the best when the size of the dataset and the num- an audio padding of 0.5 seconds is added to both ends of
ber of classes considered would increase. Once the three the recording so that different media players can play the
models are trained on the given dataset, the models are audio without losing any crucial audio data which might
saved with their training information in the ‘‘.h5d’’ format. be present at the very beginning or the ending of the audio
The user would be able to choose the model to be used signal.
for audio class prediction for the real-world audio signal, After the pre-processing is done, the audio data is saved
by specifying the path to the necessary ‘‘.h5d’’ file (for 1D- as a ‘‘.wav’’ file at a specified location, and the prediction
CNN, 2D-CNN, and LSTM) as saved during the training module is automatically prompted to run the prediction to
phase. identify the class of the recorded audio, using the deep learn-
ing model as specified by the user (among the three models
D. THE PREDICTION MODEL built during this research). After the prediction module is
Since the prediction module takes in real-world audio data called by the audio recording module, it performs all the steps
for identifying the class of the sound, it needs to pre-process as mentioned automatically.
the data before running the analysis on it. For this pur-
pose, the sound data received, after the recording module V. RESULTS
records and saves the audio data, is down-Sampled to a The problem statement of this research was to identify
mono channel with a sampling rate of 16000 Hz. This is the type of real-world audio data using deep learning
diversify the end-deliverables of the project. For instance, The recorded audio along with timestamp and geographical
the location of the victim can also be shared along with the location could also serve as evidence against the person
alert message, which can be done with further research on the committing crime.
intricacies of the technical implementation. Likewise, further
functionalities can be added as well in order to improve VI. CONCLUSION
the communication about the victim’s distress, the assistance A new software-based threat identification system has been
provided to him/her during the time of crisis, and the time developed to the dangerous situation of an individual from
taken from the first distress signal for the help to reach the his/her ambient noise, and provide immediate assistance
victim. to the victim by automatically informing their emergency
To summarize, on one hand, the research leverages the contacts about the situation. The functionality of record-
power of the Fourier Transform and Mel-Spectrogram anal- ing the live audio on the victim’s side, identifying the
ysis to facilitate the development of three different models type of the sound and predicting threatful situation is car-
(1D- CNN, 2D-CNN, and LSTM) to detect threats from ried out using the 1D-CNN, 2D-CNN, and LSTM models
a person’s surroundings and alert register/emergency ser- to achieve an accuracy of 95.2\%, 96.3\%, and 96.6\%
vices through email, SMS, and WhatsApp, it also has a respectively, with an average ac-curacy of 96.03\%. In addi-
few limitations. Other complex models, the likes of which tion to providing the user with three options to choose
(GRU, Bi-GRU, CTRNN, HRNN etc.) have already been from for the prediction of the class of the live audio, the
mentioned before, can also be used to yield even bet- choice of more than one model becomes useful as and
ter results and leaves a scope for further research on the when the size and complexity of the dataset and the live
topic. recorded audio increases. In conclusion, the requirement of
Although email, SMS, and WhatsApp are great ways to the research has been successfully achieved, fulfilling the
communicate about a person’s whereabouts, the research can target of providing an essential solution to one of the great-
be furthered to include more information about the victim, est practical problems humankind faces in today’s world
like their location, type of distress, intensity of the emer- i.e., threat detection and alert system, all while explor-
gency, physical parameters (like heart rate, oxygen level etc.), ing and delivering on the domain of deep learning and
environmental parameters (like altitude, temperature etc.), audio analysis for the detection of live events from ambient
which in turn would allow the emergency services to pre- sounds.
pare better for the rescue of the victim from their perilous
situation. REFERENCES
A very distinct advantage of the proposed system is that [1] T. P. Suma and G. Rekha, ‘‘Study on IoT based women safety devices with
screaming detection and video capturing,’’ Int. J. Eng. Appl. Sci. Technol.,
it does not require any special external hardware/wearable vol. 6, no. 7, pp. 257–262, 2021.
devices (like apart watches), but can be implemented with [2] P. Zinemanas, M. Rocamora, M. Miron, F. Font, and X. Serra, ‘‘An inter-
the help of a mere smartphone configured accordingly. pretable deep learning model for automatic sound classification,’’ Elec-
tronics, vol. 10, no. 7, p. 850, Apr. 2021.
In most modern IoT (Internet of Things) systems, the sys- [3] D. G. Monisha, M. Monisha, G. Pavithra, and R. Subhashini, ‘‘Women
tem comes with associated hardware, like smartwatches [33], safety device and application-FEMME,’’ Indian J. Sci. Technol., vol. 9,
which constantly monitor a person’s health parameters. no. 10, pp. 1–6, Mar. 2016.
[4] G. Ciaburro and G. Iannace, ‘‘Improving smart cities safety using sound
A person would not have to carry bulky hardware around events detection based on deep neural network algorithms,’’ Informatics,
with them, but only remember to carry their phones with vol. 7, no. 3, p. 23, Jul. 2020.
them. However, whenever, more and more modules would [5] J. Cao, M. Cao, J. Wang, C. Yin, D. Wang, and P.-P. Vidal, ‘‘Urban noise
recognition with convolutional neural network,’’ Multimedia Tools Appl.,
be tried to be integrated along with the existing sys- vol. 78, no. 20, pp. 29021–29041, Oct. 2019.
tem, the inclusion of a special hardware might become an [6] A. Triantafyllopoulos, G. Keren, J. Wagner, I. Steiner, and B. W. Schuller,
issue. ‘‘Towards robust speech emotion recognition using deep residual net-
works for speech enhancement,’’ in Proc. INTERSPEECH, Graz, Austria,
The benefits of proposed solution from the fetched results Sep. 2019.
are depicted as below: [7] M. T. García-Ordás, H. Alaiz-Moretón, J. A. Benítez-Andrades,
As crime rates, and the frequency of natural and man-made I. García-Rodríguez, O. García-Olalla, and C. Benavides, ‘‘Sentiment
analysis in non-fixed length audios using a fully convolutional neural
disasters have increased significantly in modern days, it is of network,’’ Biomed. Signal Process. Control, vol. 69, Aug. 2021,
prime essence to have emergency help and services at the Art. no. 102946.
victim’s assistance as quickly as possible, with immediate [8] A. Bhardwaj, P. Khanna, S. Kumar, and Pragya, ‘‘Generative model for
NLP applications based on component extraction,’’ Proc. Comput. Sci.,
and accurate information about the victim’s situation shared vol. 167, pp. 918–931, Jan. 2020.
with them automatically. The yielded solution helps greatly [9] I. S. Malova and D. V. Tikhomirova, ‘‘Recognition of emotions in ver-
to avoid any such type of hazard like a robbery, homicide bal messages based on neural networks,’’ Proc. Comput. Sci., vol. 190,
pp. 560–563, Jan. 2021.
or other threats to any individual as it helps greatly in the [10] A. Hodorog, I. Petri, and Y. Rezgui, ‘‘Machine learning and natural lan-
accurate prediction of surrounding sound/noise of victims guage processing of social media data for event detection in smart cities,’’
in minimum time and alerts the emergency contact list of Sustain. Cities Soc., vol. 85, Oct. 2022, Art. no. 104026.
[11] D. D. Nguyen, M. S. Dao, and T. V. T. Nguyen, ‘‘Natural language
victims. Through this accurate prediction, the rate of crimes processing for social event classification,’’ in Knowledge and Systems
would go down for the fear of getting caught by the police. Engineering. Cham, Switzerland: Springer, 2015, pp. 79–91.
[12] M. A. Sit, C. Koylu, and I. Demir, ‘‘Identifying disaster-related tweets [34] P. Upretee and M. E. Yüksel, ‘‘Accurate classification of heart sounds for
and their semantic, spatial and temporal context using deep learning, disease diagnosis by using spectral analysis and deep learning methods,’’
natural language processing and spatial analysis: A case study of hur- in Data Analytics in Biomedical Engineering and Healthcare, New York,
ricane irma,’’ Int. J. Digit. Earth, vol. 12, no. 11, pp. 1205–1229, NY, USA: Academic, 2021, pp. 215–232.
Nov. 2019. [35] K. Presannakumar and A. Mohamed, ‘‘Deep learning based source identifi-
[13] D. Tsiktsiris, A. Vafeiadis, A. Lalas, M. Dasygenis, K. Votis, and cation of environmental audio signals using optimized convolutional neural
D. Tzovaras, ‘‘A novel image and audio-based artificial intelligence service networks,’’ Appl. Soft Comput., vol. 143, Aug. 2023, Art. no. 110423.
for security applications in autonomous vehicles,’’ Transp. Res. Proc., [36] A. M. Tripathi and A. Mishra, ‘‘Self-supervised learning for environmental
vol. 62, pp. 294–301, Jan. 2022. sound classification,’’ Appl. Acoust., vol. 182, Nov. 2021, Art. no. 108183.
[14] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, [37] M. Mohaimenuzzaman, C. Bergmeir, I. West, and B. Meyer, ‘‘Environmen-
‘‘PANNs: Large-scale pretrained audio neural networks for audio pat- tal sound classification on the edge: A pipeline for deep acoustic networks
tern recognition,’’ IEEE/ACM Trans. Audio, Speech, Language Process., on extremely resource-constrained devices,’’ Pattern Recognit., vol. 133,
vol. 28, pp. 2880–2894, 2020. Jan. 2023, Art. no. 109025.
[15] L. Yang and H. Zhao, ‘‘Sound classification based on multihead attention [38] S. Dong, Z. Xia, X. Pan, and T. Yu, ‘‘Environmental sound classification
and support vector machine,’’ Math. Problems Eng., vol. 2021, pp. 1–11, based on improved compact bilinear attention network,’’ Digit. Signal
May 2021. Process., vol. 141, Sep. 2023, Art. no. 104170.
[16] Urban Sound Datasets. Accessed: Dec. 1, 2023. [Online]. Available:
https://urbansounddataset.weebly.com/download-urbansound8k.html
[17] Environmental Sound Classification 50. Accessed: Dec. 1, 2023.
[Online]. Available: https://www.kaggle.com/datasets/mmoreaux/
environmental-sound-classification-50?select=audio AMRIT SEN received the B.Tech. degree in com-
[18] Audio Dataset of Scream and Non Scream. Accessed: Dec. 1, 2023. puter science and engineering with a specialization
[Online]. Available: https://www.kaggle.com/datasets/aananehsansiam/ in cyber-physical systems from the Vellore Insti-
audio-dataset-of-scream-and-non-scream tute of Technology, Chennai. His research interests
[19] Writing Your Own Callbacks. Accessed: Dec. 1, 2023. [Online]. Available: include data science, machine learning, and deep
https://www.tensorflow.org/guide/keras/custom callback learning. During the course, he worked on sev-
[20] The Architecture of LeNet-5. Accessed: Dec. 1, 2023. [Online]. Avail- eral in-house projects. He also worked on a deep
able: https://www.analyticsvidhya.com/blog/2021/03/the-architecture-of- learning project, named ‘‘A Brain Tumour Seg-
lenet-5/ mentation and Classification System using Mini
[21] Understanding the Mel Spectrogram. Accessed: Dec. 1, 2023. [Online].
Batch K-Means Clustering and CNN’’ and carried
Available: https://medium.com/analytics-vidhya/understanding-the-mel-
out Industrial Research Internship with Samsung and worked on a team
spectrogram-fca2afa2ce53
project titled ‘‘Graph Embedding Generation for Link Prediction and User
[22] Kapre. Accessed: Dec. 1, 2023. [Online]. Available: https://kapre.
Classification.’’
readthedocs.io/
[23] Google Search by Voice: A Case Study. Accessed: Dec. 1, 2023. [Online].
Available: https://research.google.com/pubs/archive/36340.pdf
[24] M. Assefi, G. Liu, M. P. Wittie, and C. Izurieta, ‘‘An experimental evalu-
ation of apple Siri and Google speech recognition,’’ in Proc. ISCA SEDE, GAYATHRI RAJAKUMARAN received the bach-
2015, p. 118. elor’s degree from the Rajiv Gandhi College of
[25] A. L. Nobles, E. C. Leas, T. L. Caputi, S.-H. Zhu, S. A. Strathdee, and Engineering and Technology, in 2009, the mas-
J. W. Ayers, ‘‘Responses to addiction help-seeking from alexa, siri, Google
ter’s degree from the Pondicherry Engineering
assistant, cortana, and bixby intelligent virtual assistants,’’ npj Digit. Med.,
College, in 2011, and the Ph.D. degree from the
vol. 3, no. 1, p. 11, Jan. 2020.
Vellore Institute of Technology (VIT), Chennai,
[26] I. Lopatovska, K. Rink, I. Knight, K. Raines, K. Cosenza, H. Williams,
P. Sorsche, D. Hirsch, Q. Li, and A. Martinez, ‘‘Talk to me: Exploring user in 2020, under cloud security specialization. She is
interactions with the Amazon Alexa,’’ J. Librarianship Inf. Sci., vol. 51, currently affiliated with VIT Chennai as an Assis-
no. 4, pp. 984–997, Dec. 2019. tant Professor (Senior) with the Department of
[27] A. Farzindar, D. Inkpen, and G. Hirst, Natural Language Process- Computer Science and Engineering. Her research
ing for Social Media. San Rafael, CA, USA: Morgan Claypool, interests include cloud security, information and cyber security, the IoT, and
2015. machine learning. She has published numerous journals in high indexed
[28] S. Zad, M. Heidari, J. H. J. Jones, and O. Uzuner, ‘‘Emotion detection journals and holding two patents related to domains agriculture and the IoT.
of textual data: An interdisciplinary survey,’’ in Proc. IEEE World AI IoT She is a Reviewer of The Journal of Super Computing. She played the role of
Congr. (AIIoT), May 2021, pp. 0255–0261. an Editor and the author for publishing the books Grid and Cloud Computing,
[29] W. Graterol, J. Diaz-Amado, Y. Cardinale, I. Dongo, E. Lopes-Silva, and Cloud Computing, and Cloud Security. She was the Sponsorship Chair
C. Santos-Libarino, ‘‘Emotion detection for social robots based on NLP of International Conference on Big Data and Cloud Computing (ICBCC)
transformers and an emotion ontology,’’ Sensors, vol. 21, no. 4, p. 1322, 2018 and attracted fund from a government funding agency. She is the current
Feb. 2021. Linux Club Co-Coordinator with VIT Chennai.
[30] Z. T. Sworna, Z. Mousavi, and M. A. Babar, ‘‘NLP methods in host-based
intrusion detection systems: A systematic review and future directions,’’
J. Netw. Comput. Appl., vol. 220, Nov. 2023, Art. no. 103761.
[31] T. Özseven, ‘‘Investigation of the effectiveness of time-frequency domain MIROSLAV MAHDAL is currently the Vice-Dean
images and acoustic features in urban sound classification,’’ Appl. Acoust.,
for Science, Research and Doctoral Studies
vol. 211, Aug. 2023, Art. no. 109564.
with the Faculty of Mechanical Engineering,
[32] R. Viveros-Muñoz, P. Huijse, V. Vargas, D. Espejo, V. Poblete, J. P. Arenas,
VSB-Technical University of Ostrava, and an
M. Vernier, D. Vergara, and E. Suárez, ‘‘Dataset for polyphonic sound
event detection tasks in urban soundscapes: The synthetic polyphonic Associate Professor with the Department of Con-
ambient sound source (SPASS) dataset,’’ Data Brief, vol. 50, Oct. 2023, trol Systems and Instrumentation. His research
Art. no. 109552. interests include the control of mechatronic sys-
[33] A. B. Shrestha, B. Khanal, N. Mainali, S. Shrestha, S. Chapagain, tems, control systems, automatic control theory,
T. P. Umar, and V. Jaiswal, ‘‘Navigating the role of smartwatches in wireless technologies, artificial intelligence, cloud
cardiac fitness monitoring: Insights from physicians and the evolving computing, optimization methods, and the pro-
landscape,’’ Current Problems Cardiology, vol. 49, no. 1, Jan. 2024, gramming of control systems. He has nearly more than 80 articles to his
Art. no. 102073. credit.
SHOLA USHARANI received the Ph.D. degree RAJIV VINCENT received the master’s degree
from the Vellore Institute of Technology (VIT), in computer science and engineering from the
Chennai, in 2020. She is currently an Associate College of Engineering Guindy, Anna Univer-
Professor with the School of Computing Science sity, Chennai, India. He is currently an Assistant
and Engineering, VIT Chennai. Her research inter- Professor Senior with the School of Comput-
ests include embedded systems, the IoT, machine ing Science and Engineering, Vellore Institute of
learning, computer networks, cloud computing, Technology (VIT), Chennai. He has been an Aca-
and security. She has published articles in Scopus demician for the past 11 years and a System
indexed journals, guided more number of UG and Administrator for two years. He has published a
PG projects in the area of embedded systems, the book in machine learning titled as Image Process-
IoT, machine learning, and security. She is an active ACM Member. She ing for Machine Learning (ISBN: 978-93-5445-509-4). He has published
received Research Award, from 2014 to 2015. She is a Reviewer in journals, many research articles in reputed Scopus indexed and Web of Science
such as IGI Global journal systems and Book Systems. She is acting as an journals, also two Indian patents and got one grant in international patent.
Android Club Coordinator with VIT University. His academic and research expertise covers a wide range of subject area,
including deep learning, image processing, and web technologies.