0% found this document useful (0 votes)
5 views

Emotion_Based_Music_Recommendation_System

The document presents a study on an Emotion Based Music Recommendation System developed by students from Amity University, aimed at suggesting music based on users' facial expressions and emotional states. The system utilizes deep learning techniques, specifically Convolutional Neural Networks (CNN), to detect emotions from facial images captured via webcam, and is implemented as a web application using Streamlit. This innovative approach seeks to enhance user experience by reducing the time spent on music selection and providing tailored music recommendations that align with the user's current mood.

Uploaded by

sri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Emotion_Based_Music_Recommendation_System

The document presents a study on an Emotion Based Music Recommendation System developed by students from Amity University, aimed at suggesting music based on users' facial expressions and emotional states. The system utilizes deep learning techniques, specifically Convolutional Neural Networks (CNN), to detect emotions from facial images captured via webcam, and is implemented as a web application using Streamlit. This innovative approach seeks to enhance user experience by reducing the time spent on music selection and providing tailored music recommendations that align with the user's current mood.

Uploaded by

sri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2023 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE)

March 9-10, 2023, Amity University Dubai, UAE

Emotion Based Music Recommendation System


Micah Mariam Joseph Diya Treessa Varghese Lipsa Sadath Ved Prakash Mishra
Department of Engineering Department of Engineering Department of Engineering Department of Engineering
Amity University Amity University Amity University Amity University
Dubai, United Arab Emirates Dubai, United Arab Emirates Dubai, United Arab Emirates Dubai, United Arab Emirates
2023 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE) | 979-8-3503-3826-3/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICCIKE58312.2023.10131874

[email protected] [email protected] [email protected] [email protected]

Abstract— A person often finds it difficult to choose which these basic emotions. This study presents a method for
music to listen from different collections of music. Depending on detecting these basic universal emotions from frontal facial
the user’s mood, a variety of suggestion frameworks have been expressions. After, implementing the facial recognition
made available for topics including music, festivals and machine learning model, we then further continue to make it
celebrations. Our music recommendation system’s main goal is
into a web application by using Streamlit. The Emotion
to offer users recommendations that match user’s preferences.
Understanding the user’s present facial expression can enable detection is performed using Deep learning. Deep Learning
us to predict the user’s emotional state. Humans frequently use is a well-known model in the pattern recognition arena. The
their facial expressions to convey their intentions. More than keras library is being used, as well as the Convolution Neural
60% of users have at some point believed that the count of songs Network (CNN) algorithm. A CNN is indeed an artificial
in their music playlist is so much that the user’s are unable to neural network with some machine learning component.
choose a music to play. By creating a suggestion system, it might Among other things, CNN can also be used to detect objects,
help a user decide which music to listen, allowing the user to perform facial recognition and process images. [2]
feel less stressed. This work is a study on how to track and match
the user's mood using face detection mechanism, saving the user
time from having to search or look up music. Deep learning
performs emotion detection which is a well-known model in II. LITERATURE REVIEW
facial recognition arena. Convolution neural network algorithm
Humans frequently convey their emotions through a
has been used for the facial recognition. We use an open-source
app framework known as Streamlit to make a web application variety of ways like hand gestures, voice, tonality and so on,
from the model. The user will then be shown songs that match but they mostly do through facial expressions.An expert
his or her mood. We capture the user’s expression using a would be able to determine the emotions being experienced
webcam. An appropriate music is then played on, according to by the other person by observing or examining them.
their mood or emotion. Nevertheless, as there is technological advancement in
today’s world, machine are attempting to become more
Keywords— emotion recognition, convolution neural smarter. Machines are aiming to operate in an increasingly
network, streamlit. human-like way. On training the computer on the human
I. INTRODUCTION emotions, the machine would be capable to perform analysis
and react like a human. By enabling precise expression
Nowadays, music services make vast amounts of music patterns with improved competence and error-free emotion
easily accessible. People are constantly attempting to calculation, data mining can assist machines in discovering
enhance music arrangement and search management, in order and acting more like humans. A music player which is
to alleviate the difficulty of selection and make discovering dependent on emotions takes less time to find the appropriate
new music works easier. Recommendation systems are music that the user can resonate with. People typically have a
becoming increasingly common, allowing users to choose lot of music on their playlist, this would make it difficult for
acceptable music for any circumstance. Recommendations the user to choose an appropriate song. Random music does
for music can be used in a range of situations, including music not make the user feel better, so with the aid of this
therapy, sports, studying, relaxing, and supporting mental and technology, users can have songs played automatically based
physical activity. [1] However, in terms of personalization on their mood. [3] The webcam records the user's image, and
and emotion driven recommendations, there is still a gap. the pictures are stored. The system recorded user’s varied
Humans have been massively influenced by music. Music is expressions to assess their emotions and select the apt music.
a key factor to for various effects on humans such as
controlling their mood, relaxation, mental and physical work The ability to read a person's mood from their expression is
and help in stress relief. Music therapy can be used in a important. To capture the facial expressions, a
variety of clinical contexts and practices to help people feel webcam is used. This input can be used, among other things,
better. In this project, we’re creating a web application that to extract data that can be used to infer a person's attitude.
recommends music based on emotions. It influences how Songs are generated using the "emotion" which has been
people live and interact with one another. At times, this could inferred from the previous input. This reduces the tedious job
seem that we have been controlled by our emotion. The of manually classifying songs into various lists. The Facial
emotion we are encountering at any given moment have an Expression Based Music Recommender’s main objective is
effect on the final decision that we choose, actions that we scanning and analyzing the data, and then it would suggest
undertake, and the impression that we form. Neutral, angry, music in line with the user’s mood. [4]
disgust, fear, glad, sad, and surprise are the seven primary By utilizing image processing, we have developed an
global emotions. The look on a person's face might reveal emotion-based music system that would allow the user to

979-8-3503-3826-3/23/$31.00 ©2023 IEEE


Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 04,2023 at 04:11:50 UTC from IEEE Xplore. Restrictions apply.
505
create and manage the playlist with less effort even while certain patterns using the data presented. The system will then
delivering the best song for the user’s current expression, be able to forecast solutions to a variety of similar problems.
giving the listeners an outstanding experience. [5] The way a The Neural Network of the human brain is where the
person is feeling can be inferred from their facial movements. inspiration of the Neural Network (NN) comes from. A field
The image of the person is captured using a webcam, and in Artificial Intelligence is Computer Vision and it focuses in
information is then extracted from the image. any problems related to image. CNN paired up with
Computer Vision has the potential of executing various
Emotion detection research is influenced by a range of fields, complex operation such as they would be able to classify
including machine learning, natural language processing, images, providing solutions to scientific problems of
neurology and others. They investigated for several universal astronomy and can build Self-Automated Vehicles[8]. CNN
expressions of emotions in the face expressions, voice feature is a mixture of Convolution Layer as well as Neural Network.
and textual data in previous study. Some of the categorization Any Neural Network which is used for Image Processing
for emotions include Happy, Sad, Disgust, Fury, Fear and contain various layers such as the Input Layer, Convolution
Surprise. Layer, Pooling Layer, Dense Layer.
Later on, the image, audio, and textual data are combined to
better the work. The combination of these data yields the Convolution is a type of filter which is used on images, this
most accurate result. [6] filter helps to extract feature from it. The Convolution makes
a filter of certain size (Default size is 3X3) [8]. Element wise
A. Convolution Neural Networks
multiplication is performed from the image’s top left corner
In recent years, the development of Convolution neural after the filter is created. Multiplying elements with the same
networks has had a considerable impact on the computer index is known as element-wise multiplication. A pixel value
vision sector, as well as a substantial step and ability to is created by adding these computed values together and is
recognize objects[6]. Neural networks include CNNs. Neural then stored in the new matrix. This recently formed matrix
Networks, a subtype of machine learning, are the backbone will be put to use in additional processing [9]. Following the
of deep learning approaches. They are comprised of node application of convolutions, there is a notion refered as
tiers with input layer on each tier. Each node has a distinct pooling. Pooling is a method for reducing the size of an
weight and threshold and is linked to the other nodes. image. The first convolutional layer, when building a neural
Building a convolution neural network is indeed a practical network, requires the shape of the picture that is provided to
approach for categorizing photographs using deep learning. it as input. After the image has been passed, the image will
A Keras library module for python is available which makes be transmitted through all convolutional layers and pooling
building CNN incredibily simple. To view images, computer layers before being sent to the dense layer [10].
require pixels. Photos typically have connected pixels. For
instance, a certain collection of pixels can represent a pattern
or an edge in an image. Convolutions employ this to help with B. Mediapipe
image recognition. The nodes become active and sends the MediaPipe is an open-source framework by Google, which is
data to the network’s next layer if its output exceeds a been utilized for applying in a machine learning pipeline.
predetermined threshold. No data is forwarded to the next tier Being multimodal, it can be applied for various media and
of the hierarchy if this is not the case. audio. The processes of a program that used the MediaPipe
must be developed in a pipeline configuration. A pipeline is
made up of components known as calculators, each of which
Neural Networks comes in several types, these neural is connected by streams through which data packets pass.
networks are used for various use cases and data types [6]. Developers can create their own application by replacing or
For example, recurrent neural networks are majorly used for defining custom calculators anywhere in it [11].
Speech Recognition and Natural Language Processing
C. Streamlit
whereas the Convolution Neural Networks (CNNs or
ConvNets) are usually used for classification purposes or any Streamlit is an open-source app framework built on python.
computer vision tasks. Before CNNs came into action, a It allows us to swiftly build data science and machine learning
manual, extremely time-consuming process of feature web apps. Scikit-learn, SymPy(latex), Matplotlib, Keras,
extraction methods were made use of for the purpose of Pandas, Pytorch, Numpy, and other Python libraries are
identification of objects in images. However, now for the compatible with it. [12]
purpose of image classification and for any object recognition
task we can make use of Convolution Neural Network as the D. OpenCV
impact a much more scalable approach. CNN does it’s job by OpenCV is an amazing tool to perform image processing and
deriving principles or concepts from Linear Algebra, Matrix computer vision task. Object Detection and Object Tracking
Multiplication, for retrieving the patterns from the image. is one of the major functionalities of this open-source library.
CNN can be computationally demanding, which makes use OpenCV performs a vital role in today’s world as it can be
of the Graphical Process Unit (GPUs) for training the models used in various real-time application. It can be used to detect
[7]. faces, objects and handwriting in an image or video [13].

Artificial Intelligence includes Machine Learning as a


subtype, here we would be providing a particular data to the
system or machine, and this system would be able to derive

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 04,2023 at 04:11:50 UTC from IEEE Xplore. Restrictions apply.
506
III. METHODOLOGY present in photos are normally associated to each other. For
instance, an edge in the image or any related pattern might be
This work takes into account the major challenges which the represented by a specific set of pixels. Convolution utilizes
machine learning system faces and the core of the system is the pixels to recognize these photos. The process occurring in
the data training part. The data instructing portion of the the convolution layer is that it multiplies the matrix of pixels
system is instructed using the real data of people’s facial with a filter matrix (also known as ‘Kernel ’) and then adds
emotions. For instance, for the system to determine an angry up these multiplication values. After process one part of the
facial emotion the system should first be trained with the pixel it then moves over to the next pixel portion to process it
angry reaction. Similarly, for the model to determine a happy in similar method. This process continues until the whole
facial emotion they will have to first be trained using the image pixel have been covered. We make use of a model
happy emotion. To antecedents the model with these emotion which has the type as Sequential. Sequential is the easiest
types, we make use of the re-training process. Re-training way to generate a model in Keras. Layer by layer model is
data has been assembled from the physical world. The created. We make use of various layer in our model. The first
challenge in the system was the retraining portion. Various 2 layer in the convolution layer works with the input images,
other parts of the system are also considered as challenging. these are represented in 2-Dimentional matrix form. Another
Machine Learning is an extremely potential tool that provides function in this algorithm is Activation. We make use of the
for more efficient and rapid data processing of large ReLU which is the Rectified Linear Activation as the
databases. This provides the ability of detecting emotion activation property for the initial 2 layers. The ReLU
more accurate. The System is able to provide feedbacks in activation property is proved to work well with neural
actual-time. The model need not wait to get the final result in networks. A Flatten layer is present in between both the
later time, and the photo taken need not be stored. Conv2D layer and also in the Deep layer. The flatten/reduce
layer serves as a interrelation between both the conv2D layer
A. Data Collection and the Dense layer. The result most probably will be
predicted based on the highest probability. The next step is
The mediaPipe assigns different landmarks to different points compilation of the model. The system is compiled using the
in the face. The data contains different landmark points of our important three parameter: metrics, optimizer and loss. Out
face, and one particular row would be comprising of all the of the three the study rate is being managed using the
key points as face key points, left hand landmark, right hand optimizer. For the loss function a ‘Categorical cross entropy’
landmark and various other samples would have all these was used. This is considered as a widely used option for
properties. We compare the differences in those landmarks classification. Lower the score, better the performance of the
during each emotion to train the model on different emotions system. To make items even much better to understand, when
passed by the user, like happy, sad, etc. Hence, the model is processing the system, the ’accuracy’ measure will be
able to classify each of the emotion passed by the user. utilized.
Initially all the files that we have created are searched (which
We use the video capture class to capture the video feed are happy.npy, sad.npy, angry.npy, neutral.npy, rock.npy,
coming from the webcam. After capturing the video, the surprise.npy). the npy files are filtered using the split
system will read the frame and show it to the user. function. The files are then stored in an array(X array) with
We make use of holistic solution inside the mediaPipe. The the labels associated with it in another(Y array).
holistic solution would take in the frame, and it would return For example, while talking about happy.npy, all the data
all the facial key points such as left hand, right hand. Then under happy.npy will be the input to the model and will be
the frame is converted from cv2 colour to RGB because cv2 stored in the X array. For a particular input data, we require
reads integer formats. Basically, we use process function the model to predict something. What we require the model to
inside holis and would pass the frame and get the result out predict is the emotion happy. This prediction data will be
of it. present in the y variable.
Once the initialization is completed, we are going to
We then use drawing function to draw on our face and mark
concatenate the X and Y array. Basically, it concatenates the
the face landmark, right-hand landmark and the left-hand
input data to the X array and the prediction result in the y
landmark on the frame, of the result variable. We then store
array. The name of the file will have an integer associated to
all these drawings in a list.
it.
The collected data of various emotions are then stored in a The model then passes the file through the CNN algorithm to
numpy file format with specific names associated to it (such predict the emotion of the person.
as happy.npy, sad.npy, angry.npy, neutral.npy, rock.npy,
surprise.npy.)
C. Frontend

B. Data Training Earlier we have already created a model which can detect
Creating a Convolutional Neural Network (CNN) is an different emotions like sad, angry, happy etc.
exceptional method to categorize images using deep learning. We then deploy this model into a web app. The trained model
We make use of a library called Keras in python to construct is going to give us a model.h5 file. It’s important to note that
a CNN model. PCs recognizes photos as pixels. The pixels the structured data is been stored in the h5 file format, not a

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 04,2023 at 04:11:50 UTC from IEEE Xplore. Restrictions apply.
507
model in and of itself. Since the weights and the model setup language and the singer. To recommend different songs from
can easily be stored in a single file, Keras stores the model in the YouTube we import another module named webbrowser.
this manner. Keras is a powerful and user-friendly free open- The webbrowser module passes the URL. We inject all the
source Python tool for building and evaluating deep learning keywords which has been retrieved from the frontend to the
models.The model.h5 are been used to create a web app. URL query. [17] The user is then redirected to the YouTube
Model.h5 will help in creating a web app using different page as per as the URL query. For example, if we input the
python libraries mainly streamlit and streamlit webrtc. language as English, singer as Taylor Swift and the emotion
is detected as happy, then the web app will recommend happy
songs by Taylor Swift.

Fig 1: Streamlit Webapp

Streamlit is considered as open-source app toolkit which is


based on Python. It allows us to create web applications for
machine learning and data science quickly. Scikit-learn ,
Keras , Pytorch , SymPy (latex), Numpy , pandas ,
Matplotlib and other Python libraries are compatible with it
[14].
Firstly, in the coding part we are going to give two variables
for getting the language and the singers name so that based
on the user's choice the songs can be recommended. Then we
use webrtc to capture the video of user's face so that based on
the different emotions made, songs can be recommended.
WebRTC (Web Real-Time Communication) is a technology
that enables websites and Web applications to capture and
possibly broadcast audio and/or video content as well as send Fig 2: Key words are received from the User
arbitrary data between browsers without the need for a third
party.. Streamlit webrtc is a python library is used here for
real-time video and audio streams over the network, with
streamlit. [15]

We have used different python libraries in our project each


of them performs different functions. We used load_model to
load the model. We used mediapipe library to detect the
landmarks of face and hand. We used Cv2 for drawing
different functionalities. CV2 is considered as an open-source
library which can be utilized for doing tasks such as face
detection, object tracking, landmark detection, and more. We
also used webbrowser module to recommend different songs Fig 3: The user is redirected to the youTube page
from YouTube. The users are provided with a high-level
interface for viewing web-based material by the webbrowser
module.
The webrtc captures the video and from that it will enable us IV . RESULT AND DISCUSSION
to predict the emotions. [16] The predicted emotion is going
to be saved locally into a file and when required we load the The figure below showcases how the model performs. After
file. We use numpy library to save the prediction or that the language and singer is entered, the webcam starts
particular emotion. capturing the emotion of the user. The captured emotion is
then analyzed by the model, the inputs such as “Language”,
Now when the emotion is detected, a web browser tab is “Singer” and “Emotion” would thenbe injected into the URL
going to be opened and it should be a YouTube tab that Query. The user is then redirected to a YouTube page as
consists of all the recommended songs based on the emotions, required.

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 04,2023 at 04:11:50 UTC from IEEE Xplore. Restrictions apply.
508
using Streamlit. The Emotion detection is performed using
Deep learning. A well-known model in the field of pattern
detection is Deep Learning. The keras library is being used,
as well as the Convolution Neural Network (CNN) algorithm.
A CNN indeed is an artificial neural network that includes
machine learning components. Among other things, CNN can
also be used to detect objects, perform facial recognition and
process images.
Some modifications that could be made to this system are:
 Add advices that would help the user according to
his/her emotions (for example, if the system the
emotion of the user as sad, then the system would
provide some motivational quotes or other advices
that would cheer up the user.)
 The system can also provide small activities that
would help improve the mood of the person.
 Improve the face detection accuracy.

VI. REFERENCES
[1] Rumiantcev, M. and Khriyenko, O., 2020. Emotion based
music recommendation system. In Proceedings of
Conference of Open Innovations Association FRUCT. Fruct
Oy.
[2] Ali, M.F., Khatun, M. and Turzo, N.A., 2020. Facial
Emotion Detection Using NeuralNetwork. the international
journal of scientific and engineering research.
[3] Dureha A 2014 An accurate algorithm for generating a
music playlist based on facial expressions International
Journal of Computer Applications 100 33-9
[4] James, H.I., Arnold, J.J.A., Ruban, J.M.M., Tamilarasan,
M. and Saranya, R., 2019. Emotion based music
recommendation system. Emotion, 6(03).
Fig 4: Final Result [5] Gupte A, Naganarayanan A and Krishnan M Emotion
Based Music Player-XBeats International Journal of
Advanced Engineering Research and Science 3 236854
V. CONCLUSION AND FUTURE WORK
[6] Ruchika, A. V. Singh, and M. Sharma, “Building an
effective recommender system using machine learning based
One of the essential areas of study is the identification of framework,” in 2017 International Conference on Infocom
emotions from the facial expressions, which has previously Technologies and Unmanned Systems (Trends and Future
attracted a lot of interest. It is clear that the difficulty of Directions) (ICTUS), Dec 2017, pp. 215– 219.
emotion recognition using image processing algorithms has [7] L. Shou-Qiang, Q. Ming, and X. Qing-Zhen, “Research
been growing daily. By utilizing various features and image and design of hybrid collaborative filtering algorithm
processing techniques, researchers are constantly looking for scalability reform based on genetic algorithm optimization,”
solutions to this problem. in 2016 6th InternationalConference on Digital Home
In this paper we have implemented a system where two (ICDH), Dec 2016, pp. 175–179.
predicates such as language and singer is been used to [8] Will Hill, Larry Stead, Mark Rosenstein, George Furnas,
understand the preference of the user. Once the predicates are and South Street. Recommending andEvaluating Choices in
entered, the webcam would start to capture the image of the a Vitual Community of Use. Mosaic A Journal For The
user, the captured emotion is then analyzed by the model, the Interdisciplinary Study Of Literature, pages 5–12, 1995.
inputs such as “Language”, “Singer” and “Emotion” would [9] M.A. Casey, Remco Veltkamp, Masataka Goto, Marc
then be injected into the URL Query. The user is then Leman, Christophe Rhodes,and Malcolm Slaney. Content-
redirected to a YouTube page as required. based Music Information Retrieval: Current Directions and
Future Challenges.Proceedings of the IEEE, 96(4):668–696,
This study presents a method for detecting the basic universal 2008.
emotions from frontal facial expressions. After, [10] Qing Li, Byeong Man Kim, Dong Hai Guan, and Duk
implementing the facial recognition machine learning model, Oh. A Music Recommender Based on AudioFeatures. In
we then further continue to make it into a web application by Proceedings of the 27th annual international ACM SIGIR
conference on Research and development in information

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 04,2023 at 04:11:50 UTC from IEEE Xplore. Restrictions apply.
509
retrieval, pages 532–533, Sheffield, United Kingdom, 2004.
ACM.
[11] Bhat, A. S., Amith, V. S., Prasad, N. S., & Mohan, M.
(2014). An Efficient Classification AlgorithmFor Music
Mood Detection In Western and Hindi MusicUsing Audio
Feature Extraction. 2014 Fifth International Conference on
Signal and Image Processing, 359- 364.
[12] Talele, M., Gurnani, Y., Rochani, H., Patil, M. and
Soneja, K., SMART MUSIC PLAYER USINGMOOD
DETECTION.
[13] Ninad Mehendale, ‚Facial emotion recognition using
convolutional neural networks (FERC),‛ 18 February 2020

[14] Fan, X., Zhang, F., Wang, H., & Lu, X. (2012). The
System of Face Detection Based on OpenCV. In 24th Chinese
Control and Decision Conference (CCDC),Taiyuan, China.
IEEE.

[15] Gilda, S., Zafar, H., Soni, C., & Waghurdekar, K. (2017).
Smart Music Player Integrating Facial Emotion Recognition
and Music Mood Recommendation. In 2017 International
Conference on Wireless Communications, Signal Processing
and Networking (WiSPNET), Chennai, India. IEEE

[16]V. Bhandiwad, B. Tekwani, Face recognition and


detection using neural networks,in International Conference
on Trends in Electronics and Informatics (ICEI), Tirunelveli,
India,

[17]MediaPipe Team, Face Mesh. Mediapipe, 2020 [online].

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 04,2023 at 04:11:50 UTC from IEEE Xplore. Restrictions apply.
510

You might also like