0% found this document useful (0 votes)
4 views62 pages

emothion face

The document presents a project report on 'Medley: A Real-Time Music Player Based on Emotion Recognition,' developed by students under the guidance of Dr. T. Sudalai Muthu. The application aims to analyze users' emotions through facial recognition and play music that aligns with their mood, addressing the lack of personalized music applications. It includes sections on project motivation, system design, implementation, and future work, highlighting the importance of music in enhancing mental well-being.

Uploaded by

Heart Smile
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views62 pages

emothion face

The document presents a project report on 'Medley: A Real-Time Music Player Based on Emotion Recognition,' developed by students under the guidance of Dr. T. Sudalai Muthu. The application aims to analyze users' emotions through facial recognition and play music that aligns with their mood, addressing the lack of personalized music applications. It includes sections on project motivation, system design, implementation, and future work, highlighting the importance of music in enhancing mental well-being.

Uploaded by

Heart Smile
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

MEDLEY: A REAL-TIME MUSIC PLAYER BASED ON

EMOTION RECOGNITION

A PROJECT REPORT

Submitted by

SAKETH V (18113081)
AMAN PRASAD (18113104)
SAURABH JAISWAL (18113078)
Under the guidance of

DR.T. SUDALAI MUTHU


Professor

in partial fulfillment for the award of the degree of

BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING

HINDUSTAN INSTITUTE OF TECHNOLOGY AND SCIENCE


CHENNAI - 603 103

MAY 2022
BONAFIDE CERTIFICATE

Certified that this project report Medley: A Real-time music player based
on emotion recognition is the bonafide work of Aman
Prasad(18113104), Saketh V(18113081), Saurabh Jaiswal(18113078)
who carried out the project work under my supervision during the academic
year 2021-2022.

Dr. J. THANGAKUMAR, DR. T. SUDALAI MUTHU,


HEAD OF SUPERVISOR
DEPARTMENT
Professor, Department of CSE
Department of CSE.

INTERNAL EXAMINER EXTERNAL EXAMINER


Name: Name:

Designation: Designation:

Project Viva - voce conducted on


TABLE OF CONTENTS

CHAPTER
TITLE PAGE NO.
NO.
Acknowledgement i
Dedication ii
Abstract iii
List of Figures iv
List of Abbreviations v

1 INTRODUCTION 1
1.1 Overview 1
1.2 Motivation for the project 1
1.3 Problem Definition and Scenarios 2
1.4 Organization of the report 3
1.5 Summary 3

2 LITERATURE REVIEW 4
2.1 Introduction 4
2.2 Music Player for Android 4
2.3 Facial Expression Recognition Algorithm 5
2.4 Directing Physiology and through Music 5
2.5 Mood based Music Suggestion System 5
6
2.6 Emotion Based Music Payer
6
2.7 Summary

3 PROJECT DESCRIPTION 7
3.1 Objective of the Project work 7
3.2 Existing System 7
3.3 Shortcomings of Existing System 8
3.4 Proposed System 8
3.5 Benefits of Proposed System 9

4 SYSTEM DESIGN 10
4.1 Architecture Diagram 10
4.2 Sequence Diagram 11
4.3 Use case Diagram 12
4.4 Activity Diagram 13

5 PROJECT REQUIREMENTS 14
5.1 Hardware and Software Specification 14
5.2 Technologies Used 15

6. MODULE DESCRIPTION 16
6.1 Modules 16
6.2 Face Recognition 16
6.3 Emotion Detection 20
6.4 Song Recommendation 23

7. IMPLEMENTATION 24
7.1 User Interface 24
7.2 Machine Learning Model 25
7.3 Backend 30

8. RESULT ANALYSIS 31
8.1 Results obtained 31

9. CONCLUSION AND FUTURE WORK 34


9.1 Conclusion 34
9.2 Future Work 35

10. INDIVIDUAL TEAM MEMBER’s


REPORT 36
10.1 Individual Objective 36
10.2 Role of the Team Members 36
10.3 Contribution of Team Members 37

REFERENCES 38

APPENDIX A: SAMPLE SCREEN

APPENDIX B: SAMPLE CODE

APPENDIX C: PLAGIARISM REPORT

APPENDIX D: FUNDING / PATENT /


PUBLICATION DETAILS

APPENDIX E: TEAM DETAILS


ACKNOWLEDGEMENT

First and foremost, we would like to thank ALMIGHTY who has provided
us the strength to do justice to our work and contribute our best to it.

We wish to express our deep sense of gratitude from the bottom of our
heart to our guide DR. T. Sudalai Muthu, Professor, Computer Science
and Engineering, for his motivating discussions, overwhelming
suggestions, ingenious encouragement, invaluable supervision, and
exemplary guidance throughout this project work.

We would like to extend our heartfelt gratitude to Dr. J. Thangakumar,


Ph.D., Professor & Head, Department of Computer Science and
Engineering for his valuable suggestions and support in successfully
completing the project.

We wish to thank our Project Co-ordinator and Panel members for


keeping our project in the right track. We would like to thank all the
teaching, technical and non-technical staff of Department of Computer
Science and Engineering for their courteous assistance.

We thank the management of HINDUSTAN INSTITUTE OF


TECHNOLOGY AND SCIENCE for providing us the necessary
facilities and support required for the successful completion of the project.

As a final word, we would like to thank each and every individual who
have been a source of support and encouragement and helped us to achieve
our goal and complete our project work successfully.

i
DEDICATION

This project is dedicated to our beloved parents, for their love,

Endless support, encouragement and sacrifices.

ii
ABSTRACT
Human beings are social animals. Unlike animals who cannot convey all of their feelings
through their facial expressions, humans can express their mental states through their
facial expressions. These facial expressions are very important cues in identifying a
person’s feelings and their intentions. Humans can very well identify these features. But
recent developments and significant research in the fields of Artificial Intelligence (AI)
technology and computer vision has made it possible now for a computer to identify tiny
details and label them as a known behavior in humans. These details can be properly
identified through live feed from a digital camera. Music has been a source of
entertainment from early times. Recent studies have also proven that music aids humans
in reinforcing their minds helping them to relax. Music is therefore helpful in soothing
and relaxing a person’s mind. Focusing on these two aspects, this project effectively
integrates them to create an application which plays music to calm the person based on
their mood. This system can help to lower the user’s stress levels through music therapy.

In the recent times, people are facing conditions like hypertension and anxiety, leading
to a very stressful life. The corona virus pandemic even worsened the previous situation.
This bad mindset is caused due to issues like income and spending balance, stress at
workplace etc., Music has been a medium of entertainment and leisure since the ancient
times. Music reduces the stress through the science of acoustics. For example, temples
use bells and gongs whose frequency creates a calming effect on the people who are
inside the temple premises. But, the effect of music is considerable when the beats of
the music match the mood of the user. Through the advancement in technology, we now
have music available at our fingertips these days. But unfortunately, there are no
effective music applications which play music based on the user’s mindset. In order to
act on this problem, the following proposed system defines a practical approach to solve
this issue and create an effective music player which plays music based on the emotions
of the user.

Music is very important not only in people's daily lives, but also in today's technological
society. Users usually need to actively explore and select music playlists. Here a
proposal an efficient and accurate model for generating playlists according to the user's
current emotional state and behavior. Existing approaches to automatic playlist creation
are computationally inefficient, inaccurate, and may involve the use of additional
equipment such as EEGs and sensors. Language is the oldest and most natural way to
convey emotions, feelings and moods, and processing them is computationally
intensive, time-consuming and costly.

iv
LIST OF FIGURES

FIGURE TITLE PAGE


NO. NO.

4.1 Architecture Diagram 10

4.2 Sequence Diagram 11

4.3 Use Case Diagram 12

4.4 Activity Diagram 13


6.1 Haar Kernels 17
6.2 Types of Haar Kernels 18
6.3 Kernel on Face detection 18
6.4 Emotion Detection with CNN 21
6.5 Schematic representation of CNN 21
6.6 Convolution diagram 22
6.7 Kernel matrix 22
6.8 Feature Pooling 23
7.1 User Interface 24
7.2 Application being initialized on local 30
server
8.1 Accuracy plot of the model 31

iv
LIST OF ABBREVIATIONS

CNN Convolutional Neural Network


CSS Cascading Style Sheet

HTML Open Source Computer Vision

OpenCV Open Source Computer Vision


RR Recognition Ratio
UI User Interface

v
CHAPTER 1
INTRODUCTION

1.1 Overview
This application analyses the user’s emotions by reading their physical facial features
after which it plays music that matches the detected mood of the individual. For
instance, if the user is feeling sad, the application will play an upbeat happy song to
uplift his/her mood. The present systems employ algorithms which yield unforeseen
results which result in less accuracy. Also, they run continuously in the background
which leads to wastage of resources making it inefficient as well. Audio analysis
systems also cannot gather significant information from audio signals in less time. The
existing designs even use supplementary sensing devices or use human speech which
is quite inefficient. These systems cannot properly associate the perception of the user
with music.
To determine an individual's facial expression, a comparison with comparable
expressions can be made. Mary Duenwald released an article in 2005 that summarized
the findings of various studies and researches conducted by scientists, proven that
there are basically seven categories of face expressions over the world, in which the
following are the basic:

i. Sadness: When a person is in a sad state, the person’s eyebrows come closer
while the inner part of the eyelids go up. The corners of the lips appear to be
in the shape of a downward circular arc. Also, the lower lip may push up to
form a mope.

ii. Anger: When a person is angry, the upper and lower eyelids squeeze towards
each other keeping the eyeball visible through a small slit between the eyelids.
Additionally, the lower lip pushes up a little, the top and lower lips press
against each other, causing the jaw to move forward.

iii. Happiness: In order to identify if a person is happy, the sides edges of the lips
are raised and the mouth is shaped like a bottom arc of a semicircle. At the
same time, the eyelids transition closer a bit while the cheeks go up with the
smile while the eyebrows go down slightly.

iv. Calm: In the calm face the mouth is in the same position and also the eyelids
are in the same position as the normal one. This is the default emotion for the
project.

1
1.2 Motivation for the project
Music can balance a person’s mental state when used in the right way. It can also help
people to cope-up with stress and other psychological problems. Especially during the
pandemic, people have become victims to mental imbalance. Typically, the user is
faced with the burden of manually searching through a playlist of music in order to
make a selection. Here a proposal for an efficient and accurate model for generating
playlists according to the user's current emotional state and behavior. Existing
approaches to automatic playlist creation are computationally inefficient, inaccurate,
and may require the use of additional equipment in the form of complex sensor
integration. Language is the oldest and most natural way to convey emotions, feelings
and moods, and processing them is computationally intensive, time-consuming and
costly.

1.3 Problem Definition and Scenarios


With the recent advancements in the fields of application development, music has
become more accessible to people than ever before. Although these applications fulfill
the basic needs of the user, they do not possess the ability to play music based on the
user’s current mindset. Users themselves have to painstakingly go through the
seemingly endless collection of audio files to find their perfect songs. This system aims
to create a user-friendly web application which plays music based on the person’s
current mood by using a real-time camera for facial recognition which results in
uplifting the mood of the user.

The problems this project aims to overcome are:

o Most people, however, struggle with song choosing, particularly when it comes to
tunes that reflect their current emotions. Individuals will be less inclined to search
for the songs they wish to listen to if they see large lists of unsorted music.

o The majority of users will play songs in a random fashion which are not organized
and are played through a music player which just plays music and doesn’t provide
any personalization to the user.

o In most situations, the music that the user listens to is just played in random which
may not suit the present mindset of the user.

o For instance, consider that a person is feeling depressed. He/She might want to
listen to peppy beat tracks which creates positive reinforcement in the person. But,
not all audio tracks can have the same effect. So, the user needs to extensively
search for the music file and play it. This may be frustrating during such low
moments.

2
o Furthermore, consumers are becoming dissatisfied with the old method of
searching for and picking songs. The approach has been in use for a few years
now.

1.4 Organization of the report


This project is divided into nine chapters. The first section is a lead up to the project, in
which the initiative is defined and the scope of the project is outlined. The fundamental
issue has been identified, and specific aims and objectives have been established and
documented. The second chapter is devoted to a literature study of the project's
suggested technique. It begins by evaluating existing music applications, algorithms
used and outline the one that is being proposed. After that, a feasibility study is carried
out, and the project's methodology and strategy are explained. The system analysis and
design are covered in Chapter 3, which is the most crucial part of the project. System
requirements are gathered in the analysis step. After that, the requirements are assessed
and organized. Chapter four is for system implementation where the proper
programming language is selected. Details of implementations are illustrated such as
sample forms and reports in addition to Pseudo codes. Chapter five briefly touches upon
system evaluation and testing as this iterative process is essential to examine whether
the application is delivering the required output. In the subsequent chapters, this project
is summarized with the overall process and methodology that took place through the
course of developing this work. Also, a conclusion is provided after which a brief
description about the recommended future work that could be carried out to
complement this project and further develop and expand it.

1.5 Summary
This paper proposes an emotion-based music player, which is implemented as an
Android application. The application aims to suggest songs based on the users'
emotions. To classify the emotion, the user's facial image is analyzed based on the
Face API respectively. Then, given the user's emotion, the application suggests the
relevant songs based on the user preference mode. If the user selects the positive
mode, the application will recommend positive songs. In contrast, it will recommend
songs with a negative mood because the users want to unleash their anger, sadness or
stress. To improve the performance of the application, the plan is to explore more
techniques to eliminate the uncontrolled environment affection. Moreover, the plan is
to expand the song database in order to support more users’ interests.

3
CHAPTER 2
LITERATURE REVIEW

2.1 Introduction
Initially, when it was decided to get started with the proposed system, reference from
scholarly articles and research papers published in various reputed journals. Out of the
several sources and papers referred, it was found that the following research papers to
properly support our project. These proposed systems inspired us to come up with our current
proposed system.

This application analyses the user's emotions by reading their physical face features, and then
plays music that corresponds to the individual's observed mood. If the user is upset, for
example, the program will play an upbeat joyful tune to cheer him up. The current systems
use algorithms that produce unexpected outcomes, resulting in lower accuracy. They also
operate continuously in the background, wasting resources and making the system
inefficient. Audio analysis technologies are likewise incapable of extracting substantial
information from audio signals in a shorter amount of time. Existing solutions even rely on
auxiliary sensing devices or human speech, both of which are wasteful. These technologies
are unable to correctly associate the user's sense of music with music.

Taking into account the aforementioned shortcomings of existing systems, this paper
provides an effective method for overcoming them. The major goal of this article is to create
an application that plays songs based on the user's facial expression in real time when a digital
camera detects the user's emotion. Furthermore, the algorithm used here employs approaches
such as data augmentation, which decreases data overhead and computation time by
compressing and exporting the model, making it portable and accessible without the need for
an internet connection.

2.2 EMOSIC - An Emotion Based Music Player For Android (Ankita


Mahadik, Shambhavi Milgir, Prof. Vijaya Bharathi Jagan, Vaishali
Kavathekar, Janvi Patel)
In this paper they created an emotion-based music player for android making use of FaceAPI
and Viola-Jones algorithm to obtain an image map to identify the user’s facial features. These
mappings were then sent to an SVM classifier which then categorized the emotion. Microsoft
emotion API was used to process the emotions and retrieve the user emotion. To categorize
the songs, they performed feature extraction using Zero-crossing Rate (ZCR) calculation.
They identified the signal frequency and tempo to achieve it.

4
2.3 Efficient Facial Expression Recognition Algorithm Based on
Hierarchical Deep Neural Network Structure (JI-HAE KIM, BYUNG-
GYU KIM, PARTHA PRATIM ROY, DA-MI JEONG)
They proposed an efficient facial expression recognition algorithm, which employs a deep
neural network structure. Using this type of organization, they made it possible to re-classify
the emotion result which frequently causes errors inn accuracy. They constructed two neural
networks among which the first focuses on action units using the LBP feature. The other
network was then used to extract the geometrical changes of every action unit’s landmarks.
Building on these two results, the algorithm combined them using a dynamic function which
weighed the results and integrated them.

2.4 Directing Physiology and Mood through Music: Validation of an


Affective Music Player ( Marjolein D. van der Zwaag, Joris H. Janssen,
and Joyce H.D.M. Westerink)

In this paper the authors performed a real-time experiment on a few employees to validate
the affective music player developed by Janssen et Al. They used a skin-temperature
detection device to constantly monitor the temperature of the research participants. Further,
they tested whether the energy of a mood can be directed based on Skin Conductance Level
(SCL). The final results of this method have shown that a skin temperature is related to blood-
vessel constrictions which are a consequence of sympathetic arousal. This proved that music
can improve the mood of a person.

2.5 Mood based Music Recommendation System (Ankita Mahadik,


Shambhavi Milgir, Prof. Vijaya Bharathi Jagan, Vaishali Kavathekar,
Janvi Patel)

They have proposed a mobile system which performs facial recognition using Bitmap graphic
image objects and utilizes Mobile net CNN architecture model. This architecture was selected
to be a lightweight framework to enable the user to run the model on a mobile device which
has less resources. They have created music playlists based on moods like happy, sad, some
other approaches used varied algorithms and datasets corresponding to them. These include
the KDEF (Karolinska Directed Emotional Faces) dataset and VGG (Visual Geometry Group)
which when combined with CNN (Convolution Neural Network) model which gave an
accuracy nearing 80 per-cent. Similarly, another approach utilized Python 2.7, OpenCV, and
the CK (Cohn-Kaneda) and CK+ (Extended Cohn-Kanade) databases to achieve an accuracy
of over 80 percent.

5
2.6 Emotion-Based Music Player (Charu Agrawal, Meghna Varma, Anish
Varshaney, Khushboo Singh1, Chirag Advani, Dr. Diwakar Yagyasen)

The authors have designed an emotion recognition system by constantly monitoring the heart
rate of the user and recommend playlists for playing songs from the database. This is an
extension of the idea from using an external device other than a digital camera to recognize
the emotions. This seems to be effective when the user is stationary, but not when they are
performing a physical activity and sometimes, the heart sensor miscalculates the rate which
leads to inaccuracy in the working of the application. This application is also specific to the
android platform which is in turn integrated using firebase at the backend.

2.7 Summary
We have looked into these methodologies and partially used concepts from some works to
improve our system while some existing techniques were used and allowed us to think and
come up with a new approach resulting in the proposed system.

Thus with this Emotion Based Music Player, users can have an alternative way of selecting
songs, which is in a more interactive and simpler way. The music lovers will not have to
search through the long list of songs for the songs to be played but to match the emotion in
the songs selection.

The emotion recognition of the photographs loaded into the suggested model is the most
important aspect of this study. The main goal is to use the emotion detecting feature.

The proposed approach aims to improve an individual's entertainment through the integration
of emotion recognition technology with a music player. The proposal can recognise the four
emotions of the photos placed into it: normal, happy, and sad. After the proposed model has
compared and detected the user's emotion, the music player will play the appropriate song(s).

6
CHAPTER 3

PROJECT DESCRIPTION

3.1 Objective of the Project work


A platform independent web application which can play music in real-time based on the
person’s mood which is monitored through a web camera. To create a user-friendly web
application which plays music based on the person’s current mood and utilizes a real-
time camera for facial recognition which results in uplifting the mood of the user.
Existing systems create a personalized list based on the user’s previous
recommendations. This system plays songs based on their mood which is pre-analyzed
for mood detection.

i. Propose a facial expression detection model for detecting and analyzing an


individual's emotion.

ii. To be able to distinguish between the four basic emotions of normal, happy, sad,
and surprise.

iii. Incorporating a music player into the suggested model to play music based on the
observed emotions.

3.2 Existing System


Emosic, is an emotion-based music player for android making use of FaceAPI and Viola-
Jones algorithm to obtain an image map to identify the user’s facial features. Another
Mood based Music Recommending system utilized a mobile system which performs
facial recognition using Bitmap graphic image object and utilizes Mobile net architecture
model. Another approach designed an emotion recognition system by constantly
monitoring the heart rate of the user and recommend playlists for playing songs from the
database.

Relationship between music and emotion research

Many scholars have conducted research and studies to see if music may affect people's
emotions. The findings of studies throughout the years have shown that different musical
styles can influence people in different ways.
For example, in 1994, Antoinette L. Bouhuys, Gerda M. Bloem, and Ton G.G.Groothuis
7
conducted research on the association between individuals' facial expression and
depressive music. The findings revealed that listening to melancholy music causes a
large increase in depressed mood and a significant decrease in joyful mood.
The study found that music has the ability to affect people's moods.

Aside from the studies mentioned above, Daniel T. Bishop, Costas I. Karageorghis, and
Georgios Loizou presented their findings on the use of music to manipulate the
emotional condition of young tennis players. This study enlisted the participation of
fourteen young tennis players. Participants will frequently choose to listen to music in
order to produce visual and auditory imagery, according to the findings. "Increasing the
pace and/or intensity of a musical snippet may raise the amplitude of an effective
response and attendant action tendencies," according to Frijda (1986).
Matthew Montague Lavy developed four key assumptions about music fans and their
connection with music in 2001. Music is first heard as a sound. When a person listens to
music, the ongoing monitoring of auditory stimuli is not turned off; rather, it is monitored
and evaluated exactly like any other input. Second, music is perceived as a human
expression. The ability to recognize and distinguish emotion in the shapes and timbres
of vocal utterances is possessed by everyone. Third, music is experienced in its
environment. Music is characterised as a vast and complex web of information, ideas,
and surroundings. All of these elements combine to create an emotional experience.
Finally, music is heard as a story. When listening to music, sounds, utterances, and
context are all integrated.

3.3 Shortcomings of Existing System


Emosic is platform specific as it is an Android application and utilizes a pre-fabricated
API which gives lesser control. The next approach didn’t fine tune the model which
increased the memory consumed and processing time. The skin temperature-based
approach didn’t accurately predict the emotions as there were complexities associated
with the temperature detection and its correlation with the emotion of the person. The
last approach generates a playlist which becomes a partially manual process from the
user’s perspective.

3.4 Proposed System


The core motive of this project is to design an application which plays songs based on
the user’s facial expression in real-time after detecting the emotion of the user using a
digital camera. Moreover, the algorithm employed here makes use of techniques like
data augmentation which reduces data overhead, computation time by compressing and
exporting the model making it portable and available without any internet connection.

On the user side, the user will be able to personalise the songs in each category to their
preferences. When she is upset, some people favour romantic music, while others prefer
country music. There will be no limit to the number of songs that can be saved in each
category. To begin the proposed paradigm, the user must first launch the system. When
the system is turned on, the user has the option of selecting tunes or processing the
8
current feeling directly. After the machine has completed the interpretation, a list of
music will be played automatically. After the list of songs has been played, the user can
alter the current emotion by restarting the image loading or capturing operation.

3.5 Benefits of Proposed System


The proposed system utilizes CNN which is a light-weight algorithm to process the data
for labelling and assigning weights for better prediction. This system also uses
techniques like data augmentation which reduces the data overhead on the machine and
increases the effectiveness of the algorithm. This in turn reduces the time of execution.
Additionally, compressing and exporting the model making it portable and available
without any internet connection.

9
CHAPTER 4

SYSTEM DESIGN

4.1 Architecture Diagram

Train CNN
regression model

Figure 4.1 : Architecture Diagram

10
4.2 Sequence Diagram

Figure 4.2: Sequence Diagram

11
4.3 Use Case Diagram

Figure 4.3 : Use Case Diagram

12
4.4 Activity Diagram

Face Detection

Select/ captures mood

Classify the image

Identify the emotion

Play suitable music

Stop

Figure 4.4 : Activity Diagram

13
CHAPTER 5

PROJECT REQUIREMENTS

5.1 Hardware and Software Specifications

A. Hardware Requirements
The most common resources that a computer needs to run the application is the hardware.
As the proposed system is a web application, it doesn’t demand much resources for
running on the user’s system. The hardware requirements are:

• Minimum 4GB of RAM (for processing), this is required to ensure that the
application runs smoothly on the user end without any lag.
• Webcam to detect the face and emotions, a High Definition 720p camera is
recommended for a sharp image and to reduce noise in normal and low-light
conditions
• Any quadcore processor with minimum 2.0Ghz clock rate (this is required for
running the application smoothly and to process the image through the neural
network)
• Minimum 4 GB of available disk space for installing all the required python
modules (this requirement is for running the application on the local machine to
install Python and its related modules, Visual Studio, resources for the application)

B. Software Requirements
Software requirements are the prerequisites that need to be installed in the computer to
provide optimal function of an application. The software requirements required are:
• Chromium based browser like Chrome, Edge version 85.0+; to ensure the
compatibility of the application and its constituent libraries to run with errors
• JavaScript enabled for frontend music player function; as the application controls
are JavaScript based and needs it to be enabled for the application to work as
expected
• Google colab / TensorFlow for training the emotion classifier remotely; as the
neural network and detection algorithms need a lot of data for training and
processing the image data which is provided by colab
• Visual Studio needs to be installed for accessing the CMake library, which is C++

14
based and is essential for the face detection module
• OpenCV for computer vision tasks which consists of the Haar-Cascade classifier
which is core to facial recognition module
• Python 3.6 with Keras installed for executing the python files
• Visual Studio Code (Atom/any other text editor) – we used VS code for creating,
editing and managing our files for building the application

5.2 Technologies Used


 Python 3.6 (and higher)
Python is a high-level programming language which consists of innumerous libraries
consisting of modules which make complex tasks like facial recognition, pattern
detection and many other higher order tasks in an easy manner. This is achieved
through the use of pre-defined functions which have code already written and just
needs to be executed on the user’s input data. In this proposed system, python plays
a crucial role by performing computer vision tasks. Minimum version of 3.6 is
recommended to ensure compatibility with other modules and dependencies in this
application.

 Django
Django is a popular python web framework used to create python-based web
applications. It is very convenient for the developer as it bundles all the required files
and dependencies within a package. This also makes it easy for configuring the
application as a whole. In addition to this, Django also supports to integrate machine
learning models within a Django application. This gave us the ability to integrate our
facial detection and emotion recognition modules to work with the web application
hand-in-hand.

 HTML, CSS and JavaScript


These three web frameworks were used to develop the user interface of the
application. A HTML template was used to create the basic structure of the music
player, while CSS was used for styling the UI. JavaScript is used to trigger the
various buttons from the application to enable the user to interact with the music
player.

15
CHAPTER 6

MODULE DESCRIPTION

6.1 Modules
This proposed system consists of modules which perform mini routines. These routines
are specific to each module and may work either individually or in harmony with each
other to perform specific tasks within the system. Our proposed system has three
constituent modules which are abstracted according to their functionalities. They work
together and bring life to the application. Dividing our project into modules made our
work easier while developing and testing. These modules reduced the time and effort
needed to finish this system.

This project consists of three major modules:

● Face Detection.
● Emotion Detection
● Song Recommendation

6.2 Face Detection Module


This module involves the ability to detect the user’s face in a frame. In order to achieve
this, this module consists of the Haar-Cascade object detection algorithm which is
offered through OpenCV library. The Haar Cascade algorithm utilizes the user’s
webcam to locate the face and performs edge to edge detection of the face.

This algorithm uses edge or line detection features to get the accurate expression of the
user's face. This is done through the help of various kernels which sweep through the
image taking several pixels at once and give a rating for each pixel from 0.0 to 1.0 where
0.0 is white and 1.0 is black. When there is a significant difference between two adjacent
pixels pigmentation, the model identifies it as an edge of the face.

These pigmentation values are then summed up separately for lighter areas and darker
areas. Consequently, the difference between the lighter and darker areas gives more
insight on identifying the features and edges. This process is cascaded throughout the
whole image.

16
Haar Cascade Model

- For detecting face

Figure 6.1: Haar Kernels

Steps for Haar Cascade Algorithm:

● Using a webcam to get the data collected in the form of images.

● Image processing:
- Pre-processing stage: The collected images are then converted into grayscale and then
divided into two sub parts training data and testing data.

- Haar Cascade Algorithm:

Here we use face recognition. First, the algorithm requires a lot of positive images (face
images) and negative images (faceless images) to train the classifier. Next, we need to
extract the features from it. For this, the hair features shown in the image below are used.
They are like convolution kernels. Each feature is a single value obtained by subtracting
the sum of the pixels below the white rectangle from the sum of the pixels below the
black rectangle.

17
Figure 6.2 : Types of Haar Kernels

Now many features are calculated using all possible sizes and positions of each kernel. For
each feature calculation, we need to find the sum of the pixels below the white and black
rectangles. To solve this, they introduced an integral image. This simplifies the calculation
of total pixels, regardless of the number of pixels, to operations with only 4 pixels. Is it
beautiful? It makes things super-fast.

Figure 6.3: Kernel on face detection

To do this, apply individual features to all training images. For each function, find the
optimal threshold to classify the face as positive or negative. But obviously there are errors
and misclassifications. Select the feature with the lowest error rate. In other words, it is
the feature that best classifies images with faces and images without faces. It is not a
primitive process. Each image is initially given the similar preference. After each
classification, the weight of the misclassified image increases. Then the same process runs
again. A new error rate will be calculated. Also new weights. This process continues until
the required accuracy or error rate is achieved or the required number of features are
found.

Object Detection with OpenCV-Python Using Haar cascade classifier

Python has a wide range of uses in the field of computer vision, most notably in deep
learning. Computer Vision is a fascinating and hard field that ranges from running OCR
on documents to allowing robots to "see."
OpenCV is a cross-platform open source framework that was created as a toolkit for real-
time computer vision. Because it is cross-platform, you can use C++, Python, and Java to
interact with it regardless of your operating system.

18
Object Recognition with OpenCV
The initial stage in object recognition is to use OpenCV to read and display a picture.
The image is loaded using the imread() function, and the image is displayed using the
imshow() method. In the event when the size of the window and the picture differ, the
namedWindow() and resizeWindow() methods are used to build a bespoke window for
the image.

The waitKey() method opens a window for a specified number of milliseconds or until a
key is pushed. A value of 0 indicates that OpenCV will leave the window open indefinitely
until we close it with a keystroke. The destroyAllWindows() method instructs OpenCV to
close all open windows.

OpenCV can draw rectangles, circles, and lines, among other shapes. To associate a label
with the shape, we can use the putText() method. Let's use the rectangle() method to draw
a simple rectangular form in the image, which accepts positional arguments, colour, and
the thickness of the shape.

The cv2.rectangle() method was used to fix the rectangle's position. These places must be
deduced from the image rather than guessed. That's when OpenCV comes in to save the
day! Once it does, we can instead use this function to create a rectangle around the detected
object.

Drawing rectangles (or circles) in this manner is a key step in Object Detection since it
allows us to clearly annotate (label) the things we find.

To lower the computational expense, we're greyscaling the image for the classifier. Colors
don't really important for this detection because the patterns that define eyes look the same
whether they're coloured or not.

CascadeClassifier object with Haar features for eyes is cascade classifier. We're using f-
Strings to dynamically locate the file!
19
The real detection is done by the detectMultiScale() method, which can detect the same
object on an image regardless of scale. It returns a list of rectangles with the coordinates
of the discovered items (tuples). As a result, it's only natural to draw rectangles around
them. We can create a rectangle for each tuple of (x, y, width, height) in the detected
objects.
The minSize option specifies the smallest object that should be examined. The classifier
will most likely pick up a lot of fake-positives on the image if the size is set too tiny. This
is usually determined by the image resolution you're dealing with as well as the average
object size. In practise, it boils down to a series of little tests until the system operates
well.
Let's set the min size to (0, 0) to see what gets picked up:

Because there isn't any other fluff in this image that could be misclassified as an eye, we
only have two misclassifications. One is in the eye, and the other is on the chin! Setting a
small size may result in wrongly highlighting a large amount of the image, depending on
the image's resolution and contents.

6.3 Emotion Detection Module

Subsequently, after detecting the face as well as obtaining an acute picture through the Haar-
Cascade model, the CNN classifier segregates the emotion into one of the following
parameters: happy, angry, sad and calm. This classifier model was trained by utilizing
TensorFlow and Google Colaboratory along with Keras on the side. There are a handful of
other models available as well like SVM but CNN is the more suitable one here as it requires
less computational power to run on the system. Also, it gives a good accuracy as compared
to other models, with a neural network which is lightweight. The dataset that has been used
for training is retrieved from Kaggle from a research prediction competition titled
“challenges in Representation learning”. The data in the dataset consists of 48X48 pixel
grayscale images of faces. The faces have been automatically registered so that the face is
more or less centered and it will occupy about the same amount of space in each image. There
are almost 30,000+ training and testing images consisting of the various expressions of
around seven labeled classes out of which 4 expressions were used in the project. In addition
to these images, image augmentation was performed to train the model with more data by
augmenting the dimensions of the images to further perfect the model and to increase the
accuracy. TensorFlow is used with keras to train and test out models for four classes – happy,
angry, sad, calm. The model is trained through a maximum of 25 epochs for reducing data
loss and has 66% accuracy.

20
Figure 6.4 Emotion Detection with CNN

How CNN classifies the mood:


The CNN architecture for classification includes a convolutional layer, a maxpooling layer,
and a fully connected layer. The convolution layer and maxpooling layer are used for feature
extraction. The convolutional layer is for feature detection, while the maxpooling layer is for
feature selection. The Maxpooling layer is used when the image does not require all the high
resolution details, or when the output is needed in a smaller area extracted by the CNN after
downsampling the input data. The output of the convolutional and pooling layers is sent to
the fully connected layers for classification. Examples of classification learning tasks that
use CNNs are image classification, object recognition, and face recognition. The following
figure shows the basic CNN architecture for image classification.

basic CNN architecture for image classification:

Figure 6.5 : Schematic representation of CNN

21
Convolution:
The purpose of the convolution is to extract the features of the object on the image locally.
Convolution is an element-wise multiplication. The concept is easy to understand. The
computer will scan a part of the image, usually with a dimension of 3×3 and multiply it
to a filter. The output of the element-wise multiplication is called a feature map. This
step is repeated until all the images are scanned. Note that, after the convolution, the size
of the image is reduced.

Figure 6.6 : Convolution Diagram

There are a number of approaches for use. Below, these are some of the listed channels.
Each filter has a specific purpose and approach to classify and identify the images.

Figure 6.7 : Kernel Matrix

Pooling Operation
The purpose of pooling is to reduce the dimensions of the input image. These steps are
performed to reduce the computational complexity of the operation. By lowering the
dimension, the network needs to calculate lower weights to prevent overfitting. In this
phase, you need to define the height and inseam. The standard way to pool input images

22
is to use the maximum value in the feature map. Check out the pictures below. "Pooling"
searches the four submatrices of the 4x4 feature map and returns the maximum value.
Pooling takes the maximum of a 2x2 array and shifts those windows by 2 pixels. For
example, if the first submatrix is [3,1,3,2], pooling returns the maximum value of 3.

Figure 6.8 : Feature Pooling

Fully connected layers:

The last step consists of building a traditional artificial neural network as you did in the
previous tutorial. You connect all neurons from the previous layer to the next layer. You
use a softmax activation function to classify the number on the input image.

6.4 Song Recommendation Module:


Song Recommendation module: The song is played from a local folder where the songs
are stored in particular labels according to the expressions. Labels were pre-fabricated
according to the expression – happy, sad, calm, angry. After successfully detecting the
face, the classifier automatically detects the mood captured and then it searches for the
expression tag and plays the song accordingly to suit the person's mood corresponding to
the input. Suppose if the person is sad or angry then the player will play the song to make
the user happy and calm. The observation of psychological research after the pandemic
indicates that most of the people are dealing with an existential crisis and depression. This
application will help such people to reinforce their mental condition and achieve a better
state by listening to binaural beats to calm their minds or listening to upbeat music to boost
their condition.

23
CHAPTER 7

IMPLEMENTATION

7.1 User Interface

Figure 7.1: Music Player User Interface

The user interface is designed to be minimalistic and user friendly. We have used HTML
and CSS for designing the layout of the page. A static HTML template is used to provide
the basic outline to the music player. A CSS color gradient is used for the background
of the music player which changes according to the person’s detected mood. For calm
mood, green color is set to set a neutral visual.

If happy mood is detected, a vibrant yellow color is displayed on screen which gives a
positive vibe for the user. If sad mood is detected, a deep blue gradient with white is
displayed which is proven to be calming for people and is even used on airplanes to calm
passengers. For angry mood, a light pink color is chosen to go in hand with angry mode
to soothe the user.

The music controls are triggered by using JavaScript functions with icons selected to go
well with the interface.

24
7.2 Machine Learning Model
We have implemented machine learning model to detect the face and classifies the
emotion.
We have used Haar Cascsde model with OpenCV for detecting the face and CNN
algorithm to classify the emotion and we have used Keras and Tensorflow to train the
model.

Face Detection:

"In arbitrary (digital) photos, face detection is a computer method that calculates the
positions and sizes of human faces. It only recognises faces and ignores everything else,
including buildings, trees, and people. Face detection may be thought of as a broader
version of face localization. The goal of face localization is to determine the positions
and sizes of a set of faces.

Haar Feature:
OpenCV's algorithm is currently using the following Haar-like features which are the
input to the basic classifiers:

Cascade of Classifiers:
Rather than applying all 6000 characteristics to a window at once, organise them into
distinct stages of classifiers and apply them one at a time." (Normally, the initial few
levels will have a small amount of characteristics.) If a window fails to pass the first step,
it should be discarded. We don't take into account any leftover features. If it passes, move
on to the next stage of features and repeat the process. A face region is the window that
passes through all stages.

25
OpenCV Pre Trained Classifiers:

Many pre-trained classifiers for the face, eyes, grin, and other facial features are already
included in OpenCV. These XML files can be found in the folder
opencv/data/haarcascades/:

~/OpenCV/opencv/data/haarcascades$ ls

haarcascade_eye_tree_eyeglasses.xml haarcascade_mcs_leftear.xml
haarcascade_eye.xml haarcascade_mcs_lefteye.xml
haarcascade_frontalface_alt2.xml haarcascade_mcs_mouth.xml
haarcascade_frontalface_alt_tree.xml haarcascade_mcs_nose.xml
haarcascade_frontalface_alt.xml haarcascade_mcs_rightear.xml
haarcascade_frontalface_default.xml haarcascade_mcs_righteye.xml
haarcascade_fullbody.xml haarcascade_mcs_upperbody.xml
haarcascade_lefteye_2splits.xml haarcascade_profileface.xml
haarcascade_lowerbody.xml haarcascade_righteye_2splits.xml
haarcascade_mcs_eyepair_big.xml haarcascade_smile.xml
haarcascade_mcs_eyepair_small.xml haarcascade_upperbody.xml

Cascade Classifiers and Haar Features:


Object detection is accomplished using Cascade Classifiers and Haar Features.

It's a machine learning approach in which we use a large number of photos to train a
cascade function. There are two types of images: positive images with the goal object
and negative images without the target object.

Cascade classifiers come in a variety of shapes and sizes, depending on the target item.

26
To recognise the human face as the target object in our project, we will utilise a classifier
that considers the human face.

The goal of the Haar Feature Selection approach is to extract human face traits. Haar
features work in a similar way to convolution kernels. Different permutations of black
and white rectangles make up these features. We find the total of pixels under white and
black rectangles in each feature computation.

Steps to implement human face recognition with Python & OpenCV:

1. First, create a python file face_detection.py and paste the below code:
Imports:
import cv2
import os

2. Initialize the classifier:


cascPath=os.path.dirname(cv2.__file__)+"/data/haarcascade_frontalface_default.xml"
faceCascade = cv2.CascadeClassifier(cascPath)

3. Apply faceCascade on webcam frames:


video_capture = cv2.VideoCapture(0)
while True:
# Capture frame-by-frame
ret, frames = video_capture.read()
gray = cv2.cvtColor(frames, cv2.COLOR_BGR2GRAY)
faces = faceCascade.detectMultiScale(
gray,
scaleFactor=1.1,
minNeighbors=5,
minSize=(30, 30),
flags=cv2.CASCADE_SCALE_IMAGE
)

# Draw a rectangle around the faces


for (x, y, w, h) in faces:
cv2.rectangle(frames, (x, y), (x+w, y+h), (0, 255, 0), 2)
# Display the resulting frame
cv2.imshow('Video', frames)
if cv2.waitKey(1) & 0xFF == ord('q'):
break

4. Release the capture frames:


video_capture.release()
cv2.destroyAllWindows()

5. Now, run the project file using:


python3 face_detection.py

27
Data Pre-processing and Data Argumentation

To make the most of our limited training samples, we'll "enhance" them using a series
of random changes, ensuring that our model never sees the same image twice. This helps
the model generalise better and avoids overfitting.
The keras.preprocessing.image.ImageDataGenerator class can be used to accomplish
this in Keras. This class gives you the ability to:

During training, you can set random transformations and normalisation procedures to be
performed on your picture data.
Using.flow(data, labels) or.flow from directory, create enhanced image batch generators
(with their labels) (directory). These generators can then be utilised with the Keras model
functions fit generator, evaluate generator, and predict generator, which all accept data
generators as inputs.
Haar Cascade Detection is a sophisticated face detection technology that has been
around for a long time. It has existed for a long time, long before Deep Learning became
popular. Haar Features were employed to recognise faces as well as eyes, lips, licence
number plates, and other features. The models are hosted on GitHub, and we can use
OpenCV methods to retrieve them.

 rotation range is a degree range (0-180) within which to rotate photos at random.
 width shift and height shift are ranges within which images can be translated
vertically or horizontally at random.
 Before any other processing, we shall multiply the data by the rescale value. Our
original photos include RGB coefficients ranging from 0-255, but such values
would be too high for our models to comprehend (given a typical learning rate),
therefore we scaled them down to values between 0 and 1.
 zoom range is used to zoom in and out of images at random.
 horizontal flip is used to flip half of the photos horizontally at random, which is
useful when there are no horizontal assymetry assumptions.

Training a Small Convnet

A convnet is the best tool for picture classification, therefore let's try to train one on our
data as a starting point. Because we only have a few samples, overfitting should be our
primary concern. Overfitting occurs when a model with too few examples learns
patterns that do not generalise to new data, i.e. when the model begins to make
28
predictions based on irrelevant attributes. For example, if you only see three
photographs of lumberjacks and three images of sailors, and only one of the lumberjacks
wears a cap, you may conclude that wearing a cap is an indicator of being a lumberjack
rather than a sailor.

Data augmentation is one method for combating overfitting, but it's insufficient because
our supplemented samples are still highly linked. The entropic capacity of your model
how much information your model is allowed to hold — should be your major emphasis
while battling overfitting. A model that can store a lot of data has the potential to be
more accurate by utilising more features, but it also has a higher chance of storing
unnecessary data. Meanwhile, a model with limited storage must concentrate on the
most important properties detected in the data, which are more likely to be actually
meaningful and generalise better.

Entropic capacity can be modulated in a variety of ways. The most important is deciding
on the amount of parameters in your model, such as the number of layers and their sizes.
Weight regularisation, such as L1 or L2 regularisation, is another option, which involves
forcing model weights to adopt smaller values.

We'll utilise a modest convnet with a few layers and filters per layer, as well as data
augmentation and dropout, in our scenario. Dropout also helps to minimise overfitting
by preventing a layer from seeing the same pattern twice, similar to how data
augmentation works (you could say that both dropout and data augmentation tend to
disrupt random correlations occuring in your data).

Our first model is a basic stack of three convolution layers with a ReLU activation,
followed by max-pooling layers, as seen in the code sample below. This is extremely
similar to the image classification architectures recommended by Yann LeCun in the
1990s (with the exception of ReLU).

We place two fully connected layers on top of it. The model is finished with a single
unit and sigmoid activation, which is ideal for binary classification. To go along with it,
we'll train our model with the binary crossentropy loss.

29
7.3 Backend

Figure 7.2 : Application Being initialized on the local server

The songs are stored on a local location. Python server is used for running the application
on the local machine. The machine learning model is compressed and saved after being
trained on Google Colab with 25 epochs of CNN. The model is imported into the local
machine whose parameters are pre-processed and loaded. When the user’s face is
detected, it is streamed to the model for edge detection and these parameters are sent
through the neural network for processing. After this process, the emotion is sent back
to the user interface and appropriate changes are reflected on the music player page.

30
CHAPTER 8

RESULT ANALYSIS

8.1 Results Obtained

Figure 8.1: Accuracy plot of the model

According to our problem statement, we were required to identify the current emotion of
the user by observing their facial features and play songs which help the user to feel better.
After putting the proposed system into use, the facial features were identified accurately
which is the sole dependency of the system to work properly as the song dataset was pre-
fabricated with labels to retrieve the according file to play. We were able to achieve an
accuracy ranging from 65 to 70 percent in environments where the brightness was good
and the user’s face was well-lit.

The front-end/User Interface of the application is running as expected. All the buttons
and functionalities of the music player are working and being triggered as soon as the
user registers a click on the corresponding button. The facial recognition module is
working on time without any lag or delay. When the application is being launched
initially on the local server, it is asking for user access for the web camera. This is an
expected functionality. This also respects user privacy when the user is not using the
application.
31
Emotion Accuracy Testing Result:

For comparative reasons, a set of photos for each emotion (Happy, sad, Angry, calm) is
maintained in the suggested model. In order to recognize the users' emotion, the freshly
loaded photos will be compared to the saved dataset. The table below shows the results
displayed the photographs that were saved in the suggested model.

Images Emotion

Happy

Angry

Sad

Calm

Table 8.1: The dataset of images saved in the proposed model

32
No. of
No of Recognition
Emotion Recognized
Samples Rate
sample

Happy 10 10 100%

Sad 10 8 80%

Angry 10 6 60%

Calm 10 9 90%

Total 40 33 82.5
Table 8.2: The summary of result Record

In order to find the RR (Recognition Rate), the following formula is applied to the
results collected as below:

𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝐶ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟
RR = ∗ 100
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠

RR=33/40*100

RR=82.5

Based on the result above, it shows that the proposed model has the recognition
rate (RR) of 82.5%.

33
CHAPTER 9

CONCLUSION AND FUTURE WORK

9.1 Conclusion
The significant of this project is the emotion detection of the images loaded into the
proposed model. The main purpose is on its emotion detection functionality.
Through the integration between emotion detection technology and music player, the
proposed model is aimed to provide betterment in the individual’s entertainment. The
proposed is able to detect the four emotions i.e. calm, angry, happy, and sad of the images
loaded into it. Once the proposed model compared and detected the emotion of the user,
the music player will play the song(s) accordingly, and if the user id sad or angry it will
try to swing the mod of the person by playing the song as per the mood i.e. if the person
is sad it will play some calm song to swing the mood completely.

As for the usability and accuracy, both system testing and emotion accuracy testing has
been done to the proposed model and return a satisfying result. The proposed model able
to recognized 33 out of 40 images loaded into it, which give a Recognition
Rate of 82.5%. And for the accuracy after consideration of all the images in the dataset
the accuracy of 66%. Besides, the proposed model is a computer application which can
works well in all kinds of windows and computers.

Thus with this Emotion Based Music Player, users can have an alternative way of
selecting songs, which is in a more interactive and simpler way. The music lovers will
not have to search through the long list of songs for the songs to be played but to match
the emotion in the songs selection.

34
9.2 Future Work
For further extending the proposed system, the songs could be extracted from a cloud
API which provides users with the capability to stream and download the songs. This
enables them to access more songs, also eliminating the need to store the songs in a local
folder. To detect the mixed and complex emotions, as of now only 4 emotions are being
examined and there are certain more emotions that a may be conveyed by the user, so
we can increase number of emotion classes.

First and foremost, the limitation in emotion recognition must be reduced. As previously
stated, emotion detection has various drawbacks. When extracting facial features for
comparison, increasing the amount of facial features can help enhance accuracy. For
comparison purposes, the suggested extracts only the lips and eyes for the moment. In
the future, other facial traits like brows and cheek movement could be incorporated in
the comparison.

Apart from the foregoing, the proposed model can be improved by incorporating
automatic image resolution, brightness, and contrast adjustments. The quality of the
photographs loaded has a significant impact on the accuracy of emotion detection in the
current application. As a result of the auto adjustment, the user can load any image
quality or capture images with any webcam. The future model will have the ability to
modify the image quality that can be identified and processed.

This can be done by training the model to identify more complex and mixed emotions.
Subsequently, this will increase the effectiveness of the system and will ensure that the
emotion is not captured wrongly.

Furthermore, a real-time emotion recognition technique can be implemented to the


model for improved interaction between the user and the application. Once the app is
activated, the future model will identify and extract facial features, allowing the emotion
to be identified in real time.

35
CHAPTER 10

INDIVIDUAL TEAM MEMBERS REPORT

10.1 Individual Objective


In our project every part is equally divided between each team member however
everyone has a clear idea about every module of the project.

[Saketh V] His main objective in this project was to build a machine learning model that
captures the human face and predicts the emotion based on the given parameters. He has
trained the algorithm in such a way that it provides the maximum accuracy. Throughout
the project I have learnt various tools for prediction and gained knowledge.

[Aman Prasad] His main goal in this project was to create a user interface for the music
player web app. He made the front-end creative and interactive for the user in every
possible way. He has learnt new technologies while doing this project and also combined
the user interface with a machine learning model.

[Saurabh Jaiswal] His aim in this project was to create a back-end for the project which
stores the dataset and music files. When machine learning predicts the face and CNN
classifies the emotion then based on the emotion the song is played from the database in
which the song is stored in tags. Also, we have learnt about various tools for plugin the
backend with frontend and DB.

10.2 Role of the Team Members


I am Saketh V. My role in this project is to develop a machine learning model named
Haar Cascade with the capabilities of detecting face edge to edge so the accuracy of our
project is well maintained. I have trained the model with keras and TensorFlow and also
with the ML model I have applied CNN for classification of image after prediction upon
various parameters such as angry, sad, happy, and calm.

I am Aman Prasad. My individual role in the project is to design user interfaces with
front end technologies such as HTML, CSS, JS and bootstrap to make the front end more
creative and interactive for the user and also, I added various functionalities such as
previous button, repeat etc.

I am Saurabh Jaiswal. My contribution to this project towards the backend I have used
python and Django for the back end. I have integrated both the Haar Cascade ML model
and CNN classifier through OpenCV and also trained the classifier with colaboratory
and Tensor flow.
36
10.3 Contribution of Team Members
All the team members have equally contributed to the project with the best practices at
every stage. However, our roles are different but the contribution is equal and all the
team members have the proper idea about the project.

Team member Saketh V has put his effort in creating and training the ML model and
also with that he is parallelly doing ppt’s reports and research paper.

Team member Aman prasad has given his best for designing the front end and also
helped other team members with paper, report and ppt.

Team member Saurabh Jaiswal took a step forward to do the backend of the project with
parallelly doing the ppt’s, report and research paper.

All of the team members contributed equally to complete the project with full effort and
dedication.

37
REFERENCES

[1] K. S. Nathan, M. Arun and M. S. Kannan, "EMOSIC — An emotion based music


player for Android," 2017 IEEE International Symposium on Signal Processing and
Information Technology (ISSPIT), 2017, pp. 371-276, doi:
10.1109/ISSPIT.2017.8388671.

[2] J. -H. Kim, B. -G. Kim, P. P. Roy and D. -M. Jeong, "Efficient Facial Expression
Recognition Algorithm Based on Hierarchical Deep Neural Network Structure," in IEEE
Access, vol. 7, pp. 41273-41285, 2019, doi: 10.1109/ACCESS.2019.2907327.

[3] D. Huang, C. Shan, M. Ardabilian, Y. Wang and L. Chen, “Local binary patterns and
its application to facial image analysis: a survey.,” IEEE Transactions on Systems, Man,
and Cybernetics, Part C (Applications and Reviews), vol. 41, no. 6, pp. 765-781, Nov.
2011.

[4] M. D. van der Zwaag, J. H. Janssen and J. H. D. M. Westerink, "Directing Physiology


and Mood through Music: Validation of an Affective Music Player," in IEEE Transactions
on Affective Computing, vol. 4, no. 1, pp. 57-68, Jan.-March 2013, doi: 10.1109/T-
AFFC.2012.28.

[5] J.H. Janssen, E.L. Van den Broek, and J.H.D.M. Westerink, “Tune in to Your
Emotions: A Robust Personalized Affective Music Player,” User Modeling and User
Adapted Interaction, vol. 22, pp. 25- 279, 2012.

[6] L.-O. Lundqvist, F. Carlsson, P. Hilmersson, and P.N. Juslin, “Emotional Responses
to Music: Experience, Expression, and Physiology,” Psychology of Music, vol. 37, pp.
61-90, 2009.

[7] Ankita Mahadik, Shambhavi Milgir , Janvi Patel , Vijaya Bharathi Jagan, Vaishali
Kavathekar, , 2021, Mood based Music Recommendation System, INTERNATIONAL
JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY (IJERT) Volume 10,
Issue 06 (June 2021)

[8] K. Chankuptarat, R. Sriwatanaworachai and S. Chotipant, "Emotion-Based Music


Player," 2019 5th International Conference on Engineering, Applied Sciences and
Technology (ICEAST), 2019, pp. 1-4, doi: 10.1109/ICEAST.2019.8802550.

[9] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar and I. Matthews, "The


Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-
specified expression," 2010 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition - Workshops, San Francisco, CA, USA, 2010, pp. 94-101, doi:
10.1109/CVPRW.2010.5543262.

[10] Lee, J., Yoon, K., Jang, D., Jang, S., Shin, S., & Kim, J. (2018). MUSIC

38
RECOMMENDATION SYSTEM BASED ON GENRE DISTANCE AND USER
PREFERENCE CLASSIFICATION.

[11] https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-
recognition-challenge

[12] J. Zhou, H. Zogan, S. Yang, S. Jameel, G. Xu and F. Chen, Detecting Community


Depression Dynamics Due to COVID-19 Pandemic in Australia," in IEEE Transactions
on Computational Social Systems, vol. 8, no. 4, pp. 982-991, Aug. 2021, doi:
10.1109/TCSS.2020.3047604.

39
APPENDIX A:
SAMPLE SCREEN SHOTS

Figure 1. Launching the application on a local server

Figure 2. Loading the application on a web browser

40
Figure 3. Calm Mood Detected by the application and playing suitable song

Figure 4. Calm Mood Detected by the application and playing suitable song

41
Figure 5. Calm Mood Detected by the application and playing suitable song

Figure 6. Data received by the model being processed on the backend with feedback for each
emotion detection cycle

42
Figure 6. Angry Mood Detected by the application and playing suitable song

Figure 7. Graphs of the accuracy of face detection

43
APPENDIX B:
SAMPLE CODE
Index.html:

{% load static %}

<!DOCTYPE html>
<html>

<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Emotion-Based Music Player</title>
<link rel="stylesheet" href="{% static 'css/styles.css' %}">
<link rel="shortcut icon" href="#">
</head>

<body>
<div class="music-container">
<div class="current-mood">
<h1 id="status"></h1>
</div>

<div class="current-song">
<div class="current-song-2">
<h5 id="current_song"></h5>
</div>
</div>

<div class="music-content">
<div id="bg-image">

<div id="volume-container">
<input type="range" class="slider" id="volumeslider" min="0" max="100"
value="100" step="1">
</div>

<div id="volume-low-button">
<img src="{% static 'imgs/volume-low.png' %}" id="volume-down">
</div>

<div id="volume-high-button">
<img src="{% static 'imgs/volume-high.png' %}" id="volume-up">
</div>

<div id="music-image">
44
<div id="circle-image">
<img src="{% static 'imgs/square.jpg' %}" alt="">
</div>
</div>

<div id="currentTime">
<span id="curtimetext">00:00</span>
<span id="durtimetext">00:00</span>
</div>

<div id="mute-button">
<button id="mutebtn" style="background: transparent; border: none;"><img
src="{% static 'imgs/speaker.png' %}" alt=""></button>
</div>

<div id="seek-container">
<input type="range" class="seekslider" id="seekslider" min="0" max="100"
value="0" step="1">
</div>
<div class="mood-button">
<button id="test" style="margin-top: 15px; border-radius: 5px; height: 30px; width:
150px; color: white; background: #333333; border: none;">Detect New Mood</button>
</div>
</div>

<!--Problem is the expression variable isn't being passed to here properly-->


<!--Problem is the expression variable isn't being passed to here properly-->
<div id='imageCapture' style='display: none'></div>

<script src="https://code.jquery.com/jquery-3.5.1.min.js"></script>
<script type="text/javascript"
src='https://cdnjs.cloudflare.com/ajax/libs/webcamjs/1.0.25/webcam.js'></script>
<script src="{% static 'js/main.js' %}"></script>

</body>

</html>

Main.js

let mood, audio, playbtn, nextbtn, prevbtn, mutebtn, seekslider, volumeslider, seeking = false,
seekto,
curtimetext, durtimetext, current_song, dir, playlist, ext, agent, repeat, setvolume,
angry_playlist, angry_title,
angry_poster, happy_playlist, happy_title, happy_poster, calm_playlist, calm_title,
calm_poster, sad_playlist,
sad_title, sad_poster, playlist_index;

45
dir = "static/songs/"

angry_playlist = ["ACDC-BackinBlack", "OhTheLarceny-ManonaMission", "LedZeppelin-


ImmigrantSong"];
angry_title = ["ACDC - Back in Black", "Oh The Larceny - Man on a Mission", "Led Zeppelin
- Immigrant Song"];
angry_poster = ["static/song_imgs/back_in_black.jpg",
"static/song_imgs/man_on_a_mission.jpg", "static/song_imgs/immigrant_song.jpg"];

happy_playlist = ["WillPharrell-Happy", "Kool&TheGang-Celebration", "RickAstley-


NeverGonnaGiveYouUp"];
happy_title = ["Will Pharrell - Happy", "Kool & The Gang - Celebration", "Rick Astley - Never
Gonna Give You Up"];
happy_poster = ["static/song_imgs/happy.jpg", "static/song_imgs/celebration.jpg",
"static/song_imgs/never_gonna_give_you_up.jpg"];

calm_playlist = ["SmashMouth-AllStar", "DJOkawari-SpeedofLight", "BillieEilish-BadGuy"];


calm_title = ["Smash Mouth - All Star", "DJ Okawari - Speed of Light", "Billie Eilish - Bad
Guy"];
calm_poster = ["static/song_imgs/all_star.jpeg", "static/song_imgs/speed_of_light.jpg",
"static/song_imgs/bad_guy.jpg"];

sad_playlist = ["Adele-Hello", "CelineDion-MyHeartWillGoOn", "GaryJules-MadWorld"];


sad_title = ["Adele - Hello", "Celine Dion - My Heart Will Go On", "Gary Jules - Mad World"];
sad_poster = ["static/song_imgs/hello.JPG", "static/song_imgs/my_heart_will_go_on.jpg",
"static/song_imgs/mad_world.jpg"];

playbtn = document.getElementById("playpausebtn");
nextbtn = document.getElementById("nextbtn");
prevbtn = document.getElementById("prevbtn");
mutebtn = document.getElementById("mutebtn");
seekslider = document.getElementById("seekslider");
volumeslider = document.getElementById("volumeslider");
curtimetext = document.getElementById("curtimetext");
durtimetext = document.getElementById("durtimetext");
current_song = document.getElementById("current_song");
repeat = document.getElementById("repeat");

audio = new Audio();


audio.loop = false;

Webcam.set({
width: 320,
height: 240,
image_format: 'jpeg',
jpeg_quality: 90
});
function fetchMusicDetails(mood) {
46
$("#playpausebtn img").attr("src", "static/imgs/pause.png");
switch (mood) {
case "Angry":
$("#circle-image img").attr("src", angry_poster[playlist_index]);
current_song.innerHTML = angry_title[playlist_index];
audio.src = dir + angry_playlist[playlist_index] + ext;
break;

case "Happy":
$("#circle-image img").attr("src", happy_poster[playlist_index]);
current_song.innerHTML = happy_title[playlist_index];
audio.src = dir + happy_playlist[playlist_index] + ext;
break;
case "Sad":
playlist_index = 0;
audio.src = dir + sad_playlist[0] + ext;
current_song.innerHTML = sad_title[playlist_index];
$("#circle-image img").attr("src", sad_poster[playlist_index]);
$("body").css("background-image", "linear-gradient(to bottom, rgba(14, 9, 121, 1)
69%, rgba(0, 189, 255, 1) 100%)");
break;
}
}

setTimeout(() => { getExpression() }, 2000);

Train classifier.ipynb:

train_dir = "train"
val_dir = "validation"
BATCH_SIZE = 64

# Since the data is separated into training and validation sets this time
# we can use the ImageDataGenerator directly

# Data augmentation for train only


train_datagen = ImageDataGenerator(rescale=1./255,
rotation_range=30,
shear_range=0.3,
width_shift_range=0.3,
height_shift_range=0.3,
zoom_range=0.3,
horizontal_flip=True,
fill_mode="nearest")

# The model predicted the sample image accurately


image = Image.open("train/Angry/0.jpg")
image = image.resize((48, 48))
47
arr = np.array(image)
x_data = [arr]
x_data = np.array(x_data, dtype = "float32")
x_data = x_data.reshape((len(x_data), 48, 48, 1))
x_data /= 255

pred_array = model.predict(x_data)
result = reverse_classes[np.argmax(pred_array)]
result

model = models.Sequential([
# Block 1
# Leave strides as default since images are small
Conv2D(32, (3, 3), padding="same", kernel_initializer="he_normal", input_shape=(48, 48,
1)),
BatchNormalization(),
Activation("relu"),
Activation("relu"),
MaxPool2D((2, 2)),
Dropout(0.25),

Checkpoint = ModelCheckpoint("emotion_classifier.h5",
monitor="val_loss",
mode="min",
save_best_only=True,
verbose=1)
import matplotlib.pyplot as plt
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(16)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()
48
APPENDIX C:
PLAGIARISM REPORT

49
APPENDIX D:
PUBLICATION DETAILS

- Got acceptance from an IEEE conference publication that is 2022 Second


International Conference on Computer Science, Engineering and Applications
(ICCSEA)

- SUBMISSION OF PAPER:

50
- ACCEPTANCE OF PAPER:

51
APPENDIX E
TEAM DETAILS

NAME Aman Prasad


ROLL NO 18113104
EMAIL [email protected]
CONTACT NO +91 7338967271

NAME Saketh V
ROLL NO 18113081
EMAIL [email protected]
CONTACT NO +91 9908687873

NAME Saurabh Jaiswal


ROLL NO 18113078
EMAIL [email protected]
CONTACT NO +91 9161727824

52

You might also like