0% found this document useful (0 votes)
7 views

Team 5A- Automated Sentiment Based Music Genration_sping

Uploaded by

kick100100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Team 5A- Automated Sentiment Based Music Genration_sping

Uploaded by

kick100100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 12

Automated Sentiment-Based Music Generation for

Speech Content

Swathi Gowroju1, P.srinivas Rao2 ,A.yashwanth Kumar3, B.Sai Siddartha4, V.Sneharika5, K.Sriram6
1
Associate Professor, Dept of CSE (AI&ML)
Sreyas Institute of Engineering and Technology,
Telangana,India.
[email protected]
2
Associate Professor,Dept of CSE (AI&ML)
Sreyas Institute of Engineering and Technology,
Telangana, India
[email protected]
3
Dept of CSE (AI&ML)
Sreyas Institute of Engineering and Technology,
Telangana, India
[email protected]
4
Dept of CSE (AI&ML)
Sreyas Institute of Engineering and Technology,
Telangana, India
[email protected]
5
Dept of CSE (AI&ML)
Sreyas Institute of Engineering and Technology,
Telangana, India
[email protected]
6
Dept of CSE (AI&ML)
Sreyas Institute of Engineering and Technology,
Telangana, India
[email protected]

Abstract.The automatic music generation is one of the latest innovations in


artificial intelligence where background music is generated on the basis of the
spoken content of the user through sentiment analysis. Unlike the existing systems
that have perfected upon users' input for their analyses and subsequent
recommendations on desired music, the proposed method takes speech input from
the user that is later converted to text data in order to advocate the user's sentiments
using natural language processing techniques. The proposed sentiment analysis will
tune music with the voice in accordance with the proper sentiments. This technique
can be used at multiple platforms such as speaking events at public places, video
podcasts, and on social media to credibly play background music with speech in
real time. The proposed algorithm is much powerful and robust, which is analysed
using different speech samples as the input. This application is implemented using
python software, speech processing-related libraries, natural language processing
libraries, etc.
Keywords—Background music generation, Natural Language Processing, Artificial
Intelligence, Python Software.

I. Introduction
Especially in digital content making, one of the greatest increases in demand in
recent times has been that of personalized and emotionally engaging multimedia
experiences. With increasing use of podcasts, audiobooks, storytelling, and virtual
assistants, there arises a need for further enhancement of spoken content by
integrating music that conveys the intended emotional tones. Background music
can enhance emotional quality, set a specific mood, and draw listeners deeper into
an experience. However, manually composing or selecting the right music for
spoken content can often be quite tedious and difficult, requiring a good
understanding of the emotional undertone within the speech while knowing the
expanse of music that supports that emotion.

The difficulties are due to a fact that speech emotions are often nuanced and
varied, making it difficult to locate or generate music that follows the mood in a
particular conversation or narrative. Music selection through human intuition or
pre-set playlists may lead to mismatches where the listening experience is less
than favourable. This gap between the emotional tone of speech and the music to
accompany it has made it clear that a simple automation of assessing speech
content and generation of complimentary background music, in real-time, is quite
needed.

Automated sentiment-based music generation is one of the options in the search


for a solution to this problem. With the advance of natural language processing
(NLP) and machine learning techniques, it becomes possible to analyse the
emotional content of speech as it happens. These technologies identify the
underlying sentiment-whether it's joy, sorrow, anger, or tranquillity-and Sunday
either selects or generates music which is in touch with the emotion detected. It
guarantees, thus, the music immediately reflects the emotional rhythm on the
speech, enhancing the listener's engagement without manual intervention.

Sentiment-based music generation for digital content creation is nothing short of a


game-changer for various applications, from involvement with the straightforward
podcast format, where such a system could move through connections that might
increase the emotional contribution of the narrative, be it light-hearted comedy or
grave investigative reporting, all the way across virtual assistants where music
would adapt to suit the tone of conversation, making it much more human-like and
positively influencing users.

Besides, sentiment-based real-time music creates opportunities for the creative


industries. Musicians, content creators, and storytellers could employ such
systems to create background musical scores that enhance their projects in an agile
and responsive way for multimedia production. It can be used by the
entertainment industry,-from gaming to film to advertising-for emotionally
resonant soundscapes that evolve alongside the action unfolding in the narrative.

Automated sentiment-based music generation may revolutionize the process of


scoring speech content with background music. By using contemporary techniques
in sentiment analysis and music generation, it will be possible to create
experiences that will resonate with the listener as well as be totally personalized.
This modern approach brings together the advantages of time and resources, while
also allowing for a sound-matching process with the emotional tone of voice
content to become a smooth and immersive experience, as a richer multimedia
experience across digital platforms.

Ranging through numerous circles of varied modern digital content-


making, the pressing demand is for infinitely more personalized and affecting
multimedia experiences. Many-a-time, setting background music that fits with the
emotional note of the spoken content is a challenge. Podcasting, storytelling, and
virtual assistant music genuinely require - composing or selecting manually the
right music for the setting of the speech-can indeed be an effort-big or small-
someone going through. This provides for a demand for an automated computer-
based system, pinpointing the emotions of a speech as it happens in real-time or
thereabouts and composing background music fit to the emotional tone within a
split second. An absolutely incredible quality of audio would be accomplished
through this system for listening by human beings, which would make the
auditory experience so much more immersive and alive without asking for any
input.

II. Literature SURVEY


(HMMs) and Gaussian Mixture Models (GMMs), as well as newer methods
based on Recurrent Neural Networks (RNNs). The authors discuss improvements
in the accuracy, naturalness as well as efficiency of these models, focusing much of
the discussion on the developments in deep learning that have lifted the quality of
STT and TTS systems. Their This paper[1]furnishes a model for generating multi-
instrument symbolic music by emotional cues. The proposed algorithm uses
continuous-valued emotions to condition music generation, allowing forming of
truly emotive compositions. Results from this study show that the model's
prediction the next note in a sequence was accurate, thus showing that it could
understand and replicate emotional structures within the music. The study details
the importance of emotion in musical composition and offers another way to use
emotion as context for automatic music composition.
[2]This paper describes the theory of periodically time-variant (PTV) linear
systems and establishes a framework for their analysis and comprehension. The
authors offer algorithms for stability analysis, control design, system identification,
and observer design concerning time-variant systems. The results show that
proposed algorithms can be practically applied, validating their usefulness for real-
world applications such as control systems design.
This paper presents Seed-Music, an integrated framework under which music
generation becomes amenable to greater control. A very unique seed-based
approach enables the creation of quality music, while also allowing precise
stipulations regarding the style, genre, and structure of the composition. The results
show that the Seed-Music model is able to generate music in a wide range of
genres and styles of impressive quality and provides a flexible tool for creators to
generate controlled music.
[3]The Stem Gen model is meant for the generation of music from listening and
analyzing the input stems by considering style and construction features in the
source material. The authors propose a pioneering method where the model does
not merely generate music but can interpolate between different styles and
generates a multitude of musical outputs. Results show that Stem Gen can produce
long and short music compositions, was flexible enough to inspire musicians and
producers, and allows for cross-fading between different styles of music.
[4]This paper discusses several techniques used in converting speech to text-
computed features and vice versa, including more traditional methods such as
hidden Markov models (HMMs) and Gaussian mixture models (GMMs), as well as
advanced deep learning-based methods. Their authors analyze the developing
technologies and how they have been able to impact improving the accuracy,
naturalness, and efficiency of STT and TTS systems. They further provide
examples of how they are succeeding among researchers in deep learning settings
to improve the accuracy of these systems and make them more applicable and
priceless[5].
This survey [6] provides and offers a comprehensive discussion on various
techniques for emotion detection from text. The authors comment on these
methods, ranging from rule-based ones to machine learning approaches and deep
learning models. Strengths and weaknesses of each are outlined. The paper also
emphasizes the various applications of emotion detection, such as sentiment
analysis, automated customer service, and mental health monitoring. Models thus
far reported show a great deal of progress in accurately identifying emotions from
text, signifying the increased importance being accorded to emotion-aware
technologies.
[7] The authors propose a fusion model that combines features extracted from
both speech and text in order to enhance emotion recognition accuracy. By
integrating these two modalities, the system is able to achieve better emotion
detection in contrast to unimodal approaches relying only on speech or text. Results
reflect a substantial improvement in emotion recognition, showing how multimodal
models are robust in grasping the complexity of human emotions. This approach
paves way for applications in domains like virtual assistants, customer service, and
emotion-aware interactive systems.
This letter is a review of [8] various techniques that are employed in STT and
TTS recognition systems focusing on traditional models relying on Hidden Markov
Models Review point out how these advances had enhanced these systems'
robustness and suitability for real-time applications.
The present review provides a comprehensive overview of the different
methods employed for the speech-to-text conversion, traditional models, and
contemporary innovations in deep learning. The authors discuss multiple
algorithms and their effectiveness toward improving the accuracy, robustness, and
efficiency of the STT systems. It is highlighted how deep learning techniques,
primarily neural networks, improved the performance of STT models and made
their use more applicable for many applications like voice assistants, transcription
services, and real-time translations. [9]
The review elaborates on various annotation frameworks and approaches to
apply sentiment analysis and emotion detection from text: rule-based methods,
machine-learning techniques, and finally deep-learning models. The authors focus
on the various developments made in these domains, particularly how deep
learning results have improved the accuracy of emotion recognition and the
contextual understanding of emotions in text[10-15]. Company analyses and acts
on what the users are saying about .com on social media and feedback on all
platforms, highlighting the growing relevance of emotion-aware systems in modern
technology using deep learning [16-23].

III. Proposed Methodology


In proposed methodology, multiple stages have unfolded in the following stages
that begin with the conversion of speech input to text using Automatic Speech
Recognition technology, or ASR. The ASR technology records the audio input and
converts it into a text format to be analysed at a later point. This step is crucial in
forming a basic foundation to understand the content of something being said,
which later applies to obtain the speaker's emotional tone.
speech input via a number of distinct steps to generate an audio output that is
dynamic and emotionally in tune. voice-to-text conversion is the first step in the
process, in which sophisticated voice recognition algorithms are used to convert
spoken input into text. By giving a written representation of the spoken words, this
transcription guarantees that the speech is correctly recorded for further processing.
The algorithm then analyzes the transcribed text using sentiment analysis to
ascertain the speech's emotional tone. The system determines feelings like positive,
negative, or neutral and provides a confidence score using natural language
processing algorithms. For example, a statement such as "I am feeling very happy
today" would probably be categorized as confidently positive. The next phases are
based on this emotion. The next steps are based on this attitude, which guarantees
that the music produced reflects the speech's emotional content.
The technology then creates music that fits the emotional tone based on the
feeling that was recognized. For instance, a track with a good attitude can be lively
and joyful, whereas one with a negative sentiment would have a gloomy melody.
At the audio mixing step, the system makes sure that the speech and music are
balanced and clear before combining the music with the original voice. Ultimately,
a unified and emotionally impactful listening experience is created by combining
the audio output. Multimedia production, individualized audio content
development, and therapeutic applications are just a few of the fields in which this
technology may be used.
After this, the speech is typed into text immediately after it has been spoken. At
this stage, the NLP techniques are applied to the extraction of the sentiment of the
words. The algorithms that are applied in the NLP system look at the emotion of
the speech, whether happy, sad, angry, excited, or just neutral. This process of
sentiment detection is very important in ensuring that the background music
generated fits the emotional context of the spoken content. The system is
essentially designed with a recognition of even the smallest differences of emotions
and the possibility to differ between various emotions amidst complex speech
patterns.
Proposed methodology for automatic sentiment based music generation for
speech content is explained in detail below,
a) Input speech: module takes input through user-recorded file.
b) Speech to give text format: this application will convert the input speech
provided by the user to text using Speech-Recognition module. Through the library
for speech processing of Python, it helps in converting speech into text format
using Hidden Markov Models and neural networks
The last paradigm was the generation of automatic sentiment-based music
following a number of steps, which include sentiment analysis from speech data
and selection of background music to add to the original speech. Below is the
proposed architecture of the system:
Fig.3 System Architecture for proposed method
c) Sentiment Analysis: The extract from the second module is analysed by-a
sentiment analysis algorithm and pretrain models-to recognize users' emotion tone.
Sentiment analysis is done using Python libraries, such as BERT.
d) Music Generation: Emotion-based music is chosen according to the user
sentiment perceived in the text, played in the background. Specific music track is
chosen based on the detected demand.
e) Mixture of audio and overlay: during mixing two audio components, one
main job is to balance volume and synchronize time. Python, using relevant
libraries, performs an overlay on the audio while mixing.
f) Final outcome: finally, a composite audio file with the original voice
accompaniment and background music is created as in the figure 4 and figure 5.
The above-mentioned final combinational result is stored on the local computer.
This file is again played to assess the efficiency of the background music generator
for the designated speech data.

IV. Result Analysis


Proposed results of individual original speech and background music with final
combined signal. Fig 4 illustrates how a user's signal aligns with background music
by displaying two waveforms in different hues (red and blue). The y-axis shows the
amplitude or intensity, while the x-axis shows a range of values (maybe frequency
or duration, from -50 to 100). In a center area, the signals clearly overlap,
signifying alignment.
In order to interpret voice input, determine its sentiment, and produce a
combined audio output by superimposing background music, the suggested method
combines many phases. The spoken input is converted into text using the speech-
to-text conversion component. This stage's performance is dependent on the
correctness of the transcription, which can be affected by things like speech clarity,
background noise, and accents. The system's efficacy depends on a low word
mistake rate and real-time processing, albeit complicated speech patterns or loud
settings may provide difficulties.

Fig. 4 Signal from user and background music aligning


Meaning and emotional context are extracted from the transcribed text through
the text processing and sentiment analysis processes. The precision of sentiment
categorization and the capacity for efficient context interpretation serve as
indicators of its effectiveness. However, performance may be hampered by
unclear or complicated language patterns. More sophisticated natural language
processing methods can improve this stage's dependability.A waveform with
amplitude as a function of time (x-axis) is depicted inFig 5. The graphic displays
the combined signal that is produced when background music and user input are
merged. The waveform seems dynamic, with amplitude changes over time. The x-
axis appears to be timed in seconds based on the labeling.

Fig.5 Final Combined Signal


The background music produced by the music creation module complements the
speech's emotional tone. The success of the system is largely dependent on the
caliber and applicability of the music that is produced. Although this stage
improves customisation, it is still difficult to adapt music in real time to dynamic
shifts in voice mood. The created music and the actual speech are combined
during the audio mixing and overlay stage. Maintaining voice intelligibility while
making sure the background music enhances the spoken words is the goal. Finding
the ideal balance between music and speech can be difficult in this situation,
particularly when voice volume varies. To solve these problems, efficient mixing
methods are necessary.
In conclusion, the suggested method exhibits great potential for improving
auditory experiences by generating music based on sentiments expressed through
speech. To get ideal performance, however, enhancements in sentiment analysis
dependability, transcription accuracy, and dynamic audio changes are required.
Applications for this technology can include interactive audio services and the
production of customized multimedia content.
NLTK (Natural Language Toolkit), which offers tools for fundamental text
preprocessing such as tokenization, stemming, and lemmatization, is one of the
most popular. Because of its ease of use and thorough documentation, NLTK is a
popular option for tasks like named entity identification and sentiment analysis,
making it perfect for novices. In contrast to more contemporary libraries, it may
be somewhat sluggish.
Another well-known library made for performance and usability in production
settings is SpaCy. It supports several languages and provides powerful capabilities
for named entity identification, dependency parsing, and tokenization. Because of
its great efficiency, SpaCy may be used in real-time applications such as
recommendation systems and chatbots. In contrast to more sophisticated
frameworks like Hugging Face, its selection of pre-trained models is more
constrained, despite its advantages.
Using transformer-based models like BERT, GPT, and T5 for tasks like language
translation, text categorization, and summarization, Hugging Face Transformers is
the state-of-the-art in natural language processing. It offers pre-trained models that
provide state-of-the-art performance while drastically cutting down on training
time and resources. Hugging Face requires more processing power, which may be
a barrier for some projects, but it is especially well-suited for applications that
need extensive contextual awareness.
Table 1: Various stages of proposed system
Stage Description
Speech-to-Text Conversion IUses a voice recognition engine to
translate spoken input into text.
Sentiment Analysis Examines the transcribed text's
emotional tone.
Music Generation Creates music based on the sentiment
that has been identified.
Audio Mixing and Overlay Combines sentimental background
music with conversation to produce a
single audio recording.
Final Audio Output Playback Provides a finished audio output file
with the music and voice ready for
playback.

The Automated Sentiment-Based Music Generation for Speech material method is


a multi-step procedure that creates and superimposes background music that
corresponds with the speech's sentiment, transforming spoken material into an
enhanced audio experience. Five separate steps make up the complete operation,
and each is essential to producing the desired result. Speech-to-Text Conversion is
the first step in the process, in which a speech recognition engine transforms
unprocessed speech input into text. This phase entails properly transcribing the
uttered words using advanced speech-to-text algorithms. The system provides the
precise transcribed text for additional processing, for example, if the input speech
is "I feel so excited and thrilled today!" In order to do sentiment analysis, this step
makes sure that the substance of the spoken communication is recorded.
The stage of Sentiment Analysis then examines the text's emotional tone. The
emotion is identified and a confidence score is assigned by the system using
Natural Language Processing (NLP) techniques. With a 92% confidence level, the
algorithm recognizes the sample input as having a positive emotion. The type of
background music that will be produced is largely determined by this study.
Table 2: Generated output for each type of music category
Input Process Output
"I feel so excited and The speech-to-text "I feel so excited and
thrilled today!" is the model uses methods for thrilled today!" is the text
spoken input. recognition. output.
"I feel so excited and Sentiment is predicted Sentiment Found: 92% is
thrilled today!" is the using an NLP-based the positive confidence
text input. sentiment analysis score.
algorithm.
Sentiment Analysis: "I Selects or creates music Music generated: a lively
feel so excited and using AI for music and energizing
thrilled today!" is a creation. background music.
positive speech.

Music: a lively music Balances clarity and Speech and music are
file loudness while mixed together in this
combining the two audio file, with the music
audio sources. matching the sentiment.
Speech and Music in a Produces a merged file Final Product: A voice
Mixed Audio File that may be played recording with uplifting
back. music playing in the
background.

Ⅴ. CONCLUSION
The proposed method was implemented successfully using Python and relevant
libraries. For the proposed method, the background music data set from
Kaggle.com was used. Using the natural language processing technique, speech
was converted to text, from which the user sentiment was identified for generating
the background music. Based on sentiment, specific background music would be
selected and mixed with the original speech to create a new speech with
background music. In existing methods, background music is manually added to
original speech. The proposed methodology provides an advanced and efficient
solution that allows for automated sentiment-based generation of music for speech
content. The methodology here proposed outperforms those aspired by the
existing state-of-the-art technologies in music generation.

References
[1] Ferreira, Lucas N. and E. James Whitehead. “Learning to Generate Music With Sentiment.”
International Society for Music Information Retrieval Conference (2021).
[2] Jean-Pierre Briot, Gaëtan Hadjeres, and François Pa-chet. Deep learning techniques for music
generation-a survey. ArXiv preprint arXiv:1709.01620, 2017.
[3] Gaëtan Hadjeres, François Pachet, and Frank Nielsen. Deepbach: a steerable model for bach
chorales generation. In Proceedings of the 34th International Conference on Machine Learning-
Volume 70, pages 1362–1371. JMLR. org, 2017.
[4] Ben Krause, Iain Murray, Steve Renals, and Liang Lu. Multiplicative LSTM for sequence
modelling. ICLR Workshop track, 2017.
[5] Sageev Oore, Ian Simon, Sander Dieleman, and Doug Eck. Learning to create piano
performances. In NIPS 2017 Workshop on Machine Learning for Creativity and Design, 2017.
[6] Ruining He and Julian McAuley. Ups and downs: Modeling the visual evolution of fashion trends
with one-class collaborative filtering. In Proceedings of the 25th International Conference on
World Wide Web, WWW ’16, pages 507–517, Republic and Canton of Geneva, Switzerland,
2016. International World Wide Web Conferences Steering Committee.
[7] Sixian Chen, John Bowers, and Abigail Durrant. ’ambient walk’: A mobile application for
mindful walking with sonification of biophysical data. In Proceedings of the 2015 British HCI
Conference, British HCI ’15,pages 315–315, New York, NY, USA, 2015. ACM.
[8] Hannah Davis and Saif M Mohammad. Generating music from literature. Proceedings of the 3rd
Workshop on Computational Linguistics for Literature (CLfL),pages 1–10, 2014.
[9] Eduardo R Miranda, Wendy L Magee, John J Wilson, Joel Eaton, and Ramaswamy Palaniappan.
Brain computer music interfacing (bcmi): from basic research to the real world of special needs.
Music & Medicine, 3(3):134–140, 2011.
[10] Kristine Monteith, Tony R Martinez, and Dan Ventura. Automatic generation of music for
inducing emotive response. In International Conference on Computational Creativity, pages 140–
149, 2010.
[11] Briot, Jean-Pierre, and François Pachet. "Deep learning for music generation: challenges and
directions." Neural Computing and Applications 32, no. 4 (2020): 981-993.
[12] Conklin, Darrell. "Music generation from statistical models." In Proceedings of the AISB 2003
Symposium on Artificial Intelligence and Creativity in the Arts and Sciences, pp. 30-35. 2003.
[13] Ji, Shulei, Xinyu Yang, and Jing Luo. "A survey on deep learning for symbolic music generation:
Representations, algorithms, evaluations, and challenges." ACM Computing Surveys 56, no. 1
(2023): 1-39.
[14] Van Der Merwe, Andries, and Walter Schulze. "Music generation with markov models." IEEE
multimedia 18, no. 3 (2010): 78-85.
[15] Mangal, Sanidhya, Rahul Modak, and Poorva Joshi. "LSTM based music generation system."
arXiv preprint arXiv:1908.01080 (2019).
[16] Gowroju, Swathi, Sandeep Kumar, Aarti, and Anshu Ghimire. "Deep Neural Network for
Accurate Age Group Prediction through Pupil Using the Optimized UNet Model." Mathematical
Problems in Engineering 2022 (2022): 1-24.
[17] Swathi, A., Aarti, and Sandeep Kumar. "A smart application to detect pupil for small dataset with
low illumination." Innovations in Systems and Software Engineering 17 (2021): 29-43.
[18] Gowroju, Swathi, Aarti, and Sandeep Kumar. "Review on secure traditional and machine
learning algorithms for age prediction using IRIS image." Multimedia Tools and Applications 81,
no. 24 (2022): 35503-35531.
[19] Swathi Gowroju, “A novel implementation of fast phrase search for encrypted cloud storage”
(IJSREM-2019), volume-3-issue-09. ISSN: 2590-1892
[20] Swathi, A., and Shilpa Rani. "Intelligent fatigue detection by using ACS and by avoiding false
alarms of fatigue detection." In Innovations in Computer Science and Engineering: Proceedings
of the Sixth ICICSE 2018, pp. 225-233. Springer Singapore, 2019.
[21] Gowroju, Swathi, and Sandeep Kumar. "Robust deep learning technique: U-Net architecture for
pupil segmentation." In 2020 11th IEEE Annual Information Technology, Electronics and Mobile
Communication Conference (IEMCON), pp. 0609-0613. IEEE, 2020.
[22] Swathi, A., Aarti, V. Swathi, Y. Sirisha, M. Rishitha, S. Tejaswi, L. Shashank Reddy, and M.
Sujith Reddy. "A Reliable Novel Approach of Bio-Image Processing—Age and Gender
Prediction." In Proceedings of Fourth International Conference on Computer and
Communication Technologies: IC3T 2022, pp. 329-336. Singapore: Springer Nature Singapore,
2023.
[23] Swathi, A., and Shilpa Rani. "Intelligent fatigue detection by using ACS and by avoiding false
alarms of fatigue detection." In Innovations in Computer Science and Engineering: Proceedings
of the Sixth ICICSE 2018, pp. 225-233. Springer Singapore, 2019.

You might also like