Música - Terapia Ocupacional Basada en Análisis de Emociones en Audio
Música - Terapia Ocupacional Basada en Análisis de Emociones en Audio
Research Article
Design of Semantic Matching Model of Folk Music in
Occupational Therapy Based on Audio Emotion Analysis
Wensi Ouyang
School of Music Shaanxi Normal University, Xi'an Shaanxi 710119, China
Received 10 April 2022; Revised 17 May 2022; Accepted 26 May 2022; Published 16 June 2022
Copyright © 2022 Wensi Ouyang. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
The main semantic symbol systems for people to express their emotions include natural language and music. The analysis and
establishment of semantic association between language and music is helpful to provide more accurate retrieval and
recommendation services for text and music. Existing researches mainly focus on the surface symbolic features and association
of natural language and music, which limits the performance and interpretability of applications based on semantic association
of natural language and music. Emotion is the main meaning of music expression, and the semantic range of text expression
includes emotion. In this paper, the semantic features of music are extracted from audio features, and the semantic matching
model of audio emotion analysis is constructed to analyze ethnic music audio emotion through feature extraction ability of
deep structure. The model is based on the framework of emotional semantic matching technology and realizes the emotional
semantic matching of music fragments and words through semantic emotional recognition algorithm. Multiple experiments
show that when W = 0:65, the recognition rate of multichannel fusion model is 88.42%, and the model can reasonably realize
audio emotion analysis. When the spatial dimension of music data changes, the classification accuracy reaches the highest
when the spatial dimension is 25. Analysing the semantic association of audio promotes the application of folk music in
occupational therapy.
1. Introduction the movie to make the audience and the characters in the movie
resonate and share a common mood, so that the audience can
Emotion analysis is to establish a harmonious man-machine have a “good impression” on the movie. Music and emotion
environment by giving computers the ability to recognize, are inextricably linked. Music can affect emotions and also
understand, express, and adapt to human emotions and make show emotions. Emotions can affect people’s perception of
computers have higher and more comprehensive intelligence. music when they listen to songs. What makes music meaning-
Sentiment analysis is an important content of sentiment com- ful is the emotional connection between the participant and the
puting, and it is also a problem that must be solved in the field listener. Emotion is the essential feature of music, and cognitive
of sentiment computing. Music is the carrier of emotion and an emotion has a corresponding relationship between acoustic
important aspect of affective computing. In the emotional anal- vibration and nonsemantic structure of music. Music not only
ysis of music, people usually like some sad music when they are provides entertainment but also has many social and psycho-
sad and upset, in a happy mood, they like to listen to cheerful logical applications. Emotion analysis has become one of the
and dynamic music, and when they are calm, they tend to most active research fields in natural language processing. It
choose soothing and smooth light music, which is conducive has also been extensively studied in the fields of data mining
to rationally solving problems and maintaining healthy psy- [2], text mining [3], content recommendation [4], and infor-
chology. These potential “emotional tendencies” can expand mation retrieval [5]. Because of its commercial value, academic
the corresponding emotions of music listeners, and these emo- value, and importance to society as a whole, sentiment analysis
tions will generate some “motivation” to solve the correspond- has expanded from computer science to other disciplines. The
ing problems [1]. Filmmakers like to add background music to Internet has become an important medium for people to
2 Occupational Therapy International
express their opinions and to get some information, and the directly allow the composer and listener to carry out emotional
information conveyed by others can influence our opinions. communication, and it can be said that without audio, there is
Therefore, an automatic sentiment analysis system is required, no music. Therefore, many researchers try to use audio pro-
which usually contains sentiment classification modules. cessing to understand and analyze the emotion of music. Gir-
According to the different symbols, sentiment analysis can be gin proposed that music emotion classification based on audio
divided into text sentiment analysis [6], video sentiment analy- data was an important method in music emotion analysis,
sis [7], and audio sentiment analysis [8]. In the research of which segmented audio data and extracted physical features.
emotion recognition of audio, it mainly deals with the audio Lyrics are the important information of a song and the carrier
signal. The widely used audio features are: sound quality fea- of the content of the song, which most intuitively expresses
ture, spectrum feature, rhythm feature, and so on. that the content of the song is the original intention of the song
With the development of science and technology, the model [11]. Although social tags briefly contain users’ understanding
has significantly enhanced the ability to extract effective features of the shared content, they are additional information about
from a large number of data, and the topic of music therapy has the semantic meaning of the shared content, which contains
attracted more and more people’s attention. As music therapy potential resource classification. Social tags can be applied to
workers, it is worth pondering and discussing to develop the music field, including music classification of social tags, con-
cause of music therapy by using rich local music resources. struction of semantic space, and other multiangle and multi-
The Chinese folk music culture is effectively applied to music level applications. Juslin et al. proposed to use social labels of
therapy, and the Chinese folk music culture is carried forward music resources to construct semantic space of music retrieval
through music therapy, so that it goes to the world. Guide the and use tag clustering to improve retrieval effect in tag space to
patient to sing, play, and write music for therapeutic purposes. obtain similar music [12]. If these tags can be well integrated
The application of folk music culture to music therapy, effective into an application system, tag will play an important role in
music therapy internationalization of folk music culture, the music information retrieval system. In music resource sharing
development of music therapy with Chinese characteristics. website, social tag is the system user’s understanding of music
Playing guqin cultivates the mind and forgets troubles, and and emotional interpretation is the free gene of music.
the healer is completely immersed in music, which brings good Music develops with the development of human society, but
practical results for occupational therapy. Folk music culture is music has timeliness. Music retrieval needs to express music in
broad and profound, with a long history, far-reaching connota- the form of music score according to the tempo and pitch of
tions, and rich repertoire, which can express various artistic melody and search music data according to the similarity of
conceptions and feelings. To apply folk music culture into melody. Gingras B. and others have provided four search means.
music therapy, one needs to know more about one’s own music Modern music retrieval system can process hundreds of mil-
culture, and music therapy can be applied in a very wide range lions of music data, but the new music retrieval system should
[9]. There are discrete model and dimensional model widely pay more attention to the performance in online music retrieval,
used in the field of music emotion cognition. The model so that users from different social backgrounds can have a
includes four emotional adjectives, which are used to describe higher experience in retrieving music in different ways [13].
different emotional attributes in the music field and are classi- The content of music is very rich. It can be the rhythm of music
fied into similar emotional types according to the categories. melody or a piece of music. Content-based music information
Each link in the emotional circle is connected with the left retrieval is a hot research topic at present. More and more online
and right links in the emotional logic that there is a progressive music retrieval systems use content-based music information
relationship, and the progressive relationship represents the retrieval technology. In the research of content-based music
regular change of human emotions. The affective cluster of retrieval, humming retrieval is one of the main research direc-
the affective ring, respectively, has been widely recognized and tions. Vaidya and Kalita based their humming music retrieval
applied in the field of musical emotion analysis. The selection on audio processing analysis and obtained the fundamental fre-
of emotion model depends on emotion recognition methods quency distribution of input sound waves through autocorrela-
and specific application scenarios [10]. Emotional words in tion measurement calculation [14]. Xu et al. proposed to extract
the model are extracted from people’s perception of music audio features with an improved music melody contour extrac-
and mainly consider the psychological feelings of listeners to tion algorithm to remove the influence of noise brought by the
the songs they listen to, which is in line with the actual situation environment and audio input equipment on the humming
of psychological interaction of musical emotions. But how to query [15]. Pandeya and Lee proposed the method of extracting
identify music nondescriptive query and music emotion analy- feature kernel for semantic music retrieval, combined audio
sis has become an important research topic for musicologists information and social context information for semantic music
and psychologists. information retrieval, and developed the corresponding music
retrieval system based on semantic understanding [16]. Based
2. Related Work on the rhythm and timbre of music, Belyk et al. constructed a
multilabel emotion recognition model based on principal com-
Musical emotion describes the inherent emotional expression ponent analysis, extracted audio features through multilevel
of musical data, which is widely used in music retrieval, music convolutional network, and learned the vector representation
understanding, and other music-related applications. Audio of music labels for automatic music labelling [17]. Traditional
data is the most important form of music data expression, music retrieval methods cannot meet the needs, so we need to
and a single note cannot show the beauty of music and cannot find new music retrieval methods and ways. With more and
Occupational Therapy International 3
–20
–40
–50
–60
1000 2000 3000 4000 5000 6000 7000
Frequency
Frequency spectrum
7500
Frequency (Hz)
7000
6500
6000
0.5 1.0 1.5 2.0 2.5 3.0 3.5
Time (min)
Spectrogram
more resources on the Internet, automatic text classification by expression of the audio signal in the frequency domain. The
computer is an important research topic of natural language spectrum diagram of the signal can be obtained by Fourier
processing and artificial intelligence. Related research is still in transform, and the result value of Fourier transform is usually
the stage, related technology is still to be improved, and it is expressed by amplitude and phase, as shown in Figure 1.
difficult to analyze music features and immature representation The figure shows the corresponding frequency-domain
of music, which affects the extraction of music features. features of musical notes played by instruments. All sound fre-
The main method of the audio model based on emo- quency positions are the same. Spectrum is an important fac-
tional semantic matching proposed in this paper is to tor determining sound quality or timbre. Complex sounds
describe the audio in the emotional semantic space and then contain relevant amplitude signals of different frequencies.
identify and match the audio through the emotional seman- For the sampled signal, compute the discrete Fourier trans-
tic, so as to give the music list that best meets its emotional form. Frequency spectrum analysis of audio signals is usually
demands in occupational therapy. carried out on a short segment, called a “frame,” by which
the short-time Fourier transform can capture the frequency
3. Semantic Matching Model of Folk Music content changes in time. The mathematical expression of this
Based on Audio Emotion Analysis transformation is the addition of a window function to the dis-
crete signal, which is generally bell shaped and stationary over
3.1. Feature Extraction of Ethnic Music Audio. The automated a short period of time.
description of music content is based on the feature extraction
of computable time-frequency domain signals. The concept 3.2. Semantic Emotion Recognition Algorithm. In this study, the
and extraction process of each feature will be described below. feature extraction of speech emotion recognition algorithm and
Frequency is a simple sine curve defined as the number of the learning method of training model are used for recognition,
cycles per second, or Hertz (Hz). For example, a sine wave as well as the calculation and judgment of speech emotion
has a frequency of 440 Hz, or 440 cycles per second. The recip- recognition through the query of emotion dictionary combined
rocal of frequency is the period, and the physical meaning is with sentence structure. When semantic understanding is
the number of seconds, that is, the time interval of sinusoidal carried out through emotion dictionary query, it is necessary
signal in a oscillation period [18]. In the time domain, the to establish an emotion dictionary library to obtain the annota-
analogy signal is sampled once every second to obtain the data tion value of emotion. In this paper, some famous emotion dic-
signal. The spectrum diagram of the time domain signal is the tionaries are sorted out to establish the emotion dictionary
4 Occupational Therapy International
Master template
Semantic level
Feature level
Alignment
Melody model
template
Imported Matching
templates Pitch calculation
Proper extraction Matching
vector results
Note
Feature segmentation
extraction
Subframe
Pretreatment add window
library. In this paper, the dictionary of negative words used in tence is split into word sets. Text slice transformation is the
this paper is constructed by selecting common negative word basis of text information processing; emotion recognition tech-
dictionaries and adding popular negative descriptions on the nology cannot do without text slice transformation. Through
Internet. Emotion recognition is carried out by querying the the study of Chinese natural language processing, text cutting
emotion dictionary, and emotion words and judgment words conversion uses language rules to find the separation between
are marked with familiar words, and adverbs of degree are words and their own words and mark the parts of speech. Only
marked numerically. Different emotional words represent dif- through the multichannel fusion of multidomain knowledge
ferent emotional tendencies and sizes, so their annotations can we do a good job of word segmentation.
are also different. Different degree adverbs have different func-
tions and degrees, so determining the value of degree adverbs is 3.3. Semantic Matching Technology Framework Based on Audio
also an important part in the process of emotion recognition Emotion Analysis. In music retrieval, feature selection, repre-
[18]. In the process of emotion recognition, adverbs of degree sentation, and matching are the core techniques. Based on
need to be marked numerically. Through the modification of the research and analysis of music physical and perceptual
adverbs of degree, the intensity of emotion will also change. characteristics, this paper takes melody as the main feature
In this paper, degree adverbs are divided into four grades, and establishes melody representation model through pitch
and then by setting different emotional weights for each degree extraction and dynamic threshold segmentation algorithm to
adverb, degree words can modify the following emotional retrieve music data sets and input music samples. In order to
words through this processing. By sorting out and processing improve the retrieval accuracy, genetic algorithm was used to
the emotion word dictionary, degree word dictionary, and neg- align the template and correct the individual difference of
ative word dictionary, this paper builds an emotion dictionary matching input. The fusion Euclidean distance and similarity
base of common words on the basis of previous studies, which measure matching template is applied to enhance fault toler-
can be used for text emotion recognition in the future. Affective ance and generalization ability. Finally, the effectiveness of
orientation is the value of the valence direction in the dimen- the algorithm is verified by prototype system. Musical features
sional coordinate emotion model, which represents an emo- can be roughly divided into three levels—physical features,
tional orientation and represents the parameters of the acoustic features, and perceptual features. The semantic match-
positive and negative emotions and the positive and negative ing technical framework based on audio sentiment analysis is
degrees of the parties. The semantic emotion recognition sys- shown in Figure 2. Physical features mainly refer to the audio
tem based on emotion dictionary is quick to construct, with content recorded by physical carriers in a certain format, which
good recognition effect and fast recognition speed [19]. The is presented in the form of streaming media. Acoustic level fea-
proposed sentiment analysis algorithm is mainly composed of tures mainly include time and frequency domain features, such
text cutting and conversion, sentiment location, and sentiment as pitch frequency, short-term energy, zero crossing rate, LPC
aggregation. After obtaining the text information, the text infor- coefficient, and MFCC coefficient, which are the performance
mation needs to be converted into text cutting first, and the sen- characteristics of audio itself and are often used in each stage
Occupational Therapy International 5
Excitement
Audio feature
Audio file
extraction Sadness
National music
Fear
Preprocess Anxiety
Relief
Feature word Joy
Text lyrics
extraction
Figure 3: Structure diagram of speech emotion recognition system with semantic combination.
of speech recognition. Perceptual characteristics reflect people’s fusion after emotional tendencies. The multichannel fusion
descriptions of music feelings, such as pitch, rhythm, tone process of text emotion recognition and acoustic emotion
strength, and timbre. Perceptual characteristics can usually be recognition will be carried out in the judgment of emotion
extracted from physical characteristics, and it is more consis- orientation, namely, the valence axis in the dimensional emo-
tent with human recognition and judgment of music content. tion model.
Music is a discrete sequence of notes that changes over This gap is a relatively small difference in emotion recog-
time, yet feels like a complete entity of notes that change nition. When prosodic features are difficult to identify the
over time. The melody contour of music is the characteristic positive emotion, the method of believing the text emotion
of pitch changes with time, and pitch is determined by the orientation is adopted to adjust the emotion value [21],
pitch frequency of music. Therefore, melody contour can and the emotional tendency of the speech channel and the
be extracted and described by extracting pitch and describ- text channel is identified. The output data of speech channel
ing pitch with appropriate model. This paper proposes a are the result of recognition, the probability of positive emo-
melody representation model based on the standard tem- tional orientation, and the probability of negative emotional
plate and input template, extracts the pitch template of the orientation. The output data of the text channel is the affec-
audio file input by the user from the chord music file, and tive tendency value obtained through calculation. The two
establishes the relationship between the standard pitch fre- channel results of the model are used as input signals to
quency templates, because these two templates belong to the final model. The decision level fusion model is used to
the category of pitch frequency, and their appearance is sim- fuse the two recognition results. Finally, the recognition
ilar after normalization. On the basis of the above research, and classification of discrete emotion model based on
the melody representation model is further improved, and semantic combination is realized.
the appropriate matching algorithm is proposed to match
the final retrieval results. 3.5. Feature Model Training. This design sets up a four-layer
model, based on the model, respectively, using audio fea-
3.4. Construction of Semantic and Audio Emotion Recognition tures and lyric features to classify emotions. The input layer
Model. The model uses prosodic feature, spectral feature, in the network model is set as 100 and 130 nodes, respec-
voice quality, and audio feature to carry out emotion recogni- tively, according to the characteristic dimension of the input,
tion. With the development of speech denoising technology, and the learning rate is initialized as 0.05. The number of
the accuracy of speech recognition has been improved. The nodes in the hidden layer compresses the data layer by layer,
text recognized by speech has strong credibility, and the and the number of nodes in the hidden layer is 400, 200, and
semantic recognition by querying the dictionary can ulti- 150, and the number of iterations is 150. The SoftMax layer
mately preliminarily determine the semantic emotional state at the end of the network structure takes the emotional cat-
of speech information [20]. This paper establishes a speech egory of music as the output, with a total of 4 categories, so
emotion recognition system combining semantics, and the the number of nodes in the output layer is set to 4.
overall system structure is shown in Figure 3. Phonetic fea- Figure 4 shows the influence of iteration times on match-
tures have a high recognition rate in emotional intensity, ing error during network model training. By analysing the
but there is a certain limitation in the identification of emo- above figure, it can be clearly obtained that the relationship
tional orientation. Semantic emotion recognition, on the between the number of iterations and the matching error is
other hand, has a very good performance in identifying emo- that the latter gradually decreases with the increase of the
tional tendencies. Semantic combination of speech emotion former. When the number of iterations increases from 1 to
recognition model is set up, respectively, to identify the text 100, the classification error decreases rapidly. When the
emotion tendentiousness and voice acoustic characteristics number of iterations increases to 150 to 200, the variation
described in the emotional tendency and active situation, of classification error is no longer obvious. On the other
and then, the text recognition tendency and acoustic charac- hand, it can be concluded that the feature extraction ability
teristics identification of multimodal integration level of of deep confidence network with deep structure is stronger
emotional tendency to make decisions get a multimodal than that of shallow network.
6 Occupational Therapy International
70
60
Matching error
50
40
30
20
Iteration
Function loss
The music samples in this paper are from 1200 songs in the W Wv
database, 1000 of which are used as training samples and the We = + : ð3Þ
Ws 1 − w
remaining 200 as test samples. The emotions of the sample
songs were divided into 6 categories: excitement, sadness,
In the formula, we output multichannel fusion results;
fear, anxiety, relaxation, and joy. The audio was in a unified
W e is the weight of text channel input in the weighted mul-
format, and 30 s of music fragments with the most emotional
tichannel fusion process, W s is the text channel input, and
representation were selected for the classification of musical
WV is the voice channel input.
emotions. One thousand songs were randomly divided into
Finally, the positive and negative values are used to judge
five groups, and then, the five groups were trained
the emotion recognition result after multichannel fusion. By
separately.
adjusting different weights, the recognition rate of each
4.1. Recognition Rate of Semantic Matching Model. The weight is counted, and the weight with the best recognition
experimental training set carries out multichannel fusion of result is selected as the parameter of the multichannel fusion
the two sets of data. In the recognition process, the recogni- channel of the multichannel model. Figure 5 shows the rec-
tion results of the speech channel need to be processed first ognition rate when different weights W are selected in the
and converted into the data that can be used as the input of semantic matching model.
multichannel fusion, which is converted into According to the figure, when W =0.65, the best recogni-
tion rate obtained by the multichannel fusion model is
88.4.2%. Under the optimal parameters, the recognition rate
W v = Pp − Pn , ð1Þ of emotion, sadness, fear, anxiety, ease, and joy is 80.3%,
82.6%, 87.5%, 83.4%, 81.2%, and 88.4%, respectively. The
where W v is voice channel input, Pp is the probability semantically integrated speech emotion recognition experi-
ment finally locates each sample to each specific emotion
that speech recognition results tend to be positive, and Pn by combining the recognition results of the activation coor-
is the speech recognition results that tend to be the negative dinate axis of each sample and the recognition results of the
probability. multichannel emotion orientation. The matching recogni-
The output of this formula is a value from minus one to tion rate was 85.3%, 82.5%, 80.3%, 79.5%, 76.2%, and
one, and the output of the text channel is also processed 81.4%, respectively.
accordingly. By multiplying the result of text emotion recog-
nition by a smaller weight, it is converted into the input data 4.2. Evaluation of Audio Spatial Feature Vectors. The
of the text channel and transformed into method of graph learning is used to make a series of spatial
feature vectors and original feature vectors by integrating
W s = 0:02 × As , ð2Þ audio and text information, which proves the validity of
the musical space representation. In the original space, due
to the heterogeneity between different modal features, in
where W s in the text channel input; As is the sentence the graph learning method, the test samples of each mode
emotion value. In this way, text channel input and voice can only search for training samples in their own space,
channel input are transformed to the same data format while the spatial representation method can make the test
and order of magnitude, and then, the weighted multichan- samples search for modes. The classification accuracy of spa-
nel fusion is carried out for the two groups of data. The cal- tial learning methods is shown in Figure 6.
Occupational Therapy International 7
75 78 81 84 79.8 81.9 84.0 86.1 75.9 79.2 82.5 85.8 79.5 81.0 82.5 84.0 75.4 78.3 81.2 84.1
85.8
83.6
Excitement
81.4
79.2
84 84
81 81
Sadness
78 78
75 75
86.1 86.1
84.0 84.0
Fear 81.9
81.9
79.8 79.8
85.8 85.8
82.5 82.5
Anxiety
79.2 79.2
75.9 75.9
84.0 84.0
82.5 82.5
81.0 Relief 81.0
79.5 79.5
84.1
81.2
78.3 Joy
75.4
79.2 81.4 83.6 85.8 75 78 81 84 79.8 81.9 84.0 86.1 75.9 79.2 82.5 85.8 79.5 81.0 82.5 84.0
It can be seen from the figure that the classification accu- on affective semantic relevance to solve the problem of nonde-
racy of using spatial features is better than that of original spa- scriptive music query processing in music retrieval system, in
tial features in most classes, which proves the validity of spatial order to verify the influence of different music matching
features. This validity is precisely because spatial features make methods on model recognition ability. The verification of the
better use of the correlation between modal data than original combined emotion recognition system is achieved by compar-
spatial features. When the spatial dimension of music data ing the experimental results obtained from three experiments,
changes, the classification accuracy reaches the highest when namely, the emotion recognition based on voice acoustic fea-
the spatial dimension is 25. It can be seen from W that these tures and the speech emotion recognition based on text emo-
two curves do not have the change rule of Ming Si. Such irreg- tion recognition and semantic combination. The experimental
ularity of single mode and result improvement in multimode results and data of the three experiments are shown in Figure 7.
fusion indirectly prove the effectiveness of spatial representa- The data in the graph is expressed in the form of matrix
tion in describing the correlation between modes. graph, and the recognition rate of the three emotion recogni-
tion algorithms can be seen intuitively. The recognition
4.3. Ability to Identify Emotional Tendencies. The audio in the accuracy of traditional support vector machine can be
data set is uniformly converted into WAV format. Each song improved slightly by dimensional-classification recognition
has a long audio time and often has the same music melody. method. On this basis, the recognition rate can be improved
Therefore, 30 music clips of 15-45 seconds per song were used significantly by combining semantic recognition results with
as audio data for emotional classification. Since all the audio multichannel fusion. Among them, the recognition rate of
files in the data set cannot be directly input as training data, it anxiety and fear with low activity increased greatly, indicat-
is necessary to extract representative music features from the ing that there is some complementarity between text infor-
audio files. The accuracy of audio emotion construction and mation and phonological features in emotion recognition.
the coverage of emotion will affect the result of emotion analy-
sis. In different scenarios, the effect of audio sentiment analysis 4.4. Emotional Semantic Matching Accuracy. Accuracy rate
may not be as expected. The main purpose of the experiment is and recall rate were used as evaluation indexes. Accuracy is
to verify the ability of the proposed music retrieval model based the number of correctly matched words divided by the total
8 Occupational Therapy International
0.6
14
15
13
12
16
11
0.5
17
10
18
0.4
9
19
0.3
8
20
7
0.2 21
0.1 6
22
0.0 5
23 4
3
24
Average accuracy
2
25
1
26
49
27 48
28 47
0.0 46
29
0.1 45
30
0.2
44
0.3 31
43
32
0.4
42
33
41
0.5
34
40
35
39
38
36
0.6
37
Matching accuracy
Recognition capability
94.90
Joy
89.92
Relief
Anxiety 84.94
Fear
79.96
Sadness
74.98
Excitement
70.00
17 34 51 68 85
Figure 7: Semantic matching speech emotion recognition experimental results data graph.
95
90
Exact value 85
80
75
70
sadness and joy were selected in the experiment was that the feasibility and advantage of research results. In the process
they had higher emotional differentiation. Compared with of participating in the establishment of emotional speech data-
the experimental results without emotion screening, the base, through the analysis of annotation data, the potential rela-
experimental results are significantly improved, especially tionship between emotion types and dimension coordinates is
the effective information in the sequence can be better cap- summarized, so that the expression of emotion can be more
tured. Module and attention mechanism are effective in detailed and comprehensive, and the changing trend of emotion
extracting sequence features. After training, they can effec- can be better reflected. By identifying the transformation model
tively distinguish different emotions and are less affected for training, the data amount of training set corresponding to
by the intensity of emotions. According to the experimental the training speed is increased effectively. The recognition accu-
results, the music climax segment is selected as the adaptive racy of traditional support vector machine can be improved
matching effect of the music unit validation model, and the slightly by dimensional-classification recognition method. On
experimental results are shown in Figure 8. this basis, the recognition rate can be improved significantly
Figure can be seen that using highlights data model in 6 by combining semantic recognition results with multichannel
emotional subject adaptability experiment accuracy is over fusion. However, there are some limitations in the research pro-
75%, in sadness and joy these two emotions obviously on cess, and further research is still needed in music classification.
the theme of the accuracy of more than 82%, therefore, Emotion recognition technology has a wide range of application
although the model has certain accuracy, but on the whole scenarios, many fields have extremely important influence and
model can be done more effectively music clips and words demand, and has great research value. In the future, I will con-
to match. The matching effect of excitement and ease is bet- tinue to improve the semantic matching model and study the
ter, and the accuracy rate is 83.35% and 84.81%, respectively. influence of emotional cognition on music retrieval system.
The matching effect of emotional anxiety is poor, and both
emotional focus and fear need to be improved. It performs
best in the results of emotional semantic matching between Data Availability
music and text. Audio features can describe the audio fea-
tures of songs more effectively, and the feature vector model The data used to support the findings of this study are avail-
has higher accuracy and the best analysis effect. able from the corresponding author upon request.