MSC Data Science - 02 PDF
MSC Data Science - 02 PDF
NCSR Demokritos
http://msc-data-science.iit.demokritos.gr
Semester: 3rd
Course: Multimodal Information Processing And Analysis
Lesson 2
Audio Representations
and Feature Extraction
Theodoros Giannakopoulos
2
Virtual Assistants
Music Search
Surveillance
Environmental Monitoring
Recommendation
3
# music
# group:arctic_monkeys
# genre:indie
# genre:post_punk
# singer:alex_turner
analyze # emotion_high_arousal
+ # emotion_negative_valence
recognize # bpm:170
...
4
Demo code
What is sound?
- sound (physics): x(t)
DFT
- Discrete Fourier Transform (DFT) DFT is:
- Inverse →
- I.e. original signal can be written as a weighted average of complex exponentials (weights are DFT
coefficients)
7
?
8
Why Mel?
1 """!
2 @brief Example 07
3 @details Frequency prerceived discrimination experiment Thresholds
4 @author Theodoros Giannakopoulos {[email protected]} Freq 2 Hz 5 Hz 10 Hz 20 Hz
5 """
6 from __future__ import print_function 250 Hz 0.7 1 1 1
7 import os, time, scipy.io.wavfile as wavfile, numpy as np
8 from random import randint 500 Hz 0.4 0.8 0.9 1
9 1000 Hz 0.6 0.8 1 0.9
10 def play_sound(freq, duration, fs):
11 t = np.arange(0, duration, 1.0/fs); x = 0.5*np.cos(2 * np.pi * t * freq) 2000 Hz 0.5 0.4 0.9 1
12 wavfile.write("temp.wav", fs, x); os.system("play temp.wav -q")
13 3000 Hz 0.5 0.5 0.6 1
14 if __name__ == '__main__':
15 freqs, thres, n_exp, fs = [250, 500, 1000, 2000, 3000], [2, 5, 10, 20], 10, 16000
16 answers = [[] for i in range(len(freqs))]
17 for i_f, f in enumerate(freqs):
18 for t in thres:
19 answers[i_f].append(0)
20 for i in range(n_exp):
21 sequel = randint(1, 2)
22 if sequel == 2:
23 play_sound(f, 0.5, fs); time.sleep(0.5); play_sound(f+t, 0.5, fs)
24 else:
25 play_sound(f+t, 0.5, fs); time.sleep(0.5); play_sound(f, 0.5, fs)
26 ans = int(raw_input('Which was higher (1/2):'))
27 if ans == sequel: answers[i_f][-1] += 1
28 print("Freq\t", end='')
29 for t in thres: print("{0:.1f}\t".format(t), end='')
30 print("")
31 for i_f, f in enumerate(freqs):
32 print("{} Hz\t".format(f), end='')
33 for i_t, t in enumerate(thres):
34 print("{0:.1f}\t".format(answers[i_f][i_t] / float(n_exp)), end='')
35 print("")
18
… f1
Framing
f2
1 2 N Spectral c1, c2, …, cN σ2
centroid Feature vector
19
Time-domain features
- Energy
- usually normalized by window length
- high variation over successive speech frames (std statistic)
- Zero Crossing Rate
- rate of sign changes during the frame
- measure of noisiness
- high values for noisy signals
- Energy Entropy
- measure of abrupt changes in the signal’s energy
- divide frames to K sub-frames and compute (normalized)
sub-energies (esubframe_k)
- compute entropy of esubframe_k sequence
21
https://plot.ly/python/
29
period (seconds)
Certain peaks that are not consistent
tempo = 60 / period bpms
with the estimated tempo are discarded
Tempo estimation /
beats detection
J. P. Bello, L. Daudet, S.Abdallah, C. Duxbury, M. Davies, M. B. Sandler, “A Tutorial on Onset Detection in Music Signals,” IEEETr.Speech andAudio Proc.,vol.13,no.5,pp.1035-1047,September 2005
P. Desain & H. Honing,“Computational models of beat induction:The rule-based approach,” J. New Music Research, vol. 28 no. 1, pp. 29-42, 1999.
Eric. D. Scheirer,“Tempo and beat analysis of acoustic musical signals,” J.Acoust. Soc.Am., vol. 103, pp. 588-601, 1998.
Davies, M. E., & Plumbley, M. D. (2007). Context-dependent beat tracking of musical audio. IEEE Transactions on Audio, Speech, and Language Processing, 15(3), 1009-1020.
McKinney, M. F., & Moelants, D. (2006). Ambiguity in tempo perception: What draws listeners to different metrical levels?. Music Perception: An Interdisciplinary Journal, 24(2), 155-166.
34
Usage example:
audio|master⚡ ⇒ python3 example15.py
../data/musical_genres_small/hiphop/nwa_straght_outta_campton.wav
35
Pitch tracking
- f0: Pitch tracking:
- fundamental frequency
- a physical property of sound: - Time / spectral domain
- speech: glottal pulses freq - Spectral:
- music: most dominant freq of a note (eg freq of vibration of a string) - simple argmax?
- pitch - no! f0 not always the freq with the
- a subjective phenomenon (f0 open to measurement) max freq in spectrogram
- perceptual
- follows f0
- speech:
- not always clear
- vad required
- music:
- note transcription
- polyphony
37