Tech Talk Supplementary
Tech Talk Supplementary
74 A Modal Synthesis Background 104 is below a certain threshold, we collect it in the set of extracted fea-
105 tures, shown as the red cross in the feature space (Figure 2c, where
75 We adopt tetrahedral finite element models to represent any given 106 only the frequency f and damping d are shown).
76 geometry [O’Brien et al. 2002]. The displacements, x ∈ R3N , in
77 such a system can be calculated with the following linear deforma- 107 C Parameter Estimation
78 tion equation:
Mẍ + Cẋ + Kx = f , (1)
108 C.1 Optimization Framework
79 where M, C, and K respectively represent the mass, damping
80 and stiffness matrices. We approximate the damping matrix with 109 The material parameters are estimated through an optimization
81 Rayleigh damping: C = αM + βK, which is a well-established 110 framework. We first create a virtual object that is roughly the same
82 practice. The system can be decoupled into the following form: 111 size and geometry as the real-world object whose impact sound was
T
112 recorded. We then calculate its mass matrix M and stiffness matrix
q̈ + (αI + βΛ)q̇ + Λq = U f , (2) 113 K and find the assumed eigenvalues λ0i ’s using some initial values
114 for the Young’s modulus, mass density, and Poisson’s ratio, E0 , ρ0 ,
83 where Λ is a diagonal matrix.The solution to Eqn. 2 is a bank of
115 and ν0 . The eigenvalue λi for general E and ρ is just a multiple of
84 modes, i.e. damped sinusoidal waves. The i’th mode is
116 λ0i :
γ 0
qi = ai e−di t sin(2πfi t + θi ), (3) λi =
γ0
λi (7)
85 where fi is the frequency of the mode, di is the damping coefficient, 117 where γ = E/ρ is the ratio of Young’s modulus to density, and
86 ai is the excited amplitude, and θi is the initial phase. (fi , di , ai ) 118 γ0 = E0 /ρ0 is the ratio using the assumed values. Applying a unit
87 together define the feature of mode i. 119 impulse on the virtual object at a point corresponding to the actual
The values in Eqn. 3 depend on the material properties, the geome- 120 impact point in the example recording gives an excitation pattern of
try, and the run-time interactions: ai and θi depend on the run-time 121 the eigenvalues as Eqn. 3, where the excitation amplitude of mode j
excitation of the object, while fi and di depend on the geometry 122 is a0j . If the actual (unknown) impulse is not unit, then the excitation
and the material properties: 123 amplitude is just scaled by a factor σ,
1 aj = σa0j (8)
di = (α + βλi ), (4)
2
√ ( )2 124 2 Combining Eqn. 4, Eqn. 5, Eqn.7, and Eqn.8, we obtain a map-
fi =
1
λi −
α + βλi
. (5) 125 ping from an assumed eigenvalue and its excitation (λ0j , a0j ) to an
2π 2 126 estimated mode with frequency f˜j , damping d˜j , and amplitude ãj :
88 where the eigenvalues λi ’s are calculated from M and K, which in {α,β,γ,σ}
89 turn depend on mass density ρ, Young’s modulus E, and Poisson’s (λ0j , a0j ) −−−−−−→ (f˜j , d˜j , ãj ). (9)
90 ratio ν.
127 The estimated sound s̃[n] is generated by mixing all the estimated
128 modes,
B Feature Extraction
∑ ( −d˜ (n/F ) )
91
s̃[n] = ãj e j s
sin(2π f˜j (n/Fs )) (10)
92 We extract the features {fi , di , ai } from the example audio using
j
93 a time-varying frequency representation called power spectrogram.
94 A power spectrogram P for a a time domain signal s[n], is obtained 129 where Fs is the sampling rate.
95 by first breaking it up into overlapping frames, and then performing
96 windowing and Fourier transform on each frame: 130 The estimated sound s̃[n] can then be compared against the exam-
2 131 ple sound s[n] and a difference metric can be computed. An op-
∑ timization process is used to find the parameter set with minimal
−jωn
132
P[m, ω] = s[n]w[n − m]e , (6) difference metric value.
133
n
97 where w is the window applied to the original time domain signal. 134 C.2 Psychoacoustic Metric
1
Online Submission ID: 0028
150 to the critical band rate as described previously. The damping is 174 component of the synthesized sounds already provides transferabil-
151 transformed to duration, which is proportional to the inverse of the 175 ity of sounds due to varying geometries and dynamics. Hence, we
152 damping value. Figure 3 shows the effect of the transformation. 176 compute the transferred residual under the guidance of modes. Al-
153 A matching score can then be computed between the transformed 177 gorithm 1 shows the complete feature-guided residual transfer al-
154 point sets. gorithm.
80
1 Input: source modes Φs = {ϕsi }, target modes Φt = {ϕtj }, and
150
source residual audio ssresidual [n]
damping (1/sec)
Y(d)
100
3 40
Ψ ← DetermineModePairs(Φs , Φt )
50 2 20 foreach mode pair (ϕsk , ϕtk ) ∈ Ψ do
0
1 0 2 3
Ps′ ← ShiftSpectrogram( Ps , ∆frequency)
0 2 4 6 8 10 12 14 0 20 40 60 80 100 120 140 Ps′′ ← StretchSpectrogram( Ps′ , damping ratio)
frequency (kHz) X(f)
A ← FindPixelScale(Pt , Ps′′ )
(a) (b)
Psresidual ′ ← ShiftSpectrogram(Psresidual , ∆frequency)
Figure 3: Point set matching problem in the feature domain: (a) Psresidual ′′ ← StretchSpectrogram(Psresidual ′ , damping ratio)
′′
in the original frequency and damping, (f, d)-space. (b) in the Ptresidual ← MultiplyPixelScale(Psresidual ′′ , A)
transformed, (x, y)-space, where x = X(f ) and y = Y (d). The (ωstart , ωend ) ← FindFrequencyRange(ϕtk−1 , ϕtk )
′′
blue crosses and red circles are the reference and estimated feature Ptresidual [m, ωstart , . . . , ωend ] ← Ptresidual [m, ωstart , . . . , ωend ]
points respectively. The three features having the largest energies end
are labeled 1, 2, and 3.
stresidual [n] ← IterativeInverseSTFT(Ptresidual )
178
155 D Residual Compensation
157 Figure 4 illustrates the residual computation process. From a 180 Parameter estimation: We estimate the material parameters from
158 recorded sound (Figure 4a), the reference features are extracted 181 various real-world audio recordings: a wood plate, a plastic plate, a
159 (Figure 4b), with frequencies, dampings, and energies depicted as 182 metal plate, a porcelain plate, and a glass bowl. For each recording,
160 the blue circles (Figure 4f). After parameter estimation, the syn- 183 the parameters are estimated using a virtual object that is of the
161 thesized sound is generated (Figure 4c), with the estimated features 184 same size and shape as the one used to record the audio clips. When
162 shown as the red crosses (Figure 4g), which all lie on a curve in the 185 the virtual object is hit at the same location as the real-world object,
163 (f, d)-plane. Each reference feature may be approximated by one 186 it produces a sound similar to the recorded audio, as shown in Fig. 5
164 or more estimated features, and its match ratio number is shown. 187 and the supplementary video.
165 The represented sound is the summation of the reference features 188 Fig. 6 compares the refenece features of the real-world objects and
166 weighted by their match scores, shown as the solid blue circles (Fig- 189 the estimated features of the virtual objects as a result of the param-
167 ure 4h). Finally, the difference between the recorded sound’s power 190 eter estimation.
168 spectrogram (Figure 4a) and the represented sound’s (Figure 4d) are
computed to obtain the residual (Figure 4e). 191 Transfered parameters and residual: The parameters estimated
192 as well as the residuals can be transfered to virtual objects with
193 different sizes and shapes as shown in Fig. 7. From an example
194 recording of a porcelain plate (a), the parameters for the porcelain
195 material are estimated, and the residual computed (b). The parame-
196 ters and residual are then transfered to a smaller porcelain plate (c)
197 and a porcelain bunny (d).
198 Comparison with real recordings: Fig. 8 shows a comparison of
199 the transferred results with the real recordings. From a recording
200 of glass bowl, the parameters for glass are estimated (column (a))
201 and transfered to other virtual glass bowls of different sizes. The
202 synthesized sounds ((b) (c) (d), bottom row) are compared with the
203 real-world audio for these different-sized glass bowls ((b) (c) (d),
204 top row). More examples of transferring the material parameters as
205 well as the residuals are demonstrated in the supplementary video.
Figure 4: Residual computation.
169 206 F Perceptual Study
170 D.2 Residual Transfer 207 we also designed an experiment to evaluate the auditory perception
208 of the synthesized sounds of five different materials. Each subject is
171 As discussed in previous sections, modes transfer naturally with 209 presented with a series of 24 audio clips: 8 are audio recordings of
172 geometries in the modal analysis process, and they respond to exci- 210 sound generated from hitting a real-world objec; 16 are synthesized
173 tations at runtime in a physical manner. In other words, the modal 211 using the techniques described in this paper. For each audio clip,
2
Online Submission ID: 0028
Figure 5: Parameter estimation for different materials. For each material, the material parameters are estimated using an example recorded
audio (top row). Applying the estimated parameters to a virtual object with the same geometry as the real object used in recording the audio
will produce a similar sound (bottom row).
(a) wood plate (b) plastic plate (c) metal plate (d) porcelain plate (e) glass bowl
Figure 6: Feature comparison of real and virtual objects. The blue circles represent the reference features extracted from the recordings of
the real objects. The red crosses are the features of the virtual objects using the estimated parameters. Because of the Rayleigh damping
model, all the features of a virtual object lie on the depicted red curve on the (f, d)-plane.
Table 1: Material Recognition Rate Matrix: Recorded Sounds Table 2: Material Recognition Rate Matrix: Synthesized Sounds
Using Our Method
220 References
221 O’B RIEN , J. F., S HEN , C., AND G ATCHALIAN , C. M. 2002. Syn-
3
Online Submission ID: 0028
Figure 7: Transfered material parameters and residual: from a real-world recording (a), the material parameters are estimated and the
residual computed (b). The parameters and residual can then be applied to various objects made of the same material, including (c) a smaller
object with similar shape; (d) an object with different geometry. The transfered modes and residuals are combined to form the final results
(bottom row).
Figure 8: Comparison of transfered results with real-word recordings: from one recording (column (a), top), the optimal parameters and
residual are estimated, and a similar sound is reproduced (column (a), bottom). The parameters and residual can then be applied to different
objects of the same material ((b), (c), (d), bottom), and the results are comparable to the real-world recordings ((b), (c), (d), top).