Content extract
					
					Source: http://www.doksinet  EE E6820: Speech & Audio Processing & Recognition  Lecture 1: Introduction & DSP Dan Ellis <dpwe@ee.columbiaedu> Mike Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820  January 22, 2009 1  Sound and information  2  Course Structure  3  DSP review: Timescale modification  Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  1 / 33   Source: http://www.doksinet  Outline  1  Sound and information  2  Course Structure  3  DSP review: Timescale modification  Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  2 / 33   Source: http://www.doksinet  Sound and information Sound is air pressure variation  Mechanical vibration  Pressure waves in air  Motion of sensor  ++++  v(t)  Time-varying voltage t  Transducers convert air pressure ↔ voltage Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  3 / 33   Source:
http://www.doksinet  What use is sound? Footsteps examples: 0.5  0  -0.5  0  0.5  1  1.5  2  2.5  3  3.5  4  4.5  5  0  0.5  1  1.5  2  2.5 time / s  3  3.5  4  4.5  5  0.5  0  -0.5  Hearing confers an evolutionary advantage useful information, complements vision .   at a distance, in the dark, around corners listeners are highly adapted to ‘natural sounds’ (including speech) Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  4 / 33   Source: http://www.doksinet  The scope of audio processing  Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  5 / 33   Source: http://www.doksinet  The acoustic communication chain  message  signal  channel  receiver  decoder  ! synthesis  audio processing  recognition  Sound is an information bearer Received sound reflects source(s) plus effect of environment (channel)  Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  6 / 33   Source: http://www.doksinet  Levels of abstraction Much processing
concerns shifting between levels of abstraction abstract  representation (e.g t-f energy)  Synthesis  Analysis  ‘information’  sound p(t) concrete  Different representations serve different tasks separating aspects, making things explicit, .    Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  7 / 33   Source: http://www.doksinet  Outline  1  Sound and information  2  Course Structure  3  DSP review: Timescale modification  Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  8 / 33   Source: http://www.doksinet  Source structure Goals  survey topics in sound analysis & processing develop and intuition for sound signals I learn some specific technologies I I  Course structure weekly assignments (25%) midterm event (25%) I final project (50%) I I  Text Speech and Audio Signal Processing Ben Gold & Nelson Morgan Wiley, 2000 ISBN: 0-471-35154-7  Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  9 / 33   Source:
http://www.doksinet  Web-based  Course website: http://www.eecolumbiaedu/∼dpwe/e6820/ for lecture notes, problem sets, examples, .   + student web pages for homework, etc.  Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  10 / 33   Source: http://www.doksinet  Course outline  Fundamentals  L1: DSP  L2: Acoustics  Audio processing  L5: Signal models  Applications  L6: Music analysis/ synthesis  L8: L7: Spatial sound Audio compression & rendering  Dan Ellis (Ellis & Mandel)  L4: L3: Auditory Pattern recognition perception  L9: Speech recognition  L10: Music retrieval  L11: Signal separation  L12: Multimedia indexing  Intro & DSP  January 22, 2009  11 / 33   Source: http://www.doksinet  Weekly assignments  Research papers journal & conference publications summarize & discuss in class I written summaries on web page + Courseworks discussion I I  Practical experiments Matlab-based (+ Signal Processing Toolbox) direct experience of sound processing I
skills for project I I  Book sections  Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  12 / 33   Source: http://www.doksinet  Final project  Most significant part of course (50%) of grade Oral proposals mid-semester; Presentations in final class + website Scope practical (Matlab recommended) identify a problem; try some solutions I evaluation I I  Topic few restrictions within world of audio investigate other resources I develop in discussion with me I I  Citation & plagiarism  Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  13 / 33   Source: http://www.doksinet  Examples of past projects  Automatic prosody classification  Dan Ellis (Ellis & Mandel)  Model-based note transcription  Intro & DSP  January 22, 2009  14 / 33   Source: http://www.doksinet  Outline  1  Sound and information  2  Course Structure  3  DSP review: Timescale modification  Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  15 / 33   Source:
http://www.doksinet  DSP review: digital signals Discrete-time sampling limits bandwidth  xd[n] = Q( xc(nT ) ) Discrete-level quantization limits dynamic range  time ε T  sampling interval T sampling frequency ΩT = 2π   T quantizer Q(y ) =  y Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  16 / 33   Source: http://www.doksinet  The speech signal: time domain Speech is a sequence of different sound types Vowel: periodic “has”  .1  0  0  .1  Fricative: aperiodic “watch”  0.05  1.38  1.4  -0.05 1.86  1.42  1.88  1.9  1.92  0.2 0.1 0 -0.1 -0.2  1.4  1.6  has a  1.8  2  2.2  watch  thin  2.4  2.6  as a  time/s  dime  0.1 0.02 0 -0.1 1.52  0 -0.02 1.54  1.56  1.58  Glide: smooth transition “watch”  Dan Ellis (Ellis & Mandel)  Intro & DSP  2.42  2.44  2.46  2.4  Stop burst: transient “dime”  January 22, 2009  17 / 33   Source: http://www.doksinet  Timescale modification (TSM) Can we modify a sound to make it ‘slower’ ? i.e speech
pronounced more slowly e.g to help comprehension, analysis or more quickly for ‘speed listening’ ? Why not just slow it down? xs (t) = xo ( rt ), r = slowdown factor (> 1  slower) equivalent to playback at a different sampling rate 0.1 0.05  Original  0 -0.05 -0.1 2.35  2.4  2.45  2.5  2.55  2.6  2.4  2.45  2.5  2.55  2.6  r =2  0.1 0.05  2x slower 0 -0.05 -0.1 2.35  Dan Ellis (Ellis & Mandel)  Intro & DSP  time/s January 22, 2009  18 / 33   Source: http://www.doksinet  Time-domain TSM Problem: want to preserve local time structure but alter global time structure Repeat segments I  but: artifacts from abrupt edges  Cross-fade & overlap y m [mL + n] = y m−1 [mL + n] + w [n] · x 0.1  1  2  3  4  5  hj m k r  L+n  i  6  0  -0.1 2.35  1  2.4  2.45  2.5  4  2  1  1  2.6  time / s  6  3  0.1  2.55  5  2  2  3  3  4  4  5  5  6  4.95  time / s  0  -0.1  Dan Ellis (Ellis & Mandel)  4.7  4.75  4.8  4.85  Intro & DSP  4.9  January 22, 2009  19 / 33   Source:
http://www.doksinet  Synchronous overlap-add (SOLA) Idea: allow some leeway in placing window to optimize alignment of waveforms 1 2 Km maximizes alignment of 1 and 2  Hence, m  y [mL + n] = y  m−1  [mL + n] + w [n] · x  hj m k r  L + n + Km  i  Where Km chosen by cross-correlation: PNov  m−1 [mL + n] · x   m    L+n+K Km = argmax qP  P   0≤K ≤Ku (y m−1 [mL + n])2 (x mr L + n + K )2 n=0 y  Dan Ellis (Ellis & Mandel)  Intro & DSP  r  January 22, 2009  20 / 33   Source: http://www.doksinet  The Fourier domain Fourier Series (periodic continuous x) Ω0 = x(t) =  2π T X  x(t) 0.5  0  0.5  ck e jkΩ0 t  1 1.5  1 2πT  0.5  0  0.5  1  1.5  t 1.0  k  ck =  1  Z T /2  |ck|  x(t)e −jkΩ0 t dt  1 2 3 4 5 6 7  k  −T /2  Fourier Transform (aperiodic continuous x) 0.02  x(t) 0.01  Z  1 X (jΩ)e jΩt dΩ 2π Z X (jΩ) = x(t)e −jΩt dt x(t) =  0 -0.01 0  0.004  0.006 0.008 time / sec  |X(jΩ)|  -40 -60 -80 0  Dan Ellis (Ellis & Mandel)  0.002  level / dB
-20  Intro & DSP  2000  4000  6000 8000 freq / Hz  January 22, 2009  21 / 33   Source: http://www.doksinet  Discrete-time Fourier DT Fourier Transform (aperiodic sampled x) x [n]  Z π  1 X (e jω )e jωn dω 2π −π X X (e jω ) = x[n]e −jωn  -1  x[n] =  n  1 2 3 4 5 6 7  |X(ejω)| 3 2 1 0  π  ω 2π  3π  4π  5π  Discrete Fourier Transform (N-point x) x[n] =  X  X [k] =  X  x [n]  2πkn  X [k]e j N  k  x[n]e  |X[k]|  n  k=1.  Dan Ellis (Ellis & Mandel)  n |X(ejω)|  1 2 3 4 5 6 7  −j 2πkn N  Intro & DSP  k  January 22, 2009  22 / 33   Source: http://www.doksinet  Sampling and aliasing Discrete-time signals equal the continuous time signal at discrete sampling instants: xd [n] = xc (nT ) Sampling cannot represent rapid fluctuations 1  0.5  0  0.5  1 0  1   sin  ΩM +  2  2π T  3  4    5  6  7  8  9  10   Tn  = sin(ΩM Tn) ∀n ∈ Z  Nyquist limit (ΩT /2) from periodic spectrum:  Gp(jΩ)  −ΩM −ΩT −ΩT + ΩM Dan Ellis (Ellis & Mandel) 
“alias” of “baseband” signal  Ga(jΩ) ΩM  Intro & DSP  ΩT ΩT - Ω M  Ω  January 22, 2009  23 / 33   Source: http://www.doksinet  Speech sounds in the Fourier domain energy / dB  time domain 0.1  Vowel: periodic “has”  0  -0.1 1.37 0.05  1.38  1.39  1.4  1.41  1.42  time / s  1.87  1.88  1.89  1.9  -100  1.91  freq / Hz  0  1000  2000  3000  0  1000  2000  3000  4000  0  1000  2000  3000  4000  0  1000  2000  3000  4000  -40 -60 -80 1.54  1.56  1.58  -100 -60  0.02  Stop: transient 0 “dime”  -80  -0.02 2.42  -80  -100  -80  Glide: transition 0 “watch” -0.1 1.52  -60  -60  Fricative: aperiodic 0 “watch” -0.05 1.86 0.1  frequency domain -40  2.44  2.46  2.48  -100  dB = 20 log10 (amplitude) = 10 log10 (power) Voiced spectrum has pitch + formants Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  24 / 33   Source: http://www.doksinet  Short-time Fourier Transform Want to localize energy in time and frequency break sound into short-time
pieces calculate DFT of each one 0.1  L  2L  3L  0  -0.1  short-time 2.35 window  2.4  2.45  2.5  2.55  2.6  time / s  DFT  k  freq / Hz  4000 3000 2000  1000  0  m= 0  m=1  m=2  m=3  Mathematically, X [k, m] =  N−1 X n=0  Dan Ellis (Ellis & Mandel)    2πk(n − mL) x[n] w [n − mL] exp −j N Intro & DSP  January 22, 2009  25 / 33   Source: http://www.doksinet  The Spectrogram Plot STFT X [k, m] as a gray-scale image 0.1  4000  10 0  3000 -10 2000  -20 -30  1000  intensity / dB  freq / Hz  0  -40 0 2.35  2.4  2.45  0.5  1  2.5  2.55  2.6  -50  time / s  freq / Hz  4000  3000  2000  1000  0  0  Dan Ellis (Ellis & Mandel)  1.5  Intro & DSP  2  2.5  time / s  January 22, 2009  26 / 33   Source: http://www.doksinet  Time-frequency tradeoff Longer window w [n] gains frequency resolution at cost of time resolution 0.2  freq / Hz  4000  3000  2000  10 0  1000 -10 0  freq / Hz  Window = 48 pt Wideband  Window = 256 pt Narrowband  0  -20  4000  -30 -40  3000  -50  level /
dB  2000  1000  0  Dan Ellis (Ellis & Mandel)  1.4  1.6  1.8  2  Intro & DSP  2.2  2.4  2.6  time / s  January 22, 2009  27 / 33   Source: http://www.doksinet  Speech sounds on the Spectrogram  Stop: transient “dime”  1.6  Fric've: aperiodic “watch”  Glide: transition “watch”  1.4  freq / Hz  Vowel: periodic “has”  Most popular speech visualization  4000  3000  2000  1000  0  has a  1.8  2  watch  2.2  thin  2.4  as a  2.6  time/s  dime  Wideband (short window) better than narrowband (long window) to see formants Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  28 / 33   Source: http://www.doksinet  TSM with the Spectrogram Just stretch out the spectrogram? 4000  Frequency  3000  2000  1000  0  0  0.2  0.4  0.6  0.8  1  1.2  1.4  Time  4000  Frequency  3000  2000  1000  0  0  0.2  0.4  0.6  0.8  1  1.2  1.4  Time  how to resynthesize? spectrogram is only |Y [k, m]| Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  29 / 33 
 Source: http://www.doksinet  The Phase Vocoder Timescale modification in the STFT domain Magnitude from ‘stretched’ spectrogram: h mi |Y [k, m]| = X k, r I  e.g by linear interpolation  But preserve phase increment between slices: h mi θ̇Y [k, m] = θ̇X k, r I  e.g by discrete differentiator  Does right thing for single sinusoid I  keeps overlapped parts of sinusoid aligned ∆T  . θ = ∆θ ∆T  Dan Ellis (Ellis & Mandel)  time  . ∆θ' = θ·2∆T  Intro & DSP  January 22, 2009  30 / 33   Source: http://www.doksinet  General issues in TSM  Time window I  stretching a narrowband spectrogram  Malleability of different sounds I  vowels stretch well, stops lose nature  Not a well-formed problem? want to alter time without frequency .   but time and frequency are not separate! I ‘satisfying’ result is a subjective judgment ⇒ solution depends on auditory perception.   I  Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  31 / 33   Source:
http://www.doksinet  Summary  Information in sound I  lots of it, multiple levels of abstraction  Course overview I I  survey of audio processing topics practicals, readings, project  DSP review I I  digital signals, time domain Fourier domain, STFT  Timescale modification properties of the speech signal time-domain I phase vocoder I I  Dan Ellis (Ellis & Mandel)  Intro & DSP  January 22, 2009  32 / 33   Source: http://www.doksinet  References  J. L Flanagan and R M Golden Phase vocoder Bell System Technical Journal, pages 1493–1509, 1966. M. Dolson The Phase Vocoder: A Tutorial Computer Music Journal, 10(4):14–27, 1986. M. Puckette Phase-locked vocoder In Proc IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 222–225, 1995. A. T Cemgil and S J Godsill Probabilistic Phase Vocoder and its application to Interpolation of Missing Values in Audio Signals. In 13th European Signal Processing Conference, Antalya, Turkey, 2005.  Dan Ellis
(Ellis & Mandel)  Intro & DSP  January 22, 2009  33 / 33   Source: http://www.doksinet  EE E6820: Speech & Audio Processing & Recognition  Lecture 2: Acoustics Dan Ellis & Mike Mandel Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820  January 29, 2009 1  The wave equation  2  Acoustic tubes: reflections & resonance  3  Oscillations & musical acoustics  4  Spherical waves & room acoustics  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  1 / 38   Source: http://www.doksinet  Outline  1  The wave equation  2  Acoustic tubes: reflections & resonance  3  Oscillations & musical acoustics  4  Spherical waves & room acoustics  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  2 / 38   Source: http://www.doksinet  Acoustics & sound  Acoustics is the study of physical waves (Acoustic) waves transmit energy without permanently displacing matter (e.g ocean waves) Same math recurs in
many domains Intuition: pulse going down a rope  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  3 / 38   Source: http://www.doksinet  The wave equation Consider a small section of the rope y S ε  φ2  x  φ1 S  Displacement y (x), tension S, mass  dx ⇒ Lateral force is Fy = S sin(φ2 ) − S sin(φ1 ) ≈S  E6820 SAPR (Ellis & Mandel)  ∂2y dx ∂x 2  Acoustics  January 29, 2009  4 / 38   Source: http://www.doksinet  Wave equation (2)  Newton’s law: F = ma S  ∂2y ∂2y dx =  dx ∂x 2 ∂t 2  Call c 2 = S/ (tension to mass-per-length) hence, the Wave Equation: c2  ∂2y ∂2y = ∂x 2 ∂t 2  .   partial DE relating curvature and acceleration  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  5 / 38   Source: http://www.doksinet  Solution to the wave equation If y (x, t) = f (x − ct) (any f (·)) then ∂y = −cf 0 (x − ct) ∂t ∂2y = c 2 f 00 (x − ct) ∂t 2  ∂y = f 0 (x − ct) ∂x ∂2y = f 00 (x − ct) ∂x 2 also works for y
(x, t) = f (x + ct) Hence, general solution: c2  ∂2y ∂2y = ∂x 2 ∂t 2  ⇒ y (x, t) = y + (x − ct) + y − (x + ct)  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  6 / 38   Source: http://www.doksinet  Solution to the wave equation (2) y + (x − ct) and y − (x + ct) are traveling waves I  shape stays constant but changes position y y+  time 0:  y∆x = c·T  x y+  time T:  yx  c is traveling wave velocity (∆x/∆t) y + moves right, y − moves left resultant y (x) is sum of the two waves  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  7 / 38   Source: http://www.doksinet  Wave equation solutions (3) What is the form of y + , y − ? I  any doubly-differentiable function will satisfy wave equation  Actual waveshapes dictated by boundary conditions I I  e.g y (x) at t = 0 plus constraints on y at particular xs e.g input motion y (0, t) = m(t) rigid termination y (L, t) = 0 y  y(0,t) = m(t)  x  y+(x,t)  E6820 SAPR (Ellis & Mandel)  y(L,t) =
0  Acoustics  January 29, 2009  8 / 38   Source: http://www.doksinet  Terminations and reflections System constraints: initial y (x, 0) = 0 (flat rope) input y (0, t) = m(t) (at agent’s hand) ( y + ) I termination y (L, t) = 0 (fixed end) I wave equation y (x, t) = y + (x − ct) + y − (x + ct)  I I  At termination: I y (L, t) = 0 ⇒ y + (L − ct) = −y − (L + ct) i.e y + and y − are mirrored in time and amplitude around x = L ⇒ inverted reflection at termination  y+  [simulation travel1.m]  y(x,t) = y+ + y– y– x=L  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  9 / 38   Source: http://www.doksinet  Outline  1  The wave equation  2  Acoustic tubes: reflections & resonance  3  Oscillations & musical acoustics  4  Spherical waves & room acoustics  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  10 / 38   Source: http://www.doksinet  Acoustic tubes  pressure  Sound waves travel down acoustic tubes:  x I  1-dimensional; very similar
to strings  Common situation: wind instrument bores ear canal I vocal tract I I  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  11 / 38   Source: http://www.doksinet  Pressure and velocity Consider air particle displacement ξ(x, t)  ξ(x) x  Particle velocity v (x, t) = ∂ξ ∂t hence volume velocity u(x, t) = Av (x, t) ∂ξ (Relative) air pressure p(x, t) = − κ1 ∂x  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  12 / 38   Source: http://www.doksinet  Wave equation for a tube Consider elemental volume Area dA Force p·dA  x  Force (p+∂p/∂x·dx)·dA  Volume dA·dx Mass ρ·dA·dx  Newton’s law: F = ma −  E6820 SAPR (Ellis & Mandel)  ∂p ∂v dx dA = ρ dA dx ∂x ∂t ∂p ∂v ⇒ = −ρ ∂x ∂t 2 2 ∂ ξ ∂ ξ 1 ∴ c2 2 = 2 c=√ ∂x ∂t ρκ Acoustics  January 29, 2009  13 / 38   Source: http://www.doksinet  Acoustic tube traveling waves Traveling waves in particle displacement: ξ(x, t) = ξ + (x − ct) + ξ − (x + ct)
∂ + Call u + (α) = −cA ∂α ξ (α), Z0 = ρc A  Then volume velocity: u(x, t) = A  ∂ξ = u + (x − ct) − u − (x + ct) ∂t  And pressure: p(x, t) = −    1 ∂ξ = Z0 u + (x − ct) + u − (x + ct) κ ∂x  (Scaled) sum and difference of traveling waves  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  14 / 38   Source: http://www.doksinet  Acoustic traveling waves (2) Different resultants for pressure and volume velocity: Acoustic tube x c u+ c u-  E6820 SAPR (Ellis & Mandel)  u(x,t) = u+ - u-  Volume velocity  p(x,t) = Z0[u+ + u-]  Pressure  Acoustics  January 29, 2009  15 / 38   Source: http://www.doksinet  Terminations in tubes Equivalent of fixed point for tubes? Solid wall forces hence u+ = uu(x,t) = 0  u0(t) (Volume velocity input) Open end forces p(x,t) = 0 hence u+ = -u-  Open end is like fixed point for rope: reflects wave back inverted Unlike fixed point, solid wall reflects traveling wave without inversion  E6820 SAPR (Ellis & Mandel) 
Acoustics  January 29, 2009  16 / 38   Source: http://www.doksinet  Standing waves  Consider (complex) sinusoidal input u0 (t) = U0 e jωt Pressure/volume must have form Ke j(ωt+φ) Hence traveling waves: u + (x − ct) = |A|e j(−kx+ωt+φA ) u − (x + ct) = |B|e j(kx+ωt+φB ) where k = ω/c (spatial frequency, rad/m) (wavelength λ = c/f = 2πc/ω) Pressure and volume velocity resultants show stationary pattern: standing waves I even when |A| 6= |B| ⇒ [simulation sintwavemov.m]  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  17 / 38   Source: http://www.doksinet  Standing waves (2)  U0 ejωt kx = π x=λ/2  pressure = 0 (node) vol.veloc = max (antinode)  For lossless termination (|u + | = |u − |), have true nodes and antinodes Pressure and volume velocity are phase shifted I  in space and in time  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  18 / 38   Source: http://www.doksinet  Transfer function  Consider tube excited by u0 (t) = U0 e jωt
sinusoidal traveling waves must satisfy termination ‘boundary conditions’ satisfied by complex constants A and B in u(x, t) = u + (x − ct) + u − (x + ct) = Ae j(−kx+ωt) + Be j(kx+ωt) = e jωt (Ae −jkx + Be jkx ) standing wave pattern will scale with input magnitude point of excitation makes a big difference .    E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  19 / 38   Source: http://www.doksinet  Transfer function (2) For open-ended tube of length L excited at x = 0 by U0 e jωt u(x, t) = U0 e jωt  cos k(L − x) cos kL  k=  ω c  (matches at x = 0, maximum at x = L) i.e standing wave pattern e.g varying L for a given ω (and hence k):  U0 ejωt  U0  U0 ejωt  UL  U0  UL  magnitude of UL depends on L (and ω) E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  20 / 38   Source: http://www.doksinet  Transfer function (3) Varying ω for a given L, i.e at x = L u(L, t) 1 1 UL = = = U0 u(0, t) cos kL cos(ωL/c) u(L) u(0) L  u(L) u(0)  ∞ at ωL/c =
(2r+1)π/2, r = 0,1,2.  Output volume velocity always larger than input πc Unbounded for L = (2r + 1) 2ω = (2r + 1) λ4 i.e resonance (amplitude grows without bound)  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  21 / 38   Source: http://www.doksinet  Resonant modes  For lossless tube with L = m λ4 , m odd, λ wavelength u(L) u(0)  is unbounded, meaning: transfer function has pole on frequency axis energy at that frequency sustains indefinitely L = 3 · λ1/4  ω1 = 3ω0  L = λ0/4  compare to time domain .   e.g 175 cm vocal tract, c = 350 m/s ⇒ ω0 = 2π 500 Hz (then 1500, 2500, .   )  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  22 / 38   Source: http://www.doksinet  Scattering junctions  u+k+1  u+k uk  u-k+1  Area Ak  At abrupt change in area: • pressure must be continuous pk(x, t) = pk+1(x, t) • vol. veloc must be continuous uk(x, t) = uk+1(x, t) • traveling waves u+k, u-k, u+k+1, u-k+1  Area Ak+1  will be different  + Solve e.g for
uk− and uk+1 : (generalized term) 2r 1+r  u+k 1-r 1+r u-k  E6820 SAPR (Ellis & Mandel)  u+k+1  +  r=  r-1 r+1 u-k+1  +  2 r+1 Acoustics  Ak+1 Ak  “Area ratio”  January 29, 2009  23 / 38   Source: http://www.doksinet  Concatenated tube model Vocal tract acts as a waveguide Lips  x=L  Lips uL(t)  Glottis u0(t)  x=0 Glottis  Discrete approximation as varying-diameter tube Ak, Lk  Ak+1, Lk+1 x  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  24 / 38   Source: http://www.doksinet  Concatenated tube resonances Concatenated tubes  scattering junctions  lattice filter u+k  u-k  e-jωτ1  +  e-jωτ1  +  e-jωτ2  +  e-jωτ2  +  +  +  e-jω2τ1  +  +  +  +  e-jω2τ2  +  +  Can solve for transfer function – all-pole 1 5 0  -1  -0.5  0  0.5  1  Approximate vowel synthesis from resonances [sound example: ah ee oo] E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  25 / 38   Source: http://www.doksinet  Outline  1  The wave equation  2  Acoustic tubes:
reflections & resonance  3  Oscillations & musical acoustics  4  Spherical waves & room acoustics  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  26 / 38   Source: http://www.doksinet  Oscillations & musical acoustics Pitch (periodicity) is essence of music  why? why music? Different kinds of oscillators simple harmonic motion (tuning fork) relaxation oscillator (voice) string traveling wave (plucked/struck/bowed) air column (nonlinear energy element)  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  27 / 38   Source: http://www.doksinet  Simple harmonic motion Basic mechanical oscillation ẍ = −ω 2 x  x = A cos(ωt + φ)  Spring + mass (+ damper)  F = kx m ω2 =  ζ x  k m  e.g tuning fork Not great for music I I  fundamental (cos ωt) only relatively low energy  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  28 / 38   Source: http://www.doksinet  Relaxation oscillator Multi-state process one state builds up potential (e.g
pressure) switch to second (release) state I revert to first state, etc. I  I  e.g vocal folds:  p  u  http://www.youtubecom/watch?v=ajbcJiYhFKY  Oscillation period depends on force (tension) easy to change hard to keep stable ⇒ less used in music I I  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  29 / 38   Source: http://www.doksinet  Ringing string  e.g our original ‘rope’ example tension S mass/length ε π2 S ω2 = 2 L ε  L  Many musical instruments guitar (plucked) piano (struck) I violin (bowed) I I  Control period (pitch): change length (fretting) change tension (tuning piano) I change mass (piano strings) I I  Influence of excitation .   [pluck1am]  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  30 / 38   Source: http://www.doksinet  Wind tube  Resonant tube + energy input nonlinear element  scattering junction (tonehole)  energy  acoustic waveguide  ω=  πc 2L  (quarter wavelength)  e.g clarinet lip pressure keeps reed closed reflected
pressure wave opens reed I reinforced pressure wave passes through I I  finger holds determine first reflection ⇒ effective waveguide length  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  31 / 38   Source: http://www.doksinet  Outline  1  The wave equation  2  Acoustic tubes: reflections & resonance  3  Oscillations & musical acoustics  4  Spherical waves & room acoustics  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  32 / 38   Source: http://www.doksinet  Room acoustics  Sound in free air expands spherically:  radius r  Spherical wave equation: ∂ 2 p 2 ∂p 1 ∂2p + = ∂r 2 r ∂r c 2 ∂t 2 solved by p(r , t) = Pr0 e j(ωt−kr ) Energy ∝ p 2 falls as r12  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  33 / 38   Source: http://www.doksinet  Effect of rooms (1): Images Ideal reflections are like multiple sources: virtual (image) sources reflected path  source  listener  ‘Early echoes’ in room impulse response:
direct path early echos hroom(t)  t  actual reflections may be hr (t), not δ(t) E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  34 / 38   Source: http://www.doksinet  Effect of rooms (2): Modes Regularly-spaced echoes behave like acoustic tubes  Real rooms have lots of modes! dense, sustained echoes in impulse response complex pattern of peaks in frequency response  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  35 / 38   Source: http://www.doksinet  Reverberation  Exponential decay of reflections: ~e-t/T  hroom(t)  t  Frequency-dependent I greater absorption at high frequencies ⇒ faster decay  Size-dependent I  larger rooms  longer delays  slower decay  Sabine’s equation: 0.049V S ᾱ Time constant varies with size, absorption RT60 =  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  36 / 38   Source: http://www.doksinet  Summary  Traveling waves Acoustic tubes & resonances Musical acoustics & periodicity Room acoustics &
reverberation  Parting thought Musical bottles  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  37 / 38   Source: http://www.doksinet  References  E6820 SAPR (Ellis & Mandel)  Acoustics  January 29, 2009  38 / 38   Source: http://www.doksinet  EE E6820: Speech & Audio Processing & Recognition  Lecture 3: Machine learning, classification, and generative models Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820  February 7, 2008 1 2 3 4  Classification Generative models Gaussian models Hidden Markov models  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  1 / 43   Source: http://www.doksinet  Outline  1  Classification  2  Generative models  3  Gaussian models  4  Hidden Markov models  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  2 / 43   Source: http://www.doksinet  Classification and generative models F2/Hz  f/Hz  4000  2000  3000  ay ao  x 
2000 1000  1000  0 0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  x  1.0 time/s 0 0  1000  2000 F1/Hz  Classification discriminative models discrete, categorical random variable of interest I fixed set of categories I I  Generative models descriptive models continuous or discrete random variable(s) of interest I can estimate parameters I Bayes’ rule makes them useful for classification I I  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  3 / 43   Source: http://www.doksinet  Building a classifier Define classes/attributes I I  could state explicit rules better to define through ‘training’ examples  Define feature space Define decision algorithm I  set parameters from examples  Measure performance I  calculated (weighted) error rate Pols vowel formants: "u" (x) vs. "o" (o) 1100  F2 / Hz  1000  900  800  700  600 250  Michael Mandel (E6820 SAPR)  300  350  400  450 F1 / Hz  Machine learning  500  550  600  February 7, 2008  4 / 43   Source:
http://www.doksinet  Classification system parts Sensor signal Pre-processing/ segmentation  • STFT • Locate vowels  segment Feature extraction  • Formant extraction  feature vector Classification class Post-processing  Michael Mandel (E6820 SAPR)  • Context constraints • Costs/risk  Machine learning  February 7, 2008  5 / 43   Source: http://www.doksinet  Feature extraction Right features are critical I I  waveform vs formants vs cepstra invariance under irrelevant modifications  Theoretically equivalent features may act very differently in a particular classifier I I  representations make important aspects explicit remove irrelevant information  Feature design incorporates ‘domain knowledge’ I  although more data ⇒ less need for ‘cleverness’  Smaller ‘feature space’ (fewer dimensions)  simpler models (fewer parameters)  less training data needed  faster training  [inverting MFCCs] Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  6 / 43  
Source: http://www.doksinet  Optimal classification Minimize probability of error with Bayes optimal decision θ̂ = argmax p(θi | x) θ Z i p(error) = p(error | x)p(x) dx XZ = (1 − p(θi | x))p(x) dx i  Λi  I where Λi is the region of x where θi is chosen .   but p(θi | x) is largest in that region I so p(error) is minimized  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  7 / 43   Source: http://www.doksinet  Sources of error  p(x|ω1)·Pr(ω1)  Xω^ 1  Xω^ 2 p(x|ω2)·Pr(ω2)  Pr(ω2|Xω^ 1) = Pr(err|Xω^ 1)  x Pr(ω1|Xω^ 2) = Pr(err|Xω^ 2)  Suboptimal threshold / regions (bias error) I  use a Bayes classifier  Incorrect distributions (model error) I  better distribution models / more training data  Misleading features (‘Bayes error’) I I  irreducible for given feature set regardless of classification scheme  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  8 / 43   Source: http://www.doksinet  Two roads to classification Optimal
classifier is θ̂ = argmax p(θi | x) θi  but we don’t know p(θi | x) Can model distribution directly e.g Nearest neighbor, SVM, AdaBoost, neural net I maps from inputs x to outputs θi I a discriminative model  Often easier to model data likelihood p(x | θi ) I I  use Bayes’ rule to convert to p(θi | x) a generative (descriptive) model  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  9 / 43   Source: http://www.doksinet  Nearest neighbor classification Pols vowel formants: "u" (x), "o" (o), "a" (+) 1800 1600 1400  + F2 / Hz  1200 1000  o x  800  *  600 0  200  400  new observation x 600  800 1000 F1 / Hz  1200  1400  1600  1800  Find closest match (Nearest Neighbor) Naı̈ve implementation takes O(N) time for N training points As N  ∞, error rate approaches twice the Bayes error rate With K summarized classes, takes O(K ) time Locality sensitive hashing gives approximate nearest neighbors 2 in O(dn1/c ) time (Andoni and Indyk,
2006) Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  10 / 43   Source: http://www.doksinet  Support vector machines (w . x) + b = + 1  (w .x) + b = –1  “Large margin” linear classifier for separable data regularization of margin avoids over-fitting I can be adapted to non-separable data (C parameter) I made nonlinear using kernels k(x1 , x2 ) = Φ(x1 ) · Φ(x2 )  x1  x2  yi = +1  I  yi = – 1  w  (w .x) + b = 0  Depends only on training points near the decision boundary, the support vectors Unique, optimal solution for given Φ and C  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  11 / 43   Source: http://www.doksinet  Outline  1  Classification  2  Generative models  3  Gaussian models  4  Hidden Markov models  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  12 / 43   Source: http://www.doksinet  Generative models Describe the data using structured probabilistic models Observations are random variables whose distribution
depends on model parameters Source distributions p(x | θi ) reflect variability in features reflect noise in observation I generally have to be estimated from data (rather than known in advance) I I  p(x|ωi) ω1 ω2 ω3 ω4 x  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  13 / 43   Source: http://www.doksinet  Generative models (2) Three things to do with generative models Evaluate the probability of an observation, possibly under multiple parameter settings p(x),  p(x | θ1 ),  p(x | θ2 ),  .  Estimate model parameters from observed data θ̂ = argmin C (θ∗ , θ | x) θ  Run the model forward to generate new data x̃ ∼ p(x | θ̂)  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  14 / 43   Source: http://www.doksinet  Random variables review Random variables have joint distributions, p(x, y )  p(x,y) y  Knowing one value in a joint distribution constrains the remainder Conditional distribution of x given y p(x | y ) ≡  p(x, y ) p(x, y ) =R
p(y ) p(x, y ) dy  Michael Mandel (E6820 SAPR)  Machine learning  σxy  µy σy σx  p(y)  Marginal distribution of y Z p(y ) = p(x, y ) dx  µx  x  p(x)  p(x,y) y Y  x p(x y = Y )  |  February 7, 2008  15 / 43   Source: http://www.doksinet  Bayes’ rule  p(x | y )p(y ) = p(x, y ) = p(y | x)p(x) ∴  p(y | x) =  p(x | y )p(y ) p(x)  ⇒ can reverse conditioning given priors/marginals terms can be discrete or continuous generalizes to more variables p(x, y , z) = p(x | y , z)p(y , z) = p(x | y , z)p(y | z)p(z) allows conversion between joint, marginals, conditionals  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  16 / 43   Source: http://www.doksinet  Bayes’ rule for generative models  Run generative models backwards to compare them Likelihood R p(x | θ) · p(θ) p(θ | x) = p(x | θ)p(θ) Posterior prob Evidence = p(x) Prior prob  Posterior is the classification we’re looking for combination of prior belief in each class with likelihood under our model R I
normalized by evidence (so posteriors = 1)  I  I  Objection: priors are often unknown .   but omitting them amounts to assuming they are all equal  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  17 / 43   Source: http://www.doksinet  Computing probabilities and estimating parameters Want probability of the observation under a model, p(x) I  regardless of parameter settings  Full Bayesian integral Z p(x) =  p(x | θ)p(θ) dθ  Difficult to compute in general, approximate as p(x | θ̂) I  Maximum likelihood (ML) θ̂ = argmax p(x | θ) θ  I  Maximum a posteriori (MAP): ML + prior θ̂ = argmax p(θ | x) = argmax p(x | θ)p(θ) θ  Michael Mandel (E6820 SAPR)  θ  Machine learning  February 7, 2008  18 / 43   Source: http://www.doksinet  Model checking  After estimating parameters, run the model forward Check that model is rich enough to capture variation in data parameters are estimated correctly I there aren’t any bugs in your code I I  Generate data from the model
and compare it to observations x̃ ∼ p(x | θ) are they similar under some statistics T (x) : Rd 7 R ? I can you find the real data set in a group of synthetic data sets?  I  Then go back and update your model accordingly Gelman et al. (2003, ch 6)  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  19 / 43   Source: http://www.doksinet  Outline  1  Classification  2  Generative models  3  Gaussian models  4  Hidden Markov models  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  20 / 43   Source: http://www.doksinet  Gaussian models  Easiest way to model distributions is via parametric model I  assume known form, estimate a few parameters  Gaussian model is simple and useful. In 1D "   # 1 1 x − µi 2 p(x | θi ) = √ exp − 2 σi σi 2π Parameters mean µi and variance σi  fit  PM PM e1/2  Michael Mandel (E6820 SAPR)  p(x|ωi) σi µi Machine learning  x  February 7, 2008  21 / 43   Source: http://www.doksinet  Gaussians in d dimensions  
 1 1 T −1 exp − (x − µi ) Σi (x − µi ) p(x | θi ) = 2 (2π)d/2 |Σi |1/2 Described by a d-dimensional mean µi and a d × d covariance matrix Σi 5 4  1 0.5  3  0  2  4  1  2  4 2 0 0  Michael Mandel (E6820 SAPR)  0  0  Machine learning  1  2  3  4  5  February 7, 2008  22 / 43   Source: http://www.doksinet  Gaussian mixture models Single Gaussians cannot model I I  distributions with multiple modes distributions with nonlinear correlations  What about a weighted sum? X p(x) ≈ ck p(x | θk ) k  where {ck } is a set of weights and {p(x | θk )} is a set of Gaussian components I can fit anything given enough components I  Interpretation: each observation is generated by one of the Gaussians, chosen with probability ck = p(θk )  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  23 / 43   Source: http://www.doksinet  Gaussian mixtures (2) e.g nonlinear correlation  resulting surface  3 2 1.4 1 1.2 0  1  -1  0.8  -2  0.6 0  5  Gaussian components  0.4 10 
original data  15  20  0.2 25  30  35  0  Problem: finding ck and θk parameters easy if we knew which θk generated each x Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  24 / 43   Source: http://www.doksinet  Expectation-maximization (EM) General procedure for estimating model parameters when some are unknown e.g which GMM component generated a point  Iteratively updated model parameters θ to maximize Q, the expected log-probability of observed data x and hidden data z Z Q(θ, θt ) = p(z | x, θt ) log p(z, x | θ) z  E step: calculate p(z | x, θt ) using θt M step: find θ that maximizes Q using p(z | x, θt ) I can prove p(x | θ) non-decreasing I hence maximum likelihood model I local optimumdepends on initialization  I I  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  25 / 43   Source: http://www.doksinet  Fitting GMMs with EM  Want to find parameters of the Gaussians θk = {µk , Σk } weights/priors on Gaussians ck = p(θk ) .   that
maximize likelihood of training data x I  I  If we could assign each x to a particular θk , estimation would be direct Hence treat mixture indices, z, as hidden form Q = E [p(x, z | θ)] differentiate wrt model parameters  equations for µk , Σk , ck to maximize Q I  I  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  26 / 43   Source: http://www.doksinet  GMM EM updated equations Parameters that maximize Q νnk ≡ p(zk | xn , θt ) P νnk xn µk = Pn νnk P n νnk (xn − µk )(xn − µk )T P Σk = n n νnk X 1 νnk ck = N n Each involves νnk , ‘fuzzy membership’ of xn in Gaussian k Updated parameter is just sample average, weighted by fuzzy membership  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  27 / 43   Source: http://www.doksinet  GMM examples  Vowel data fit with different mixture counts 1 Gauss logp(x)=-1911  2 Gauss logp(x)=-1864  1600  1600  1400  1400  1200  1200  1000  1000  800  800  600 200  400  600  800  1000  1200  600 200 
3 Gauss logp(x)=-1849 1600  1600  1400  1400  1200  1200  1000  1000  800  800  600 200  400  600  800  1000  1200  400  600  800  1000  1200  4 Gauss logp(x)=-1840  600 200  400  600  800  1000  1200  [Example.   ] Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  28 / 43   Source: http://www.doksinet  Outline  1  Classification  2  Generative models  3  Gaussian models  4  Hidden Markov models  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  29 / 43   Source: http://www.doksinet  Markov models  A (first order) Markov model is a finite-state system whose behavior depends only on the current state “The future is independent of the past, conditioned on the present” e.g generative Markov model S  .8  .8 .1 .1  A .1  B .1  .1  C .7  .1  .1 E  p(qn+1|qn) S A qn B C E  qn+1 S A B C E 0 1 0 0 0 .8 1 1 0 .1 8 1 0 .1 1 7 0 0 0 0  0 0 0 .1 1  SAAAAAAAABBBBBBBBBCCCCBBBBBBCE  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  30 / 43   Source:
http://www.doksinet  Hidden Markov models Markov model where state sequence Q = {qn } is not directly observable (‘hidden’) But, observations X do depend on Q I  xn is RV that depends only on current state p(xn | qn ) State sequence  q=A  0.8  q=B  AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC  q=C 3  0.6  xn  p(x|q)  Emission distributions  2  0.2  1  0  0  0.8  q=A  q=B  q=C  3  0.6  xn  p(x|q)  0.4  0.4  2 1  0.2 0  Observation sequence  0  1  2  3  4  0  observation x  0  10  20  30  time step n  can still tell something about state sequence.   Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  31 / 43   Source: http://www.doksinet  (Generative) Markov models HMM is specified by parameters Θ: - states qi - transition probabilities aij  - emission distributions bi(x)  k  •  a  t  •  •  k  a  t  •  •  k  a  t  •  • k a t  k a t • 1.0 00 00 00 0.9 01 00 00 0.0 09 01 00 0.0 00 09 01  p(x|q) x  (+ initial state probabilities πi ) i aij ≡ p(qnj | qn−1 )  Michael
Mandel (E6820 SAPR)  bi (x) ≡ p(x | qi )  Machine learning  πi ≡ p(q1i )  February 7, 2008  32 / 43   Source: http://www.doksinet  Markov models for sequence recognition Independence of observations I  observation xn depends only on current state qn  p(X | Q) = p(x1 , x2 , .   xN | q1 , q2 ,    qN ) = p(x1 | q1 )p(x2 | q2 ) · · · p(xN | qN ) N Y  =  p(xn | qn ) =  n=1  N Y  bqn (xn )  n=1  Markov transitions I  transition to next state qi+1 depends only on qi  p(Q | M) = p(q1 , q2 , .   | M) = p(qN | qN−1 .   q1 )p(qN−1 | qN−2    q1 )p(q2 | q1 )p(q1 ) = p(qN | qN−1 )p(qN−1 | qN−2 )p(q2 | q1 )p(q1 ) = p(q1 )  N Y  p(qn | qn−1 ) = πq1  n=2 Michael Mandel (E6820 SAPR)  N Y  aqn−1 qn  n=2 Machine learning  February 7, 2008  33 / 43   Source: http://www.doksinet  Model-fit calculations From ‘state-based modeling’: X p(X | Θj ) = p(X | Q, Θj )p(Q | Θj ) all Q  For HMMs p(X | Q) =  N Y  bqn (xn )  n=1  p(Q | M) = πq1  N Y  aqn−1 qn  n=2  Hence, solve for
Θ̂ = argmaxΘj p(Θj | X ) I  Using Bayes’ rule to convert from p(X | Θj )  Sum over all Q??? Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  34 / 43   Source: http://www.doksinet  Summing over all paths Model M1 A  0.2  B  0.2  E  0.1  0.1  S • • • •  A B E 0.9 01 • 0.7 02 01 • 0.8 02 • • 1  S A B E  1  E States  0.9  S  xn Observations x1, x2, x3 p(x|B) p(x|A)  0.8  0.7  B  3  n  x1  x2  x3  A 2.5 02 01 q{ B 0.1 22 23  Paths 0.8 08 02 0.1 0.1 02 02  A S  2  Observation likelihoods p(x|q)  0.7  0.7  0.9  0 1 2 3 4 time n All possible 3-emission paths Qk from S to E q0 q1 q2 q3 q4  p(Q | M) = Πn p(qn|qn-1)  p(X | Q,M) = Πn p(xn|qn) p(X,Q | M)  S S S S  .9 x 7 x 7 x 1 = 00441 .9 x 7 x 2 x 2 = 00252 .9 x 2 x 8 x 2 = 00288 .1 x 8 x 8 x 2 = 00128 Σ = 0.1109  0.0022 2.5 x 02 x 01 = 005 2.5 x 02 x 23 = 115 0.0290 2.5 x 22 x 23 = 1265 03643 0.1 x 22 x 23 = 0506 00065 Σ = p(X | M) = 0.4020  A A A B  A A B B  A B B B  E E E E  Michael Mandel (E6820
SAPR)  Machine learning  February 7, 2008  35 / 43   Source: http://www.doksinet  The ‘forward recursion’ Dynamic-programming-like technique to sum over all Q Define αn (i) as the probability of getting to state q i at time step n (by any path): αn (i) = p(x1 , x2 , .   xn , qn = q i ) ≡ p(X1n , qni ) αn+1 (j) can be calculated recursively: Model states qi  S  [i=1  ]  αn+1(j) = Σ αn(i)·aij ·bj(xn+1)  αn(i+1) αn(i)  Michael Mandel (E6820 SAPR)  ai+1j  bj(xn+1)  aij Time steps n  Machine learning  February 7, 2008  36 / 43   Source: http://www.doksinet  Forward recursion (2)  Initialize α1 (i) = πi bi (x1 ) Then total probability p(X1N | Θ) =  PS  i=1 αN (i)   Practical way to solve for p(X | Θj ) and hence select the most probable model (recognition) p(X | M1)·p(M1) p(X | M2)·p(M2) Observations X  Michael Mandel (E6820 SAPR)  Choose best  Machine learning  February 7, 2008  37 / 43   Source: http://www.doksinet  Optimal path  May be interested in actual qn
assignments I which state was ‘active’ at each time frame e.g phone labeling (for training?)  Total probability is over all paths .   but can also solve for single best path, “Viterbi” state sequence j Probability along best path to state qn+1 :    α̂n+1 (j) = max {α̂n (i)aij } bj (xn+1 ) i  backtrack from final state to get best path final probability is product only (no sum)  log-domain calculation is just summation I I  Best path often dominates total probability p(X | Θ) ≈ p(X , Q̂ | Θ) Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  38 / 43   Source: http://www.doksinet  Interpreting the Viterbi path Viterbi path assigns each xn to a state q i I performing classification based on b (x) i .   at the same time applying transition constraints aij  Inferred classification  xn  3 2 1 0  0  10  20  30  Viterbi labels: AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC  Can be used for segmentation I I  train an HMM with ‘garbage’ and ‘target’ states decode on
new data to find ‘targets’, boundaries  Can use for (heuristic) training e.g forced alignment to bootstrap speech recognizer e.g train classifiers based on labels   Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  39 / 43   Source: http://www.doksinet  Aside: Training and test data A rich model can learn every training example (overtraining)  But the goal is to classify new, unseen data I  sometimes use ‘cross validation’ set to decide when to stop training  For evaluation results to be meaningful: I I  don’t test with training data! don’t train on test data (even indirectly.   )  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  40 / 43   Source: http://www.doksinet  Aside (2): Model complexity More training data allows the use of larger models  More model parameters create a better fit to the training data I I  more Gaussian mixture components more HMM states  For fixed training set size, there will be some optimal model size that avoids
overtraining  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  41 / 43   Source: http://www.doksinet  Summary  Classification is making discrete (hard) decisions Basis is comparison with known examples I  explicitly or via a model  Classification models discriminative models, like SVMs, neural nets, boosters, directly learn posteriors p(θi | x) I generative models, like Gaussians, GMMs, HMMs, model likelihoods p(x | θ) I Bayes’ rule lets us use generative models for classification I  EM allows parameter estimation even with some data missing  Parting thought Is it wise to use generative models for discrimination or vice versa?  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  42 / 43   Source: http://www.doksinet  References  Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proc IEEE Symposium on Foundations of Computer Science, pages 459–468, Washington, DC, USA, 2006. IEEE
Computer Society. Andrew Gelman, John B. Carlin, Hal S Stern, and Donald B Rubin Bayesian Data Analysis. Chapman & Hall/CRC, second edition, July 2003 ISBN 158488388X Lawrence R. Rabiner A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989 Jeff A. Bilmes A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, 1997. Christopher J. C Burges A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov, 2(2):121–167, June 1998  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  43 / 43   Source: http://www.doksinet  EE E6820: Speech & Audio Processing & Recognition  Lecture 3: Machine learning, classification, and generative models Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820  February 7, 2008 1 2 3 4 
Classification Generative models Gaussian models Hidden Markov models  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  1 / 43   Source: http://www.doksinet  Outline  1  Classification  2  Generative models  3  Gaussian models  4  Hidden Markov models  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  2 / 43   Source: http://www.doksinet  Classification and generative models F2/Hz  f/Hz  4000  2000  3000  ay ao  x  2000 1000  1000  0 0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  x  1.0 time/s 0 0  1000  2000 F1/Hz  Classification discriminative models discrete, categorical random variable of interest I fixed set of categories I I  Generative models descriptive models continuous or discrete random variable(s) of interest I can estimate parameters I Bayes’ rule makes them useful for classification I I  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  3 / 43   Source: http://www.doksinet  Building a classifier Define classes/attributes I I  could
state explicit rules better to define through ‘training’ examples  Define feature space Define decision algorithm I  set parameters from examples  Measure performance I  calculated (weighted) error rate Pols vowel formants: "u" (x) vs. "o" (o) 1100  F2 / Hz  1000  900  800  700  600 250  Michael Mandel (E6820 SAPR)  300  350  400  450 F1 / Hz  Machine learning  500  550  600  February 7, 2008  4 / 43   Source: http://www.doksinet  Classification system parts Sensor signal Pre-processing/ segmentation  • STFT • Locate vowels  segment Feature extraction  • Formant extraction  feature vector Classification class Post-processing  Michael Mandel (E6820 SAPR)  • Context constraints • Costs/risk  Machine learning  February 7, 2008  5 / 43   Source: http://www.doksinet  Feature extraction Right features are critical I I  waveform vs formants vs cepstra invariance under irrelevant modifications  Theoretically equivalent features may act very differently in a
particular classifier I I  representations make important aspects explicit remove irrelevant information  Feature design incorporates ‘domain knowledge’ I  although more data ⇒ less need for ‘cleverness’  Smaller ‘feature space’ (fewer dimensions)  simpler models (fewer parameters)  less training data needed  faster training  [inverting MFCCs] Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  6 / 43   Source: http://www.doksinet  Optimal classification Minimize probability of error with Bayes optimal decision θ̂ = argmax p(θi | x) θ Z i p(error) = p(error | x)p(x) dx XZ = (1 − p(θi | x))p(x) dx i  Λi  I where Λi is the region of x where θi is chosen .   but p(θi | x) is largest in that region I so p(error) is minimized  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  7 / 43   Source: http://www.doksinet  Sources of error  p(x|ω1)·Pr(ω1)  Xω^ 1  Xω^ 2 p(x|ω2)·Pr(ω2)  Pr(ω2|Xω^ 1) = Pr(err|Xω^ 1)  x Pr(ω1|Xω^ 2) =
Pr(err|Xω^ 2)  Suboptimal threshold / regions (bias error) I  use a Bayes classifier  Incorrect distributions (model error) I  better distribution models / more training data  Misleading features (‘Bayes error’) I I  irreducible for given feature set regardless of classification scheme  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  8 / 43   Source: http://www.doksinet  Two roads to classification Optimal classifier is θ̂ = argmax p(θi | x) θi  but we don’t know p(θi | x) Can model distribution directly e.g Nearest neighbor, SVM, AdaBoost, neural net I maps from inputs x to outputs θi I a discriminative model  Often easier to model data likelihood p(x | θi ) I I  use Bayes’ rule to convert to p(θi | x) a generative (descriptive) model  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  9 / 43   Source: http://www.doksinet  Nearest neighbor classification Pols vowel formants: "u" (x), "o" (o), "a" (+) 1800
1600 1400  + F2 / Hz  1200 1000  o x  800  *  600 0  200  400  new observation x 600  800 1000 F1 / Hz  1200  1400  1600  1800  Find closest match (Nearest Neighbor) Naı̈ve implementation takes O(N) time for N training points As N  ∞, error rate approaches twice the Bayes error rate With K summarized classes, takes O(K ) time Locality sensitive hashing gives approximate nearest neighbors 2 in O(dn1/c ) time (Andoni and Indyk, 2006) Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  10 / 43   Source: http://www.doksinet  Support vector machines (w . x) + b = + 1  (w .x) + b = –1  “Large margin” linear classifier for separable data regularization of margin avoids over-fitting I can be adapted to non-separable data (C parameter) I made nonlinear using kernels k(x1 , x2 ) = Φ(x1 ) · Φ(x2 )  x1  x2  yi = +1  I  yi = – 1  w  (w .x) + b = 0  Depends only on training points near the decision boundary, the support vectors Unique, optimal solution for given Φ and C 
Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  11 / 43   Source: http://www.doksinet  Outline  1  Classification  2  Generative models  3  Gaussian models  4  Hidden Markov models  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  12 / 43   Source: http://www.doksinet  Generative models Describe the data using structured probabilistic models Observations are random variables whose distribution depends on model parameters Source distributions p(x | θi ) reflect variability in features reflect noise in observation I generally have to be estimated from data (rather than known in advance) I I  p(x|ωi) ω1 ω2 ω3 ω4 x  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  13 / 43   Source: http://www.doksinet  Generative models (2) Three things to do with generative models Evaluate the probability of an observation, possibly under multiple parameter settings p(x),  p(x | θ1 ),  p(x | θ2 ),  .  Estimate model parameters from observed data
θ̂ = argmin C (θ∗ , θ | x) θ  Run the model forward to generate new data x̃ ∼ p(x | θ̂)  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  14 / 43   Source: http://www.doksinet  Random variables review Random variables have joint distributions, p(x, y )  p(x,y) y  Knowing one value in a joint distribution constrains the remainder Conditional distribution of x given y p(x | y ) ≡  p(x, y ) p(x, y ) =R p(y ) p(x, y ) dy  Michael Mandel (E6820 SAPR)  Machine learning  σxy  µy σy σx  p(y)  Marginal distribution of y Z p(y ) = p(x, y ) dx  µx  x  p(x)  p(x,y) y Y  x p(x y = Y )  |  February 7, 2008  15 / 43   Source: http://www.doksinet  Bayes’ rule  p(x | y )p(y ) = p(x, y ) = p(y | x)p(x) ∴  p(y | x) =  p(x | y )p(y ) p(x)  ⇒ can reverse conditioning given priors/marginals terms can be discrete or continuous generalizes to more variables p(x, y , z) = p(x | y , z)p(y , z) = p(x | y , z)p(y | z)p(z) allows conversion between joint, marginals,
conditionals  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  16 / 43   Source: http://www.doksinet  Bayes’ rule for generative models  Run generative models backwards to compare them Likelihood R p(x | θ) · p(θ) p(θ | x) = p(x | θ)p(θ) Posterior prob Evidence = p(x) Prior prob  Posterior is the classification we’re looking for combination of prior belief in each class with likelihood under our model R I normalized by evidence (so posteriors = 1)  I  I  Objection: priors are often unknown .   but omitting them amounts to assuming they are all equal  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  17 / 43   Source: http://www.doksinet  Computing probabilities and estimating parameters Want probability of the observation under a model, p(x) I  regardless of parameter settings  Full Bayesian integral Z p(x) =  p(x | θ)p(θ) dθ  Difficult to compute in general, approximate as p(x | θ̂) I  Maximum likelihood (ML) θ̂ = argmax p(x | θ) θ  I 
Maximum a posteriori (MAP): ML + prior θ̂ = argmax p(θ | x) = argmax p(x | θ)p(θ) θ  Michael Mandel (E6820 SAPR)  θ  Machine learning  February 7, 2008  18 / 43   Source: http://www.doksinet  Model checking  After estimating parameters, run the model forward Check that model is rich enough to capture variation in data parameters are estimated correctly I there aren’t any bugs in your code I I  Generate data from the model and compare it to observations x̃ ∼ p(x | θ) are they similar under some statistics T (x) : Rd 7 R ? I can you find the real data set in a group of synthetic data sets?  I  Then go back and update your model accordingly Gelman et al. (2003, ch 6)  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  19 / 43   Source: http://www.doksinet  Outline  1  Classification  2  Generative models  3  Gaussian models  4  Hidden Markov models  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  20 / 43   Source: http://www.doksinet  Gaussian
models  Easiest way to model distributions is via parametric model I  assume known form, estimate a few parameters  Gaussian model is simple and useful. In 1D "   # 1 1 x − µi 2 p(x | θi ) = √ exp − 2 σi σi 2π Parameters mean µi and variance σi  fit  PM PM e1/2  Michael Mandel (E6820 SAPR)  p(x|ωi) σi µi Machine learning  x  February 7, 2008  21 / 43   Source: http://www.doksinet  Gaussians in d dimensions    1 1 T −1 exp − (x − µi ) Σi (x − µi ) p(x | θi ) = 2 (2π)d/2 |Σi |1/2 Described by a d-dimensional mean µi and a d × d covariance matrix Σi 5 4  1 0.5  3  0  2  4  1  2  4 2 0 0  Michael Mandel (E6820 SAPR)  0  0  Machine learning  1  2  3  4  5  February 7, 2008  22 / 43   Source: http://www.doksinet  Gaussian mixture models Single Gaussians cannot model I I  distributions with multiple modes distributions with nonlinear correlations  What about a weighted sum? X p(x) ≈ ck p(x | θk ) k  where {ck } is a set of weights and {p(x | θk )}
is a set of Gaussian components I can fit anything given enough components I  Interpretation: each observation is generated by one of the Gaussians, chosen with probability ck = p(θk )  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  23 / 43   Source: http://www.doksinet  Gaussian mixtures (2) e.g nonlinear correlation  resulting surface  3 2 1.4 1 1.2 0  1  -1  0.8  -2  0.6 0  5  Gaussian components  0.4 10  original data  15  20  0.2 25  30  35  0  Problem: finding ck and θk parameters easy if we knew which θk generated each x Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  24 / 43   Source: http://www.doksinet  Expectation-maximization (EM) General procedure for estimating model parameters when some are unknown e.g which GMM component generated a point  Iteratively updated model parameters θ to maximize Q, the expected log-probability of observed data x and hidden data z Z Q(θ, θt ) = p(z | x, θt ) log p(z, x | θ) z  E step: calculate p(z |
x, θt ) using θt M step: find θ that maximizes Q using p(z | x, θt ) I can prove p(x | θ) non-decreasing I hence maximum likelihood model I local optimumdepends on initialization  I I  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  25 / 43   Source: http://www.doksinet  Fitting GMMs with EM  Want to find parameters of the Gaussians θk = {µk , Σk } weights/priors on Gaussians ck = p(θk ) .   that maximize likelihood of training data x I  I  If we could assign each x to a particular θk , estimation would be direct Hence treat mixture indices, z, as hidden form Q = E [p(x, z | θ)] differentiate wrt model parameters  equations for µk , Σk , ck to maximize Q I  I  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  26 / 43   Source: http://www.doksinet  GMM EM updated equations Parameters that maximize Q νnk ≡ p(zk | xn , θt ) P νnk xn µk = Pn νnk P n νnk (xn − µk )(xn − µk )T P Σk = n n νnk X 1 νnk ck = N n Each involves νnk ,
‘fuzzy membership’ of xn in Gaussian k Updated parameter is just sample average, weighted by fuzzy membership  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  27 / 43   Source: http://www.doksinet  GMM examples  Vowel data fit with different mixture counts 1 Gauss logp(x)=-1911  2 Gauss logp(x)=-1864  1600  1600  1400  1400  1200  1200  1000  1000  800  800  600 200  400  600  800  1000  1200  600 200  3 Gauss logp(x)=-1849 1600  1600  1400  1400  1200  1200  1000  1000  800  800  600 200  400  600  800  1000  1200  400  600  800  1000  1200  4 Gauss logp(x)=-1840  600 200  400  600  800  1000  1200  [Example.   ] Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  28 / 43   Source: http://www.doksinet  Outline  1  Classification  2  Generative models  3  Gaussian models  4  Hidden Markov models  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  29 / 43   Source: http://www.doksinet  Markov models  A (first order) Markov model is a
finite-state system whose behavior depends only on the current state “The future is independent of the past, conditioned on the present” e.g generative Markov model S  .8  .8 .1 .1  A .1  B .1  .1  C .7  .1  .1 E  p(qn+1|qn) S A qn B C E  qn+1 S A B C E 0 1 0 0 0 .8 1 1 0 .1 8 1 0 .1 1 7 0 0 0 0  0 0 0 .1 1  SAAAAAAAABBBBBBBBBCCCCBBBBBBCE  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  30 / 43   Source: http://www.doksinet  Hidden Markov models Markov model where state sequence Q = {qn } is not directly observable (‘hidden’) But, observations X do depend on Q I  xn is RV that depends only on current state p(xn | qn ) State sequence  q=A  0.8  q=B  AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC  q=C 3  0.6  xn  p(x|q)  Emission distributions  2  0.2  1  0  0  0.8  q=A  q=B  q=C  3  0.6  xn  p(x|q)  0.4  0.4  2 1  0.2 0  Observation sequence  0  1  2  3  4  0  observation x  0  10  20  30  time step n  can still tell something about state sequence.   Michael Mandel (E6820
SAPR)  Machine learning  February 7, 2008  31 / 43   Source: http://www.doksinet  (Generative) Markov models HMM is specified by parameters Θ: - states qi - transition probabilities aij  - emission distributions bi(x)  k  •  a  t  •  •  k  a  t  •  •  k  a  t  •  • k a t  k a t • 1.0 00 00 00 0.9 01 00 00 0.0 09 01 00 0.0 00 09 01  p(x|q) x  (+ initial state probabilities πi ) i aij ≡ p(qnj | qn−1 )  Michael Mandel (E6820 SAPR)  bi (x) ≡ p(x | qi )  Machine learning  πi ≡ p(q1i )  February 7, 2008  32 / 43   Source: http://www.doksinet  Markov models for sequence recognition Independence of observations I  observation xn depends only on current state qn  p(X | Q) = p(x1 , x2 , .   xN | q1 , q2 ,    qN ) = p(x1 | q1 )p(x2 | q2 ) · · · p(xN | qN ) N Y  =  p(xn | qn ) =  n=1  N Y  bqn (xn )  n=1  Markov transitions I  transition to next state qi+1 depends only on qi  p(Q | M) = p(q1 , q2 , .   | M) = p(qN | qN−1 .   q1 )p(qN−1 | qN−2    q1 )p(q2 | q1
)p(q1 ) = p(qN | qN−1 )p(qN−1 | qN−2 )p(q2 | q1 )p(q1 ) = p(q1 )  N Y  p(qn | qn−1 ) = πq1  n=2 Michael Mandel (E6820 SAPR)  N Y  aqn−1 qn  n=2 Machine learning  February 7, 2008  33 / 43   Source: http://www.doksinet  Model-fit calculations From ‘state-based modeling’: X p(X | Θj ) = p(X | Q, Θj )p(Q | Θj ) all Q  For HMMs p(X | Q) =  N Y  bqn (xn )  n=1  p(Q | M) = πq1  N Y  aqn−1 qn  n=2  Hence, solve for Θ̂ = argmaxΘj p(Θj | X ) I  Using Bayes’ rule to convert from p(X | Θj )  Sum over all Q??? Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  34 / 43   Source: http://www.doksinet  Summing over all paths Model M1 A  0.2  B  0.2  E  0.1  0.1  S • • • •  A B E 0.9 01 • 0.7 02 01 • 0.8 02 • • 1  S A B E  1  E States  0.9  S  xn Observations x1, x2, x3 p(x|B) p(x|A)  0.8  0.7  B  3  n  x1  x2  x3  A 2.5 02 01 q{ B 0.1 22 23  Paths 0.8 08 02 0.1 0.1 02 02  A S  2  Observation likelihoods p(x|q)  0.7  0.7  0.9  0 1 2 3 4 time n
All possible 3-emission paths Qk from S to E q0 q1 q2 q3 q4  p(Q | M) = Πn p(qn|qn-1)  p(X | Q,M) = Πn p(xn|qn) p(X,Q | M)  S S S S  .9 x 7 x 7 x 1 = 00441 .9 x 7 x 2 x 2 = 00252 .9 x 2 x 8 x 2 = 00288 .1 x 8 x 8 x 2 = 00128 Σ = 0.1109  0.0022 2.5 x 02 x 01 = 005 2.5 x 02 x 23 = 115 0.0290 2.5 x 22 x 23 = 1265 03643 0.1 x 22 x 23 = 0506 00065 Σ = p(X | M) = 0.4020  A A A B  A A B B  A B B B  E E E E  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  35 / 43   Source: http://www.doksinet  The ‘forward recursion’ Dynamic-programming-like technique to sum over all Q Define αn (i) as the probability of getting to state q i at time step n (by any path): αn (i) = p(x1 , x2 , .   xn , qn = q i ) ≡ p(X1n , qni ) αn+1 (j) can be calculated recursively: Model states qi  S  [i=1  ]  αn+1(j) = Σ αn(i)·aij ·bj(xn+1)  αn(i+1) αn(i)  Michael Mandel (E6820 SAPR)  ai+1j  bj(xn+1)  aij Time steps n  Machine learning  February 7, 2008  36 / 43   Source:
http://www.doksinet  Forward recursion (2)  Initialize α1 (i) = πi bi (x1 ) Then total probability p(X1N | Θ) =  PS  i=1 αN (i)   Practical way to solve for p(X | Θj ) and hence select the most probable model (recognition) p(X | M1)·p(M1) p(X | M2)·p(M2) Observations X  Michael Mandel (E6820 SAPR)  Choose best  Machine learning  February 7, 2008  37 / 43   Source: http://www.doksinet  Optimal path  May be interested in actual qn assignments I which state was ‘active’ at each time frame e.g phone labeling (for training?)  Total probability is over all paths .   but can also solve for single best path, “Viterbi” state sequence j Probability along best path to state qn+1 :    α̂n+1 (j) = max {α̂n (i)aij } bj (xn+1 ) i  backtrack from final state to get best path final probability is product only (no sum)  log-domain calculation is just summation I I  Best path often dominates total probability p(X | Θ) ≈ p(X , Q̂ | Θ) Michael Mandel (E6820 SAPR)  Machine learning
 February 7, 2008  38 / 43   Source: http://www.doksinet  Interpreting the Viterbi path Viterbi path assigns each xn to a state q i I performing classification based on b (x) i .   at the same time applying transition constraints aij  Inferred classification  xn  3 2 1 0  0  10  20  30  Viterbi labels: AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC  Can be used for segmentation I I  train an HMM with ‘garbage’ and ‘target’ states decode on new data to find ‘targets’, boundaries  Can use for (heuristic) training e.g forced alignment to bootstrap speech recognizer e.g train classifiers based on labels   Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  39 / 43   Source: http://www.doksinet  Aside: Training and test data A rich model can learn every training example (overtraining)  But the goal is to classify new, unseen data I  sometimes use ‘cross validation’ set to decide when to stop training  For evaluation results to be meaningful: I I  don’t test with training
data! don’t train on test data (even indirectly.   )  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  40 / 43   Source: http://www.doksinet  Aside (2): Model complexity More training data allows the use of larger models  More model parameters create a better fit to the training data I I  more Gaussian mixture components more HMM states  For fixed training set size, there will be some optimal model size that avoids overtraining  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  41 / 43   Source: http://www.doksinet  Summary  Classification is making discrete (hard) decisions Basis is comparison with known examples I  explicitly or via a model  Classification models discriminative models, like SVMs, neural nets, boosters, directly learn posteriors p(θi | x) I generative models, like Gaussians, GMMs, HMMs, model likelihoods p(x | θ) I Bayes’ rule lets us use generative models for classification I  EM allows parameter estimation even with some data
missing  Parting thought Is it wise to use generative models for discrimination or vice versa?  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  42 / 43   Source: http://www.doksinet  References  Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proc IEEE Symposium on Foundations of Computer Science, pages 459–468, Washington, DC, USA, 2006. IEEE Computer Society. Andrew Gelman, John B. Carlin, Hal S Stern, and Donald B Rubin Bayesian Data Analysis. Chapman & Hall/CRC, second edition, July 2003 ISBN 158488388X Lawrence R. Rabiner A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989 Jeff A. Bilmes A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, 1997. Christopher J. C Burges A tutorial on support vector machines for pattern recognition.
Data Min Knowl Discov, 2(2):121–167, June 1998  Michael Mandel (E6820 SAPR)  Machine learning  February 7, 2008  43 / 43   Source: http://www.doksinet  EE E6820: Speech & Audio Processing & Recognition  Lecture 4: Auditory Perception Mike Mandel <mim@ee.columbiaedu> Dan Ellis <dpwe@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820  February 10, 2009 1 2 3 4 5 6  Motivation: Why & how Auditory physiology Psychophysics: Detection & discrimination Pitch perception Speech perception Auditory organization & Scene analysis  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  1 / 44   Source: http://www.doksinet  Outline 1  Motivation: Why & how  2  Auditory physiology  3  Psychophysics: Detection & discrimination  4  Pitch perception  5  Speech perception  6  Auditory organization & Scene analysis  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  2
/ 44   Source: http://www.doksinet  Why study perception? Perception is messy: can we avoid it? No! Audition provides the ‘ground truth’ in audio what is relevant and irrelevant subjective importance of distortion (coding etc.) I (there could be other information in sound.   ) I I  Some sounds are ‘designed’ for audition I  co-evolution of speech and hearing  The auditory system is very successful I  we would do extremely well to duplicate it  We are now able to model complex systems I  faster computers, bigger memories  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  3 / 44   Source: http://www.doksinet  How to study perception? Three different approaches: Analyze the example: physiology  I  dissection & nerve recordings  Black box input/output: psychophysics  I  fit simple models of simple functions  Information processing models investigate and model complex functions e.g scene analysis, speech perception I  E6820 (Ellis & Mandel)  L4:
Auditory Perception  February 10, 2009  4 / 44   Source: http://www.doksinet  Outline 1  Motivation: Why & how  2  Auditory physiology  3  Psychophysics: Detection & discrimination  4  Pitch perception  5  Speech perception  6  Auditory organization & Scene analysis  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  5 / 44   Source: http://www.doksinet  Physiology  Processing chain from air to brain: Middle ear Auditory nerve Cortex Outer ear  Midbrain Inner ear (cochlea)  Study via: I I  anatomy nerve recordings  Signals flow in both directions  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  6 / 44   Source: http://www.doksinet  Outer & middle ear Ear canal Middle ear bones  Pinna  Cochlea (inner ear)  Eardrum (tympanum)  Pinna ‘horn’ I  complex reflections give spatial (elevation) cues  Ear canal I  acoustic tube  Middle ear I  bones provide impedance matching  E6820 (Ellis & Mandel)  L4: Auditory Perception 
February 10, 2009  7 / 44   Source: http://www.doksinet  Inner ear: Cochlea Oval window (from ME bones)  Basilar Membrane (BM) Travelling wave  Cochlea 16 kHz  Resonant frequency 50 Hz 0  Position  35mm  Mechanical input from middle ear starts traveling wave moving down Basilar membrane Varying stiffness and mass of BM results in continuous variation of resonant frequency At resonance, traveling wave energy is dissipated in BM vibration I  Frequency (Fourier) analysis  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  8 / 44   Source: http://www.doksinet  Cochlea hair cells Ear converts sound to BM motion I  each point on BM corresponds to a frequency Cochlea Tectorial membrane  Basilar membrane Auditory nerve  Inner Hair Cell (IHC)  Outer Hair Cell (OHC)  Hair cells on BM convert motion into nerve impulses (firings) Inner Hair Cells detect motion Outer Hair Cells? Variable damping?  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  9 / 44  
Source: http://www.doksinet  Inner Hair Cells  IHCs convert BM vibration into nerve firings Human ear has ∼3500 IHCs I  each IHC has ∼7 connections to Auditory Nerve  Each nerve fires (sometimes) near peak displacement Local BM displacement  50  time / ms  Typical nerve signal (mV)  Histogram to get firing probability Firing count Cycle angle  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  10 / 44   Source: http://www.doksinet  Auditory nerve (AN) signals Single nerve measurements Tone burst histogram Spike count  Frequency threshold dB SPL  80  100 60 40  Time  20  100 ms 1 kHz  100 Hz  Tone burst  10 kHz  (log) frequency  Rate vs intensity One fiber: ~ 25 dB dynamic range  Spikes/sec  300  200  100 Intensity / dB SPL 0  0  20  40  60  80  100  Hearing dynamic range > 100 dB  Hard to measure: probe living ANs? E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  11 / 44   Source: http://www.doksinet  AN population response All the
information the brain has about sound average rate & spike timings on 30,000 fibers  Not unlike a (constant-Q) spectrogram freq / 8ve re 100 Hz  PatSla rectsmoo on bbctmp2 (2001-02-18) 5 4 3 2 1 0 0  E6820 (Ellis & Mandel)  10  20  30  40  50  L4: Auditory Perception  60  time / ms  February 10, 2009  12 / 44   Source: http://www.doksinet  Beyond the auditory nerve  Ascending and descending Tonotopic × ? I  modulation, position, source??  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  13 / 44   Source: http://www.doksinet  Periphery models  IHC  Sound  IHC  Outer/middle ear filtering  Cochlea filterbank  SlaneyPatterson 12 chans/oct from 180 Hz, BBC1tmp (20010218) 60  Modeled aspects channel  outer / middle ear I hair cell transduction I cochlea filtering I efferent feedback? I  50 40 30 20 10 0  0.1  0.2  0.3  0.4  0.5  time / s  Results: ‘neurogram’ / ‘cochleagram’ E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  14 /
44   Source: http://www.doksinet  Outline 1  Motivation: Why & how  2  Auditory physiology  3  Psychophysics: Detection & discrimination  4  Pitch perception  5  Speech perception  6  Auditory organization & Scene analysis  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  15 / 44   Source: http://www.doksinet  Psychophysics  Physiology looks at the implementation Psychology looks at the function/behavior Analyze audition as signal detection: p(θ | x) psychological tests reflect internal decisions assume optimal decision process I infer nature of internal representations, noise, .    lower bounds on more complex functions I I  Different aspects to measure time, frequency, intensity tones, complexes, noise I binaural I pitch, detuning I I  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  16 / 44   Source: http://www.doksinet  Basic psychophysics Relate physical and perceptual variables e.g intensity  loudness frequency  pitch 
Methodology: subject tests I I  just noticeable difference (JND) magnitude scaling e.g “adjust to twice as loud”  Results for Intensity vs Loudness: Weber’s law ∆I ∝ I ⇒ log(L) = k log(I )  Log(loudness rating)  Hartmann(1993) Classroom loudness scaling data 2.6 2.4  Textbook figure: L α I 0.3  2.2 2.0  Power law fit: L α I 0.22  1.8 1.6 1.4 -20  -10  E6820 (Ellis & Mandel)  0  10  log2 (L) = 0.3 log2 (I ) log I = 0.3 10 log10 2 0.3 dB = log10 2 10 = dB/10  Sound level / dB  L4: Auditory Perception  February 10, 2009  17 / 44   Source: http://www.doksinet  Loudness as a function of frequency Fletcher-Munsen equal-loudness curves Intensity / dB SPL  120 100 80 60 40 20 0 100  100  100 Hz  80  60  rapid loudness growth  40 20  E6820 (Ellis & Mandel)  1 kHz  80  60  0  10,000 freq / Hz  Equivalent loudness @ 1kHz  Equivalent loudness @ 1kHz  100  1000  0  20 40 60 80 Intensity / dB  40 20 0  0  L4: Auditory Perception  20 40 60 80 Intensity / dB  February 10, 2009 
18 / 44   Source: http://www.doksinet  Loudness as a function of bandwidth Same total energy, different distribution e.g 2 channels at −6 dB (not −10 dB) freq  I0  I1  freq  B1  Loudness  B0  mag  mag  time Same total energy I·B  freq  . but wider perceived as louder ‘Critical’ Bandwidth B bandwidth  Critical bands: independent frequency channels I  ∼25 total (4-6 / octave)  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  19 / 44   Source: http://www.doksinet  Simultaneous masking  Intensity / dB  A louder tone can ‘mask’ the perception of a second tone nearby in frequency: absolute threshold  masking tone  masked threshold log freq  Suggests an ‘internal noise’ model: p(x | I)  p(x | I+∆I)  p(x | I)  σn I E6820 (Ellis & Mandel)  internal noise decision variable  L4: Auditory Perception  x  February 10, 2009  20 / 44   Source: http://www.doksinet  Sequential masking Backward/forward in time: masker envelope  Intensity / dB  simultaneous
masking ~10 dB masked threshold  time  backward masking ~5 ms  forward masking ~100 ms   Time-frequency masking ‘skirt’:  intensity  Masking tone freq  Masked threshold  time  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  21 / 44   Source: http://www.doksinet  What we do and don’t hear A  “two-interval forced-choice”:  X  B  X = A or B? time  Timing: 2 ms attack resolution, 20 ms discrimination I  but: spectral splatter  Tuning: ∼1% discrimination I  but: beats  Spectrum: profile changes, formants I  variables time-frequency resolution  Harmonic phase? Noisy signals & texture (Trace vs categorical memory) E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  22 / 44   Source: http://www.doksinet  Outline 1  Motivation: Why & how  2  Auditory physiology  3  Psychophysics: Detection & discrimination  4  Pitch perception  5  Speech perception  6  Auditory organization & Scene analysis  E6820 (Ellis & Mandel)  L4:
Auditory Perception  February 10, 2009  23 / 44   Source: http://www.doksinet  Pitch perception: a classic argument in psychophysics Harmonic complexes are a pattern on AN freq. chan  70 60 50 40 30 20 10  0  I  0.05  0.1  time/s  but give a fused percept (ecological)  What determines the pitch percept? I  not the fundamental  How is it computed? Two competing models: place and time E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  24 / 44   Source: http://www.doksinet  Place model of pitch AN excitation pattern shows individual peaks ‘Pattern matching’ method to find pitch AN excitation  broader HF channels cannot resolve harmonics  resolved harmonics  frequency channel  Correlate with harmonic ‘sieve’:  Pitch strength frequency channel  Support: Low harmonics are very important But: Flat-spectrum noise can carry pitch  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  25 / 44   Source: http://www.doksinet  Time model of pitch Timing
information is preserved in AN down to ∼1 ms scale  freq  Extract periodicity by e.g autocorrelation and combine across frequency channels per-channel autocorrelation  autocorrelation  time  Summary autocorrelation  0  10  20  30  common period (pitch)  lag / ms  But: HF gives weak pitch (in practice)  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  26 / 44   Source: http://www.doksinet  Alternate & competing cues Pitch perception could rely on various cues average excitation pattern summary autocorrelation I more complex pattern matching I I  Relying on just one cue is brittle I  e.g missing fundamental   Perceptual system appears to use a flexible, opportunistic combination Optimal detector justification? argmax p(θ | x) = argmax p(x | θ)p(θ) θ  θ  = argmax p(x1 | θ)p(x2 | θ)p(θ) θ  I  if x1 and x2 are conditionally independent  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  27 / 44   Source: http://www.doksinet  Outline
1  Motivation: Why & how  2  Auditory physiology  3  Psychophysics: Detection & discrimination  4  Pitch perception  5  Speech perception  6  Auditory organization & Scene analysis  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  28 / 44   Source: http://www.doksinet  Speech perception Highly specialized function I subsequent to source organization? .   but also can interact  Kinds of speech sounds glide freq / Hz  vowel  nasal fricative  stop burst 60  4000  50  3000  40 2000 30 1000  20  level/dB  0 1.4  has a  E6820 (Ellis & Mandel)  1.6  1.8  watch  2  2.2  thin  2.4  as  L4: Auditory Perception  a  2.6  time/s dime  February 10, 2009  29 / 44   Source: http://www.doksinet  Cues to phoneme perception Linguists describe speech with phonemes  ^  e  hε z  w  has a  tcl c θ  I n I  z  watch  thin  as a  I d  ay  m dime  Acoustic-phoneticians describe phonemes by  freq  • formants & transitions  • bursts & onset times  transition 
stop burst voicing onset time  vowel formants time  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  30 / 44   Source: http://www.doksinet  Categorical perception (Some) speech sounds perceived categorically rather than analogically I  e.g stop-burst and timing:  vowel formants  fb  time  T burst freq fb / Hz  stop burst  3000  2000  K  P  P  1000  i  e  ε  P a  c  freq  4000  o  u  following vowel  I I  tokens within category are hard to distinguish category boundaries are very sharp  Categories are learned for native tongue I  “merry” / “Mary” / “marry”  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  31 / 44   Source: http://www.doksinet  Where is the information in speech? ‘Articulation’ of high/low-pass filtered speech:  Articulation / %  high-pass  low-pass  80 60 40 20  1000  2000  3000  4000  freq / Hz  sums to more than 1.   Speech message is highly redundant e.g constraints of language, context  listeners can
understand with very few cues E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  32 / 44   Source: http://www.doksinet  Top-down influences: Phonemic restoration (Warren, 1970)  freq / Hz  What if a noise burst obscures speech? 4000 3000 2000 1000 0  1.4  1.6  1.8  2  2.2  2.4  2.6  time / s  auditory system ‘restores’ the missing phoneme .   based on semantic content .   even in retrospect Subjects are typically unaware of which sounds are restored  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  33 / 44   Source: http://www.doksinet  A predisposition for speech: Sinewave replicas  freq / Hz  freq / Hz  Replace each formant with a single sinusoid (Remez et al., 1981) 5000 4000 3000 2000 1000 0 5000 4000 3000 2000 1000 0  Speech  Sines 0.5  1  1.5  2  2.5  3  time / s  speech is (somewhat) intelligible people hear both whistles and speech (“duplex”) processed as speech despite un-speech-like What does it take to be speech? E6820
(Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  34 / 44   Source: http://www.doksinet  Simultaneous vowels dB  Mix synthetic vowels with different f0 s /iy/ @ 100 Hz  +  freq  =  /ah/ @ 125 Hz  Pitch difference helps (though not necessarily)  % both vowels correct  DV identification vs. ∆f0 (200ms) (Culling & Darwin 1993) 75 50 25  0 1/4 1/2 1 2 4  ∆f0 (semitones) E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  35 / 44   Source: http://www.doksinet  Computational models of speech perception Various theoretical-practical models of speech comprehension e.g Speech  Phoneme recognition  Lexical access  Words  Grammar constraints  Open questions: mechanism of phoneme classification mechanism of lexical recall I mechanism of grammar constraints I I  ASR is a practical implementation (?)  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  36 / 44   Source: http://www.doksinet  Outline 1  Motivation: Why & how  2 
Auditory physiology  3  Psychophysics: Detection & discrimination  4  Pitch perception  5  Speech perception  6  Auditory organization & Scene analysis  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  37 / 44   Source: http://www.doksinet  Auditory organization Detection model is huge simplification The real role of hearing is much more general: Recover useful information from the outside world  Sound organization into events and sources frq/Hz  Voice  4000  Stab 2000  Rumble 0  0  2  4  time/s  Research questions: what determines perception of sources? how do humans separate mixtures? I how much can we tell about a source? I I  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  38 / 44   Source: http://www.doksinet  Auditory scene analysis: simultaneous fusion  freq  Harmonics are distinct on AN, but perceived as one sound (“fused”)  time I I  depends on common onset depends on harmonicity (common period)  Methodologies: ask
subject how many ‘objects’ match attributes e.g object pitch I manipulate high level e.g vowel identity  I I  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  39 / 44   Source: http://www.doksinet  Sequential grouping: streaming Pattern / rhythm: property of a set of objects I  subsequent to fusion ∵ employs fused events?  frequency  TRT: 60-150 ms 1 kHz  ∆f: –2 octaves  time  Measure by relative timing judgments I  cannot compare between streams  Separate ‘coherence’ and ‘fusion’ boundaries Can interact and compete with fusion  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  40 / 44   Source: http://www.doksinet  Continuity and restoration  freq  Tone is interrupted by noise burst: what happened? +  ? +  +  time I  masking makes tone undetectable during noise  Need to infer most probable real-world events I I  observation equally likely for either explanation prior on continuous tone much higher ⇒ choose  Top-down
influence on perceived events.    E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  41 / 44   Source: http://www.doksinet  Models of auditory organization Psychological accounts suggest bottom-up input mixture  Front end  signal features (maps)  Object formation  discrete objects  Grouping rules  Source groups  freq onset time  period frq.mod  Brown and Cooke (1994) Complicated in practice formation of separate elements contradictory cues influence of top-down constraints (context, expectations, .   ) E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  42 / 44   Source: http://www.doksinet  Summary  Auditory perception provides the ‘ground truth’ underlying audio processing Physiology specifies information available Psychophysics measure basic sensitivities Sounds sources require further organization Strong contextual effects in speech perception Transduce Sound  Multiple represent'ns  Scene analysis  High-level recognition  Parting
thought Is pitch central to communication? Why?  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  43 / 44   Source: http://www.doksinet  References  Richard M. Warren Perceptual restoration of missing speech sounds Science, 167 (3917):392–393, January 1970. R. E Remez, P E Rubin, D B Pisoni, and T D Carrell Speech perception without traditional speech cues. Science, 212(4497):947–949, May 1981 G. J Brown and M Cooke Computational auditory scene analysis Computer Speech & Language, 8(4):297–336, 1994. Brian C. J Moore An Introduction to the Psychology of Hearing Academic Press, fifth edition, April 2003. ISBN 0125056281 James O. Pickles An Introduction to the Physiology of Hearing Academic Press, second edition, January 1988. ISBN 0125547544  E6820 (Ellis & Mandel)  L4: Auditory Perception  February 10, 2009  44 / 44   Source: http://www.doksinet  EE E6820: Speech & Audio Processing & Recognition  Lecture 5: Speech modeling Dan Ellis
<dpwe@ee.columbiaedu> Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820  February 19, 2009 1 2 3 4 5  Modeling speech signals Spectral and cepstral models Linear predictive models (LPC) Other signal models Speech synthesis  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  1 / 46   Source: http://www.doksinet  Outline  1  Modeling speech signals  2  Spectral and cepstral models  3  Linear predictive models (LPC)  4  Other signal models  5  Speech synthesis  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  2 / 46   Source: http://www.doksinet  The speech signal  w  e  has a  ^  hε z  tcl c θ  I n I  z  watch  thin  as a  I d  ay  m dime  Elements of the speech signal spectral resonances (formants, moving) periodic excitation (voicing, pitched) + pitch contour noise excitation transients (stop-release bursts) amplitude modulation (nasals, approximants) timing!
E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  3 / 46   Source: http://www.doksinet  The source-filter model Notional separation of source: excitation, fine time-frequency structure filter: resonance, broad spectral structure Formants Pitch  t  Glottal pulse train  Voiced/ unvoiced  Vocal tract resonances  +  Radiation characteristic  Speech  f  Frication noise t  Source  Filter  More a modeling approach than a single model  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  4 / 46   Source: http://www.doksinet  Signal modeling  Signal models are a kind of representation to make some aspect explicit for efficiency I for flexibility I I  Nature of model depends on goal classification: remove irrelevant details coding/transmission: remove perceptual irrelevance I modification: isolate control parameters I I  But commonalities emerge perceptually irrelevant detail (coding) will also be irrelevant for classification I modification domain will usually
reflect ‘independent’ perceptual attributes I getting at the abstract information in the signal I  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  5 / 46   Source: http://www.doksinet  Different influences for signal models Receiver I  see how signal is treated by listeners  cochlea-style filterbank models .    Transmitter (source) I  physical vocal apparatus can generate only a limited range of signals .    LPC models of vocal tract resonances  Making explicit particular aspects I  compact, separable correlates of resonances  cepstrum  I  modeling prominent features of NB spectrogram  I  addressing unnaturalness in synthesis   sinusoid models  Harmonic+noise model  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  6 / 46   Source: http://www.doksinet  Application of (speech) signal models Classification / matching Goal: highlight important information speech recognition (lexical content) speaker recognition (identity or class) I other signal
classification I content-based retrieval I I  Coding / transmission / storage Goal: represent just enough information I I  real-time transmission, e.g mobile phones archive storage, e.g voicemail  Modification / synthesis Goal: change certain parts independently I I  speech synthesis / text-to-speech (change the words) speech transformation / disguise (change the speaker)  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  7 / 46   Source: http://www.doksinet  Outline  1  Modeling speech signals  2  Spectral and cepstral models  3  Linear predictive models (LPC)  4  Other signal models  5  Speech synthesis  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  8 / 46   Source: http://www.doksinet  Spectral and cepstral models  Spectrogram seems like a good representation long history satisfying in use I experts can ‘read’ the speech I I  What is the information? I I  intensity in time-frequency cells typically 5ms × 200 Hz × 50 dB   Discarded
detail: I I  phase fine-scale timing  The starting point for other representations  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  9 / 46   Source: http://www.doksinet  Short-time Fourier transform (STFT) as filterbank View spectrogram rows as coming from separate bandpass filters f  sound  Mathematically:   2πk(n − n0 ) X [k, n0 ] = x[n]w [n − n0 ] exp −j N n X = x[n]hk [n0 − n] X  n  hk[n]  where hk [n] = w [−n] exp  j 2πkn N    Hk(ejω)  w[-n]  W(ej(ω − 2πk/N))  n 2πk/N  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  ω  10 / 46   Source: http://www.doksinet  Spectral models: which bandpass filters? Constant bandwidth? (analog / FFT) But: cochlea physiology & critical bandwidths  implement ear models with bandpass filters & choose bandwidths by e.g CB estimates  Auditory frequency scales I  constant ‘Q’ (center freq / bandwidth), mel, Bark, .    E6820 (Ellis & Mandel)  L5: Speech modeling  February 19,
2009  11 / 46   Source: http://www.doksinet  Gammatone filterbank Given bandwidths, which filter shapes? match inferred temporal integration window match inferred spectral shape (sharp high-freq slope) I keep it simple (since it’s only approximate) I I   Gammatone filters I I  2N poles, 2 zeros, low complexity reasonable linear match to cochlea  h[n] = nN−1 e −bn cos(ωi n) 0  2  -10  2  2  2  mag / dB  z plane  -20 -30 -40 -50  E6820 (Ellis & Mandel)  time   50  100  200  L5: Speech modeling  500  1000  2000  5000 freq / Hz  February 19, 2009  12 / 46   Source: http://www.doksinet  Constant-BW vs. cochlea model  Frequency responses  Spectrograms  Effective FFT filterbank  0  freq / Hz  Gain / dB  -10 -20 -30 -40 -50  0  1000  2000  3000  4000  5000  6000  7000  2000 0.5  1  1.5  2  2.5  3  Q=4 4 pole 2 zero cochlea model downsampled @ 64 5000 2000  freq / Hz  Gain / dB  4000  0 0  8000  -10 -20 -30 -40 -50  6000  Gammatone filterbank  0  FFT-based WB spectrogram (N=128) 
8000  1000 500 200 100  0  1000  2000  3000  4000  5000  6000  7000  8000  Freq / Hz  0  0.5  1  1.5  2  2.5  3  time / s  Magnitude smoothed over 5-20 ms time window  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  13 / 46   Source: http://www.doksinet  Limitations of spectral models  Not much data thrown away just fine phase / time structure (smoothing) little actual ‘modeling’ I still a large representation I I  Little separation of features e.g formants and pitch  Highly correlated features I  modifications affect multiple parameters  But, quite easy to reconstruct I  iterative reconstruction of lost phase  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  14 / 46   Source: http://www.doksinet  The cepstrum  Original motivation: assume a source-filter model: Excitation source g[n]  Resonance filter H(ejω)  n ω  n  Define ‘Homomorphic deconvolution’: source-filter convolution g [n] ∗ h[n] FT  product G (e jω )H(e jω ) log  sum log
G (e jω ) + log H(e jω ) IFT  separate fine structure cg [n] + ch [n] = deconvolution Definition Real cepstrum cn = idft (log |dft(x[n])|)  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  15 / 46   Source: http://www.doksinet  Stages in cepstral deconvolution Original waveform has excitation fine structure convolved with resonances DFT shows harmonics modulated by resonances  Waveform and min. phase IR 0.2  0  -0.2 20  Log DFT is sum of harmonic ‘comb’ and resonant bumps  10  IDFT separates out resonant bumps (low quefrency) and regular, fine structure (‘pitch pulse’)  dB 0  Selecting low-n cepstrum separates resonance information (deconvolution / ‘liftering’)  0 0  L5: Speech modeling  100  200 300 400 abs(dft) and liftered  samps  2000 3000 1000 log(abs(dft)) and liftered  freq / Hz  1000 2000 3000 real cepstrum and lifter  freq / Hz  -20 -40 0 200  pitch pulse  100 0 0  E6820 (Ellis & Mandel)  0  100  200  quefrency  February 19, 2009  16 / 46  
Source: http://www.doksinet  Properties of the cepstrum Separate source (fine) from filter (broad structure) I  smooth the log magnitude spectrum to get resonances  Smoothing spectrum is filtering along frequency i.e convolution applied in Fourier domain  multiplication in IFT (‘liftering’)  Periodicity in time  harmonics in spectrum  ‘pitch pulse’ in high-n cepstrum Low-n cepstral coefficients are DCT of broad filter / resonance shape Z  dω  cn = log X (e jω ) (cos nω +  j sin nω) 5th order Cepstral reconstruction  Cepstral coefs 0.5 0.1  2 1  0  0 -1 0  1  2  3  4  5  -0.1 0  E6820 (Ellis & Mandel)  1000  2000  3000  L5: Speech modeling  4000  5000  6000  7000  February 19, 2009  17 / 46   Source: http://www.doksinet  Aside: correlation of elements Cepstrum is popular in speech recognition I  feature vector elements are decorrelated  Cepstral coefficients  Auditory spectrum  Features 20  20  16  15  12  10  8  5  4  Example joint distrib (10,15) -2 -3 -4 -5  20 
18 16 14 12 10 8 6 4 2  16 12 8 4 50  I  Covariance matrix  25  100  150 frames  5  10  15  20  3 2 1 0 -1 -2 -3 -4 -5  0  5  c0 ‘normalizes out’ average log energy  Decorrelated pdfs fit diagonal Gaussians I  simple correlation is a waste of parameters  DCT is close to PCA for (mel) spectra? E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  18 / 46   Source: http://www.doksinet  Outline  1  Modeling speech signals  2  Spectral and cepstral models  3  Linear predictive models (LPC)  4  Other signal models  5  Speech synthesis  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  19 / 46   Source: http://www.doksinet  Linear predictive modeling (LPC) LPC is a very successful speech model it is mathematically efficient (IIR filters) it is remarkably accurate for voice (fits source-filter distinction) I it has a satisfying physical interpretation (resonances) I  I  Basic math I  model output as linear function of prior outputs: ! p X s[n] = ak s[n −
k] + e[n] k=1  .   hence “linear prediction” (p th order) I e[n] is excitation (input), AKA prediction error ⇒  1 S(z) 1 Pp = = −k E (z) A(z) 1 − k=1 ak z  .   all-pole modeling, ‘autoregression’ (AR) model E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  20 / 46   Source: http://www.doksinet  Vocal tract motivation for LPC Direct expression of source-filter model ! p X s[n] = ak s[n − k] + e[n] k=1  Pulse/noise excitation  e[n]  Vocal tract H(z) = 1/A(z)  s[n]  |H(ejω)|  H(z)  f z-plane  Acoustic tube models suggest all-pole model for vocal tract Relatively slowly-changing I  update A(z) every 10-20 ms  Not perfect: Nasals introduce zeros E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  21 / 46   Source: http://www.doksinet  Estimating LPC parameters Minimize short-time squared prediction error E=  m X  2  e [n] =  X  s[n] −  n  n=1  p X  !2 ak s[n − k]  k=1  Differentiate w.rt ak to get equations for each k:   p X X 0=
2 s[n] − aj s[n − j] (−s[n − k]) n  X  s[n]s[n − k] =  n  φ(0, k) =  j=1  X  aj  X  s[n − j]s[n − k]  j  n  X  aj φ(j, k)  j  where φ(j, k) = coefficients I  Pm  n=1 s[n − j]s[n − k] are correlation  p linear equations to solve for all aj s .    E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  22 / 46   Source: http://www.doksinet  Evaluating parameters P Linear equations φ(0, k) = pj=1 aj φ(j, k) If s[n] is assumed to be zero outside of some window X φ(j, k) = s[n − j]s[n − k] = rss (|j − k|) n I  rss (τ ) is autocorrelation  Hence equations become:    r (1) r (0)  r (2)   r (1)     .  =  .  .   . r (p)  r (1) r (2) . .  ··· ··· . .  r (p − 1) r (p − 2) · · ·    r (p − 1) a1   r (p − 2)    a2    .  .  .  . r (0)  ap  Toeplitz matrix (equal antidiagonals)  can use Durbin recursion to solve  (Solve full φ(j, k) via Cholesky)
E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  23 / 46   Source: http://www.doksinet  LPC illustration 0.1 0  windowed original  -0.1 -0.2  LPC residual -0.3  0  dB  50  100  150  200  250  300  350  400  time / samp  original spectrum  0  LPC spectrum -20 -40 -60  residual spectrum 0  1000  2000  3000  4000  5000  6000  7000 freq / Hz  Actual poles  z-plane E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  24 / 46   Source: http://www.doksinet  Interpreting LPC  Picking out resonances I  if signal really was source + all-pole resonances, LPC should find the resonances  Least-squares fit to spectrum minimizing e 2 [n] in time domain is the same as minimizing E 2 (e jω ) by Parseval  close fit to spectral peaks; valleys don’t matter I  Removing smooth variation in spectrum 1 A(z) is a low-order approximation to S(z) S(z) 1 I E (z) = A(z) I  I  hence, residual E (z) = A(z)S(z) is a ‘flat’ version of S  Signal whitening: white noise
(independent x[n]s) has flat spectrum  whitening removes temporal correlation I  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  25 / 46   Source: http://www.doksinet  Alternative LPC representations  Many alternate p-dimensional representations coefficients {a  P Qj } 1 − λj z −j = 1 − aj z −1 roots {λj }: I line spectrum frequencies.   I reflection coefficients {kj } from lattice form    I I  I  tube model log area ratios gj = log  1−kj 1+kj  Choice depends on: mathematical convenience / complexity quantization sensitivity I ease of guaranteeing stability I what is made explicit I distributions as statistics I I  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  26 / 46   Source: http://www.doksinet  LPC applications  Analysis-synthesis (coding, transmission) (z) S(z) = EA(z) hence can reconstruct by filtering e[n] with {aj }s I whitened, decorrelated, minimized e[n]s are easy to quantize .   or can model e[n] eg as simple pulse
train I  Recognition / classification I I  LPC fit responds to spectral peaks (formants) can use for recognition (convert to cepstra?)  Modification I I  separating source and filter supports cross-synthesis pole / resonance model supports ‘warping’ e.g male  female  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  27 / 46   Source: http://www.doksinet  Aside: Formant tracking Formants carry (most?) linguistic information Why not classify  speech recognition? e.g local maxima in cepstral-liftered spectrum pole frequencies in LPC fit  But: recognition needs to work in all circumstances I  formants can be obscured or undefined Original (mpgr1 sx419)  freq / Hz  4000 3000 2000 1000 0  Noise-excited LPC resynthesis with pole freqs  freq / Hz  4000 3000 2000 1000 0  0  0.2  0.4  0.6  0.8  1  1.2  1.4  time / s   need more graceful, robust parameters E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  28 / 46   Source: http://www.doksinet  Outline  1 
Modeling speech signals  2  Spectral and cepstral models  3  Linear predictive models (LPC)  4  Other signal models  5  Speech synthesis  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  29 / 46   Source: http://www.doksinet  Sinusoid modeling Early signal models required low complexity e.g LPC  Advances in hardware open new possibilities.   freq / Hz  NB spectrogram suggests harmonics model 4000 3000 2000 1000 0  0  0.5  1  time / s  1.5  ‘important’ info in 2D surface is set of tracks? harmonic tracks have ∼smooth properties I straightforward resynthesis I I  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  30 / 46   Source: http://www.doksinet  Sine wave models Model sound as sum of AM/FM sinusoids N[n]  s[n] =  X  Ak [n] cos(n ωk [n] + φk [n])  k=1 I I  Ak , ωk , φk piecewise linear or constant can enforce harmonicity: ωk = kω0  Extract parameters directly from STFT frames: time  mag  freq  I I  find local maxima of |S[k, n]| along
frequency track birth/death and correspondence  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  31 / 46   Source: http://www.doksinet  Finding sinusoid peaks Look for local maxima along DFT frame i.e |s[k − 1, n]| < |S[k, n]| > |S[k + 1, n]|  Want exact frequency of implied sinusoid DFT is normally quantized quite coarsely e.g 4000 Hz / 256 bands = 156 Hz/band I  quadratic fit to 3 points magnitude  interpolated frequency and magnitude spectral samples frequency  I  may also need interpolated unwrapped phase  Or, use differential of phase along time (pvoc): ω=  E6820 (Ellis & Mandel)  aḃ − b ȧ a2 + b 2  where S[k, n] = a + jb  L5: Speech modeling  February 19, 2009  32 / 46   Source: http://www.doksinet  Sinewave modeling applications Modification (interpolation) and synthesis I  connecting arbitrary ω and φ requires cubic phase interpolation (because ω = φ̇)  Types of modification I  time and frequency scale modification .   with or without
changing formant envelope  I I  concatenation / smoothing boundaries phase realignment (for crest reduction)  freq / Hz  Non-harmonic signals? OK-ish 4000 3000 2000 1000 0  E6820 (Ellis & Mandel)  0  0.5  L5: Speech modeling  1  time / s  1.5  February 19, 2009  33 / 46   Source: http://www.doksinet  Harmonics + noise model Motivation to improve sinusoid model because problems with analysis of real (noisy) signals problems with synthesis quality (esp. noise) I perceptual suspicions I I  Model N[n]  s[n] =  X k=1  I I  Ak [n] cos(nkω0 [n]) + e[n](hn [n] ∗ b[n]) | {z } | {z } Harmonics  Noise  sinusoids are forced to be harmonic remainder is filtered and time-shaped noise  ‘Break frequency’ Fm [n] between H and N dB 40  Harmonics Harmonicity limit Fm[n] Noise  20 0 0 E6820 (Ellis & Mandel)  1000  2000 L5: Speech modeling  3000  freq / Hz February 19, 2009  34 / 46   Source: http://www.doksinet  HNM analysis and synthesis freq / Hz  Dynamically adjust Fm [n] based on
‘harmonic test’: 4000 3000 2000 1000 0  0  0.5  1  time / s 1.5  Noise has envelopes in time e[n] and frequency Hn  1000 0  0  2000  dB Hn[k] 40  freq / Hz 3000  e[n] 0  0.01  0.02  0.03 time / s  reconstruct bursts / synchronize to pitch pulses E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  35 / 46   Source: http://www.doksinet  Outline  1  Modeling speech signals  2  Spectral and cepstral models  3  Linear predictive models (LPC)  4  Other signal models  5  Speech synthesis  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  36 / 46   Source: http://www.doksinet  Speech synthesis  One thing you can do with models Synthesis easier than recognition? I listeners do the work .   but listeners are very critical  Overview of synthesis text  Phoneme generation Text normalization  Synthesis algorithm  speech  Prosody generation  normalization disambiguates text (abbreviations) phonetic realization from pronunciation dictionary I prosodic synthesis by
rule (timing, pitch contour) .   all control waveform generation I I  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  37 / 46   Source: http://www.doksinet  Source-filter synthesis Flexibility of source-filter model is ideal for speech synthesis  Pitch info Voiced/ unvoiced  t  Glottal pulse source  Phoneme info th ax k ae t  t  Vocal tract filter  + t  Speech  Noise source t  Excitation source issues voiced / unvoiced / mixture ([th] etc.) pitch cycles of voiced segments glottal pulse shape  voice quality?  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  38 / 46   Source: http://www.doksinet  Vocal tract modeling freq  Simplest idea: store a single VT model for each phoneme  time  th  ax  k  ae  t  but discontinuities are very unnatural freq  Improve by smoothing between templates  time  th  ax  k  ae  t  trick is finding the right domain E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  39 / 46   Source: http://www.doksinet 
Cepstrum-based synthesis Low-n cepstrum is compact model of target spectrum Can invert to get actual VT IR waveforms: cn = idft(log |dft(x[n])|) ⇒ h[n] = idft(exp(dft(cn ))) All-zero (FIR) VT response  can pre-convolve with glottal pulses Glottal pulse inventory  Pitch pulse times (from pitch contour)  ee  ae time  ah  I  cross-fading between templates OK  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  40 / 46   Source: http://www.doksinet  LPC-based synthesis Very compact representation of target spectra I  3 or 4 pole pairs per template  Low-order IIR filter  very efficient synthesis How to interpolate? cannot just interpolate aj in a running filter but lattice filter has better-behaved interpolation +  s[n] a1  z-1  a2  z-1  a3  z-1  e[n] +  +  + kp-1  z-1  -  +  e[n]  z-1  -  +  I I  s[n] -1 k0 z -1  What to use for excitation residual from original analysis reconstructed periodic pulse train I parametrized residual resynthesis I I  E6820 (Ellis & Mandel)
 L5: Speech modeling  February 19, 2009  41 / 46   Source: http://www.doksinet  Diphone synethsis Problems in phone-concatenation synthesis phonemes are context-dependent coarticulation is complex I transitions are critical to perception I I   store transitions instead of just phonemes e  hε z  w  ^  Phones  tcl c θ  I n I  z  I d  ay  m  Diphone segments  I I  ∼ 40 phones ⇒ ∼ 800 diphones or even more context if have larger database  How to splice diphones together? I I  TD-PSOLA: align pitch pulses and cross fade MBROLA: normalized multiband  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  42 / 46   Source: http://www.doksinet  HNM synthesis  High quality resynthesis of real diphone units + parametric representation for modification I I  pitch, timing modifications removal of discontinuities at boundaries  Synthesis procedure linguistic processing gives phones, pitch, timing database search gives best-matching units I use HNM to fine-tune pitch and timing
I cross-fade Ak and ω0 parameters at boundaries I I  freq  time  Careful preparation of database is key I I  sine models allow phase alignment of all units larger database improves unit match  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  43 / 46   Source: http://www.doksinet  Generating prosody The real factor limiting speech synthesis? Waveform synthesizers have inputs for intensity (stress) duration (phrasing) I fundamental frequency (pitch) I I  Curves produced by superposition of (many) inferred linguistic rules I  phrase final lengthening, unstressed shortening, .    Or learn rules from transcribed elements E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  44 / 46   Source: http://www.doksinet  Summary  Range of models I I  spectral, cepstral LPC, sinusoid, HNM  Range of applications general spectral shape (filterbank)  ASR precise description (LPC + residual)  coding I pitch, time modification (HNM)  synthesis  I I  Issues performance vs
computational complexity generality vs accuracy I representation size vs quality I I  Parting thought not all parameters are created equal.    E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  45 / 46   Source: http://www.doksinet  References  Alan V. Oppenheim Speech analysis-synthesis system based on homomorphic filtering. The Journal of the Acoustical Society of America, 45(1):309–309, 1969 J. Makhoul Linear prediction: A tutorial review Proceedings of the IEEE, 63(4): 561–580, 1975. Bishnu S. Atal and Suzanne L Hanauer Speech analysis and synthesis by linear prediction of the speech wave. The Journal of the Acoustical Society of America, 50(2B):637–655, 1971. J.E Markel and AH Gray Linear Prediction of Speech Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1982 R. McAulay and T Quatieri Speech analysis/synthesis based on a sinusoidal representation. Acoustics, Speech, and Signal Processing [see also IEEE Transactions on Signal Processing], IEEE
Transactions on, 34(4):744–754, 1986. Wael Hamza, Ellen Eide, Raimo Bakis, Michael Picheny, and John Pitrelli. The IBM expressive speech synthesis system. In INTERSPEECH, pages 2577–2580, October 2004.  E6820 (Ellis & Mandel)  L5: Speech modeling  February 19, 2009  46 / 46   Source: http://www.doksinet  EE E6820: Speech & Audio Processing & Recognition  Lecture 6: Nonspeech and Music Dan Ellis <dpwe@ee.columbiaedu> Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820  February 26, 2009 1 2 3 4  Music & nonspeech Environmental Sounds Music Synthesis Techniques Sinewave Synthesis  E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  1 / 30   Source: http://www.doksinet  Outline  1  Music & nonspeech  2  Environmental Sounds  3  Music Synthesis Techniques  4  Sinewave Synthesis  E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  2 / 30  
Source: http://www.doksinet  Music & nonspeech What is ‘nonspeech’ ? I I  according to research effort: a little music in the world: most everything  high  Information content  speech  low  music  animal sounds  wind & water natural  contact/ collision  Origin  machines & engines  man-made  attributes? E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  3 / 30   Source: http://www.doksinet  Sound attributes  Attributes suggest model parameters What do we notice about ‘general’ sound? psychophysics: pitch, loudness, ‘timbre’ bright/dull; sharp/soft; grating/soothing I sound is not ‘abstract’: tendency is to describe by source-events I I  Ecological perspective what matters about sound is ‘what happened’  our percepts express this more-or-less directly I  E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  4 / 30   Source: http://www.doksinet  Motivations for modeling Describe/classify I  cast sound into model
because want to use the resulting parameters  Store/transmit I  model implicitly exploits limited structure of signal  Resynthesize/modify I  model separates out interesting parameters  Sound  E6820 (Ellis & Mandel)  Model parameter space  L6: Nonspeech and Music  February 26, 2009  5 / 30   Source: http://www.doksinet  Analysis and synthesis Analysis is the converse of synthesis: Model / representation  Synthesis  Analysis Sound  Can exist apart: I I  analysis for classification synthesis of artificial sounds  Often used together: encoding/decoding of compressed formats resynthesis based on analyses I analysis-by-synthesis I I  E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  6 / 30   Source: http://www.doksinet  Outline  1  Music & nonspeech  2  Environmental Sounds  3  Music Synthesis Techniques  4  Sinewave Synthesis  E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  7 / 30   Source: http://www.doksinet  Environmental Sounds
Where sound comes from: mechanical interactions contact / collisions rubbing / scraping I ringing / vibrating I I  Interest in environmental sounds carry information about events around us . including indirect hints I need to create them in virtual environments . including soundtracks I  Approaches to synthesis I I  recording / sampling synthesis algorithms  E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  8 / 30   Source: http://www.doksinet  Collision sounds  Factors influencing: colliding bodies: size, material, damping local properties at contact point (hardness) I energy of collision I I  Source-filter model ”source” = excitation of collision event (energy, local properties at contact) I ”filter” = resonance and radiation of energy (body properties) I  Variety of strike/scraping sounds resonant freqs ∼ size/shape damping ∼ material I HF content in excitation/strike ∼ mallet, force I  I  I  (from Gaver, 1993)  E6820 (Ellis & Mandel)  L6:
Nonspeech and Music  February 26, 2009  9 / 30   Source: http://www.doksinet  Sound textures  What do we hear in: I I  a city street a symphony orchestra  How do we distinguish: waterfall rainfall I applause I static I I  Rain01 5000  4000  4000 freq / Hz  freq / Hz  Applause04 5000  3000  3000  2000  2000  1000  1000  0  0  1  2  3  4  0 time / s  0  1  2  3  4  time / s  Levels of ecological description. E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  10 / 30   Source: http://www.doksinet  Sound texture modeling (Athineos) Model broad spectral structure with LPC I  could just resynthesize with noise  Model fine temporal structure in residual with linear prediction in time domain y[n]  e[n] E[k] TD-LP FD-LP DCT y[n] = E[k] = ΣibiE[k-i] Σ a y[n-i] Whitened Residual i i Sound residual  Per-frame spectral parameters I I  spectrum  Per-frame temporal envelope parameters  precise dual of LPC in frequency ‘poles’ model temporal events Temporal envelopes (40
poles, 256ms) 0.02 0  amplitude  -0.02 0.03 0.02 0.01 0.05  0.1  0.15  0.2  time / sec 0.25  Allows modification / synthesis? E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  11 / 30   Source: http://www.doksinet  Outline  1  Music & nonspeech  2  Environmental Sounds  3  Music Synthesis Techniques  4  Sinewave Synthesis  E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  12 / 30   Source: http://www.doksinet  Music synthesis techniques  HA  http://www.score -on-line com  MESSIAH  What is music? I  44. chorus  could be anything  flexible synthesisHALLELUJAH! needed!  Allegro music Key elements of conventional 3  œ. j  #  j  œ Jœ œ Œ & # c „ ∑ instruments  note-events (time, pitch, accent level) 3 j # œ . œ œj œj Œ & # c „ ∑  melody, harmony, rhythm I patterns of repetition & variation 3 . I  Soprano  Hal - le - lu - jah!  Alto  ## c Synthesis framework: V Tenor  „ ∑  œ œ œ œ Œ J J J  Hal - le - lu -
jah!  G. F  œ . œj œ œj ‰ œ œ R R J  œ œ œ œ œ œ J J ‰R R J J  r r j j r r j œ . œ œj œj ‰ œ œ Jœ œ ‰ œ œ Jœ œ  Hal - le - lu - jah!  Hal - le  -  œ. œ œ œ ‰ œ œ J J J R R  Hal - le - lu - jah!  Hal - le  -  lu - jah!  Hal - le - lu - jah  œ œ J Jœ ‰ Rœ Rœ J Jœ lu - jah!  Hal - le - lu - jah  3 instruments: common for many notes ? ## framework c „ ∑ œ . Jœ Jœ œ Œ œ  Jœ Jœ œ ‰ Rœ Rœ Jœ œ ‰ Rœ Rœ Jœ œ J J J score: sequence of (time, pitch, level) note events J Hal - le - lu - jah!  Hal - le - lu - jah!  Hal - le  -  Hal - le - lu - jah!  Hal - le  -  lu - jah!  Hal - le - lu - jah  lu - jah!  Hal - le - lu - jah  Bass  7  S  A  # & # Jœ œ Jœ œ Œ &  ##  le  j # œ V # Jœ œ œ Œ  E6820 (Ellis & Mandel)  -  lu - jah,  ? ## œ œ œ Jœ œ Œ le  B  lu - jah,  œ œ œ œj œ Œ le  T  -  -  lu - jah,  Hal - le - lu - jah!  œ . œj Jœ œ Œ J j j j œ. œ œ œ Œ  Hal - le - lu - jah,  œ . Jœ
Jœ œ Œ J œ œ . œ J J œ Œ J  œ . œj Jœ œ ‰ œ œ J R R  j j j r r j j r r j j œ. œ œ œ ‰ œ œ œ œ ‰ œ œ œ œ  Hal - le - lu - jah,  Hal - le  œ . Jœ Jœ œ ‰ Rœ Rœ J œ œ . œ J J œ ‰ Rœ Rœ J  Hal - le - lu - jah,  Hal - le - lu - jah,  Hal - le - lu - jah,  Hal - le - lu - jah,  L6: Nonspeech and Music  œ œ J Jœ ‰ Rœ Rœ J Jœ  -  lu - jah,  Hal - le - lu - jah,  œ œ œ œ J Jœ ‰ R R J Jœ Hal - le - lu - jah, Hal - le - lu - jah, œ œ J Jœ ‰ Rœ Rœ J Jœ Hal - le  -  lu - jah,  February 26, 2009  Hal - le - lu - jah,  13 / 30   Source: http://www.doksinet  The nature of musical instrument notes Characterized by instrument (register), note, loudness/emphasis, articulation. Frequency  Piano  Violin  4000  4000  3000  3000  2000  2000  1000 0  1000 0  0  1  2  3  Time  4  0  1  Frequency  Clarinet 4000  3000  3000  2000  2000  1000 0  1000 0  1  2  3  Time  4  Trumpet  4000  0  2  3  Time  4  0  1  2  3  Time  4  Distinguish how?
E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  14 / 30   Source: http://www.doksinet  Development of music synthesis Goals of music synthesis: I I  generate realistic / pleasant new notes control / explore timbre (quality)  Earliest computer systems in 1960s (voice synthesis, algorithmic) Pure synthesis approaches: 1970s: Analog synths 1980s: FM (Stanford/Yamaha) I 1990s: Physical modeling, hybrids I I  Analysis-synthesis methods: sampling / wavetables sinusoid modeling I harmonics + noise (+ transients) I I  others?  E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  15 / 30   Source: http://www.doksinet  Analog synthesis  The minimum to make an ‘interesting’ sound Envelope Trigger Pitch  t Vibrato  Oscillator t  Filter  +  +  + Cutoff freq  f  Sound  Gain  Elements: harmonics-rich oscillators time-varying filters I time-varying envelope I modulation: low frequency + envelope-based I I  Result: I  time-varying spectrum, independent
pitch  E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  16 / 30   Source: http://www.doksinet  FM synthesis  Fast frequency modulation P sidebands: cos(ωc t + β sin(ωm t)) = ∞ n=−∞ Jn (β) cos((ωc + nωm )t) I  a harmonic series if ωc = r ωm  Jn (β) is a Bessel function: 1  J0  0.5  J1  J2  J3  J4  Jn(β) ≈ 0 for β < n - 2  0  -0.5 0  1  2  3  4  5  6  7  8  9  modulation index β   Complex harmonic spectra by varying β 4000  freq / Hz  3000  2000  1000  0  0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  time / s  ωc = 2000 Hz, ωm = 200 Hz I what use? I  E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  17 / 30   Source: http://www.doksinet  Sampling synthesis Resynthesis from real notes  vary pitch, duration, level  Pitch: stretch (resample) waveform 0.2  0.2  596 Hz  0.1  0.1  0  0  -0.1 -0.2 0  894 Hz  -0.1 0.002  0.004  0.006  0.008  -0.2 0  time / s  0.002  0.004  0.006  0.008  time / s  0.3  time / s  Duration: loop a
‘sustain’ section 0.2  0.2 0.1 0.174 0176 0  0.1  0.204 0206  0  -0.1  -0.1  -0.2 0  -0.2 0.1  0.2  0.3  0  time / s  0.1  0.2  Level: cross-fade different examples Soft  0.1  0.2  mix  0.2  0  -0.1 -0.2  I I  Loud  0.1  0  veloc 0  0.05  0.1  0.15 time / s  -0.1 -0.2 0  0.05  0.1  0.15  time / s  need to ‘line up’ source samples good & bad?  E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  18 / 30   Source: http://www.doksinet  Outline  1  Music & nonspeech  2  Environmental Sounds  3  Music Synthesis Techniques  4  Sinewave Synthesis  E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  19 / 30   Source: http://www.doksinet  Sinewave synthesis If patterns of harmonics are what matter, why notPgenerate them all explicitly: s[n] = k Ak [n] cos(k · ω0 [n] · n) I  particularly powerful model for pitched signals  Analysis (as with speech): find peaks in STFT |S[ω, n]| & track or track fundamental ω0 (harmonics /
autocorrelation) & sample STFT at k · ω0  set of Ak [n] to duplicate tone: I  freq / Hz  I  8000 6000  mag  4000  1  2000 0  2  0 5000  0  0.05  0.1  0.15  0.2  time / s  freq / Hz  0  0  0.1  0.2  time / s  Synthesis via bank of oscillators E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  20 / 30   Source: http://www.doksinet  Steps to sinewave modeling - 1 The underlying STFT: X [k, n0 ] =  N−1 X   x[n + n0 ] · w [n] · exp −j  n=0 I I  2πkn N    what value for N (FFT length & window size)? what value for H (hop size: n0 = r · H, r = 0, 1, 2 .   )?  STFT window length determines freq. resolution: Xw (e jω ) = X (e jω ) ∗ W (e jω ) Choose N long enough to resolve harmonics  2-3x longest (lowest) fundamental period I e.g 30-60 ms = 480-960 samples @ 16 kHz I choose H ≤ N/2  N too long  lost time resolution I  limits sinusoid amplitude rate of change  E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  21 / 30  
Source: http://www.doksinet  Steps to sinewave modeling - 2 Choose candidate sinusoids at each time by picking peaks in each STFT frame: freq / Hz  8000 6000 4000 2000  level / dB  0 0 20  0.02  0.04  0.06  0.08  0.1  0.12  3000  4000  0.14  time / s  0.16  0.18  6000  7000 freq / Hz  0 -20 -40 -60 0  1000  2000  5000  Quadratic fit for peak: y ab2/4  0 -10  0 y = ax(x-b) x b/2  -20 400  600  phase / rad  level / dB  20 10  -5  -10  800 freq / Hz  400  600  800 freq / Hz  + linear interpolation of unwrapped phase E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  22 / 30   Source: http://www.doksinet  Steps to sinewave modeling - 3 Which peaks to pick? Want ‘true’ sinusoids, not noise fluctuations ‘prominence’ threshold above smoothed spectrum  I  level / dB  20 0  -20 -40 -60  0  1000  2000  3000  4000  5000  6000  7000  freq / Hz  Sinusoids exhibit stability. of amplitude in time of phase derivative in time  compare with adjacent time frames to test? I I 
E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  23 / 30   Source: http://www.doksinet  Steps to sinewave modeling - 4  freq  ‘Grow’ tracks by appending newly-found peaks to existing tracks: birth  existing tracks  death I  new time peaks  ambiguous assignments possible  Unclaimed new peak I I  ‘birth’ of new track backtrack to find earliest trace?  No continuation peak for existing track I I  ‘death’ of track or: reduce peak threshold for hysteresis  E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  24 / 30   Source: http://www.doksinet  Resynthesis of sinewave models After analysis, each track defines contours in frequency, amplitude fk [n], Ak [n] (+ phase?) use to drive a sinewave oscillators & sum up  I  3  level  3  2  2  Ak[n]·cos(2πfk[n]·t) 1  Ak[n]  1  0 -1  600  freq  / Hz  0 700  -2  fk[n]  -3 0  0.05 01 015 02 time / s  500 0  n  0.05  0.1 015  time / s  0.2  6000  freq / Hz  freq / Hz  ‘Regularize’ to
exactly harmonic fk [n] = k · f0 [n] 4000 2000 0 0  E6820 (Ellis & Mandel)  0.05  0.1 015  0.2 time / s  700 650 600 550 0  L6: Nonspeech and Music  0.05  0.1 015  0.2 time / s  February 26, 2009  25 / 30   Source: http://www.doksinet  Modification in sinewave resynthesis Change duration by warping timebase I  may want to keep onset unwarped freq / Hz  5000 4000 3000 2000 1000 0 0  0.05  0.1  0.15  0.2  0.25  0.3  0.35  0.4  0.45  0.5  time / s  Change pitch by scaling frequencies either stretching or resampling envelope 40  level / dB  level / dB  I  30 20 10 0 0  1000 2000 3000 4000  40 30 20 10 0 0  1000 2000 3000 4000  freq / Hz  freq / Hz  Change timbre by interpolating parameters E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  26 / 30   Source: http://www.doksinet  Sinusoids + residual Only ‘prominent peaks’ became tracks remainder of spectral energy was noisy?  model residual energy with noise I  How to obtain ‘non-harmonic’ spectrum? I I 
zero-out spectrum near extracted peaks? or: resynthesize (exactly) & subtract waveforms X es [n] = s[n] − Ak [n] cos(2πn · fk [n]) k  . must preserve phase! mag / dB  20  sinusoids  0  original  -20 -40 -60 -80  residual LPC 0  1000  2000  3000  4000  5000  6000  7000 freq / Hz  Can model residual signal with LPC  flexible representation of noisy residual E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  27 / 30   Source: http://www.doksinet  Sinusoids + noise + transients Sound represented as sinusoids and noise: X s[n] = Ak [n] cos(2πn · fk [n]) + hn [n] ∗ b[n] k  Sinusoids Parameters are Ak [n], fk [n], hn [n]  Residual es [n] 8000  freq / Hz  6000 8000  4000  6000  2000  4000  8000 6000  2000 0 0  0.2  0.4  0.6  time / s  4000 2000 0 0  0.2  0.4  0.6  SeparatePout abrupt transients in residual? es [n] = k tk [n] + hn [n] ∗ b 0 [n] I  more specific  more flexible  E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  28 / 30  
Source: http://www.doksinet  Summary  ‘Nonspeech audio’ I I  i.e sound in general characteristics: ecological  Music synthesis control of pitch, duration, loudness, articulation evolution of techniques I sinusoids + noise + transients I I  Music analysis. and beyond?  E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  29 / 30   Source: http://www.doksinet  References  W.W Gaver Synthesizing auditory icons In Proc Conference on Human factors in computing systems INTERCHI-93, pages 228–235. Addison-Wesley, 1993 M. Athineos and D P W Ellis Autoregressive modeling of temporal envelopes IEEE Tr. Signal Proc, 15(11):5237–5245, 2007 URL http://www.eecolumbiaedu/~dpwe/pubs/AthinE07-fdlppdf X. Serra and J Smith III Spectral Modeling Synthesis: A Sound Analysis/Synthesis System Based on a Deterministic Plus Stochastic Decomposition. Computer Music Journal, 14(4):12–24, 1990. T. S Verma and T H Y Meng An analysis/synthesis tool for transient signals that allows
aflexible sines+ transients+ noise model for audio. In Proc ICASSP, pages VI–3573–3576, Seattle, 1998.  E6820 (Ellis & Mandel)  L6: Nonspeech and Music  February 26, 2009  30 / 30   Source: http://www.doksinet  EE E6820: Speech & Audio Processing & Recognition  Lecture 7: Audio compression and coding Dan Ellis <dpwe@ee.columbiaedu> Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820  March 31, 2009 1  Information, Compression & Quantization  2  Speech coding  3  Wide-Bandwidth Audio Coding  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  1 / 37   Source: http://www.doksinet  Outline  1  Information, Compression & Quantization  2  Speech coding  3  Wide-Bandwidth Audio Coding  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  2 / 37   Source: http://www.doksinet  Compression & Quantization How big is audio data? What
is the bitrate? Fs frames/second (e.g 8000 or 44100) x C samples/frame (e.g 1 or 2 channels) x B bits/sample (e.g 8 or 16)  Fs · C · B bits/second (e.g 64 Kbps or 14 Mbps) bits / frame  I  CD Audio 1.4 Mbps 32 8  Mobile !13 Kbps  8000  44100 frames / sec  Telephony 64 Kbps  How to reduce? lower sampling rate  less bandwidth (muffled) lower channel count  no stereo image I lower sample size  quantization noise I I  Or: use data compression E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  3 / 37   Source: http://www.doksinet  Data compression: Redundancy vs. Irrelevance Two main principles in compression: I I  remove redundant information remove irrelevant information  Redundant information is implicit in remainder e.g signal bandlimited to 20kHz, but sample at 80kHz  can recover every other sample by interpolation: I  sample  In a bandlimited signal, the red samples can be exactly recovered by interpolating the blue samples  time  Irrelevant info is unique
but unnecessary I  e.g recording a microphone signal at 80 kHz sampling rate  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  4 / 37   Source: http://www.doksinet  Irrelevant data in audio coding  For coding of audio signals, irrelevant means perceptually insignificant I  an empirical property  Compact Disc standard is adequate: I I  44 kHz sampling for 20 kHz bandwidth 16 bit linear samples for ∼ 96 dB peak SNR  Reflect limits of human sensitivity: 20 kHz bandwidth, 100 dB intensity sinusoid phase, detail of noise structure I dynamic properties - hard to characterize I I  Problem: separating salient & irrelevant  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  5 / 37   Source: http://www.doksinet  Quantization  Represent waveform with discrete levels 5 4 3 2 1 Q[x] 0 -1 -2 -3 -4 -5 -5 -4 -3 -2 -1 0 1 2 3 4 5  6  x[n]  Q[x[n]]  4 2 0 -2  error e[n] = x[n] - Q[x[n]] 0  5  10  15  20  25  30  35  40  x  Equivalent to
adding error e[n]: x[n] = Q [x[n]] + e[n] e[n] ∼ uncorrelated, uniform white noise p(e[n]) 2   variance σe2 = D12 -D/  2  +D/  2  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  6 / 37   Source: http://www.doksinet  Quantization noise (Q-noise) Uncorrelated noise has flat spectrum With a B bit word and a quantization step D I I  max signal range (x) = −(2B−1 ) · D .   (2B−1 − 1) · D quantization noise (e) = −D/2 .   D/2   Best signal-to-noise ratio (power) SNR = E [x 2 ]/E [e 2 ] = (2B )2 . or, in dB, 20 · log10 2 · B ≈ 6 · B dB 0 level / dB  Quantized at 7 bits -20 -40 -60 -80 0  1000  E6820 (Ellis & Mandel)  2000  3000  4000  5000  6000  7000 freq / Hz  L7: Audio compression and coding  March 31, 2009  7 / 37   Source: http://www.doksinet  Redundant information Redundancy removal is lossless Signal correlation implies redundant information e.g if x[n] = x[n − 1] + v [n] x[n] has a greater amplitude range  uses more bits than v
[n] I sending v [n] = x[n] − x[n − 1] can reduce amplitude, hence bitrate  I  x[n] - x[n-1]  I  but: ‘white noise’ sequence has no redundancy .  Problem: separating unique & redundant  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  8 / 37   Source: http://www.doksinet  Optimal coding  Shannon information: An unlikely occurrence is more ‘informative’ p(A) = 0.5  p(B) = 0.5  p(A) = 0.9  p(B) = 0.1  ABBBBAAABBABBABBABB  AAAAABBAAAAAABAAAAB  A, B equiprobable  A is expected; B is ‘big news’  Information in bits I = − log2 (probability ) I  clearly works when all possibilities equiprobable  Optimal bitrate  av.token length = entropy H = E [I ] I  . equal-length tokens are equally likely  How to achieve this? transform signal to have uniform pdf nonuniform quantization for equiprobable tokens I variable-length tokens  Huffman coding I I  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  9 / 37   Source:
http://www.doksinet  Quantization for optimum bitrate Quantization should reflect pdf of signal: p(x < x0)  p(x = x0)  x' 1.0 0.8 0.6 0.4 0.2  -0.02 -0015 -001 -0005  I  0  0.005 001  0.015 002  0.025  x  0  cumulative pdf p(x < x0 ) maps to uniform x 0  Or, codeword length per Shannon log2 (p(x)): -0.02 -0.01 0 0.01 0.02 0.03  I  p(x)  Shannon info / bits Codewords 111111111xx 111101xx 111100xx 101xx 100xx 0xx 110xx 1110xx 111110xx 1111110xx 111111100xx 111111101xx 111111110xx 0  2  4  6  8  Huffman coding: tree-structured decoder  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  10 / 37   Source: http://www.doksinet  Vector Quantization Quantize mutually dependent values in joint space: 3  x1  2 1 0 -1 -2  x2 -6  -4  -2  0  2  4  May help even if values are largely independent I  larger space x1,x2 is easier for Huffman  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  11 / 37   Source: http://www.doksinet 
Compression & Representation  As always, success depends on representation Appropriate domain may be ‘naturally’ bandlimited I I  e.g vocal-tract-shape coefficients can reduce sampling rate without data loss  In right domain, irrelevance is easier to ‘get at’ I  e.g STFT to separate magnitude and phase  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  12 / 37   Source: http://www.doksinet  Aside: Coding standards Coding only useful if recipient knows the code! Standardization efforts are important Federal Standards: Low bit-rate secure voice: I I  FS1015e: LPC-10 2.4 Kbps FS1016: 4.8 Kbps CELP  ITU G.x series (also Hx for video) I I  G.726 ADPCM G.729 Low delay CELP  MPEG MPEG-Audio layers 1,2,3 (mp3) MPEG 2 Advanced Audio Codec (AAC) I MPEG 4 Synthetic-Natural Hybrid Codec I I  More recent ‘standards’ I I  proprietary: WMA, Skype. Speex .  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  13 / 37   Source:
http://www.doksinet  Outline  1  Information, Compression & Quantization  2  Speech coding  3  Wide-Bandwidth Audio Coding  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  14 / 37   Source: http://www.doksinet  Speech coding  Standard voice channel: I I  analog: 4 kHz slot (∼ 40 dB SNR) digital: 64 Kbps = 8 bit µ-law x 8 kHz  How to compress? Redundant I  signal assumed to be a single voice, not any possible waveform  Irrelevant I  need code only enough for intelligibility, speaker identification (c/w analog channel)  Specifically, source-filter decomposition I  vocal tract & f0 change slowly  Applications: I I  live communications offline storage  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  15 / 37   Source: http://www.doksinet  Channel Vocoder (1940s-1960s) Basic source-filter decomposition I I  filterbank breaks into spectral bands transmit slowly-changing energy in each band Encoder Bandpass filter 1 
Decoder Smoothed energy  Downsample & encode  E1  Bandpass filter 1 Output  Input Bandpass filter N  Smoothed energy  Voicing analysis  I  Downsample & encode  EN  Bandpass filter 1  V/UV Pitch  Pulse generator  Noise source  10-20 bands, perceptually spaced  Downsampling? Excitation? I I  pitch / noise model or: baseband + ‘flattening’.  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  16 / 37   Source: http://www.doksinet  LPC encoding  The classic source-filter model Encoder Filter coefficients {a } Input s[n]  Decoder  i  |1/A(ej!)|  Represent & encode  f  LPC analysis  Represent & encode  Residual e[n]  Excitation generator  ^ e[n]  All-pole filter  H(z) =  t  Output ^ s[n]  1 1 - "aiz-i  Compression gains: I I  filter parameters are ∼slowly changing excitation can be represented many ways 20 ms  Filter parameters Excitation/pitch parameters 5 ms  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009 
17 / 37   Source: http://www.doksinet  Encoding LPC filter parameters For ‘communications quality’: 8 kHz sampling (4 kHz bandwidth) ∼10th order LPC (up to 5 pole pairs) I update every 20-30 ms  300 - 500 param/s I I  Representation & quantization ai I  I  I  {ai } - poor distribution, can’t interpolate reflection coefficients {ki }: guaranteed stable  -2  -1.5  -1  -0.5  0  0.5  1  1.5  2  -1  -0.5  0  0.5  1  1.5  2  0.45  0.5  ki -2  -1.5  fLi  LSPs - lovely! 0  0.05  0.1  0.15  0.2  0.25  0.3  0.35  0.4  Bit allocation (filter): I I  GSM (13 kbps): 8 LARs x 3-6 bits / 20 ms = 1.8 Kbps FS1016 (4.8 kbps): 10 LSPs x 3-4 bits / 30 ms = 11 Kbps  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  18 / 37   Source: http://www.doksinet  Line Spectral Pairs (LSPs) LSPs encode LPC filter by a set of frequencies Excellent for quantization & interpolation Definition: zeros of P(z) = A(z) + z −p−1 · A(z −1 ) Q(z) = A(z) − z −p−1 · A(z
−1 ) z = e jω  z −1 = e −jω  |A(z)| = |A(z −1 )| on u.circ P(z), Q(z) have (interleaved) zeros when −1 ∠{A(z)} = ±∠{z −p−1 A(zQ )} −1 I reconstruct P(z), Q(z) = ) etc. i (1 − ζi z I A(z) = [P(z) + Q(z)]/2  I  I  A(z-1) = 0 Q(z) = 0 A(z) = 0 P(z) = 0  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  19 / 37   Source: http://www.doksinet  Encoding LPC excitation Excitation already better than raw signal: 5000 0 -5000 100  Original signal  0 -100 LPC residual 1.3 135 1.4 145  I  1.5  1.55  1.6  1.65  time / s  1.7  save several bits/sample, but still > 32 Kbps  Crude model: U/V flag + pitch period I  ∼ 7 bits / 5 ms = 1.4 Kbps  LPC10 @ 24 Kbps  Pitch period values  16 ms frame boundaries  100 50 0 -50 1.3  1.35  1.4  1.45  1.5  1.55  1.6  1.65  1.7  1.75  time / s  Band-limit then re-extend (RELP) E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  20 / 37   Source: http://www.doksinet  Encoding
excitation Something between full-quality residual (32 Kbps) and pitch parameters (1.4 kbps)? ‘Analysis by synthesis’ loop: Input  Filter coefficients A(z)  LPC analysis  s[n]  Excitation code  Synthetic excitation control  c[n]  Pitch-cycle predictor + b·z–NL ‘Perceptual’ weighting W(z)= A(z) A(z/!)  MSError minimization  ^ x[n]  LPC filter 1 A(z)  – +  ^ x[n] *ha[n] - s[n]  Excitation encoding  ‘Perceptual’ weighting discounts peaks: gain / dB  20 10 0  W(z)= A(z) A(z/!)  1/A(z/!)  -10 -20 0  E6820 (Ellis & Mandel)  1/A(z)  500  1000  1500  2000  2500  L7: Audio compression and coding  3000  3500 freq / Hz  March 31, 2009  21 / 37   Source: http://www.doksinet  Multi-Pulse Excitation (MPE-LPC) Stylize excitation as N discrete pulses 15  original LPC residual  10 5 0 -5  mi  multi-pulse excitation  5 0 -5  I  0  20  ti  40  60  80  100  120  time / samps  encode as N × (ti , mi ) pairs  Greedy algorithm places one pulse at a time:   A(z) X (z) Epcp = − S(z)
A(z/γ) A(z) X (z) R(z) = − A(z/γ) A(z/γ) I I  R(z) is residual of target waveform after inverse-filtering cross-correlate hγ and r ∗ hγ , iterate  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  22 / 37   Source: http://www.doksinet  CELP  Represent excitation with codebook e.g 512 sparse excitation vectors Codebook  Search  index  000  001  002  003  004  005  006  007  008  009  010  011  012  013  014  015  0  I  excitation  60 time / samples  linear search for minimum weighted error?  FS1016 4.8 Kbps CELP (30ms frame = 144 bits): 10 LSPs 4x4 + 6x3 bits = 34 bits Pitch delay 4 x 7 bits = 28 bits Pitch gain 4 x 5 bits = 20 bits I 138 bits Codebk index 4 x 9 bits = 36 bits Codebk gain 4 x 5 bits = 20 bits E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  23 / 37   Source: http://www.doksinet  Aside: CELP for nonspeech? CELP is sometimes called a ‘hybrid’ coder: I I  originally based on source-filter voice model
CELP residual is waveform coding (no model) freq / kHz  Original (mrZebra-8k) 4 3 2  CELP does not break with multiple voices etc.  1 0  5 kbps buzz-hiss 4 3 2  I  just does the best it can  1 0  4.8 kbps CELP 4 3 2 1 0  0  1  2  3  4  5  6  7  8  time / sec  LPC filter models vocal tract; also matches auditory system? I  i.e the ‘source-filter’ separation is good for relevance as well as redundancy?  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  24 / 37   Source: http://www.doksinet  Outline  1  Information, Compression & Quantization  2  Speech coding  3  Wide-Bandwidth Audio Coding  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  25 / 37   Source: http://www.doksinet  Wide-Bandwidth Audio Coding Goals: I I  transparent coding i.e no perceptible effect general purpose - handles any signal  Simple approaches (redundancy removal) I  Adaptive Differential PCM (ADPCM) X[n]  + –  Xp[n]  I  +  D[n]  Adaptive quantizer
C[n] = Q[D[n]]  Predictor Xp[n] = F[X'[n-i]]  C[n]  X'[n] + +  + D'[n]  Dequantizer D'[n] = Q-1[C[n]]  as prediction gets smarter, becomes LPC e.g shorten - lossless LPC encoding  Larger compression gains needs irrelevance I  hide quantization noise with psychoacoustic masking  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  26 / 37   Source: http://www.doksinet  Noise shaping  Plain Q-noise sounds like added white noise I I  actually, not all that disturbing . but worst-case for exploiting masking 1  Have Q-noise scale with signal level  mu-law quantization  0.8  0.6  I  I  i.e quantizer step gets larger with amplitude  mu(x)  0.4  minimum distortion for some center-heavy pdf  0.2  x - mu(x)  0 0  0.2  0.4  0.6  0.8  1  Or: put Q-noise around peaks in spectrum key to getting benefit of perceptual masking level / dB  I  20  Signal  0  Linear Q-noise  -20 -40 -60  E6820 (Ellis & Mandel)  0  Transform Q-noise 1500  2000  2500  3000 
L7: Audio compression and coding  3500  freq / Hz  March 31, 2009  27 / 37   Source: http://www.doksinet  Subband coding  Idea: Quantize separately in separate bands Encoder Bandpass f Input  Decoder Downsample M  Analysis filters  Q-1[•]  M  M  f Reconstruction filters  (M channels)  f  I  Quantize Q[•]  Q-1[•]  Q[•]  Output +  M  Q-noise stays within band, gets masked gain / dB  ‘Critical sampling’  1/M of spectrum per band  I  0 -10 -20 -30 -40 -50 0  alias energy  0.1  0.2  0.3  0.4  0.5  0.6  0.7  normalized freq 1  some aliasing inevitable  Trick is to cancel with alias of adjacent band  ‘quadrature-mirror’ filters E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  28 / 37   Source: http://www.doksinet  MPEG-Audio (layer I, II) Basic idea: Subband coding plus psychoacoustic masking model to choose dynamic Q-levels in subbands Scale indices Scale normalize Input  Quantize  32 band polyphase filterbank  Format Bitstream & pack
bitstream Data frame Scale normalize  Psychoacoustic masking model  Control & scales  Quantize  32 chans x 36 samples  Per-band masking margins  24 ms  22 kHz ÷ 32 equal bands = 690 Hz bandwidth 8 / 24 ms frames = 12 / 36 subband samples I fixed bitrates 32 - 256 Kbps/chan (1-6 bits/samp) I scale factors are like LPC envelope? I  I  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  29 / 37   Source: http://www.doksinet  MPEG Psychoacoustic model Based on simultaneous masking experiments Difficulties: noise energy masks ∼10 dB better than tones masking level nonlinear in frequency & intensity I complex, dynamic sounds not well understood  I  pick ‘tonal peaks’ in NB FFT spectrum  I  remaining energy  ‘noisy’ peaks  I  apply nonlinear ‘spreading function’  I  sum all masking & threshold in power domain  E6820 (Ellis & Mandel)  SPL / dB  I  SPL / dB  Procedure  SPL / dB  I  60 50 40 30 20 10 0 -10  60 50 40 30 20 10 0 -10  60 50
40 30 20 10 0 -10  Tonal peaks  Non-tonal energy  Signal spectrum 1  3  5  7  9  11  13  15  17  19  21  23  25  23  25  23  25  Masking spread  1  3  5  7  9  11  13  15  17  19  21  Resultant masking  1  3  L7: Audio compression and coding  5  7  9  11 13 15 freq / Bark  17  19  21  March 31, 2009  30 / 37   Source: http://www.doksinet  MPEG Bit allocation Result of psychoacoustic model is maximum tolerable noise per subband SNR ~ 6·B  level  Masking tone Masked threshold Safe noise level freq  Quantization noise Subband N I  safe noise level  required SNR  bits B  Bit allocation procedure (fixed bit rate): pick channel with worst noise-masker ratio improve its quantization by one step I repeat while more bits available for this frame I I  Bands with no signal above masking curve can be skipped E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  31 / 37   Source: http://www.doksinet  MPEG Audio Layer III ‘Transform coder’ on top of ‘subband coder’
Line Subband 575 31 Filterbank MDCT 32 Subbands 0  0  Window Switching  Coded Audio Signal 192 kbit/s  Distortion Control Loop Nonuniform Quantization Rate Control Loop  Huffman Encoding  Coding of Sideinformation  Bitstream Formatting CRC-Check  Digital Audio Signal (PCM) (768 kbit/s)  32 kbit/s  Psychoacoustic Model  FFT 1024 Points  External Control  Blocks of 36 subband time-domain samples become 18 pairs of frequency-domain samples more redundancy in spectral domain finer control e.g of aliasing, masking I scale factors now in band-blocks I I  Fixed Huffman tables optimized for audio data Power-law quantizer E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  32 / 37   Source: http://www.doksinet  Adaptive time window Time window relies on temporal masking I  single quantization level over 8-24 ms window  ‘Nightmare’ scenario: Pre-echo distortion  I  ‘backward masking’ saves in most cases  Adaptive switching of time window: normal  window level  1
 0.6 0.4 0.2 0  E6820 (Ellis & Mandel)  transition short  0.8  0  20  40  60  80  100 time / ms  L7: Audio compression and coding  March 31, 2009  33 / 37   Source: http://www.doksinet  The effects of MP3 Before & after: freq / kHz  Josie - direct from CD 20  15 10 5 0  freq / kHz  After MP3 encode (160 kbps) and decode 20  15 10 5 0  freq / kHz  Residual (after aligning 1148 sample delay) 20  15 10 5 0 0  2  4  6  8  10  time / sec  chop off high frequency (above 16 kHz) occasional other time-frequency ‘holes’ I quantization noise under signal I  I  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  34 / 37   Source: http://www.doksinet  MP3 & Beyond  MP3 is ‘transparent’ at ∼ 128 Kbps for stereo (11x smaller than 1.4 Mbps CD rate) I  only decoder is standardized: better psychological models  better encoders  MPEG2 AAC rebuild of MP3 without backwards compatibility 30% better (stereo at 96 Kbps?) I multichannel etc. I I  MPEG4-Audio I I
 wide range of component encodings MPEG Audio, LSPs, .  SAOL ‘synthetic’ component of MPEG-4 Audio complete DSP/computer music language! I how to encode into it? I I  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  35 / 37   Source: http://www.doksinet  Summary  For coding, every bit counts I I  take care over quantization domain & effects Shannon limits.  Speech coding I I  LPC modeling is old but good CELP residual modeling can go beyond speech  Wide-band coding I I  noise shaping ‘hides’ quantization noise detailed psychoacoustic models are key  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  36 / 37   Source: http://www.doksinet  References  Ben Gold and Nelson Morgan. Speech and Audio Signal Processing: Processing and Perception of Speech and Music. Wiley, July 1999 ISBN 0471351547 D. Pan A tutorial on MPEG/audio compression IEEE Multimedia, 2(2):60–74, 1995 Karlheinz Brandenburg. MP3 and AAC explained In
Proc AES 17th International Conference on High Quality Audio Coding, pages 1–12, 1999. T. Painter and A Spanias Perceptual coding of digital audio Proc IEEE, 80(4): 451–513, Apr. 2000  E6820 (Ellis & Mandel)  L7: Audio compression and coding  March 31, 2009  37 / 37   Source: http://www.doksinet  EE E6820: Speech & Audio Processing & Recognition  Lecture 8: Spatial sound Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820  March 27, 2008 1  Spatial acoustics  2  Binaural perception  3  Synthesizing spatial audio  4  Extracting spatial sounds  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  1 / 33   Source: http://www.doksinet  Outline  1  Spatial acoustics  2  Binaural perception  3  Synthesizing spatial audio  4  Extracting spatial sounds  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  2 / 33   Source: http://www.doksinet  Spatial acoustics  Received sound = source
+ channel I  so far, only considered ideal source waveform  Sound carries information on its spatial origin I  ”ripples in the lake”  I  evolutionary significance  The basis of scene analysis? I  yes and notry blocking an ear  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  3 / 33   Source: http://www.doksinet  Ripples in the lake  Listener Source Wavefront (@ c m/s) Energy ∝ 1/r2  Effect of relative position on sound delay = ∆r c I energy decay ∼ 12 r I absorption ∼ G (f )r I direct energy plus reflections I  Give cues for recovering source position Describe wavefront by its normal Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  4 / 33   Source: http://www.doksinet  Recovering spatial information Source direction as wavefront normal pressure  moving plane found from timing at 3 points B  ∆t/c = ∆s = AB·cosθ θ  A  C time  wavefront  need to solve correspondence  ge  ran  Space: need 3 parameters  r  e.g 2 angles and range  elevation φ 
azimuth θ  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  5 / 33   Source: http://www.doksinet  The effect of the environment Reflection causes additional wavefronts  reflection  diffraction & shadowing  + scattering, absorption many paths  many echoes  I I  Reverberant effect causal ‘smearing’ of signal energy  I  + reverb from hlwy16 freq / Hz  freq / Hz  Dry speech 'airvib16' 8000 6000  8000 6000  4000  4000  2000  2000  0  0  0.5  Michael Mandel (E6820 SAPR)  1  1.5  0 time / sec Spatial sound  0  0.5  1  1.5 time / sec March 27, 2008 6 / 33   Source: http://www.doksinet  Reverberation impulse response Exponential decay of reflections: hlwy16 - 128pt window freq / Hz  ~e-t/T  hroom(t)  8000  -10 -20  6000 -30 -40  4000  -50 2000 -60  t  0  -70 0  0.1  0.2  0.3  0.4  0.5  0.6  0.7 time / s  Frequency-dependent I  greater absorption at high frequencies  faster decay  Size-dependent I  larger rooms  longer delays  slower decay  Sabine’s equation:
0.049V S ᾱ Time constant as size, absorption RT60 =  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  7 / 33   Source: http://www.doksinet  Outline  1  Spatial acoustics  2  Binaural perception  3  Synthesizing spatial audio  4  Extracting spatial sounds  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  8 / 33   Source: http://www.doksinet  Binaural perception  R  L  head shadow (high freq)  path length difference source  What is the information in the 2 ear signals? I I  the sound of the source(s) (L+R) the position of the source(s) (L-R)  Example waveforms (ShATR database) shatr78m3 waveform 0.1 Left 0.05 0 -0.05 Right -0.1 2.2  Michael Mandel (E6820 SAPR)  2.205  2.21  2.215  2.22  Spatial sound  2.225  2.23  2.235  time / s  March 27, 2008  9 / 33   Source: http://www.doksinet  Main cues to spatial hearing Interaural time difference (ITD) from different path lengths around head dominates in low frequency (< 1.5 kHz) I max ∼750 µs  ambiguous for freqs
> 600 Hz  I I  Interaural intensity difference (IID) I I  from head shadowing of far ear negligible for LF; increases with frequency  Spectral detail (from pinna reflections) useful for elevation & range Direct-to-reverberant useful for range freq / kHz  Claps 33 and 34 from 627M:nf90 20 15 10 5 0  0  0.2  Michael Mandel (E6820 SAPR)  0.4  0.6  0.8  1  Spatial sound  1.2  1.4  1.6  1.8  time / s  March 27, 2008  10 / 33   Source: http://www.doksinet  Head-Related Transfer Functions (HRTFs) Capture source coupling as impulse responses {`θ,φ,R (t), rθ,φ,R (t)} Collection: (http://interface.cipicucdavisedu/)  HRIR 021 Left @ 0 el  HRIR 021 Right @ 0 el 1  45  0  0  1  -45  0  Azimuth / deg  RIGHT  LEFT  -1  0  0.5  1  1.5  0  0.5  1  1.5  time / ms 0  HRIR 021 Left @ 0 el 0 az  HRIR 021 Right @ 0 el 0 az  0.5  1  1.5  time / ms  Highly individual! Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  11 / 33   Source: http://www.doksinet  Cone of confusion  azimuth θ 
Cone of confusion (approx equal ITD)  Interaural timing cue dominates (below 1kHz) I  from differing path lengths to two ears  But: only resolves to a cone I  Up/down? Front/back?  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  12 / 33   Source: http://www.doksinet  Further cues  Pinna causes elevation-dependent coloration  Monaural perception I  separate coloration from source spectrum?  Head motion I I  synchronized spectral changes also for ITD (front/back) etc.  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  13 / 33   Source: http://www.doksinet  Combining multiple cues Both ITD and ILD influence azimuth; What happens when they disagree? Identical signals to both ears  image is centered l(t)  r(t) t  1 ms  t  Delaying right channel moves image to left l(t)  r(t) t  t  Attenuating left channel returns image to center l(t)  r(t) t  t  “Time-intensity trading” Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  14 / 33   Source:
http://www.doksinet  Binaural position estimation Imperfect results: (Wenzel et al., 1993)  Judged Azimuth (Deg)  180  120  60  0  -60  -120  -180 -180  -120  -60  0  60  120  180  Target Azimuth (Deg)  listening to ‘wrong’ HRTFs  errors front/back reversals stay on cone of confusion  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  15 / 33   Source: http://www.doksinet  The Precedence Effect Reflections give misleading spatial cues  l(t)  direct reflected  R  t  r(t)  R/  c  t  But: Spatial impression based on 1st wavefront then ‘switches off’ for ∼50 ms .   even if ‘reflections’ are louder .   leads to impression of room Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  16 / 33   Source: http://www.doksinet  Binaural Masking Release Adding noise to reveal target Tone + noise to one ear: tone is masked +  t  t  Identical noise to other ear: tone is audible +  t  t  t  Binaural Masking Level Difference up to 12dB I  greatest for noise in phase, tone
anti-phase  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  17 / 33   Source: http://www.doksinet  Outline  1  Spatial acoustics  2  Binaural perception  3  Synthesizing spatial audio  4  Extracting spatial sounds  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  18 / 33   Source: http://www.doksinet  Synthesizing spatial audio  Goal: recreate realistic soundfield I I  hi-fi experience synthetic environments (VR)  Constraints resources information (individual HRTFs) I delivery mechanism (headphones) I I  Source material types I I  live recordings (actual soundfields) synthetic (studio mixing, virtual environments)  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  19 / 33   Source: http://www.doksinet  Classic stereo  L  R  ‘Intensity panning’: no timing modifications, just vary level ±20 dB I  works as long as listener is equidistant (ILD)  Surround sound: extra channels in center, sides, .   I  same basic effect: pan between pairs  Michael Mandel
(E6820 SAPR)  Spatial sound  March 27, 2008  20 / 33   Source: http://www.doksinet  Simulating reverberation Can characterize reverb by impulse response I I  spatial cues are important: record in stereo IRs of ∼1 sec  very long convolution  Image model: reflections as duplicate sources virtual (image) sources reflected path  source  listener  ‘Early echos’ in room impulse response: direct path early echos hroom(t)  t  Actual reflection may be hreflect (t), not δ(t) Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  21 / 33   Source: http://www.doksinet  Artificial reverberation Reproduce perceptually salient aspects early echo pattern ( room size impression) overall decay tail ( wall materials.   ) I interaural coherence ( spaciousness) I I  Nested allpass filters (Gardner, 1992) Allpass -g  H(z) =  z-k - g 1 - g·z-k  h[n]  1-g2 g(1-g2) 2 g (1-g2)  y[n]  x[n] +  z-k  +  k  g g,k  n  Synthetic Reverb +  30,0.7  a0 +  AP0 g  Michael Mandel (E6820 SAPR)  3k  -g 
Nested+Cascade Allpass 50,0.5 20,0.3  2k  Spatial sound  AP1  + a1  a2  AP2  LPF March 27, 2008  22 / 33   Source: http://www.doksinet  Synthetic binaural audio Source convolved with {L,R} HRTFs gives precise positioning .   for headphone presentation I can combine multiple sources (by adding)  Where to get HRTFs? measured set, but: specific to individual, discrete interpolate by linear crossfade, PCA basis set I or: parametric model - delay, shadow, pinna (Brown and Duda, 1998) I I  Delay  Shadow  Pinna  z-tDL(θ)  1 - bL(θ)z-1 1 - azt  Σ pkL(θ,φ)·z-tPkL(θ,φ)  Room echo  Source z-tDR(θ)  +  KE·z-tE  1 - bR(θ)z-1 1 - azt  Σ pkR(θ,φ)·z-tPkR(θ,φ)  + (after Brown & Duda '97)  Head motion cues? I  head tracking + fast updates  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  23 / 33   Source: http://www.doksinet  Transaural sound  Binaural signals without headphones? Can cross-cancel wrap-around signals I I  speakers SL,R , ears EL,R , binaural signals
BL,R Goal: present BL,R to EL,R BL BR M  SR  SL  −1 SL = HLL (BL − HRL SR )  HLR  −1 SR = HRR (BR − HLR SL )  HRL HRR  HLL  EL  ER  Narrow ‘sweet spot’ I  head motion?  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  24 / 33   Source: http://www.doksinet  Soundfield reconstruction Stop thinking about ears I  just reconstruct pressure + spatial derivatives ∂p(t)/∂y  ∂p(t)/∂z  ∂p(t)/∂x p(x,y,z,t)  I  ears in reconstructed field receive same sounds  Complex reconstruction setup (ambisonics)  I  able to preserve head motion cues?  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  25 / 33   Source: http://www.doksinet  Outline  1  Spatial acoustics  2  Binaural perception  3  Synthesizing spatial audio  4  Extracting spatial sounds  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  26 / 33   Source: http://www.doksinet  Extracting spatial sounds  Given access to soundfield, can we recover separate components? I I  degrees of freedom:
> N signals from N sensors is hard but: people can do it (somewhat)  Information-theoretic approach I I  use only very general constraints rely on precision measurements  Anthropic approach I I  examine human perception attempt to use same information  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  27 / 33   Source: http://www.doksinet  Microphone arrays Signals from multiple microphones can be combined to enhance/cancel certain sources ‘Coincident’ mics with different directional gains m1  s1 a11 a22  a12 a21 s2  m2    m1 m2     =  a11 a12 a21 a22    s1 s2      ŝ1 ŝ2  ⇒    = Â−1 m  Microphone arrays (endfire) 0  λ = 4D -20  λ = 2D  -40  λ=D  +  Michael Mandel (E6820 SAPR)  Spatial sound  D  +  D  +  D  March 27, 2008  28 / 33   Source: http://www.doksinet  Adaptive Beamforming & Independent Component Analysis (ICA) Formulate mathematical criteria to optimize Beamforming: Drive interference to zero I  cancel energy during nontarget intervals
 ICA: maximize mutual independence of outputs I  from higher-order moments during overlap  m1 m2  x  a11 a12 a21 a22  s1 s2 −δ MutInfo δa  Limited by separation model parameter space I  only N × N?  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  29 / 33   Source: http://www.doksinet  Binaural models  Human listeners do better? I  certainly given only 2 channels  Extract ITD and IID cues?  cross-correlation finds timing differences ‘consume’ counter-moving pulses I how to achieve IID, trading I vertical cues. I I  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  30 / 33   Source: http://www.doksinet  Time-frequency masking How to separate sounds based on direction? assume one source dominates each time-frequency point assign regions of spectrogram to sources based on probabilistic models I re-estimate model parameters based on regions selected I I  Model-based EM Source Separation and Localization  Mandel and Ellis (2007) I models include IID as Lω and
IPD as arg Lω Rω Rω I independent of source, but can model it separately I  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  31 / 33   Source: http://www.doksinet  Summary  Spatial sound I  sampling at more than one point gives information on origin direction  Binaural perception I  time & intensity cues used between/within ears  Sound rendering I I  conventional stereo HRTF-based  Spatial analysis I I  optimal linear techniques elusive auditory models  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  32 / 33   Source: http://www.doksinet  References  Elizabeth M. Wenzel, Marianne Arruda, Doris J Kistler, and Frederic L Wightman Localization using nonindividualized head-related transfer functions. The Journal of the Acoustical Society of America, 94(1):111–123, 1993. William G. Gardner A real-time multichannel room simulator The Journal of the Acoustical Society of America, 92(4):2395–2395, 1992. C. P Brown and R O Duda A structural model for binaural
sound synthesis IEEE Transactions on Speech and Audio Processing, 6(5):476–488, 1998. Michael I. Mandel and Daniel P Ellis EM localization and separation using interaural level and phase cues. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 275–278, 2007. J. C Middlebrooks and D M Green Sound localization by human listeners Annu Rev Psychol, 42:135–159, 1991. Brian C. J Moore An Introduction to the Psychology of Hearing Academic Press, fifth edition, April 2003. ISBN 0125056281 Jens Blauert. Spatial Hearing - Revised Edition: The Psychophysics of Human Sound Localization. The MIT Press, October 1996 V. R Algazi, R O Duda, D M Thompson, and C Avendano The cipic hrtf database. In Applications of Signal Processing to Audio and Acoustics, 2001 IEEE Workshop on the, pages 99–102, 2001.  Michael Mandel (E6820 SAPR)  Spatial sound  March 27, 2008  33 / 33   Source: http://www.doksinet  EE E6820: Speech & Audio Processing & Recognition 
Lecture 9: Speech Recognition Dan Ellis <dpwe@ee.columbiaedu> Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820  April 7, 2009 1 2 3 4  Recognizing speech Feature calculation Sequence recognition Large vocabulary, continuous speech recognition (LVCSR)  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  1 / 43   Source: http://www.doksinet  Outline  1  Recognizing speech  2  Feature calculation  3  Sequence recognition  4  Large vocabulary, continuous speech recognition (LVCSR)  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  2 / 43   Source: http://www.doksinet  Recognizing speech “So, I thought about that and I think it’s still possible” Frequency  4000 2000  0  0  0.5  1  1.5  2  2.5  3  Time  What kind of information might we want from the speech signal? words phrasing, ‘speech acts’ (prosody) I mood / emotion I speaker identity I I  What kind of
processing do we need to get at that information? time scale of feature extraction signal aspects to capture in features I signal aspects to exclude from features I I  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  3 / 43   Source: http://www.doksinet  Speech recognition as Transcription Transcription = “speech to text” I  find a word string to match the utterance  Gives neat objective measure: word error rate (WER) % I  can be a sensitive measure of performance Reference:  THE CAT SAT ON THE MAT – Recognized: CAT SAT AN THE A MAT Deletion  Substitution  Insertion  Three kinds of errors: WER = (S + D + I )/N  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  4 / 43   Source: http://www.doksinet  Problems: Within-speaker variability Timing variation I  word duration varies enormously Frequency  4000 2000  0 0  0.5  s ow SO  I  1.0  1.5  2.0  ay  aa ax b aw ax ay ih k t s t ih l th dx th n th n ih I ABOUT I IT'S STILL THOUGHT THAT THINK AND
 2.5  3.0  p aa s b ax l POSSIBLE  fast speech ‘reduces’ vowels  Speaking style variation I I  careful/casual articulation soft/loud speech  Contextual effects I  speech sounds vary with context, role: “How do you do?”  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  5 / 43   Source: http://www.doksinet  Problems: Between-speaker variability Accent variation I  regional / mother tongue  Voice quality variation I  gender, age, huskiness, nasality  Individual characteristics  mbma0  mannerisms, speed, prosody freq / Hz  I  8000 6000 4000 2000 0 8000 6000  fjdm2  4000 2000 0 0  E6820 (Ellis & Mandel)  0.5  1  1.5  L9: Speech recognition  2  2.5  time / s  April 7, 2009  6 / 43   Source: http://www.doksinet  Problems: Environment variability Background noise I  fans, cars, doors, papers  Reverberation I  ‘boxiness’ in recordings  Microphone/channel I  huge effect on relative spectral gain  Close mic  freq / Hz  4000 2000 0 4000  Tabletop 2000 mic 0  E6820
(Ellis & Mandel)  0  0.2  0.4  0.6  0.8  L9: Speech recognition  1  1.2  1.4  time / s  April 7, 2009  7 / 43   Source: http://www.doksinet  How to recognize speech? Cross correlate templates? waveform? spectrogram? I time-warp problems I I  Match short-segments & handle time-warp later I I  model with slices of ∼10 ms pseudo-stationary model of words:  freq / Hz  sil  g  w  0.15  0.2  eh  n  sil  4000 3000 2000 1000 0  I  0  0.05  0.1  0.25  0.3  0.35  0.4  0.45 time / s  other sources of variation.    E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  8 / 43   Source: http://www.doksinet  Probabilistic formulation Probability that segment label is correct I  gives standard form of speech recognizers  Feature calculation: s[n]  Xm I  (m = Hn )  transforms signal into easily-classified domain  Acoustic classifier: p(q i | X ) I  calculates probabilities of each mutually-exclusive state q i  ‘Finite state acceptor’ (i.e HMM) Q ∗ = argmax p(q0 , q1 , .  
qL | X0 , X1 ,    XL ) {q0 ,q1 ,.qL }  I  MAP match of allowable sequence to probabilities: q0 = “ay” q1 . X  E6820 (Ellis & Mandel)  0 1 2 .  L9: Speech recognition  time  April 7, 2009  9 / 43   Source: http://www.doksinet  Standard speech recognizer structure Fundamental equation of speech recognition: Q ∗ = argmax p(Q | X , Θ) Q  = argmax p(X | Q, Θ)p(Q | Θ) Q  X = acoustic features p(X | Q, Θ) = acoustic model I p(Q | Θ) = language model I argmax Q = search over sequences I I  Questions: what are the best features? how do we do model them? I how do we find/match the state sequence? I I  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  10 / 43   Source: http://www.doksinet  Outline  1  Recognizing speech  2  Feature calculation  3  Sequence recognition  4  Large vocabulary, continuous speech recognition (LVCSR)  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  11 / 43   Source: http://www.doksinet  Feature Calculation Goal: Find a
representational space most suitable for classification waveform: voluminous, redundant, variable spectrogram: better, still quite variable I .? I I  Pattern Recognition: representation is upper bound on performance I I  maybe we should use the waveform.   or, maybe the representation can do all the work  Feature calculation is intimately bound to classifier I  pragmatic strengths and weaknesses  Features develop by slow evolution I  current choices more historical than principled  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  12 / 43   Source: http://www.doksinet  Features (1): Spectrogram Plain STFT as features e.g X Xm [k] = S[mH, k] = s[n + mH] w [n] e −j2πkn/N n  Consider examples: freq / Hz  Feature vector slice 8000 6000 4000 2000 0 8000 6000 4000 2000 0 0  0.5  1  1.5  2  2.5  time / s  Similarities between corresponding segments I  but still large differences  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  13 / 43   Source:
http://www.doksinet  Features (2): Cepstrum Idea: Decorrelate, summarize spectral slices: Xm [`] = IDFT{log |S[mH, k]|} I I  good for Gaussian models greatly reduce feature dimension  Male  8000  spectrum 4000  cepstrum Female  0 8 6 4 2 8000  spectrum 4000  cepstrum  0 8 6 4 2 0  E6820 (Ellis & Mandel)  0.5  1  1.5  L9: Speech recognition  2  2.5  time / s  April 7, 2009  14 / 43   Source: http://www.doksinet  Features (3): Frequency axis warp Linear frequency axis gives equal ‘space’ to 0-1 kHz and 3-4 kHz I  but perceptual importance very different  Warp frequency axis closer to perceptual axis I  mel, Bark, constant-Q .    X [c] =  uc X  |S[k]|2  k=`c Male  8000  spectrum 4000 0 15  audspec 10 5  Female  15  audspec 10 5 0  E6820 (Ellis & Mandel)  0.5  1  1.5  L9: Speech recognition  2  2.5  time / s  April 7, 2009  15 / 43   Source: http://www.doksinet  Features (4): Spectral smoothing Generalizing across different speakers is helped by smoothing (i.e blurring)
spectrum Truncated cepstrum is one way: I  MMSE approx to log |S[k]|  LPC modeling is a little different: MMSE approx to |S[k]|  prefers detail at peaks level / dB  I  50  audspec 30  Male  plp  40  0  2  4  6  8  10  12  14  16  18 freq / chan  15  audspec 10  5  15  plp 10 smoothed 5 0  E6820 (Ellis & Mandel)  0.5  1  1.5  L9: Speech recognition  2  2.5  time / s  April 7, 2009  16 / 43   Source: http://www.doksinet  Features (5): Normalization along time Idea: feature variations, not absolute level Hence: calculate average level and subtract it: Ŷ [n, k] = X̂ [n, k] − mean{X̂ [n, k]} n  Factors out fixed channel frequency response x[n] = hc ∗ s[n] X̂ [n, k] = log |X [n, k]| = log |Hc [k]| + log |S[n, k]| Male plp  15 10 5 15  mean 10 norm 5 Female 15 mean 10 norm 5 0  E6820 (Ellis & Mandel)  0.5  1  1.5  L9: Speech recognition  2  2.5  time / s  April 7, 2009  17 / 43   Source: http://www.doksinet  Delta features  Want each segment to have ‘static’ feature vals
I but some segments intrinsically dynamic!  calculate their derivativesmaybe steadier?  Append dX /dt (+ d 2 X /dt 2 ) to feature vectors Male 15 ddeltas 10  5  deltas  15 10 5 15  plp 10 (µ,σ norm) 5  0  0.5  1  1.5  2  2.5  time / s  Relates to onset sensitivity in humans?  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  18 / 43   Source: http://www.doksinet  Overall feature calculation MFCCs and/or RASTA-PLP Sound FFT X[k]  FFT X[k] spectra  Mel scale freq. warp  Key attributes:  Bark scale freq. warp  spectral, auditory scale  audspec log|X[k]|  log|X[k]|  decorrelation  IFFT  Rasta band-pass  smoothed (spectral) detail  Truncate  LPC smooth  Subtract mean  Cepstral recursion  CMN MFCC features  Rasta-PLP cepstral features  cepstra  E6820 (Ellis & Mandel)  smoothed onsets  normalization of levels  LPC spectra  L9: Speech recognition  April 7, 2009  19 / 43   Source: http://www.doksinet  Features summary Male  Female  8000  spectrum  4000 0 15  audspec 10 5
15  rasta 10 5 15  deltas 10 5 0  0.5  1  1.5  0  0.5  1  time / s  Normalize same phones Contrast different phones E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  20 / 43   Source: http://www.doksinet  Outline  1  Recognizing speech  2  Feature calculation  3  Sequence recognition  4  Large vocabulary, continuous speech recognition (LVCSR)  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  21 / 43   Source: http://www.doksinet  Sequence recognition: Dynamic Time Warp (DTW)  FOUR ONE TWO THREE  Reference  FIVE  Framewise comparison with stored templates: 70 60 50 40 30 20 10 10  20  30  40  50 time /frames  Test  I I  distance metric? comparison across templates?  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  22 / 43   Source: http://www.doksinet  Dynamic Time Warp (2) Find lowest-cost constrained path: matrix d(i, j) of distances between input frame fi and reference frame rj I allowable predecessors and transition costs Txy  I 
T10  D(i,j) = d(i,j) + min  D(i-1,j)  T01  11  D(i-1,j) T  Reference frames rj  Lowest cost to (i,j)  {  D(i-1,j) + T10 D(i,j-1) + T01 D(i-1,j-1) + T11  }  Local match cost  D(i-1,j)  Best predecessor (including transition cost)  Input frames fi  Best path via traceback from final state I  store predecessors for each (i, j)  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  23 / 43   Source: http://www.doksinet  DTW-based recognition Reference templates for each possible word For isolated words: I  mark endpoints of input word calculate scores through each template (+prune)  Reference  ONE TWO THREE  FOUR  I  Input frames I  continuous speech: link together word ends  Successfully handles timing variation I  recognize speech at reasonable cost  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  24 / 43   Source: http://www.doksinet  Statistical sequence recognition  DTW limited because it’s hard to optimize I I  learning from multiple observations
interpretation of distance, transition costs?  Need a theoretical foundation: Probability Formulate recognition as MAP choice among word sequences: Q ∗ = argmax p(Q | X , Θ) Q  X = observed features Q = word-sequences I Θ = all current parameters I I  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  25 / 43   Source: http://www.doksinet  State-based modeling Assume discrete-state model for the speech: I I  observations are divided up into time frames model  states  observations: Model Mj  Qk : q1 q2 q3 q4 q5 q6 .  states time  X1N : x1 x2 x3 x4 x5 x6 .  observed feature vectors  Probability of observations given model is: X p(X | Θ) = p(X1N | Q, Θ) p(Q | Θ) all Q  I  sum over all possible state sequences Q  How do observations X1N depend on states Q? How do state sequences Q depend on model Θ? E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  26 / 43   Source: http://www.doksinet  HMM review  HMM is specified by parameters Θ: - states qi -
transition probabilities aij  - emission distributions bi(x)  k  •  a  t  •  •  k  a  t  •  •  k  a  t  •  • k a t  k a t • 1.0 00 00 00 0.9 01 00 00 0.0 09 01 00 0.0 00 09 01  p(x|q) x  (+ initial state probabilities πi ) i aij ≡ p(qnj | qn−1 )  E6820 (Ellis & Mandel)  bi (x) ≡ p(x | qi )  L9: Speech recognition  πi ≡ p(q1i )  April 7, 2009  27 / 43   Source: http://www.doksinet  HMM summary (1) HMMs are a generative model: recognition is inference of p(Q | X ) During generation, behavior of model depends only on current state qn : I I  transition probabilities p(qn+1 | qn ) = aij observation distributions p(xn | qn ) = bi (x)  Given states Q = {q1 , q2 , .   , qN } and observations X = X1N = {x1 , x2 , .   , xN } Markov assumption makes Y p(X , Q | Θ) = p(xn | qn )p(qn | qn−1 ) n  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  28 / 43   Source: http://www.doksinet  HMM summary (2) Calculate p(X | Θ) via forward recursion: " S
# X n j p(X1 , qn ) = αn (j) = αn−1 (i)aij bj (xn ) i=1  Viterbi (best path) approximation    ∗ ∗ αn (j) = max αn−1 (i)aij bj (xn ) i  I  then backtrace.    Q ∗ = argmax(X , Q | Θ) Q  Pictorially: X  M* Q*  observed  inferred  M= Q = {q1,q2,.qn}  assumed, hidden E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  29 / 43   Source: http://www.doksinet  Outline  1  Recognizing speech  2  Feature calculation  3  Sequence recognition  4  Large vocabulary, continuous speech recognition (LVCSR)  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  30 / 43   Source: http://www.doksinet  Recognition with HMMs Isolated word I  choose best p(M | X ) ∝ p(X | M)p(M) Model M1 w Input  ah  p(X | M1)·p(M1) = .  n  Model M2 t  p(X | M2)·p(M2) = .  uw  Model M3 th  r  p(X | M3)·p(M3) = .  iy  Continuous speech I  Viterbi decoding of one large HMM gives words p(M1) Input  p(M2) sil p(M3)  E6820 (Ellis & Mandel)  w  th  L9: Speech recognition  ah t 
n uw  r  iy  April 7, 2009  31 / 43   Source: http://www.doksinet  Training HMMs  Probabilistic foundation allows us to train HMMs to ‘fit’ training data i.e estimate aij , bi (x) given data I better than DTW.    Algorithms to improve p(Θ | X ) are key to success of HMMs I  maximum-likelihood of models.    State alignments Q for training examples are generally unknown I  . else estimating parameters would be easy  Viterbi training I I  ‘Forced alignment’ choose ‘best’ labels (heuristic)  EM training I  ‘fuzzy labels’ (guaranteed local convergence)  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  32 / 43   Source: http://www.doksinet  Overall training procedure Labelled training data  Word models  “two one”  one  “five”  two  “four three”  w t  three  Data  ah  th  n uw  r  iy  Models t  uw  th  r  f  ao  w  ah  n  th  r  iy  iy  Fit models to data  Repeat until  Re-estimate model parameters convergence E6820 (Ellis & Mandel)  L9:
Speech recognition  April 7, 2009  33 / 43   Source: http://www.doksinet  Language models Recall, fundamental equation of speech recognition Q ∗ = argmax p(Q | X , Θ) Q  = argmax p(X | Q, ΘA )p(Q | ΘL ) Q  So far, looked at p(X | Q, ΘA ) What about p(Q | ΘL )? I I  Q is a particular word sequence ΘL are parameters related to the language  Two components: I I  link state sequences to words p(Q | wi ) priors on word sequences p(wi | Mj )  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  34 / 43   Source: http://www.doksinet  HMM Hierarchy  HMMs support composition I  can handle time dilation, pronunciation, grammar all within the same framework  ae1  k  ae2  ae  ae3  p(q | M) = p(q, φ, w | M)  t  aa  = p(q | φ) · p(φ | w )  CAT  ATE  · p(wn | w1n−1 , M)  THE SAT DOG  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  35 / 43   Source: http://www.doksinet  Pronunciation models Define states within each word p(Q | wi ) Can have unique states
for each word (‘whole-word’ modeling), or .   Sharing (tying) subword units between words to reflect underlying phonology more training examples for each unit generalizes to unseen words I (or can do it automatically.   ) I I  Start e.g from pronunciation dictionary: ZERO(0.5) ZERO(0.5) ONE(1.0) TWO(1.0)  E6820 (Ellis & Mandel)  z iy r ow z ih r ow w ah n tcl t uw  L9: Speech recognition  April 7, 2009  36 / 43   Source: http://www.doksinet  Learning pronunciations ‘Phone recognizer’ transcribes training data as phones I  align to ‘canonical’ pronunciations Baseform Phoneme String f ay v y iy r ow l d f ah ay v y uh r ow l  Surface Phone String I I  infer modification rules predict other pronunciation variants  e.g ‘d deletion’: d  ∅|`stop  p = 0.9  Generate pronunciation variants; use forced alignment to find weights E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  37 / 43   Source: http://www.doksinet  Grammar  Account for different likelihoods
of different words and word sequences p(wi | Mj ) ‘True’ probabilities are very complex for LVCSR I  need parses, but speech often agrammatic   Use n-grams: p(wn | w1L ) = p(wn | wn−K , .   , wn−1 ) e.g n-gram models of Shakespeare:  n=1 n=2 n=3 n=4  To him swallowed confess hear both. Which Of save on    What means, sir. I confess she? then all sorts, he is trim,    Sweet prince, Falstaff shall die. Harry of Monmouth’s grave   King Henry. What! I will go seek the traitor Gloucester     Big win in recognizer WER I I  raw recognition results often highly ambiguous grammar guides to ‘reasonable’ solutions  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  38 / 43   Source: http://www.doksinet  Smoothing LVCSR grammars n-grams (n = 3 or 4) are estimated from large text corpora I I  100M+ words but: not like spoken language  100,000 word vocabulary  1015 trigrams! I I  never see enough examples unobserved trigrams should NOT have Pr = 0!  Backoff to bigrams,
unigrams I I  p(wn ) as an approx to p(wn | wn−1 ) etc. interpolate 1-gram, 2-gram, 3-gram with learned weights?  Lots of ideas e.g category grammars p(PLACE | “went”, “to”)p(wn | PLACE) how to define categories? I how to tag words in training corpus? I  I  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  39 / 43   Source: http://www.doksinet  Decoding  How to find the MAP word sequence? States, pronunciations, words define one big HMM I  with 100,000+ individual states for LVCSR!   Exploit hierarchic structure I I  phone states independent of word next word (semi) independent of word history oy  DECOY  s  DECODES  ow  d  z  DECODES  axr  DECODER  k iy  d  uw  DECODE DO  root b  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  40 / 43   Source: http://www.doksinet  Decoder pruning  Searching ‘all possible word sequences’ ? need to restrict search to most promising ones: beam search sort by estimates of total probability = Pr (so far)+
lower bound estimate of remains I trade search errors for speed I  I  Start-synchronous algorithm: extract top hypothesis from queue: [Pn, {w1 , .   , wk }, n] pr. so far words next time frame I find plausible words {wi } starting at time n  new hypotheses: I  [Pn p(Xnn+N−1 | w i )p(w i | wk .  ), I I  {w1 , .   , wk , w i },  n + N]  discard if too unlikely, or queue is too long else re-insert into queue and repeat  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  41 / 43   Source: http://www.doksinet  Summary  Speech signal is highly variable I I  need models that absorb variability hide what we can with robust features  Speech is modeled as a sequence of features I I  need temporal aspect to recognition best time-alignment of templates = DTW  Hidden Markov models are rigorous solution I I  self-loops allow temporal dilation exact, efficient likelihood calculations  Language modeling captures larger structure pronunciation, word sequences fits directly into HMM
state structure I need to ‘prune’ search space in decoding I I  Parting thought Forward-backward trains to generate, can we train to discriminate? E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  42 / 43   Source: http://www.doksinet  References  Lawrence R. Rabiner A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989 Mehryar Mohri, Fernando Pereira, and Michael Riley. Weighted finite-state transducers in speech recognition. Computer Speech & Language, 16(1):69–88, 2002 Wendy Holmes. Speech Synthesis and Recognition CRC, December 2001 ISBN 0748408576. Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition Prentice Hall PTR, April 1993. ISBN 0130151572 Daniel Jurafsky and James H. Martin Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, January 2000 ISBN 0130950696
Frederick Jelinek. Statistical Methods for Speech Recognition (Language, Speech, and Communication). The MIT Press, January 1998 ISBN 0262100665 Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon. Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice Hall PTR, April 2001. ISBN 0130226165  E6820 (Ellis & Mandel)  L9: Speech recognition  April 7, 2009  43 / 43   Source: http://www.doksinet  EE E6820: Speech & Audio Processing & Recognition  Lecture 10: Signal Separation Dan Ellis <dpwe@ee.columbiaedu> Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820  April 14, 2009 1 2 3 4  Sound mixture organization Computational auditory scene analysis Independent component analysis Model-based separation  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  1 / 45   Source: http://www.doksinet  Outline  1  Sound mixture organization  2  Computational
auditory scene analysis  3  Independent component analysis  4  Model-based separation  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  2 / 45   Source: http://www.doksinet  Sound Mixture Organization 4000 frq/Hz 3000  0  2000  -20 -40  1000 0  0  2  4  6  8  10  12  time/s  -60 level / dB  Analysis  Voice (evil)  Voice (pleasant) Stab  Rumble  Choir Strings  Auditory Scene Analysis: describing a complex sound in terms of high-level sources / events .   like listeners do  Hearing is ecologically grounded I I  reflects ‘natural scene’ properties subjective, not absolute  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  3 / 45   Source: http://www.doksinet  Sound, mixtures, and learning freq / kHz  Speech  Noise  Speech + Noise  4 3  +  2  =  1 0  0  0.5  1  time / s  0  0.5  1  time / s  0  0.5  1  time / s  Sound I I  carries useful information about the world complements vision  Mixtures .   are the rule, not the exception I medium is
‘transparent’, sources are many I must be handled!  Learning the ‘speech recognition’ lesson: let the data do the work I like listeners  I  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  4 / 45   Source: http://www.doksinet  The problem with recognizing mixtures  “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman, 1990) Received waveform is a mixture I  two sensors, N signals .   underconstrained  Disentangling mixtures as the primary goal? I I  perfect solution is not possible need experience-based constraints  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  5 / 45   Source: http://www.doksinet  Approaches to sound mixture recognition  Separate signals, then recognize e.g Computational Auditory Scene Analysis (CASA),
Independent Component Analysis (ICA) I nice, if you can do it  Recognize combined signal I I  ‘multicondition training’ combinatorics.    Recognize with parallel models I I  full joint-state space? divide signal into fragments, then use missing-data recognition  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  6 / 45   Source: http://www.doksinet  What is the goal of sound mixture analysis?  CASA  ?  Separate signals? output is unmixed waveforms underconstrained, very hard .   I too hard? not required? I I  Source classification? I I  output is set of event-names listeners do more than this.    Something in-between? Identify independent sources + characteristics I  standard task, results?  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  7 / 45   Source: http://www.doksinet  Segregation vs. Inference Source separation requires attribute separation sources are characterized by attributes (pitch, loudness, timbre, and finer details) I need to
identify and gather different attributes for different sources.   I  Need representation that segregates attributes I I  spectral decomposition periodicity decomposition  Sometimes values can’t be separated e.g unvoiced speech I maybe infer factors from probabilistic model? p(O, x, y )  p(x, y | O) I  or: just skip those values & infer from higher-level context  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  8 / 45   Source: http://www.doksinet  Outline  1  Sound mixture organization  2  Computational auditory scene analysis  3  Independent component analysis  4  Model-based separation  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  9 / 45   Source: http://www.doksinet  Auditory Scene Analysis (Bregman, 1990) How do people analyze sound mixtures? break mixture into small elements (in time-freq) elements are grouped in to sources using cues I sources have aggregate attributes I I  Grouping ‘rules’ (Darwin and Carlyon, 1995) I  cues:
common onset/offset/modulation, harmonicity, spatial location, .    Onset map Sound  Frequency analysis  Elements  Harmonicity map  Sources Grouping mechanism  Source properties  Spatial map  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  10 / 45   Source: http://www.doksinet  Cues to simultaneous grouping  freq / Hz  Elements + attributes 8000  6000  4000  2000  0  0  1  2  3  4  5  6  7  8  9 time / s  Common onset I  simultaneous energy has common source  Periodicity I  energy in different bands with same cycle  Other cues I  spatial (ITD/IID), familiarity, .    E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  11 / 45   Source: http://www.doksinet  The effect of context Context can create an ‘expectation’ i.e a bias towards a particular interpretation I  A change in a signal will be interpreted as an added source whenever possible  frequency / kHz  e.g Bregman’s “old-plus-new” principle:  2 1  + 0 0.0  0.4  0.8  1.2  time / s  I  a
different division of the same energy depending on what preceded it  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  12 / 45   Source: http://www.doksinet  Computational Auditory Scene Analysis (CASA) Sound mixture Object 1 percept Object 2 percept Object 3 percept  Goal: Automatic sound organization I Systems to ‘pick out’ sounds in a mixture .   like people do  e.g voice against a noisy background I  to improve speech recognition  Approach psychoacoustics describes grouping ‘rules’ .   just implement them? I  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  13 / 45   "#$#!%&'()*+(,!-&'.+//0(1 CASA front-end processing  Source: http://www.doksinet  2 "'&&+3'1&456 Correlogram: Loosely based on known/possible physiology 7''/+38!94/+,!'(!:(';(<-'//093+!-=8/0'3'18 ine  yl ela  short-time autocorrelation  d  sound  Cochlea filterbank  frequency channels 
"'&&+3'1&45 /30.+ envelope follower  freq lag time  - linear filterbank cochlear approximation  linear filterbank cochlear approximation static nonlinearity I static -nonlinearity I zero-delay slice is likeslice spectrogram - zero-delay is like spectrogram I periodicity from delay-and-multiply detectors - periodicity from delay-and-multiply detectors I  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  14 / 45   Source: http://www.doksinet  Bottom-Up Approach (Brown and Cooke, 1994) Implement psychoacoustic theory input mixture  Front end  signal features (maps)  Object formation  discrete objects  Grouping rules  Source groups  freq onset time  period frq.mod  I I  left-to-right processing uses common onset & periodicity cues  Able to extract voiced speech v3n7 - Hu & Wang 02 freq / khz  freq / khz  v3n7 original mix 8  8 7  20  6  6  10  5  5  0  4  4  -10  3  3  -20  2  2  -30  1  1  0  0  7  0  0.2  0.4  0.6  E6820 (Ellis &
Mandel)  0.8  1  1.2 1.4 time / sec  0  -40 0.2  0.4  0.6  0.8  1  L10: Signal separation  1.2 1.4 time / sec  level ./ dB  April 14, 2009  15 / 45   Source: http://www.doksinet  Problems with ‘bottom-up’ CASA  "#$%&'()!*+,-!.%$,,$(/012!3454  freq  time  6  3+#70()7#+%+89!,+('/:#';0'87<!'&'('8,)  Circumscribing time-frequency elements - need to have ‘regions’, but hard to find need have ‘regions’, but hard to find 6 to"'#+$=+7+,<!+)!,-'!1#+(>#<!70' Periodicity -ishow the to primary handlecue aperiodic energy? I how to handle aperiodic energy? 6 ?')<8,-')+)!@+>!(>)A'=!!&,'#+89 Resynthesis via masked filtering - cannot separate within a single t-f element I cannot separate within a single t-f element 6 B$,,$(/01!&'>@')!8$!>(%+90+,<!$#!7$8,'C, Bottom-up leaves no ambiguity or context - how to model illusions? I how to model illusions? I 
E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  16 / 45   Source: http://www.doksinet  Restoration in sound perception Auditory ‘illusions’ = hearing what’s not there The continuity illusion & Sinewave Speech (SWS) 8000  Frequency  6000 4000 2000 0  0  0.5  1  1.5 Time  2  2.5  2 Time  2.5  3  Frequency  5000 4000 3000 2000 1000 0  I  0.5  1  1.5  3.5  duplex perception  What kind of model accounts for this? I  is it an important part of hearing?  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  17 / 45   Source: http://www.doksinet  Adding top-down constraints: Prediction-Driven CASA Perception is not direct but a search for plausible hypotheses Data-driven (bottom-up).   input mixture  I  Front end  signal features  Object formation  discrete objects  Grouping rules  Source groups  objects irresistibly appear  vs. Prediction-driven (top-down) hypotheses  Hypothesis management prediction errors input mixture  I I  Front end  signal
features  Compare & reconcile  Noise components Periodic components  Predict & combine  predicted features  match observations with a ‘world-model’ need world-model constraints.    E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  18 / 45   Source: http://www.doksinet  Generic sound elements for PDCASA (Ellis, 1996) Goal is a representational space that covers real-world perceptual sounds minimal parameterization (sparseness) I separate attributes in separate parameters I I  f/Hz  NoiseObject  profile  magProfile H[k]  4000  ClickObject  XC[n,k]  decayTimes T[k]  f/Hz  2000  4000 2000  1000  1000  WeftObject H W[n,k]  400  400  H[k]  200 0  200 f/Hz 400  XN[n,k]  0.010  0.00  magContour A[n]  0.005  0.01  2.2  2.4 time / s  0.0  0.1 time / s  0.000 0.0  0.1  200  p[n]  100  XC[n,k] = H[k]·exp((n - n0) / T[k])  50  Correlogram analysis  f/Hz Periodogram 400  0.2 time/s  200  XN[n,k] = H[k]·A[n]  100 50 0.0  0.2  0.4  0.6  0.8 time/s  XW[n,k] =
HW[n,k]·P[n,k]  Object hierarchies built on top.   E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  19 / 45   Source: http://www.doksinet  PDCASA for old-plus-new Incremental analysis t1  t2  t3 Input Signal  Time t1: Initial element created Time t2: Additional element required Time t2: Second element finished  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  20 / 45   Source: http://www.doksinet  PDCASA for the continuity illusion Subjects hear the tone as continuous .   if the noise is a plausible masker f/Hz  ptshort  4000 2000 1000  0.0  0.2  0.4  0.6  0.8  1.0  1.2  1.4 t/s  Data-driven analysis gives just visible portions:  Prediction-driven can infer masking:  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  21 / 45   Source: http://www.doksinet  Prediction-Driven CASA Explain a complex sound with basic elements f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0  1  2  3  f/Hz Wefts1−4  4  5  Weft5  6  7  Wefts6,7 
8  Weft8  9  Wefts9−12  4000 2000 1000 400 200 1000 400 200 100 50  Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200  Crash (10/10) f/Hz Noise1 4000 2000 1000  −40  400 200  −50 −60  Squeal (6/10) Truck (7/10)  −70 0  E6820 (Ellis & Mandel)  1  2  3  4  5  6  7  8  L10: Signal separation  9  time/s  dB  April 14, 2009  22 / 45   Source: http://www.doksinet  Aside: Ground Truth  What do people hear in sound mixtures? I  do interpretations match?   Listening tests to collect ‘perceived events’:  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  23 / 45   Source: http://www.doksinet  Aside: Evaluation Evaluation is a big problem for CASA what is the goal, really? what is a good test domain? I how do you measure performance? I I  SNR improvement tricky to derive from before/after signals: correspondence problem I can do with fixed filtering mask I differentiate removing signal from adding noise
I  Speech Recognition (ASR) improvement I  recognizers often sensitive to artifacts  ‘Real’ task? I  mixture corpus with specific sound events.    E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  24 / 45   Source: http://www.doksinet  Outline  1  Sound mixture organization  2  Computational auditory scene analysis  3  Independent component analysis  4  Model-based separation  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  25 / 45   Source: http://www.doksinet  Independent Component Analysis (ICA) (Bell and Sejnowski, 1995, etc.) If mixing is like matrix multiplication, then separation is searching for the inverse matrix a11 a12 a21 a22  s1  ×  s1 s2  x1 x2  a11  w11 w12 × w21 w22  a21 a12 a22  s2  x1 x2  s^1 s^2 −δ MutInfo δwij  i.e W ≈ A−1 I with N different versions of the mixed signals (microphones), we can find N different input contributions (sources) I how to rate quality of outputs? i.e when do outputs look separate? E6820
(Ellis & Mandel)  L10: Signal separation  April 14, 2009  26 / 45   Source: http://www.doksinet  Gaussianity, Kurtosis, & Independence A signal can be characterized by its PDF p(x) i.e as if successive time values are drawn from a random variable (RV) I Gaussian PDF is ‘least interesting’ I Sums of independent RVs (PDFs convolved) tend to Gaussian PDF (central limit theorem)  Measures of deviations from Gaussianity: 4th moment is Kurtosis (“bulging”) 0.8 exp(-|x|) kurt = 2.5 0.6  " kurt(y ) = E  y −µ σ  4 # −3  0.4  Gaussian kurt = 0  exp(-x4) kurt = -1  0.2  0 -4  -2  0  2  4  kurtosis of Gaussian is zero (this def.) ‘heavy tails’  kurt > 0 I closer to uniform dist.  kurt < 0 I Directly related to KL divergence from Gaussian PDF I  I  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  27 / 45   Source: http://www.doksinet  Independence in Mixtures Scatter plots & Kurtosis values s1 vs. s2  0.6  x1 vs. x2 0.6  0.4  0.4 0.2 
0.2  0 0  -0.2 -0.4  -0.2  -0.6 -0.4 -0.4  -0.2  E6820 (Ellis & Mandel)  -0.2  0  0.2  0.4  0.6  -0.5  0  0.5  p(s1) kurtosis = 27.90  p(x1) kurtosis = 18.50  p(s2) kurtosis = 53.85  p(x2) kurtosis = 27.90  -0.1  0  0.1  0.2  -0.2  L10: Signal separation  -0.1  0  0.1  0.2  April 14, 2009  28 / 45   Source: http://www.doksinet  Finding Independent Components Sums of independent RVs are more Gaussian  minimize Gaussianity to undo sums i.e search over wij terms in inverse matrix Kurtosis vs. θ  s2  0.6 0.4  kurtosis  mix 2  Mixture Scatter 0.8  s1  12 10 8  0.2 6 0 4 -0.2 2  -0.4  s1 -0.6 -0.3  -0.2  -0.1  0  0.1  0.2  0.3  0.4  0  0  0.2  0.4  0.6  mix 1  s2 0.8  θ/π  1  Solve by Gradient descent or Newton-Raphson: w + = E[xg (w T x)] − E[g 0 (w T x)]w w=  w+ kw + k  “Fast ICA”, (Hyvärinen and Oja, 2000) http://www.cishutfi/projects/ica/fastica/ E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  29 / 45   Source: http://www.doksinet  Limitations of ICA
Assumes instantaneous mixing I I  real world mixtures have delays & reflections STFT domain? x1 (t) = a11 (t) ∗ s1 (t) + a12 (t) ∗ s2 (t) ⇒ X1 (ω) = A11 (ω)S1 (ω) + A12 (ω)S2 (ω)  I  Solve ω subbands separately, match up answers  Searching for best possible inverse matrix I I  cannot find more than N outputs from N inputs but: “projection pursuit” ideas + time-frequency masking.    Cancellation inherently fragile ŝ1 = w11 x1 + w12 x2 to cancel out s2 sensitive to noise in x channels I time-varying mixtures are a problem I  I  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  30 / 45   Source: http://www.doksinet  Outline  1  Sound mixture organization  2  Computational auditory scene analysis  3  Independent component analysis  4  Model-based separation  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  31 / 45   Source: http://www.doksinet  Model-Based Separation: HMM decomposition (e.g Varga and Moore, 1990; Gales and Young,
1993) Independent state sequences for 2+ component source models model 2  model 1 observations / time  New combined state space q 0 = q1 × q2 I  need pdfs for combinations p(X | q1 , q2 )  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  32 / 45   Source: http://www.doksinet  One-channel Separation: Masked Filtering Multichannel  ICA: Inverse filter and cancel target s  s^  + -  mix  interference n  n^  proc  One channel: find a time-frequency mask mix s+n  spectro gram  stft  x  istft  s^  mask  proc  Frequency  Frequency  Cannot remove overlapping noise in t-f cells, but surprisingly effective (psych masking?): 0 dB SNR  8000  -12 dB SNR  -24 dB SNR  6000  Noise mix  4000 2000 8000 6000  Mask resynth  4000 2000 0  0  1  2 Time  3  E6820 (Ellis & Mandel)  0  1  2 Time  3  0  1  2 Time  L10: Signal separation  3  April 14, 2009  33 / 45   Source: http://www.doksinet  “One microphone source separation” (Roweis, 2001) State sequences  t-f estimates  mask
Original voices  freq / Hz  Speaker 1  Speaker 2  4000 3000 2000 1000 0  Mixture  4000 3000 2000 1000 0  Resynthesis masks  State means  4000 3000 2000 1000 0 4000 3000 2000 1000 0  I I  0  1  2  3  4  5 6 time / sec  0  1  2  3  4  5 6 time / sec  1000 states/model ( 106 transition probs.) simplify by subbands (coupled HMM)?  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  34 / 45   Source: http://www.doksinet  Speech Fragment Recognition  (Barker et al., 2005) Signal separation is too hard! Instead: I I  segregate features into partially-observed sources then classify  Made possible by missing data recognition I  integrate over uncertainty in observations for true posterior distribution  Goal: Relate clean speech models P(X | M) to speech-plus-noise mixture observations .   and make it tractable  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  35 / 45   "#$$#%&!'()(!*+,-&%#)#-%  Source: http://www.doksinet  !!! !! ! !! .
Speech /0++,1!2-3+4$!p(x|m)!(5+!264)#3#2+%$#-%(4777 models p(x | m) are multidimensional.   9i.e'<2<!=2#$(-!>#1'#$?2(!@*1!2>21A!@12B<!?C#$$2& means, variances for every freq. channel "#$$#%&!'()(!*+,-&%#)#-% I need values for all dimensions to get p(·) 9 $22,!>#&+2(!@*1!#&&!,'=2$('$(!0!520!p(•) . /0++,1!2-3+4$!p(x|m)!(5+!264)#3#2+%$#-%(4777  Missing Data Recognition  .  But: can evaluate over a subset of dimensions xk  9 '<2<!=2#$(-!>#1'#$?2(!@*1!2>21A!@12B<!?C#$$2& 86)9!,(%!+:(46()+!-:+5!(! xu 9 $22,!>#&+2(!@*1!#&&!,'=2$('$(!0!520! p(•) p(x k,xu) $6;$+)!-<!3#2+%$#-%$!xk .  86)9!,(%!+:(46()+!-:+5!(!  p ! x k m " =Z $ p ! $6;$+)!-<!3#2+%$#-%$! x # x u m " dx u xk p(xk | m) = p(xk , kxu | m) dx p ! x k m " = $ p ! x k#ux u m " dx u .  y xu  =+%,+>! . =+%,+>! 2#$$#%&!3()(!5+,-&%#)#-%9 2#$$#%&!3()(!5+,-&%#)#-%9 
Hence, missing data recognition: ?5+$+%)!3()(!2($@ ?5+$+%)!3()(!2($@  p(xk,xu)  y  p(xk|xu<y )  p(xk|xu<y )  xk p(x xk k ) p(xk )  P(x P(x | q)| q)= =  P(x1 | q)  dimension !  dimension !  P(x·1P(x | q) 2 | q) · P(x3 q) · P(x 2 | | q) · P(x4 | q) · P(x 3 |5 q) · P(x | q) · P(x · P(x | q) 4 |6 q) time ! · P(x5 | q) 9 C#1,!D#10!'(!!$,'$5!0C2!=#(E!F(25125#0'*$G · P(x6 | q) I hard part is finding the mask (segregation) "#$!%&&'(  E6820 (Ellis & Mandel)  )*+$,-!.'/0+12(!3!42#1$'$5 time ! separation L10: Signal  67789::976!9!:7!;!68  April 14, 2009  36 / 45   Source: http://www.doksinet  Missing Data Results Estimate static background noise level N(f ) Cells with energy close to background are considered “missing” Factory Noise "1754" + noise  SNR mask  Missing Data Classifier  "1754"  Digit recognition accuracy / %  100  80  60  40  20 0  a priori missing data MFCC+CMN  5  10  15  20  SNR (dB) I  " 
must use spectral features!  But: nonstationary noise  spurious mask bits I  can we try removing parts of mask?  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  37 / 45   Source: http://www.doksinet  Comparing different segregations Standard classification chooses between models M to match source features X M ∗ = argmax p(M | X ) = argmax p(X | M)p(M) M  M  Mixtures: observed features Y , segregation S, all related by p(X | Y , S) Observation Y(f) Source X(f) Segregation S  freq  Joint classification of model and segregation: Z p(X | Y , S) p(M, S | Y ) = p(M) p(X | M) dX p(S | Y ) p(X ) I  P(X ) no longer constant  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  38 / 45   Source: http://www.doksinet  Calculating fragment matches Z p(M, S | Y ) = p(M)  p(X | M)  p(X | Y , S) dX p(S | Y ) p(X )  p(X | M) – the clean-signal feature model p(X | Y ,S) – is X ‘visible’ given segregation? p(X )  Integration collapses some bands.   p(S|Y ) –
segregation inferred from observation I I  just assume uniform, find S for most likely M or: use extra information in Y to distinguish Ss.    Result: probabilistically-correct relation between clean-source models p(X | M) and I inferred, recognized source + segregation p(M, S | Y )  I I  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  39 / 45   Source: http://www.doksinet  Using CASA features p(S | Y ) links acoustic information to segregation I I  is this segregation worth considering? how likely is it?  Opportunity for CASA-style information to contribute periodicity/harmonicity: these different frequency bands belong together I onset/continuity: this time-frequency region must be whole I  Frequency Proximity  E6820 (Ellis & Mandel)  Common Onset  L10: Signal separation  Harmonicity  April 14, 2009  40 / 45   Source: http://www.doksinet  Fragment decoding Limiting S to whole fragments makes hypothesis search tractable: Fragments  Hypotheses w1  w5  w2  w6 w7 
w3  w8  w4  time T1  T2  T3  T4  T5  T6  I choice of fragments reflects p(S | Y )p(X | M) i.e best combination of segregation and match to speech models  Merging hypotheses limits space demands .   but erases specific history  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  41 / 45   Source: http://www.doksinet  Speech fragment decoder results Simple p(S | Y ) model forces contiguous regions to stay together I  big efficiency gain when searching S space  "1754" + noise  SNR mask  Fragments  Fragment Decoder  "1754"  Clean-models-based recognition rivals trained-in-noise recognition E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  42 / 45   Source: http://www.doksinet  Multi-source decoding Search for more than one source  q2(t) S2(t)  Y(t)  S1(t)  q1(t)  Mutually-dependent data masks disjoint subsets of cells for each source each model match p(Mx | Sx , Y ) is independent I masks are mutually dependent: p(S1 , S2 | Y )  I I 
Huge practical advantage over full search E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  43 / 45   Source: http://www.doksinet  Summary  Auditory Scene Analysis: I  Hearing: partially understood, very successful  Independent Component Analysis: I  Simple and powerful, some practical limits  Model-based separation: I  Real-world constraints, implementation tricks  Parting thought Mixture separation the main obstacle in many applications e.g soundtrack recognition  E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  44 / 45   Source: http://www.doksinet  References  Albert S. Bregman Auditory Scene Analysis: The Perceptual Organization of Sound The MIT Press, 1990. ISBN 0262521954 C.J Darwin and RP Carlyon Auditory grouping In BCJ Moore, editor, The Handbook of Perception and Cognition, Vol 6, Hearing, pages 387–424. Academic Press, 1995. G. J Brown and M P Cooke Computational auditory scene analysis Computer speech and language, 8:297–336, 1994.
D. P W Ellis Prediction–driven computational auditory scene analysis PhD thesis, Department of Electrtical Engineering and Computer Science, M.IT, 1996 Anthony J. Bell and Terrence J Sejnowski An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6):1129–1159, 1995. ftp://ftpcnlsalkedu/pub/tony/bellblindpsZ A. Hyvärinen and E Oja Independent component analysis: Algorithms and applications. Neural Networks, 13(4–5):411–430, 2000 http://www.cishutfi/aapo/papers/IJCNN99 tutorialweb/ A. Varga and R Moore Hidden markov model decomposition of speech and noise In Proceedings of the 1990 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 845–848, 1990. M.JF Gales and SJ Young Hmm recognition in noise using parallel model combination. In Proc Eurospeech-93, volume 2, pages 837–840, 1993 S. Roweis One-microphone source separation In NIPS, volume 11, pages 609–616 MIT Press, Cambridge MA, 2001. Jon
Barker, Martin Cooke, and Daniel P. W Ellis Decoding speech in the presence of other sources. Speech Communication, 45(1):5–25, 2005 URL http://www.eecolumbiaedu/~dpwe/pubs/BarkCE05-sfdpdf E6820 (Ellis & Mandel)  L10: Signal separation  April 14, 2009  45 / 45   Source: http://www.doksinet  EE E6820: Speech & Audio Processing & Recognition  Lecture 10: Music Analysis Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820  April 17, 2008 1  Music transcription  2  Score alignment and musical structure  3  Music information retrieval  4  Music browsing and recommendation  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  1 / 40   Source: http://www.doksinet  Outline  1  Music transcription  2  Score alignment and musical structure  3  Music information retrieval  4  Music browsing and recommendation  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  2 / 40   Source:
http://www.doksinet  Music Transcription Basic idea: recover the score Frequency  4000 3000 2000 1000 0 0  0.5  1  1.5  2  2.5  3  3.5  4  4.5  5  Time  Is it possible? Why is it hard? I music students do it .   but they are highly trained; know the rules  Motivations for study: what was played? highly compressed representation (e.g MIDI) I the ultimate restoration system.    I I  Not trivial to turn a “piano roll” into a score I I  meter determination, rhythmic quantization key finding, pitch spelling  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  3 / 40   Source: http://www.doksinet  Transcription framework Recover discrete events to explain signal Note events ? Observations {tk, pk, ik}  I  synthesis  X[k,n]  analysis-by-synthesis?  Exhaustive search? would be possible given exact note waveforms .   or just a 2-dimensional ‘note’ template? I  note template  2-D convolution I  but superposition is not linear in |STFT| space  Inference depends on all detected
notes I I  is this evidence ‘available’ or ‘used’ ? full solution is exponentially complex  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  4 / 40   Source: http://www.doksinet  Problems for transcription  Music is practically worst case! I  note events are often synchronized  I  notes have harmonic relations (2:3 etc.)   defeats common onset  collision/interference between harmonics I  variety of instruments, techniques, .    Listeners are very sensitive to certain errors .   and impervious to others  Apply further constraints I I  like our ‘music student’ maybe even the whole score!  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  5 / 40   Source: http://www.doksinet  Types of transcription  Full polyphonic transcriptions is hard, but maybe unnecessary Melody transcription I I  the melody is produced to stand out useful for query-by-humming, summarization, score following  Chord transcription I I  consider the signal holistically useful for
finding structure  Drum transcription I I  very different from other instruments can’t use harmonic models  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  6 / 40   Source: http://www.doksinet  Spectrogram Modeling Sinusoid model as with synthesis, but signal is more complex  Break tracks I  need to detect new ‘onset’ at single frequencies 0.06 0.04 0.02 0 0  0.5  1  1.5  3000  freq / Hz  I  2500 2000 1500 1000 500 0  0  1  2  3  4  time / s  time / s  Group by onset & common harmonicity find sets of tracks that start around the same time + stable harmonic pattern I  Pass on to constraint-based filtering.   Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  7 / 40   Source: http://www.doksinet  Searching for multiple pitches (Klapuri, 2005) At each frame: estimate dominant f0 by checking for harmonics cancel it from spectrum I repeat until no f0 is prominent I  I  Subtract & iterate  audio frame  level / dB  Harmonics enhancement  Predominant f0
estimation  f0 spectral smoothing  60 40  dB  dB  20 10  0 -20  0  20  30 0  1000  2000  3000  frq / Hz  20 10  10 0  0 0  -10 0  1000  2000  3000  4000 frq / Hz  0  0  100  200  300  1000  2000  3000  frq/Hz  f0 / Hz  Stop when no more prominent f0s  Can use pitch predictions as features for further processing e.g HMM Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  8 / 40   Source: http://www.doksinet  Probabilistic Pitch Estimates (Goto, 2001) Generative probabilistic model of spectrum as weighted combination of tone models at different fundamental frequencies: ! Z X p(x(f )) = w (F , m)p(x(f ) | F , m) dF m  ‘Knowledge’ in terms of tone models + prior distributions for f0 :  p(x,1 | F,2, µ(t)(F,2)) c(t)(1|F,2) (t)  c (2|F,2)  Tone Model (m=1) p(x | F,1, µ(t)(F,1))  c(t)(4|F,2)  c(t)(2|F,1)  fundamental frequency F+1200  Music analysis  c(t)(3|F,2)  c(t)(1|F,1) F  EM (RT) results:  Michael Mandel (E6820 SAPR)  Tone Model (m=2) p(x | F,2, µ(t)(F,2))  c(t)(3|F,1)
c(t)(4|F,1) F+1902 F+2400  x [cent]  x [cent]  April 17, 2008  9 / 40   Source: http://www.doksinet  Generative Model Fitting (Cemgil et al., 2006)  Multi-level graphical model rj,t : piano roll note-on indicators sj,t : oscillator state for each harmonic I yj,t : time-domain realization of each note separately I yt : combined time-domain waveform (actual observation)  I I  Incorporates knowledge of high-level musical structure and low-level acoustics Inference exact in some cases, approximate in others I  special case of the generally intractable switching Kalman filter  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  10 / 40   Source: http://www.doksinet  Transcription as Pattern Recognition (Poliner and Ellis) Existing methods use prior knowledge about the structure of pitched notes i.e we know they have regular harmonics  What if we didn’t know that, but just had examples and features? I  the classic pattern recognition problem  Could use music signal as evidence for
pitch class in a black-box classifier: Audio  I  Trained classifier  p("C0"|Audio) p("C#0"|Audio) p("D0"|Audio) p("D#0"|Audio) p("E0"|Audio) p("F0"|Audio)  nb: more than one class at once!  But where can we get labeled training data? Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  11 / 40   Source: http://www.doksinet  Ground Truth Data Pattern classifiers need training data i.e need {signal, note-label} sets i.e MIDI transcripts of real music   already exists?  Make sound out of midi I I  “play” midi on a software synthesizer record a player piano playing the midi  Make midi from monophonic tracks in a multi-track recording I  for melody, just need a capella tracks  Distort recordings to create more data I I  resample/detune any of the audio and repeat add in reverb or noise  Use a classifier to train a better classifier I I  alignment in the classification domain run SVM & HMM to label, use to retrain
SVM  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  12 / 40   Source: http://www.doksinet  Polyphonic Piano Transcription (Poliner and Ellis, 2007) Training data from player piano freq / pitch  Bach 847 Disklavier A6  20  A5 10  A4  0  A3 A2  -10  A1  -20 0  1  2  3  4  5  6  7  8  level / dB 9 time / sec  Independent classifiers for each note I  plus a little HMM smoothing  Nice results .   when test data resembles training Algorithm  Errs  False Pos  False Neg  d0  SVM Klapuri & Ryynänen Marolt  43.3% 66.6% 84.6%  27.9% 28.1% 36.5%  15.4% 38.5% 48.1%  3.44 2.71 2.35  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  13 / 40   Source: http://www.doksinet  Outline  1  Music transcription  2  Score alignment and musical structure  3  Music information retrieval  4  Music browsing and recommendation  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  14 / 40   Source: http://www.doksinet  Midi-audio alignment  Pattern classifiers need training
data i.e need {signal, note-label} sets i.e MIDI transcripts of real music   already exists?  Idea: force-align MIDI and original I I  can estimate time-warp relationships recover accurate note events in real music!  Also useful by itself I I  comparing performances of the same piece score following, etc.  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  15 / 40   Source: http://www.doksinet  Which features to align with? Transcription classifier  posteriors  Playback piano  Recorded Audio  chroma STFT  Feature Analysis Warped MIDI  tr  Reference MIDI  t (t ) w r  DTW MIDI score  MIDI to WAV  Warping Function  tw  Synthesized Audio  chroma  ^ t (t ) r w  STFT  Audio features: STFT, chromagram, classification posteriors Midi features: STFT of synthesized audio, derived chromagram, midi score Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  16 / 40   Source: http://www.doksinet  Alignment example Dynamic time warp can overcome timing variations  Michael Mandel
(E6820 SAPR)  Music analysis  April 17, 2008  17 / 40   Source: http://www.doksinet  Segmentation and structure Find contiguous regions that are internally similar and different from neighbors e.g “self-similarity” matrix (Foote, 1997) time / 183ms frames  DYWMB - self similarity 1800 1600 1400 1200 1000 800 600 400 200 0  0  500  1000  1500  time / 183ms frames  100  200  300  400  500  2D convolution of checkerboard down diagonal = compare fixed windows at every point I  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  18 / 40   Source: http://www.doksinet  BIC segmentation Use evidence from whole segment, not just local window Do ‘significance test’ on every possible division of every possible context  BIC :  log  L(X1 ; M1 )L(X2 ; M2 ) λ ≷ log(N)#(M) L(X ; M0 ) 2  Eventually, a boundary is found:  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  19 / 40   Source: http://www.doksinet  HMM segmentation Recall, HMM Viterbi path is joint
classification and segmentation e.g for singing/accompaniment segmentation  But: HMM states need to be defined in advance I I  define a ‘generic set’ ? (MPEG7) learn them from the piece to be segmented? (Logan and Chu, 2000; Peeters et al., 2002)  Result is ‘anonymous’ state sequence characteristic of particular piece U2-The Joshua Tree-01-Where The Streets Have No Name 33677 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 1 7 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 1 1 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1 5 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 3 3 3 3
3 3 3 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  20 / 40   Source: http://www.doksinet  Finding Repeats  Music frequently repeats main phrases Repeats give off-diagonal ridges in Similarity matrix (Bartsch and Wakefield, 2001) time / 183ms frames  DYWMB - self similarity 1800 1600 1400 1200 1000 800 600 400 200 0  0  500  1000  1500  time / 183ms frames  Or: clustering at phrase-level .   Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  21 / 40   Source: http://www.doksinet  Music summarization  What does it mean to ‘summarize’ ? compact representation of larger entity maximize ‘information content’ I sufficient to recognize known item I I  So summarizing music? short version e.g < 10% duration (< 20s for pop) sufficient to identify style, artist e.g chorus or ‘hook’ ? I  I  Why? browsing existing collection discovery among unknown works I commerce.    I I  Michael Mandel (E6820 SAPR) 
Music analysis  April 17, 2008  22 / 40   Source: http://www.doksinet  Outline  1  Music transcription  2  Score alignment and musical structure  3  Music information retrieval  4  Music browsing and recommendation  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  23 / 40   Source: http://www.doksinet  Music Information Retrieval  Text-based searching concepts for music? “Google of music” finding a specific item I finding something vague I finding something new I I  Significant commercial interest Basic idea: Project music into a space where neighbors are “similar” (Competition from human labeling)  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  24 / 40   Source: http://www.doksinet  Music IR: Queries & Evaluation What is the form of the query? Query by humming I I  considerable attention, recent demonstrations need/user base?  Query by noisy example “Name that tune” in a noisy bar Shazam Ltd.: commercial deployment I database access is the
hard part? I I  Query by multiple examples I  “Find me more stuff like this”  Text queries? (Whitman and Smaragdis, 2002) Evaluation problems I I  requires large, shareable music corpus! requires a well-defined task  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  25 / 40   Source: http://www.doksinet  Genre Classification (Tzanetakis et al., 2001) Classifying music into genres would get you some way towards finding “more like this” Genre labels are problematic, but they exist Real-time visualization of “GenreGram”:  9 spectral and 8 rhythm features every 200ms 15 genres trained on 50 examples each I single Gaussian model  ∼ 60% correct I I  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  26 / 40   Source: http://www.doksinet  Artist Classification (Berenzweig et al., 2002) Artist label as available stand-in for genre Train MLP to classify frames among 21 artists Using only “voice” segments: I  Song-level accuracy improves 56.7%  649% Track
117 - Aimee Mann (dynvox=Aimee, unseg=Aimee)  true voice Michael Penn The Roots The Moles Eric Matthews Arto Lindsay Oval Jason Falkner Built to Spill Beck XTC Wilco Aimee Mann The Flaming Lips Mouse on Mars Dj Shadow Richard Davies Cornelius Mercury Rev Belle & Sebastian Sugarplastic Boards of Canada  0  50  100  150  200  time / sec  Track 4 - Arto Lindsay (dynvox=Arto, unseg=Oval) true voice Michael Penn The Roots The Moles Eric Matthews Arto Lindsay Oval Jason Falkner Built to Spill Beck XTC Wilco Aimee Mann The Flaming Lips Mouse on Mars Dj Shadow Richard Davies Cornelius Mercury Rev Belle & Sebastian Sugarplastic Boards of Canada  0  10  Michael Mandel (E6820 SAPR)  20  30  40  Music analysis  50  60  70  80  time / sec  April 17, 2008  27 / 40   Source: http://www.doksinet  Textual descriptions Classifiers only know about the sound of the music not its context So collect training data just about the sound I I  Have humans label short, unidentified clips Reward for being
relevant and thorough   MajorMiner Music game  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  28 / 40   Source: http://www.doksinet  Tag classification  Use tags to train classifiers I  ‘autotaggers’  Treat each tag separately, easy to evaluate performance No ‘true’ negatives I  approximate as absence of one tag in the presence of others  rap hip hop saxophone house techno r&b jazz ambient beat electronica electronic dance punk fast rock country strings synth distortion guitar female drum machine pop soft piano slow british male solo 80s instrumental keyboard vocal bass voice drums 0.4  Michael Mandel (E6820 SAPR)  Music analysis  0.5  0.6 0.7 0.8 0.9 Classification accuracy  1  April 17, 2008  29 / 40   Source: http://www.doksinet  Outline  1  Music transcription  2  Score alignment and musical structure  3  Music information retrieval  4  Music browsing and recommendation  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  30 / 40   Source:
http://www.doksinet  Music browsing  Most interesting problem in music IR is finding new music I  is there anything on myspace that I would like?  Need a feature space where music/artists are arranged according to perceived similarity Particularly interested in little-known bands I I  little or no ‘community data’ (e.g collaborative filtering) audio-based measures are critical  Also need models of personal preference I I  where in the space is stuff I like relative sensitivity to different dimensions  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  31 / 40   Source: http://www.doksinet  Unsupervised Clustering (Rauber et al., 2002) Map music into an auditory-based space Build ‘clusters’ of nearby  similar music I  “Self-Organizing Maps” (Kohonen)  Look at the results: ga-iwantit ga-iwantit verve-bittersweetga-moneymilk ga-moneymilk verve-bittersweet party party  backforgood backforgood bigworld bigworld nma-poison nma-poison br-anesthesia whenyousay
br-anesthesia whenyousay youlearn youlearn  br-punkrock br-punkrock limp-lesson limp-lesson limp-stalemate limp-stalemate  addict addict ga-lie ga-lie ga-innocent ga-innocent pinkpanther pinkpanther ga-time ga-time vm-classicalgas vm-classicalgas vm-toccata vm-toccata  d3-kryptonite d3-kryptonite fbs-praise fbs-praise ga-anneclaire ga-anneclaire limp-rearranged korn-freak limp-rearranged limp-nobodyloves ga-heaven korn-freak limp-nobodyloves ga-heaven limp-show limp-show pr-neverenoughlimp-wandering pr-neverenough limp-99 limp-99 limp-wandering pr-deadcell pr-deadcell rem-endoftheworld pr-binge pr-binge wildwildwest wildwildwestrem-endoftheworld  “Islands of music” I  quantitative evaluation?  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  32 / 40   Source: http://www.doksinet  Artist Similarity  Artist classes as a basis for overall similarity: Less corrupt than ‘record store genres’ ? paula abdul abba roxette toni braxton westlife aaron carter erasure lara
fabian jessica simpson mariah carey new order janet jackson a ha whitney houston eiffel 65 celine dionpet shop boys christina aguilera aqua lauryn hill duran y spears sade soft cell all saints backstreet boys madonna prince spice girlsbelinda carlisle ania twain nelly furtado jamiroquai annie lennox cultur ace of base seal savage garden matthew sweet faith hill tha mumba  But: what is similarity between artists? I  pattern recognition systems give a number.    Need subjective ground truth: Collected via web site I  www.musicseercom  Results: 1800 users, 22,500 judgments collected over 6 months Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  33 / 40   Source: http://www.doksinet  Anchor space  A classifier trained for one artist (or genre) will respond partially to a similar artist A new artist will evoke a particular pattern of responses over a set of classifiers We can treat these classifier outputs as a new feature space in which to estimate similarity n-dimensional
vector in "Anchor Space"  Anchor Anchor  p(a1|x) Audio Input (Class i)  AnchorAnchor  p(a2n-dimensional |x) vector in "Anchor Space"  Anchor Audio Input (Class j)  GMM Modeling Similarity Computation  p(a1|x)p(an|x) p(a2|x)  Anchor  GMM Modeling KL-d, EMD, etc.  Conversion to Anchorspace p(an|x)  Conversion to Anchorspace  “Anchor space” reflects subjective qualities? Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  34 / 40   Source: http://www.doksinet  Playola interface ( http://www.playolaorg ) Browser finds closest matches to single tracks or entire artists in anchor space Direct manipulation of anchor space axes  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  35 / 40   Source: http://www.doksinet  Tag-based search ( http://majorminer.com/search ) Two ways to search MajorMiner tag data Use human labels directly I I  (almost) guaranteed to be relevant small dataset  Use autotaggers trained on human labels I I  can only train
classifiers for labels with enough clips once trained, can label unlimited amounts of music  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  36 / 40   Source: http://www.doksinet  Music recommendation Similarity is only part of recommendation I need familiar items to build trust .   in the unfamiliar items (serendipity)  Can recommend based on different amounts of history I I  none: particular query, like search lots: incorporate every song you’ve ever listened to  Can recommend from different music collections I I  personal music collection: “what am I in the mood for?” online databases: subscription services, retail  Appropriate music is a subset of good music? Transparency builds trust in recommendations See (Lamere and Celma, 2007)  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  37 / 40   Source: http://www.doksinet  Evaluation  Are recommendations good or bad? Subjective evaluation is the ground truth .   but subjects don’t know the bands being
recommended I can take a long time to decide if a recommendation is good  Measure match to similarity judgments e.g musicseer data  Evaluate on “canned” queries, use-cases I I  concrete answers: precision, recall, area under ROC curve applicable to long-term recommendations?  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  38 / 40   Source: http://www.doksinet  Summary  Music transcription I  hard, but some progress  Score alignment and musical structure I  making good training data takes work, has other benefits  Music IR I  alternative paradigms, lots of interest  Music recommendation I  potential for biggest impact, difficult to evaluate  Parting thought Data-driven machine learning techniques are valuable in each case  Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  39 / 40   Source: http://www.doksinet  References  A. P Klapuri A perceptually motivated multiple-f0 estimation method In IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics, pages 291–294, 2005. M. Goto A predominant-f¡sub¿0¡/sub¿ estimation method for cd recordings: Map estimation using em algorithm for adaptive tone models. In IEEE Intl Conf on Acoustics, Speech, and Signal Processing, volume 5, pages 3365–3368 vol.5, 2001 A. T Cemgil, H J Kappen, and D Barber A generative model for music transcription IEEE Transactions on Audio, Speech, and Language Processing, 14(2):679–694, 2006. G. E Poliner and D P W Ellis A discriminative model for polyphonic piano transcription EURASIP J Appl Signal Process., pages 154–154, January 2007 J. Foote A similarity measure for automatic audio classification In Proc AAAI Spring Symposium on Intelligent Integration and Use of Text, Image, Video, and Audio Corpora, March 1997. B. Logan and S Chu Music summarization using key phrases In IEEE Intl Conf on Acoustics, Speech, and Signal Processing, volume 2, pages 749–752, 2000. G. Peeters, A La Burthe, and X Rodet Toward automatic music audio summary
generation from signal analysis In Proc. Intl Symp Music Information Retrieval, October 2002 M. A Bartsch and G H Wakefield To catch a chorus: Using chroma-based representations for audio thumbnailing. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, October 2001 B. Whitman and P Smaragdis Combining musical and cultural features for intelligent style detection In Proc Intl. Symp Music Information Retrieval, October 2002 A. Rauber, E Pampalk, and D Merkl Using psychoacoustic models and self-organizing maps to create a hierarchical structuring of music by musical styles. In Proc Intl Symp Music Information Retrieval, October 2002. G. Tzanetakis, G Essl, and P Cook Automatic musical genre classification of audio signals In Proc Intl Symp Music Information Retrieval, October 2001. A. Berenzweig, D Ellis, and S Lawrence Using voice segments to improve artist classification of music In Proc AES-22 Intl. Conf on Virt, Synth, and Ent Audio, June 2002 P. Lamere and
O Celma Music recommendation tutorial In Proc Intl Symp Music Information Retrieval, September 2007. Michael Mandel (E6820 SAPR)  Music analysis  April 17, 2008  40 / 40   Source: http://www.doksinet  Analysis of Everyday Sounds Dan Ellis and Keansub Lee Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng, Columbia Univ, NY USA dpwe@ee.columbiaedu  1. 2. 3. 4. 5.  Personal and Consumer Audio Segmenting & Clustering Special-Purpose Detectors Generic Concept Detectors Challenges & Future  Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 1 /35   Source: http://www.doksinet  LabROSA Overview Information Extraction Music  Recognition  Environment  Separation  Machine Learning  Retrieval  Signal Processing  Speech  Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 2 /35   Source: http://www.doksinet  1. Personal Audio Archives  • Easy to record everything you hear <2GB / week @ 64 kbps  to find anything • Hard how to
scan? how to visualize? how to index?  • Need automatic analysis • Need minimal impact Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 3 /35   Source: http://www.doksinet  Personal Audio Applications  • Automatic appointment-book history fills in when & where of movements • “Life statistics”  how long did I spend in meetings this week? most frequent conversations favorite phrases?  • Retrieving details  what exactly did I promise? privacy issues.  • Nostalgia • . or what?  Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 4 /35   Source: http://www.doksinet  Consumer Video  • Short video clips as the evolution of snapshots 10-60 sec, one location, no editing browsing?  • More information for indexing. video + audio foreground + background  Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 5 /35   Source: http://www.doksinet  Information in Audio  • Environmental recordings contain info on:  location – type (restaurant,
street, .) and specific activity – talking, walking, typing people – generic (2 males), specific (Chuck & John) spoken content . maybe  • but not:  what people and things “looked like” day/night . . except when correlated with audible features  Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 6 /35   Source: http://www.doksinet  A Brief History of Audio Processing  • Environmental sound classification  draws on earlier sound classification work as well as source separation.  Speech Recognition  Source Separation One channel  Speaker ID  Multi-channel  GMM-HMMs Model-based  Music Audio Genre & Artist ID  Cue-based  Sountrack & Environmental Recognition  Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 7 /35   Source: http://www.doksinet  2. Segmentation & Clustering  • Top-level structure for long recordings: Where are the major boundaries? e.g for diary application support for manual browsing  of fundamental time-frame • Length
60s rather than 10ms? background more important than foreground average out uncharacteristic transients  features • Perceptually-motivated . so results have perceptual relevance broad spectrum + some detail Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 8 /35   Source: http://www.doksinet  MFCC Features  • Need “timbral” features:  log-domain               auditory-like frequency warping    Mel-Frequency Cepstral Coeffs (MFCCs)                     discrete    cosine   transform   = orthogonalization    Analysis of Everyday Sounds - Ellis & Lee  
           2007-07-24 p. 9 /35      Source: http://www.doksinet  Long-Duration Features Ellis & Lee ’04 Normalized Energy Deviation  Average Linear Energy 120  15  100  10 80  60  20  freq / bark  freq / bark  20  15  40  10 20 5  5 60  dB  dB  Average Log Energy 15  100  10  80  20  freq / bark  freq / bark  Log Energy Deviation  120  20  5  15  15 10  10 5  5  60 dB  dB  Spectral Entropy Deviation  Average Spectral Entropy 0.9 0.8  15  0.7 10  0.6  5  20  freq / bark  freq / bark  20  0.5 bits  0.5  15  0.4  10  0.3 0.2  5  0.1 50  100  150  200  250  300  350  400  450 time / min  • Capture both average and variation • Capture a little more detail in subbands. Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 10/35  bits   Source: http://www.doksinet  Spectral Entropy NF  • Auditory spectrum: A[n, j] =  w X[n, k] ‘peakiness’ of each band: • Spectral entropy ≈w X[n, ! " w X[n, k]
k] jk  k=0  NF  H[n, j] = −   jk  A[n, j]  energy / dB  k=0  jk  · log  A[n, j]  FFT spectral magnitude  0  -20  Auditory Spectrum  -40 -60  rel. entropy / bits  0  1000  2000  3000  4000  5000  6000  7000  8000  0.5 0  per-band Spectral Entropies  -0.5 -1 30 340 750 1130 1630  2280  3220 3780  4470  5280  6250  7380  freq / Hz  Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 11 /35   Source: http://www.doksinet  BIC Segmentation Chen & Gopalakrishnan ’98  • BIC (Bayesian Info. Crit) compares models: λ 1 )L(X2 ;M2 ) log L(X1 ;M ≷ L(X;M0 ) 2 log(N )∆#(M )  AvgLogAudSpec  2004-09-10-1023 AvgLEnergy 20 15 10 5  BIC score  boundary passes BIC 0  no boundary with shorter context  -100  -200  13:30  14:00  14:30  L(X1;M1) last segmentation point  L(X;M0)  Analysis of Everyday Sounds - Ellis & Lee  15:00  15:30  16:00  time / hr  L(X2;M2) candidate boundary  current context limit  2007-07-24 p. 12/35   Source: http://www.doksinet  BIC Segmentation Results
62 hr hand-marked dataset • Evaluate: 8 days, 139 segments, 16 categories measure Correct Accept % @ False Accept = 2%: Feature  Correct Accept 80.8%  0.8  μH  81.1%  0.7  σH/μH  81.6%  μdB + σH/μH  84.0%  μdB + σH/μH + μH  83.6%  0.4  mfcc  73.6%  0.3  o  Sensitivity  μdB  0.6  0.5  0.2 0  Analysis of Everyday Sounds - Ellis & Lee  µdB µH !H/µH µdB + !H/µH µdB + µH + !H/µH 0.005  0.01  0.015  0.02 0.025 1 - Specificity  0.03  2007-07-24 p. 13 /35  0.035  0.04   Source: http://www.doksinet  Segment Clustering  • Daily activity has lots of repetition:  Automatically cluster similar segments ‘affinity’ of segments as KL2 distances 4*5)#1-% 1))%'23 -"#"0-) ,"#,)# ()!%*#)/ ,'(('"#. ,#)"()!%*#)+ !"#$%"&'  ;01) ,0:('23 4%#))% #)4%"*#"2%  +  768  (',#"#9 7  !"15*4 !15  Analysis of Everyday Sounds - Ellis & Lee  (', #4% 4%# 666  2007-07-24 p. 14 /35   Source:
http://www.doksinet  Spectral Clustering  Ng, Jordan, Weiss ’01  • Eigenanalysis of affinity matrix: A = U•S•V SVD components: uk•skk•vk'  Affinity Matrix  900  800  800  600  700  400  600  200  k=1  k=2  k=3  k=4  500 400  800  300  600  200  400  100  200 200  400  600  800  200 400 600 800  200 400 600 800  eigenvectors vk give cluster memberships  • Number of clusters?  Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 15 /35   Source: http://www.doksinet  Clustering Results  • Clustering of automatic segments gives ‘anonymous classes’  BIC criterion to choose number of clusters make best correspondence to 16 GT clusters  • Frame-level scoring gives ~70% correct  errors when same ‘place’ has multiple ambiences  Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 16 /35   Source: http://www.doksinet  Browsing Interface  / Diary interface • Browsing links to other information (diary, email, photos) synchronize with note taking?
(Stifelman & Arons) audio thumbnails  • Release Tools + “how to” for capture !"#!!  '!!(D!%D&$  '!!(D!%D&(  '!!(D!%D&)  '!!(D!%D&*  '!!(D!%D&+  !"#$! !%#!! !%#$! &!#!! &!#$! &&#!!  ,-./01223 045. ,-./01223 045. >2= 3.067-  <.68=: 045. 25580. <.68=:' ,2/63.0 EFG!(  C' 045. 25580. 2769223.067-  EFG!$  &&#$! &'#!!  25580.  &'#$!  25580. 276922:-27,  <.68=:'  27692227692225580. 276922276922<.68=: 25580. ,2/63.0 <.68=:  &$#$!  34;  &(#!!  045. <.68=:' ?4=73  27692225580. 045. 25580.  ?8H.  C' 045. 25580. <.68=: 3.067-'  @--2A2B  276922-  F4<;4-64B  25580.  &(#$!  &+#$!  25580.  &"#$! &%#!! &%#$!  /.<8=4-  25580.  :,  25580. ,2/63.0 25580.  :-27,  C.//-  H.4=/7;  25580. <.68=:'  :-27, 25580. 045.  27692234;  25580.  Analysis of Everyday Sounds - Ellis & Lee &"#!!  :-27,
25580.  :-414<  &*#$! &+#!!  045. :-27,  276922-  &)#!!  &*#!!  -9:  02<,<6: 276922-  &$#!!  &)#$!  045. :-27,  3.067-  045.  045. 276922-  ,-./01223  =4614= 045.  045.  2007-07-24 p. 17 /35 045.   Source: http://www.doksinet  3. Special-Purpose Detectors: Speech  • Speech emerges as most interesting content • Just identifying speech would be useful goal is speaker identification / labeling • Lots of background noise conventional Voice Activity Detection inadequate • Insight: Listeners detect pitch track (melody) look for voice-like periodicity in noise coffeeshop excerpt  Frequency  4000 3000 2000 1000 0  0  0.5  1  1.5  2  2.5 Time  Analysis of Everyday Sounds - Ellis & Lee  3  3.5  4  4.5  2007-07-24 p. 18/35   Source: http://www.doksinet  Voice Periodicity Enhancement  • Noise-robust subband autocorrelation  Lee & Ellis ’06  • Subtract  local average suppresses steady background e.g machine noise  15 min test set; 88% acc (no
suppression: 79%) also for enhancing speech by harmonic filtering Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 19/35   Source: http://www.doksinet  Detecting Repeating Events  Ogle & Ellis ’07  • Recurring sound events can be informative indicate similar circumstance. but: define “event” – sound organization define “recurring event” – how similar? . and how to find them – tractable?  • Idea: Use hashing (fingerprints)  index points to other occurrences of each hash; intersection of hashes points to match - much quicker search use a fingerprint insensitive to background?  Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 20/35   Source: http://www.doksinet  Shazam Fingerprints  A. Wang ’06  • Prominent spectral onsets are landmarks; Use relations {f1, f2, t} as hashes Phone ring - Shazam fingerprint 4000  3000  2000  1000  0  0  0.5  1  1.5  2  2.5  intrinsically robust to background noise Analysis of Everyday Sounds - Ellis
& Lee  2007-07-24 p. 21/35  3   .(<,#'8#*'%(/#,1,(&" Source: http://www.doksinet  D'2#&4,#-2'/%3&6'(#.%/6'#(/#&,5,-4'(,#26(<*7#&4,#2,-,.&,/#*,.234#:'2?*#8.625=#:,55"##D6<%2,#EF#*4':#&4,#'%&-%&#'8#. ,-,.&,/#*,.234#'(##*326-&,/#GH#)6(%&,#2,3'2/6(<"##>46#6(35%/,/#2,-,.&*#'8#&:'#'(<7#&42,,#&,5,-4'(,#26(<7#&:'#,5,3&2'(63 ;,55*7#.(/#&:'#,34#'8#<2<,#/''2#(/#32#/''2#('6*,"##$#1.26,&=#'8#;3?<2'%(/#('6*,#:,2,#.5*'#-2,,(&"  Exhaustive Search for Repeats  • More selective hashes   few hits required to confirm match (faster; better precision) but less robust to backgound (reduce recall)  • Works well when exact structure repeats 
!"#$%&'()S#+,-,.&,/#*'%(/#,1,('%(/#/%26(<#,.234"##C'&,#&4,#K*,58L).&34#/6<'(5"##+,-,&*#'8#,1,(&#.2,#8'%(/#'88T/6<'(5#* 5.;,5,/"  recorded music, electronic alerts no good for “organic” sounds e.g garage door  $*#,I-,3&,/#&4,#-2'/%3&6'(#.%/6'#(/#&,5,-4'(,#26(<*#:,2,#,.*65=#8'%(/#:465,#&4,#6(<5,#&'(,#;,55#.(/#('6*,#'%(/#:,2, )6*,/"##>46#3.&&,2-5'*#32,.&,/#;=#-5'&&6(<#&4,#&6),#'8#&4,#J%,2=#*'%(/#,1,(&#.<6(*&#&4,#&6),#'8#.(=#2,-,&#'33%22,(3,* '8#&4.&#,1,(&"##>4%*#&4,#&2.6<4&#/6<'(5#56(,#2,-2,*,(&#&4,#K,58L#).&34#&4&#'33%2*#:4,(#&4,#J%,2=#'%(/#,1,(&#-.*,
&*,58#6(#&4,#865,"##>4,#'88#/6.<'(5#-'6(&*#.2,#&4,#2,-,&,/#,1,(&*"##D'2#,I.)-5,7#&4,#-'6(&*#.&#MHFF7ENHFO#(/#MHFF7#EPFFO ,-2,*,(&#&:'#2,-,.&#'33%22,(3,*#'8#&4,#&,5,-4'(,#26(<#.&#HFF#*,3'(/"##>4,,#.),#2,-,&#,1,(&*#.2,#*4':(#.&#MHFF7HFFO7 ENHF7HFFO7#.(/#MEPFF7HFFO"##>46*#=)),&2=#3.(#;,#*,,(#8'2#.55#&4,#)&34,*#.(/#6*#&4,#2,%5&#'8#&4,#5.&,2#'33%22,(3,*#'8#. '%(/#,1,(V(/6(<#&4,#,.256,2#'33%22,(3,"##Q(#-23&63,##8'2:2/#*,.234#:'%5/#'(5=#;,#(,,/,/#(/#458#'8#&46*#-5'&#:'%5/ 4':#.55#2,5,1(&##6(8'2)&6'("##>4,#/%-563&6'(#6(#&46*#6)-5,),(&.&6'(#:*#/'(,#&'#6(32,.*,#&4,#(%);,2#'8#2,-,.&,/
5,),(&*#&'#;,&&,2#,1.5%&,#&4,#*=&,)"##>4,#,.234#&6),#8'2#&4,#GH#)6(%&,#%/6'#865,#%*,/#&'#<,(,2.&,#&46*#-5'&#:.*#RP  Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 22/35   Source: http://www.doksinet  Music Detector  • Two characteristic features for music  Lee & Ellis ’08  strong, sustained periodicity (notes) clear, rhythmic repetition (beat) at least one should be present! Pitch-range subband autocorrelation  Local stability measure  Rhythm-range envelope autocorrelation  Perceptual rhythm model  Fused music classifier  music audio  • Noise-robust pitch detector looks for high-order autocorrelation • Beat tracker . from Music IR work  Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 23/35   Source: http://www.doksinet  4. Generic Concept Detectors Chang, Ellis et al. ’07  • Consumer Video application: How to assist browsing?  system automatically tags
recordings tags chosen by usefulness, feasibility  • Initial set of 25 tags defined:  “animal”, “baby”, “cheer”, “dancing” . human annotation of 1300+ videos evaluate by average precision  • Multimodal detection  separate audio + visual low-level detectors (then fused.)  Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 24 /35   Source: http://www.doksinet  MFCC Covariance Representation  • Each clip/segment  fixed-size statistics similar to speaker ID and music genre classification • Full Covariance matrix of MFCCs 8 7 6 5 4 3 2 1 0  VTS 04 0001 - Spectrogram  MFCC Covariance Matrix  30 20 10 0 -10  MFCC covariance  -20 1  2  3  4  5  6  7  8  9 time / sec  20 18 16 14 12 10 8 6 4 2 1  2  3  4  5  6  7  8 9 time / sec  50  20  level / dB  18 16  20 15 10 5 0 -5 -10 -15 -20 value  MFCC dimension  MFCC features  MFCC bin  Video Soundtrack  freq / kHz  maps the kinds of spectral shapes present  14 12 0  10 8 6 4 2  • Clip-to-clip distances for SVM
classifier 5  10 15 MFCC dimension  20  by KL or 2nd Gaussian model  Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 25/35  -50   Source: http://www.doksinet  GMM Histogram Representation  • Want a more ‘discrete’ description  . to accommodate nonuniformity in MFCC space . to enable other kinds of models  • Divide up feature space with a single Gaussian Mixture Model  . then represent each clip by the components used 8  150  6 4  7  2  MFCC features  Per-Category Mixture Component Histogram count  MFCC(1)  Global Gaussian Mixture Model  0  2 5  -2  6 1 10 14 8  100  15 9 12 13 3 4  -4  50 11  -6 -8 -10 -20  -10  0  10  20 MFCC(0)  Analysis of Everyday Sounds - Ellis & Lee  0  1  2  3  4  5  6 7 8 9 10 11 12 13 14 15 GMM mixture  2007-07-24 p. 26/35   Source: http://www.doksinet  Latent Semantic Analysis (LSA)  • Probabilistic LSA (pLSA) models each  Hofmann ’99  histogram as a mixture of several ‘topics’ . each clip may have several things going on  •
Topic sets optimized through EM  p(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip)  =  GMM histogram ftrs  *  “Topic”  p(topic | clip)  p(ftr | clip)  “Topic”  AV Clip  AV Clip  GMM histogram ftrs  p(ftr | topic)  use p(topic | clip) as per-clip features Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 27/35   Source: http://www.doksinet  AP  Audio-Only Results 0.8 0.7 0.6  • Wide range of results: 1G + KL 1G + Mah GMM Hist. + pLSA Guess  0.5 0.4 0.3 0.2 0.1 t Ba by Be a W ch ed di M ng us eu m Pa Pl ra ay de gr ou nd Su ns et Pi cn Bi ic rth da y  Bo a  Da Ski nc in g  On e p Gr ers ou on p of 3 + M us ic Ch e Gr ou er p of 2 Sp or ts Pa r Cr k ow d Ni gh An t im Si al ng in g Sh ow  0  Concepts  audio (music, ski) vs. non-audio (group, night) large AP uncertainty on infrequent classes Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 28/35   Source: http://www.doksinet  How does it ‘feel’?  • Browser impressions: How wrong is wrong? Top 8
hits for “Baby”  Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 29/35   Source: http://www.doksinet  Confusion analysis  • Where are the errors coming from? (b) Confusion Matrix of Classified Labels  True Labels  (a) Overlapped Manual Labels 1  Birthday Picnic Sunset Playground Parade Museum Wedding Beach Baby Boat Dancing Ski Show Singing Animal Night Crowd Park Sports Group of 2 Cheer Music Group of 3 + One person  0.7  0.9 0.8  0.6  0.7  0.5  0.6 0.4 0.5 0.3  0.4 0.3  0.2  0.2 0.1  0.1  1 3 M C 2 S P C N A S S S D B B BWM P P S P B  Analysis of Everyday Sounds - Ellis & Lee  0  1 3 M C 2 S P C N A S S S D B B BWM P P S P B  2007-07-24 p. 30/35  0   Source: http://www.doksinet  Fused Results - AV Joint Boosting  • Audio helps in many classes  0.9  Random Baseline Video Only Audio Only A+V Fusion  0.8 0.7  0.5 0.4 0.3 0.2 0.1  Analysis of Everyday Sounds - Ellis & Lee  MAP  birthday  picnic  sunset  playground  parade  museum  wedding  beach  baby  boat 
dancing  ski  shows  singing  animal  night  crowd  park  sport  group of 2  music cheer  group of 3+  0 one person  AP  0.6  2007-07-24 p. 31/35   Source: http://www.doksinet  5. Future: Temporal Focus  • Global vs. local class models  tell-tale acoustics may be ‘washed out’ in statistics try iterative realignment of HMMs:  YT baby 002: voice baby laugh 4  New Way: Limited temporal extents freq / kHz  freq / kHz  Old Way: All frames contribute 3  4  2  1  1 5  10 voice  15  0 time / s  baby  3  2  0  voice  bg  5 voice baby  10 laugh  15 bg  time / s  baby laugh  “background” (bg) model shared by all clips Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 32 /35  laugh   Source: http://www.doksinet  Handling Sound Mixtures  • MFCCs of mixtures ≠ mix of MFCCs  Solo Voice  freq / kHz  recognition despite widely varying background? factorial models / Nonnegative Matrix Factorization sinusoidal / landmark techniques 4  Original  MFCC Noise resynth  crm-11-070307 
crm-11-070307-noise  3 2 1 0 0  M+F Voice Mix  freq / kHz  -20 crm-11-070307+16-050105-noise  crm-11-070307+16-050105  4  -40  crm-11737.wav  3  crm-11737-noise.wav  -60 level / dB  2 1 0  crm-11737+16515.wav crm-11737+16515-noisewav 0  0.5  1  1.5 0 time / sec  0.5  Analysis of Everyday Sounds - Ellis & Lee  1  1.5 time / sec  2007-07-24 p. 33/35   Source: http://www.doksinet  Larger Datasets  • Many detectors are visibly data-limited getting data is ~ hard labeling data is expensive  • Bootstrap from YouTube etc.  lots of web video is edited/dubbed. - need a “consumer video” detector?  • Preliminary YouTube results disappointing downloaded data needed extensive clean-up models did not match Kodak data  • (Freely available data!) Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 34/35   Source: http://www.doksinet  Conclusions  • Environmental sound contains information . that’s why we hear! . computers can hear it too  • Personal audio can be
segmented, clustered find specific sounds to help navigation/retrieval  • Consumer video can be ‘tagged’ . even in unpromising cases audio is complementary to video  • Interesting directions for better models Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 35 /35   Source: http://www.doksinet  References • D. Ellis and KS Lee, “Minimal-Impact Audio-Based Personal Archives,” First ACM workshop on Continuous Archiving and Recording of Personal Experiences CARPE-04, New York, Oct 2004, pp. 39-47.  • S. Chen and P Gopalakrishnan, “Speaker, environment and channel change detection and clustering via the Bayesian Information Criterion,” In Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998.  • A. Ng, M Jordan, and Y Weiss On spectral clustering: Analysis and an algorithm In Advances in NIPS. MIT Press, Cambridge MA, 2001  • K. Lee and D Ellis, “Voice Activity Detection in Personal Audio Recordings Using Autocorrelogram
Compensation,” Proc. Interspeech ICSLP-06, pp 1970-1973, Pittsburgh, Oct 2006  • J. Ogle and D Ellis, “Fingerprinting to Identify Repeated Sound Events in Long-Duration Personal Audio Recordings,” Proc. ICASSP-07, Hawai'i, ppI-233-236  • A. Wang, “The Shazam music recognition service,” Communications of the ACM, 49(8):44–48, Aug 2006.  • K. Lee and D Ellis, “Detecting Music in Ambient Audio by Long-Window Autocorrelation,” Proc ICASSP-08, pp. 9-12, Las Vegas, April 2008  •S.-F Chang, D Ellis, W Jiang, K Lee, A Yanagawa, A Loui, J Luo (2007), Large-scale multimodal semantic concept detection for consumer video, Multimedia Information Retrieval workshop, ACM Multimedia Augsburg, Germany, Sep 2007, pp. 255-264  •T. Hofmann Probabilistic latent semantic indexing In Proc SIGIR, 1999  Analysis of Everyday Sounds - Ellis & Lee  2007-07-24 p. 36/35