鸡尾酒会算法 SVD 实现... 在一行代码?

在斯坦福大学的 Andrew Ng 在 Coursera 举办的机器学习入门讲座的幻灯片中,他给出了鸡尾酒会问题的一行 Octave 解决方案,因为音频来源是由两个空间分离的麦克风记录的:

[W,s,v]=svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');

幻灯片的底部是“来源: Sam Roweis,Yair Weiss,Eero Simoncelli”,早期幻灯片的底部是“ Te-Won Lee 提供的音频剪辑”。在视频中,吴教授说,

“因此,你可能会看到这样的非监督式学习,然后问,‘实现这一点有多复杂?似乎为了构建这个应用程序,似乎需要进行音频处理,需要编写大量代码,或者链接到一堆处理音频的 C + + 或 Java 库中。这似乎是一个非常复杂的程序来做这个音频: 分离出音频等。你们刚才听到的算法,只需要一行代码就可以完成,就在这里。研究人员花了很长时间才想出这行代码。所以我不是说这是个简单的问题。但事实证明,当你使用正确的编程环境时,许多学习算法将是非常短的程序。”

在视频讲座中播放的分离音频结果并不完美,但在我看来,令人惊讶。有人知道为什么这一行代码执行得这么好吗?特别是,有没有人知道关于这一行代码的参考文献,可以解释 Te-Won Lee、 Sam Roweis、 Yair Weiss 和 Eero Simoncelli 的工作?

更新

为了证明算法对麦克风分离距离的敏感性,下面的模拟(用八度音阶)将音调从两个空间分离的音调发生器中分离出来。

% define model
f1 = 1100;              % frequency of tone generator 1; unit: Hz
f2 = 2900;              % frequency of tone generator 2; unit: Hz
Ts = 1/(40*max(f1,f2)); % sampling period; unit: s
dMic = 1;               % distance between microphones centered about origin; unit: m
dSrc = 10;              % distance between tone generators centered about origin; unit: m
c = 340.29;             % speed of sound; unit: m / s


% generate tones
figure(1);
t = [0:Ts:0.025];
tone1 = sin(2*pi*f1*t);
tone2 = sin(2*pi*f2*t);
plot(t,tone1);
hold on;
plot(t,tone2,'r'); xlabel('time'); ylabel('amplitude'); axis([0 0.005 -1 1]); legend('tone 1', 'tone 2');
hold off;


% mix tones at microphones
% assume inverse square attenuation of sound intensity (i.e., inverse linear attenuation of sound amplitude)
figure(2);
dNear = (dSrc - dMic)/2;
dFar = (dSrc + dMic)/2;
mic1 = 1/dNear*sin(2*pi*f1*(t-dNear/c)) + \
1/dFar*sin(2*pi*f2*(t-dFar/c));
mic2 = 1/dNear*sin(2*pi*f2*(t-dNear/c)) + \
1/dFar*sin(2*pi*f1*(t-dFar/c));
plot(t,mic1);
hold on;
plot(t,mic2,'r'); xlabel('time'); ylabel('amplitude'); axis([0 0.005 -1 1]); legend('mic 1', 'mic 2');
hold off;


% use svd to isolate sound sources
figure(3);
x = [mic1' mic2'];
[W,s,v]=svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
plot(t,v(:,1));
hold on;
maxAmp = max(v(:,1));
plot(t,v(:,2),'r'); xlabel('time'); ylabel('amplitude'); axis([0 0.005 -maxAmp maxAmp]); legend('isolated tone 1', 'isolated tone 2');
hold off;

在我的笔记本电脑上执行大约10分钟后,模拟生成以下三个数字,说明两个孤立的音调具有正确的频率。

Figure 1 Figure 2 Figure 3

然而,将麦克风分离距离设置为零(即 dMic = 0)会导致模拟转而生成以下三个图形,说明模拟不能隔离第二个音调(由 svd 的 s 矩阵中返回的单个有意义的对角项证实)。

Figure 1 with no mic separation Figure 2 with no mic separation Figure 3 with no mic separation

我希望智能手机上的麦克风分离距离足够大,以产生良好的结果,但设置麦克风分离距离为5.25英寸(即,dMic = 0.1333米)导致模拟产生以下,不是令人鼓舞的,数字说明第一个隔离音调的高频分量。

Figure 1 on smartphone Figure 2 on smartphone Figure 3 on smartphone

45015 次浏览

x(t) is the original voice from one channel/microphone.

X = repmat(sum(x.*x,1),size(x,1),1).*x)*x' is an estimation of the power spectrum of x(t). Although X' = X, the intervals between rows and columns are not the same at all. Each row represents the time of the signal, while each column is frequency. I guess this is an estimation and simplification of a more strict expression called spectrogram.

Singular Value Decomposition on spectrogram is used to factorize the signal into different components based on spectrum information. Diagonal values in s are the magnitude of different spectrum components. The rows in u and columns in v' are the orthogonal vectors that map the frequency component with the corresponding magnitude to X space.

I don't have voice data to test, but in my understanding, by means of SVD, the components fall into the similar orthogonal vectors are hopefully be clustered with the help of unsupervised learning. Say, if the first 2 diagonal magnitudes from s are clustered, then u*s_new*v' will form the one-person-voice, where s_new is the same of s except all the elements at (3:end,3:end) are eliminated.

Two articles about the sound-formed matrix and SVD are for your reference.

I was trying to figure this out as well, 2 years later. But I got my answers; hopefully it'll help someone.

You need 2 audio recordings. You can get audio examples from http://research.ics.aalto.fi/ica/cocktail/cocktail_en.cgi.

reference for implementation is http://www.cs.nyu.edu/~roweis/kica.html

ok, here's code -

[x1, Fs1] = audioread('mix1.wav');
[x2, Fs2] = audioread('mix2.wav');
xx = [x1, x2]';
yy = sqrtm(inv(cov(xx')))*(xx-repmat(mean(xx,2),1,size(xx,2)));
[W,s,v] = svd((repmat(sum(yy.*yy,1),size(yy,1),1).*yy)*yy');


a = W*xx; %W is unmixing matrix
subplot(2,2,1); plot(x1); title('mixed audio - mic 1');
subplot(2,2,2); plot(x2); title('mixed audio - mic 2');
subplot(2,2,3); plot(a(1,:), 'g'); title('unmixed wave 1');
subplot(2,2,4); plot(a(2,:),'r'); title('unmixed wave 2');


audiowrite('unmixed1.wav', a(1,:), Fs1);
audiowrite('unmixed2.wav', a(2,:), Fs1);

enter image description here