Academia.eduAcademia.edu

A Probability Weighted Beamformer for Noise Robust ASR

2018, Interspeech 2018

We investigate a novel approach to spatial filtering that is adaptive to conditions at different time-frequency (TF) points for noise removal by taking advantage of speech sparsity. Our approach combines a noise reduction beamformer with a minimum variance distortionless response (MVDR) beamformer or Generalized Eigenvalue (GEV) beamformer through TF posterior probabilities of speech presence (PPSP). To estimate PPSP, we study both statistical model-based and neural network based methods, where in the former, we use complex Gaussian mixture modeling (CGMM) on temporally augmented spatial spectral features, and in the latter, we use neural network (NN) based TF masks to initialize speech and noise covariance matrices in CGMM. We have conducted experiments on CHiME-3 task. On its real noisy speech test set, our methods of feature augmentation, TF dependent spatial filter, and NN-based mask initialization on covariances for CGMM have yielded relative word error rate (WER) reductions cumulatively by 8%, 16%, and 25% over the original CGMM based MVDR. On the real test data, the three methods have also produced consistent WER reductions when replacing MVDR by GEV.

Interspeech 2018 2-6 September 2018, Hyderabad A probability weighted beamformer for noise robust ASR Suliang Bu1 , Yunxin Zhao1 , Mei-Yuh Hwang2 , Sining Sun3 1 Dept. of Electrical Engineering and Computer Science, University of Missouri-Columbia, USA 2 Mobvoi AI Lab, Redmond WA, USA 3 Sch. of Computer Science, Northwestern Polytechnical University, Xi’an, China [email protected], [email protected], [email protected], [email protected] Abstract is the speech distortion weighted multichannel Wiener filtering (SDW-MWF) [14, 15, 16], a generalization of Multichannel Wiener filtering (MWF), which provides a tradeoff between noise reduction and speech distortion. SDW-MWF can be viewed as a MVDR followed by a time-invariant post-filter [16] that scales the MVDR output in each frequency bin, which may not be optimal for sparse signal and time-varying noise. In this work, we investigate a TF-dependent spatial filtering approach and adapt the spatial filter design to speech and noise conditions at different TF points. To do so, we first derive separate filters with different aims: one aiming at capturing target speech in a desired direction, which can be accomplished by MVDR or GEV, and another aiming at maximally reducing noise, which can be accomplished by a linear filter derived from noise spatial covariance. We then combine the speech capture and noise reduction filters via the posterior probability of speech presence (PPSP) at each TF point to generate a TF-dependent spatial filter. Furthermore, to improve the estimation of statistical model based PPSP, we incorporate a differential temporal context to spatial spectral vectors in CGMM, and derive parameter updating formula based on Expectation-Maximization (EM) algorithm. Additionally, we investigate using the NNbased TF masks of [9] to improve the initialization of speech and noise covariance matrices for CGMM. In Section 2, we briefly review MVDR, GEV, SDW-MWF, and two methods of TF mask estimation of [6, 9]. In Section 3, we describe the proposed TF-dependent filter and differential context features for CGMM. We present experimental results on CHiME-3 [17] in Section 4, and draw conclusions in Section 5. We investigate a novel approach to spatial filtering that is adaptive to conditions at different time-frequency (TF) points for noise removal by taking advantage of speech sparsity. Our approach combines a noise reduction beamformer with a minimum variance distortionless response (MVDR) beamformer or Generalized Eigenvalue (GEV) beamformer through TF posterior probabilities of speech presence (PPSP). To estimate PPSP, we study both statistical model-based and neural network based methods, where in the former, we use complex Gaussian mixture modeling (CGMM) on temporally augmented spatial spectral features, and in the latter, we use neural network (NN) based TF masks to initialize speech and noise covariance matrices in CGMM. We have conducted experiments on CHiME-3 task. On its real noisy speech test set, our methods of feature augmentation, TF dependent spatial filter, and NN-based mask initialization on covariances for CGMM have yielded relative word error rate (WER) reductions cumulatively by 8%, 16%, and 25% over the original CGMM based MVDR. On the real test data, the three methods have also produced consistent WER reductions when replacing MVDR by GEV. Index Terms: noise robust speech recognition, MVDR beamformer, GEV beamformer, noise reduction 1. Introduction The performance of an automatic speech recognition (ASR) system may degrade significantly in noisy environments. Microphone array beamforming has shown great potential in improving ASR performance in noise [1, 2, 3, 4, 5]. In narrowband beamforming, to estimate a steering vector (SV) or a spatial filter, eigen analyses can be made on the spatial spectral covariance matrices of speech and noise. In the minimum variance distortionless response (MVDR) beamformer of [6, 4, 7], a SV was estimated as the eigenvector associated with the largest eigenvalue of the speech spatial covariance matrix in each frequency bin. In the Generalized Eigenvalue beamformer (GEV) of [8, 9], the filter was estimated as the generalized eigenvector with the largest eigenvalue involving both speech and noise spatial covariance to maximize signal-to-noise ratio (SNR). Speech and noise spatial covariance matrices are usually estimated by using time-frequency (TF) masks of speech. The TF masks can be obtained by methods of statistical models [10, 6, 11] or neural networks (NN) [9, 12, 13]. The former does not need stereo training data and it usually estimates masks independently for each TF point, while the latter may require stereo data and it jointly estimates masks over all frequency bins. MVDR and GEV beamformers are well established as effective methods for enhancing speech from noise. However, in real conditions, noticeable noises may still exist in their enhanced signals. To further remove noise, an alternative method 2. MVDR, GEV, SDW-MWF, and TF masks In this paper, we use bold font for vectors and regular font for scalars, with matrices specified explicitly. 2.1. MVDR and GEV beamforming Let yf,t = [yf,t,1 , ..., yf,t,M ]T denote the signal vector from M microphones, where yf,t,i denotes the i-th microphone signal at frequency f and time t, and (.)T denotes transpose. MVDR minimizes the output energy while keeping a fixed gain in the direction of the desired signal [18], whereas GEV maximizes SNR in the output signal at the expense of speech distortion [8]. Given the spatial covariance matrices of speech and noise Φxx (f ) and Φnn (f ), the GEV filter is the eigenvector with the largest eigenvalue of Φ−1 nn (f )Φxx (f ). On the other hand, given a unit gain on the desired signal and a SV hf , the MVDR filter is Φ−1 (f )hf wMVDR,f = H nn−1 (1) hf Φnn (f )hf Given a spatial linear filter wf , the enhanced signal x̂f,t is ob- 3048 10.21437/Interspeech.2018-2427 binary mask for speech, IBMX , and noise, IBMN , are defined by ( 1, kxk/knk > 10thX (f ) (7) IBMX (t, f ) = 0, else, ( 1, kxk/knk < 10thN (f ) (8) IBMN (t, f ) = 0, else, tained by x̂f,t = wfH yf,t (2) with (.)H the conjugate transpose. 2.2. SDW-MWF Using Woodbury identity [19], SDW-MWF can be decomposed into MVDR followed by a postfiltering as [16]: wSDWMWF,f = wMVDR,f · 2 σdx,f 2 2 σdx,f + µσdn,f where k·k is the Euclidean norm, thX (f ) and thN (f ) are two different thresholds. During test, the masks obtained for each channel are then condensed to a single speech mask and a single noise mask using a median operation that reduces the effect of outliers, such as broken channels. The speech and noise masks are used as weights on spatial spectral vectors in computing the spatial covariance matrices of speech and noise. (3) 2 2 denote the speech and noise variances and σdn,f where σdx,f in the MVDR output signal, respectively, and µ is the tradeoff parameter between speech distortion and noise reduction: larger µ results in more noise reduction at the cost of larger speech distortion. 3. Proposed methods We first describe the proposed TF-dependent spatial filter, and then explain the proposed temporal augmentation to the spatial spectral features for CGMM. In the following, we deal with narrowband beamformers by default, so the frequency index f is omitted when no ambiguity occurs. 2.3. CGMM-based mask estimation For statistical model based TF mask estimation, we adopt the CGMM method in [6]. Let yf,t , xf,t and nf,t denote an observed signal, speech signal1 , and noise signal at (f, t), respecn x tively, with xf,t = sxf,t rfx and nf,t = sn f,t rf , where sf,t is the x speech component, and rf is the acoustic transfer function vector (ATF) from the speech source to the M microphones, and n sn f,t and rf are defined similarly. The variables sxf,t and sn f,t are assumed to have zero-mean complex Gaussian distributions, i.e., sxf,t ∼ CN (0, φxf,t ) and n x n sn f,t ∼ CN (0, φf,t ), with φf,t and φf,t the variance of speech and noise, respectively. Thus, xf,t and nf,t are modeled as n xf,t ∼ CN (0, φxf,t Rfx ) and nf,t ∼ CN (0, φn f,t Rf ), where n n H n x H x x Rf = rf (rf ) and Rf = rf (rf ) . Accordingly, yf,t is modeled by a CGMM with two-components, one for speech and the other for noise. In practice, to accommodate for variations in speaker and microphone positions, Rfx and Rfn are treated as full-rank covariance matrices [20]. The CGMM parameters are estimated by EM algorithm. For the speech component, the model parameters are iteratively updated as: i h −1 H yf,t /M φxf,t = yf,t Rfx Rfx = P λxf,t = wfx X λxf,t 1 H yf,t yf,t x x t λf,t t φf,t wfx · p(y f,t |0, φxf,t Rfx ) n · p(y f,t |0, φxf,t Rfx ) + wfn · p(y f,t |0, φn f,t Rf ) 3.1. Speech probability weighted spatial filter Conventional beamformers, like MVDR or GEV, often use a time-invariant filter in each frequency bin [5, 9, 4]. Such filters are desired if the target signal always exists in the frequency band. However, this is not true for speech as it is sparse in the TF domain. Therefore, the beamformed signals often require a followup postfiltering to reduce residue noise. Here, to further reduce noises, we exploit the speech property of sparsity and investigate the following approach to spatial filter design: depending on the PPSP at a TF point, λxf,t , we swing the spatial filtering objective between speech capture (like MVDR or GEV filters) and noise reduction. For noise reduction, we consider the following filter wn : wn = argmin wH Φnn w, st. wH w = 1 (9) w The eigenvector with the minimum eigenvalue of Φnn is a solution to Eq. (9). Admittedly, we can switch between wMVDR /wGEV and wn with reference to a threshold of PPSP. However, to avoid tuning the threshold, we adopt a softswitching approach. Specifically, we define the following spatial filter for each (f, t) point: (4) (5) wt = (w∗ )pt ⊙ (wn )1−pt (6) (10) where w∗ can be wGEV or wMVDR , pt is the PPSP at (f, t) point, and (.)p and ⊙ denote element-wise power and multiplication operations, respectively. Clearly, if pt = 1, wt equals to w∗ ; if pt = 0, wt is wn ; for intermediate values of pt ’s, the combined local filter would have a mixed effect on speech capture and noise reduction. For the i’th microphone channel, the filter’s phase is a weighted linear interpolation of the phases of w*,i and wn,i , and the magnitude is the weighed geometric average of the magnitudes of w*,i and wn,i , with the weights being pt and 1 − pt , respectively. In narrowband beamformers, usually only single speech and noise covariance matrices are used to estimate spatial filters in each frequency bin [5, 9, 4]. In this case, the filters do not adapt to time-varying noises that often occur in real conditions. However, by using PPSP’s as the combining weight, our where wfx and wfn are the mixture weights. When EM converges, the posterior probability of speech, λxf,t , is taken as the local PPSP in our proposed local filter method (Section 3). The noise parameters are updated similarly. 2.4. NN-based mask estimation For NN based mask estimation, we review the method of bidirectional long-short term memory (BLSTM) network in [9, 12] due to its good ASR performance. During its noise-aware training, binary masks are used as training targets. The ideal 1 In [6], an observed signal was defined to be consisted of noisy speech and noise. To avoid confusions, we use the term of speech and noise instead. 3049 points were detected and used to initialize noise covariance, and these TF points were fixed as noise during EM iterations, while the rest points were used to initialize speech covariance. In feature augmentation, the step size l was set to 2 to avoid temporal overlap between contextual vectors (the frame shift was 25% of frame size). Note that the augmented features were not used in Eq.(2) or parameter initialization. We adopt the NN-based masks in [9, 12] due to its reported good performance for ASR. Empirically, we found that these masks could not be used directly as PPSP in (10). This might be due to the 0-1 binary target setting in NN training (Eq. (7, 8)), and the separate estimation of speech and noise masks that does not guarantee the sum-to-one constraint. Although the mask scores computed during test were continuous and normalized within [0,1], they did not work well as PPSP’s. On the other hand, to take advantage of the smooth TF masks produced by NN due to its estimating masks jointly across frequency bins, we investigated using the NN-based masks to initialize the speech and noise covariances for CGMM. Specifically, the NN-based masks were first used to detect noise TF points at the two ends of each utterance: points with a noise score larger than 0.9 were fixed as noise in EM iterations of CGMM. These noise points together with the rest NN-based masks were used for noise and speech covariance initialization in CGMM. proposed spatial filter (10) is able to change its objective from speech capture to noise removal. Although it no longer guarantees distortionless response in the desired signal, this weakness is compensated for by more effective noise reduction. 3.2. Spatial spectral feature augmentation in CGMM As the accuracy of local PPSP, λxf,t , is important to the above filter composition method, it is desired to improve the speechnoise discriminative power of CGMM. In [6, 7], only TFspecific spatial spectral vectors yt were used in local CGMMs. Since neighboring spectra may provide additional discriminative information, we augment each center spatial spectral vector by its temporal context. Specifically, a first-order time difference of yt with the step size l, ∆yt = yt+l − yt−l , is also considered as a feature: ∆yt = ∆xt + ∆nt = ∆st r x + ∆nt r n We see that in ∆yt , the ATFs of speech and noise remain unchanged. Therefore, ∆xt and ∆nt can be modeled by n CN (0, ∆φxt Rfx ), and CN (0, ∆φn t Rf ), respectively. For computational convenience, we adopt block-diagonal covariance matrices for speech and noise to model the augmented feature vector, [ytT ∆ytT ]T , in CGMM. Specifically, for the speech component, its covariance matrix becomes:  x x  0 φ1,t R 0 φx2,t Rx 4.2. Experiment Results Our ASR results are summarized in word error rate (WER) for the simulated and real test data. We first evaluated the proposed feature augmentation when CGMM-based masks were used alone, and compared it with the CHiME-3 baseline BeamformIt [22]. These results are given in Table 1, where “MVDR∆” and “GEV∆” denote using the augmented features. and its CGMM parameter update formulas are derived as : i h (11) φx1,t = ytH (Rx )−1 yt /M i h φx2,t = ∆ytH (Rx )−1 ∆yt /M (12) x R = 2 1 P t λxt X t yt y H λxt ( x t φ1,t ∆yt ∆ytH + ) φx2,t Table 1: WERs (%) of baseline, MVDR, GEV, with and without feature augmentation (13) For noise, its covariance matrix and its formulas for parameter updates are similarly defined and derived. baseline MVDR MVDR∆ GEV GEV∆ 4. Experiments and Results The CHiME-3 task covered four noisy environments: cafe (CAF), street (STR), public transport (BUS) and pedestrian area (PED). Real noisy speech data had 1600 utterances which were supplemented by 7138 simulated noisy speech utterances for acoustic model training. Test data also had real and simulated noisy speech and consisted of the 330 sentences as in the WSJ0 5k task. Data details are described in [17]. BUS 8.7 4.8 4.4 4.6 4.2 CAF 13.1 6.5 5.4 5.3 4.8 eval simu PED 12.9 5.4 5.6 5.6 5.5 STR 14.9 7.7 8.2 7.8 7.4 AVG 12.4 6.1 5.9 5.8 5.5 BUS 18.8 16.6 15.6 14.0 12.2 CAF 10.5 8.4 6.9 7.4 7.8 eval real PED 10.3 6.8 6.2 7.0 6.8 STR 9.8 8.3 8.1 7.5 7.8 AVG 12.4 10.0 9.2 9.0 8.6 In Table 1, the average WERs by our MVDR on simulated and real data was 6.1% and 10.0%, respectively, which greatly lowered the baseline WERs, indicating the effectiveness of the approach of [6]. These two figures were better than the corresponding figures of 6.96% and 10.37% of [6] with five microphones used in beamforming. A possible reason was that our noise spatial covariance initialization was more informative than the identity matrix based initialization in [6]. On the other hand, GEV performed better than MVDR. One likely reason was the better numerical stability of GEV over MVDR [12]: while MVDR needed matrix inversion, GEV did not. In addition, we observed that in the GEV beamformed signals, the lower frequency components appeared to be attenuated appreciably, which was beneficial to conditions with strong lowfrequency noise, like BUS. This might be another reason why GEV was better than MVDR in this task. On the other hand, comparing MVDR with MVDR∆ or GEV with GEV∆, feature augmentation further reduced average WERs, suggesting its benefit in boosting the discriminative power of CGMM. In the subsequent experiments, feature augmentation was used in CGMM by default. 4.1. Experiment Setup In the setup for speech recognition, we used the CHiME-3 baseline backend in Kaldi [21] without any modification. In the setup for beamforming, we evaluated our proposed methods in two cases: one only used CGMM to estimate PPSP, the other used NN-based masks to initialize the speech and noise covariances for CGMM and the converged CGMM was used to estimate PPSP. When CGMM alone was used to estimate PPSP, we largely followed the setting in [6] but added the following refinements. Before CGMM initialization, the microphone channel with the highest SNR was determined. Within the first and last 25 frames of this microphone signal and in each frequency bin, noise TF 3050 Compared with MVDR in Table 1, MVDR∆n* obtained 25% relative WER reduction on real test data. On the other hand, our GEV∆n on real test data had 7.4% WER, while in [12] GEV had a WER of 7.45%, which directly used the NN-based masks to calculate its filters. Although using NN-based masks to initialize CGMM did not affect WER of GEV, the probability weighted beamformer further reduced the WER to 7.0%. Finally, we summarize the performance in WER on the real test data in Fig. 1 for MVDR/GEV and our proposed methods. With the successive introduction of feature augmentation, TF dependent filter, and NN mask based initialization, the WER is decreased progressively for both MVDR and GEV, and the total relative WER reduction was 25% for MVDR and 22% for GEV. In Table 2, we provide WER for the proposed TF-dependent filters, where MVDR∆* and GEV∆* indicate that MVDR and GEV filter were used in (10), respectively. In addition, we compared MVDR∆* with SDW-MWF. Based on Eq. (3), our implementation of SDW-MWF was actually MVDR∆ followed by a post-filter, where the tradeoff parameter µ was set to 0.5 and 1, respectively, denoted as MWF0, and MWF1. In the post-filter, the speech and noise power spectrum density (PSD) estimation was based on [23]. For convenience of comparison, results for MVDR∆ and GEV∆ were repeated. Table 2: WER (%) of proposed local filtering and SDW-MWF MVDR∆ MWF0 MWF1 MVDR∆* GEV∆ GEV∆* BUS 4.4 4.5 4.2 5.4 4.2 5.0 CAF 5.4 5.4 5.4 5.9 4.8 6.1 eval simu PED 5.6 5.0 5.9 5.3 5.5 6.0 STR 8.2 7.5 7.9 7.7 7.4 7.6 AVG 5.9 5.6 5.9 6.1 5.5 6.2 BUS 15.6 16.0 16.8 13.8 12.2 12.4 CAF 6.9 6.8 6.8 6.1 7.8 6.7 eval real PED 6.2 6.2 6.2 6.5 6.8 5.7 STR 8.1 7.6 8.2 7.2 7.8 7.7 AVG 9.2 9.1 9.5 8.4 8.6 8.1 Comparing MWF0, MWF1 with MVDR∆, it appeared that SDW-MWF did not reduce WER significantly. A possible reason was that the noise being non-stationary and hence the noise PSD was difficult to model. On the other hand, comparing MVDR∆ with MVDR∆*, or GEV∆ with GEV∆*, we found that our TF-dependent filters worked effectively on real data. In comparison with MVDR in Table 1, MVDR∆* got 16% relative WER reduction on real test data. To better understand the positive effect of the noise reduction filter wn in WER reduction, we examined the estimated filter component values and found that the magnitude of the individual components correlated with the relative noise strength in the multi-channels: a larger magnitude was correlated with a lower level of noise in a channel. As the result, according to (10), a cleaner channel would make a larger contribution to the beamformed signal. That said, more careful analysis is still needed in a future study. On the other hand, the local filter methods slightly increased WER on simulated data. Clearly, better performance on real data is more valuable for real applications. A further examination revealed that MVDR and GEV tended to remove noises much better in simulated data than in real data: the beamformed simulated data tended to have larger SNR than beamformed real data. As the result, noise corruption in simulated data was less an issue after MVDR or GEV, but speech distortion due to the TF-dependent filtering became noticeable, which might be significant if the PPSP’s were inaccurate at some TF points. This points to the need for further improving the robustness and design of our filter in Eq. (10). One possibility along this line is to derive more accurate PPSP from the beamformed signal of MVDR or GEV, and then use the PPSP in Eq. (10). Figure 1: WER comparison of (a) CGMM based MVDR/GEV; (b) with feature augmentation in CGMM; (c) TF dependent filter with feature augmentation; and (d) TF dependent filter with feature augmentation and NN mask based initialization It is worth noting that presumably, using soft masks as targets in NN training might produce mask scores more compatible with the PPSP’s for filter composition in Eq. (10). On the other hand, unlike CGMM, NN-based methods such as [12] do not exploit the ATF model that may facilitate discrimination between speech and noise. Our approach of using NN masks to initialize CGMM provides a way to utilize both the spectraltemporal context-dependent scores provided by NN and the explicit ATF modeling by CGMM. 5. Conclusions In this paper, we have introduced a TF-dependent spatial filter that focuses on speech capture or noise reduction dynamically according to PPSP at different TF points. This method takes into consideration of speech sparsity in TF domain and attempts to remove noise more aggressively than MVDR or GEV alone. To better estimate PPSP under CGMM, we have augmented spatial spectral vectors by their contextual vectors. We have further investigated using the NN-based TF masks to initialize the speech and noise covariance matrices for CGMM. We have achieved word error reductions with each of these methods. On the real test set of ChiME-3 task, our methods of feature augmentation, local spatial filter, and NN-based mask initialization on covariances for CGMM have cumulatively yielded relative word error rate reductions of 8%, 16%, and 25% over our implementation of CGMM based MVDR of [6]. The three methods have also produced consistent word error rate reductions when GEV was used in place of MVDR on real test data. In a future work, we plan to further improve the probability weighted beamformer and investigate its performance in heavier reverberation conditions than those of CHiME-3. Table 3: WER (%) of local filtering with NN-based masks for CGMM initialization MVDR∆n MVDR∆n* GEV∆n GEV∆n* BUS 4.4 4.2 4.2 4.4 CAF 6.2 6.2 5.4 6.2 eval simu PED 5.6 5.9 5.2 5.4 STR 6.6 6.7 5.9 6.8 AVG 5.7 5.7 5.2 5.7 BUS 13.0 11.8 10.4 9.5 CAF 7.3 6.0 6.9 6.4 eval real PED 6.9 5.9 5.5 5.8 STR 7.4 6.5 6.6 6.4 AVG 8.6 7.5 7.4 7.0 In Table 3, the NN-based TF masks were used to initialize the speech and noise covariance matrices for CGMM and all methods were tagged by “n” to indicate this setting. Comparing results of Table 3 with Table 2, we found that all methods gained benefit from this initialization as it led to better PPSP estimates. 3051 6. References [1] K. Kumatani, T. Arakawa et al., “Microphone array processing for distant speech recognition: Towards real-world deployment,” in APSIPA ASC, 2012, pp. 1–10. [2] L. Pfeifenberger, T. Schrank et al., “Multi-channel speech processing architestures for noise robust speech recognition: 3-rd CHiME chanllenge results,” in Interspeech, 2015. [3] T. Menne, J. Heymann et al., “The RWTH/UPB/FORTH system combination for the 4th CHiME challenge evaluation,” in The 4th IWSPEE, 2016. [4] T. Yoshioka, N. Ito et al., “The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multimicrophone devices,” in ASRU, 2015. [5] H. Erdogan, T. Hayashi et al., “Multi-channel speech recognition: Lstms all the way through,” in CHiME-4 workshop, 2016. [6] T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, “Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise,” in ICASSP, 2016. [7] T. Higuchi, N. Ito et al., “Online MVDR beamformer based on complex gaussian mixture model with spatial prior for noise robust asr,” IEEE/ACM Trans. ASLP, vol. 25, no. 4, pp. 780–793, 2017. [8] E. Warsitz and R. Haeb-Umbach, “Blind acoustic beamforming based on generalized eigenvalue decomposition,” IEEE/ACM, Trans. ASLP, vol. 15, no. 5, pp. 1529–1539, 2007. [9] J. Heymann, L. Drude et al., “BLSTM supported gev beamformer front-end for the 3rd CHiME challenge,” in ASRU, 2015. [10] S. Araki, T. Nakatani, H. Sawada, and S. Makino, “Blind sparse source separation for unknown number of sources using Gaussian mixture model fitting with dirichlet prior,” in ICASSP, 2009. [11] D. Vu and R. Haeb-Umbach, “Blind speech separation employing directional statistics in an expectation maximization framework,” in ICASSP, 2010. [12] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” in ICASSP, 2016. [13] X. Xiao, S. Zhao et al., “On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition,” in ICASSP, 2017. [14] A. Spriet, M. Moonen, and J. Wouters, “Spatially pre-processed speech distortion weighted multi-channel wiener filtering for noise reduction,” Signal Processing, vol. 84, no. 12, pp. 2367– 2387, 2004. [15] S. Doclo, A. Spriet et al., “Speech distortion weighted multichannel wiener filtering techniques for noise reduction,” Speech enhancement, pp. 199–228, 2005. [16] S. Gannot, E. Vincent et al., “A consolidated perspective on multimicrophone speech enhancement and source separation,” IEEE/ACM Trans. ASLP, vol. 25, no. 4, pp. 692–730, 2017. [17] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third CHiME speech separation and recognition challenge: Dataset, task and baselines,” in ASRU, 2015. [18] J. Benesty, J. Chen, and Y. Huang, Microphone Array Signal Processing. Springer, Berlin-Heidelberg-New York, 2008. [19] G. H. Golub and C. F. V. Loan, “Matrix computations,” 1996. [20] N. Duong, E. Vincent, and R. Gribonval, “Under-determined reverberant audio source separation using a full-rank spatial covariance model,” IEEE/ACM, Trans. ASLP, 2010. [21] P. Daniel, G. Arnab et al., “The Kaldi speech recognition toolkit,” in ASRU, 2011. [22] X. Anguera, C. Wooters, and J. Hernando, “Acoustic beamforming for speaker diarization of meetings,” IEEE/ACM, Trans. ASLP, vol. 15, no. 7, pp. 2011–2022, 2007. [23] M. Souden, J. Chen et al., “Gaussian model-based multichannel speech presence probability,” IEEE/ACM, Trans. ASLP, vol. 18, no. 5, pp. 1072–1077, 2010. 3052