I Introduction
Deep Neural Networks (DNNs), the workhorse of speechbased user interaction systems, prove particularly effective when big amounts of data and plenty of computing resources are available. However, in many realworld applications the limited computing infrastructure, latency and the power constraints during the operation phase effectively suspend most of the current resourcehungry DNN approaches. Therefore, there are several key challenges which have to be jointly considered to facilitate the usage of DNNs when it comes to edgecomputing implementations:

Efficient representation: The model complexity measured by the number of model parameters should match the limited resources of the computing hardware, in particular regarding memory footprint.

Computational efficiency: The model should be computationally efficient during inference, exploiting the available hardware optimally with respect to time and energy. Power constraints are key for embedded systems, as the device lifetime for a given battery charge needs to be maximized.

Prediction quality: The focus is usually on optimizing the prediction quality of the models. For embedded devices, model complexity versus prediction quality tradeoffs must be considered to achieve good prediction performance while simultaneously reducing computational complexity and memory requirements.
DNN models use GPUs to enable efficient processing, where single precision floatingpoint numbers are common for parameter representation and arithmetic operations. To facilitate deep models in today’s consumer electronics, the model usually has to be scaled down to be implemented efficiently on embedded or low power systems. Most research emphasizes one of the following two techniques: (i) reduce model size in terms of number of weights and/or neurons
[17, 16, 15, 60, 25, 68], or (ii) reduce arithmetic precision of parameters and/or computational units [10, 64, 35, 43]. Evidently, these two basic techniques are almost “orthogonal directions” towards efficiency in DNNs, and they can be naturally combined, e.g. one can do both sparsify the model and reduce arithmetic precision. Both strategies reduce the memory footprint accordingly and are vital for the deployment of DNNs in many realworld applications. This is especially important as reduced memory requirements are one of the main contributing factors reducing the energy consumption [16, 8, 24]. Apart from that, model size reduction and sparsification techniques such as weight pruning [32, 17, 16], weight sharing [51], knowledge distillation [22] of special weight matrix structures [28, 9, 62] also impacts the computational demand measured in terms of number of arithmetic operations. Unfortunately, this reduction usually does not directly translate into savings of wallclock time, as current hardware and software are not welldesigned to exploit model sparseness [63]. Instead, reducing parameter precision proves quite effective for improving execution time on CPUs [55, 47] and specialized hardware such as FPGAs [52]. When the precision of the inference process is driven to the extreme, i.e. assuming binary weights or ternary weightsin conjunction with binary inputs and binary activation functions
, floating or fixed point multiplications are replaced by hardwarefriendly logical XNOR and bitcount operations, i.e. DNNs essentially reduce to a logical circuit. Training such discretevalued DNNs
^{1}^{1}1Due to finite precision, in fact any DNN is discrete valued. However, we use this term here to highlight the extremely low number of values. is delicate as they cannot be directly optimized using gradient based methods. However, the obtained computational savings on today’s computer architectures are of great interest, especially when it comes to humanmachine interaction (HMI) systems where latency and energy efficiency plays an important role.Due to steadily decreasing cost of consumer electronics, many stateoftheart HMI systems include multichannel speech enhancement (MCSE) as a preprocessing stage. In particular, beamformers (BF) spatially separate background noise from the desired speech signal. Common BF methods include the Minimum Variance Distortionless Response (MVDR) beamformer [56] and the Generalized Eigenvalue (GEV) beamformer [58]. Both MVDR and GEV beamformers are frequently combined with DNNbased mask estimators, estimating a gainmask to obtain the spatial Power Spectral Density (PSD) matrices of the desired and interfering sound sources. Maskbased BFs are amongst stateoftheart beamforming approaches [21, 20, 19, 13, 39, 46, 40, 41, 67]. However, they are computational demanding and need to be trimmed down to facilitate the usage in lowresource applications.
In this paper, we investigate the tradeoff between performance and resources in MCSE systems. We exploit Bidirectional Long ShortTerm Memory (BLSTM) architectures to estimate a gain mask from noisy speech signals, and combine it with both the GEV and MVDR beamformers
[39]. We analyze the computational demands of the overall system and highlight techniques to reduce the computational load. In particular, we observe that the computational effort for mask estimation is by orders of magnitude larger compared to subsequent beamforming. Hence, we concentrate on efficient mask estimation in the remainder of the paper. We limit the numerical precision of the mask estimation DNN’s weights and after each processing step in the forward pass (i.e. inference) to either 8, 4 or 1 bits. This makes the MCSE system both resource and memory efficient. We report both perceptual audio quality and speech intelligibility of the overall MCSE system in terms of SNR improvement and Word Error Rate (WER) using the Google SpeechtoText API [49]. In particular, we use the WSJ0 corpus [37] in conjunction with simulated room acoustics to obtain 6 channel data of multiple, moving speakers [36]. When reducing the numerical precision in the forward pass, the system still yields competitive results for single speaker scenarios with a slightly decreased WER. Additionally, we show that reducedprecision DNNs can be readily exploited on today’s hardware, by benchmarking the core operation of binary DNNs (BNNs), i.e. binary matrix multiplication, on NVIDIA Tesla K80 and ARM CortexA57 architectures.The paper is structured as follows. In Section II we introduce the MCSE system. We highlight both MVDR and GEV beamformers and introduce DNNbased mask estimators. Section III provides details about the computational complexity of the MCSE system. We introduce reducedprecision LSTMs and discuss efficient representations in detail. In Section IV we present experiments of the MCSE system. In particular, the experimental setup and the results in terms of SNR and WER accuracy are discussed. Section V concludes the paper.
Ii MultiChannel Speech Enhancement System
The acoustic environment of our MCSE system consists of up to independent sound sources, i.e. human speech or ambient noise. The sound sources may be nonstationary, due to moving speakers, and their spatial and temporal characteristics are unknown.
The speech enhancement system itself is composed of a circular microphone array with microphones, a DNN to estimate gain masks from the noisy microphone observations, and a broadband beamformer to isolate the desired signal as shown in Figure (1).
The signal at the microphones is a mixture of all sources, i.e. in
shorttime Fourier transform
(STFT) domain(1) 
where the timefrequency bins of all microphones are stacked into a vector . The vector represents the ^{th} sound source at all microphones at frequency bin and time frame .^{2}^{2}2For the sake of brevity, the frequency and time frame indices will be omitted where the context is clear. Each sound source is composed of a monaural recording convolved with the Acoustic Transfer Function (ATF) , i.e.
(2) 
where models the acoustic path from the ^{th} sound source to the microphones, including all reverberations and reflections caused by the room acoustics [31]. In the near field of the array, the ATFs can be modeled by a finite impulse response (FIR) filter [5]. The filter characteristics varies with the movement of the speaker, i.e. it is nonstationary. Without loss of generality, we specify the first source to be the desired source, i.e. , and the interfering signal as the sum of the remaining sources, i.e. . The spatial PSD matrix for the desired signal is given as [26]
(3) 
and for the interfering signal
(4) 
The aim of beamforming is to recover the desired source while suppressing the interfering sources at the same time. We use a filter and sum beamformer [7], where each microphone signal is weighted with the beamforming weights , prior to summation into the result , i.e.
(5) 
where .
Iia MVDR Beamformer
The MVDR beamformer [5, 48] minimizes the signal energy at the output of the beamformer, while maintaining an undistorted response with respect to the steering vector , i.e. its weights are
(6) 
The steering vector guides the beamformer towards the direction of the desired signal. This direction can be determined using Direction Of Arrival (DOA) estimation algorithms [4, 14, 50, 38]. However, In realworld application this is suboptimal, as it does not consider reverberations and multipath propagations. Assuming that the PSD matrix of the desired source is known, the steering vector can be obtained in signal subspace [48] using Eigenvalue decomposition (EVD) of the PSD matrix
. In particular, the eigenvector belonging to the largest eigenvalue is used as steering vector
.IiB GEV Beamformer
An alternative to the MVDR beamformer is the GEV beamformer [58, 59]. It determines the filter weights to maximize the SNR at the beamformer output, i.e.
(7) 
where
(8) 
(9) 
IiC PSD Matrix Estimation
The spatial PSD matrix can be approximated using
(10) 
and the gain mask for the speech signal. Analogously, can be estimated using the gain mask for the interfering signal. Note that the window length defines the number of time frames used for estimating the PSD matrices. For moving sources, has to be sufficiently large to obtain well estimated PSD matrices. If is too large, the estimated PSD matrices might fail to adapt quickly enough to changes in the spatial characteristics of the moving sources. An alternative is provided by recursive estimation, i.e.
(11)  
IiD Recursive Eigenvector Tracking
If Eq. (10) is used, the generalized Eigenvalue decomposition in Eq. (9) has to be performed for every timefrequency bin. This expensive operation can be circumvented by recursive Eigenvector tracking using Oja’s method [18]; i.e.
(12)  
IiE DNNbased Speech Mask Estimation
The DNN used to estimate the gain mask for the beamformer uses the noisy microphone observations as features. In particular, the features per timefrequencybin are defined as , where is a whitened and phasenormalized version of . Further details on whitening can be found in [39]. For microphones, contains realvalued elements. The DNN processes frequency bins at a time, hence each time frame uses the feature vector as input. It contains elements.
Figure 2 shows the architecture of the DNN consisting of Dense layers and BLSTM units. Similar architectures for speech mask estimation can be found in [12, 20, 66, 40].
The first BLSTM layer consists of two separate LSTM units [23], each with neurons for each frequency bin. While the first LSTM processes the data in forward direction (i.e. one time frame after another), the second LSTM operates in backward direction. The output of both LSTMs is then concatenated to an intermediate vector with elements. The second and third layer consists of a Dense layer. The first three layers reduce the feature vector size from elements per timefrequency bin down to 1. Note that those layers have very few weights, as they consist of independent units with neurons each. The fourth layer is a BLSTM processing all frequency bins at a time. Finally, three separate Dense layers are used. The first dense layer estimates the mask for the desired source, the second estimates the mask for the interfering sources, and the third estimates the mask for timefrequency bins which are not assigned to the other two classes. The activation function of this layer is a softmax, so that the sum of each of the three masks is 1 for each timefrequency bin, i.e. .
Iii Computational Efficiency of the MCSE System
Iiia Complexity analysis of MCSE system
Table I shows both the computational complexity and the number of multiplyandaccumulate (MAC) operations for the proposed DNNbased mask estimator (cf. Section 2). Overall, 5562e6 MAC operations are needed to compute a gainmask given a multichannel signal with microphones, frequency bins and frames. Table II shows the MAC operations of a static and dynamic beamformer, needed to infer the target speech. Static beamformers, which do not track moving targets have a reduced computational overhead compared to dynamic variants, computing the beamforming weight for every timestep. However, the overall computational complexity is orders of magnitude lower compared to the DNNbased mask estimator. This indicates that significant computational savings can be obtained when optimizing DNNs with respect to resource efficiency.
Layer  Shape  Weights  MAC 
BLSTM  590976  295e6  
Dense layer  6156  3e6  
Dense layer  526338  263e6  
BLSTM  8421408  4211e6  
Dense layer  1579014  790e6  
Total  11123892  5562e6 
Mode  Layer  Complexity  MAC 
static  Eq. 10  18e6  
static  Eq. 9  0.1e6  
Total  18.1e6  
dynamic  Eq. 10  18e6  
dynamic  Eq. 9  55e6  
Total  73e6 
Reducing the precision of the DNNbased mask estimators reduces the computational complexity and memory consumption of the overall MCSE system. Reduced precision DNNs can be realized via bitpacking^{3}^{3}3https://github.com/google/gemmlowp schemes, with the help of processor specific GEMM instructions [1] or can be implemented on a DSP or FPGA.
Computational savings for various 8bit DNN models on both ARM processors and GPUs have been reported in [54, 53, 45, 1]. In particular, [55] reported that speech recognition performance is maintained when quantizing the neural network parameters to 8 bit fixedpoint, while the system runs 3 times faster on a x86 architecture.
In order to demonstrate the advantages that binary computations achieve on other generalpurpose processors, we implemented matrixmultiplication operators for NVIDIA GPUs and ARM CPUs. BNNs can be implemented very efficiently as 1bit scalar products, i.e. multiplications of two vectors and of length reduce to bitwise xnor() operation, followed by counting the number of set bits with popc(), i.e.
(13) 
where and denote the element of and , respectively. We use the matrixmultiplication algorithms of the MAGMA and Eigen libraries and replace float multiplications by xnor() operations, as depicted in Equation (13). Our CPU implementation uses NEON vectorization in order to fully exploit SIMD instructions on ARM processors. We report execution time of GPUs and ARM CPUs in Table III. As can be seen, binary arithmetic offers considerable speedups over singleprecision with manageable implementation effort. This also affects energy consumption since binary values require less offchip accesses and operations. Performance results of x86 architectures are not reported because neither SSE nor AVX ISA extensions support vectorized popc().
arch  matrix size  time (float32)  time (binary)  speedup 
GPU  256  0.14ms  0.05ms  2.8 
GPU  513  0.34ms  0.06ms  5.7 
GPU  1024  1.71ms  0.16ms  10.7 
GPU  2048  12.87ms  1.01ms  12.7 
ARM  256  3.65ms  0.42ms  8.7 
ARM  513  16.73ms  1.43ms  11.7 
ARM  1024  108.94ms  8.13ms  13.4 
ARM  2048  771.33ms  58.81ms  13.1 
IiiB Reduced Precision DNNs
We exploit reducedprecision weights and limit the numerical precision of a DNNbased mask estimator to either 8 or 4 bit fixedpoint representations or to binary weights. Recently, there has been numerous extensions to train DNNs with limited precision [64, 61, 57, 11].
IiiB1 DNN with Lowprecision Weights
The weights and activations of a DNN often lie within a small range, making it possible to introduce quantization schemes. Implementations like [35, 47] use reduced precision for their DNN’s weights. In [55], an improvement of inference speed of factor 3 for fixedpoint implementation on a general purpose hardware has been reported. Hence, we consider a fixedpoint representation of the computed values in the forward pass of our DNN [11]
. In particular, we use 8 and 4 bit weights, which represent the Q2.6 and Q2.2 fractional formats, respectively. After each layer, we use batch normalization to ensure the activations to fit within
. The accumulation of the values in the dot products and the batch normalization are performed with high precision, while the multiplication is performed at lower precision.During training we compute the gradient and update the weights using float32, while the precision is only reduced accordingly in the forward pass^{4}^{4}4The derivative is computed with respect to the quantized weights as in [10, 64, 35].. This is known as straight through estimator (STE) [10, 64], where the parameter update is performed in fullprecision. Usually, when deploying the DNN in an application, only the forwardpass calculations are required. Hence, the reducedprecision weights can be used, reducing memory requirements by a factor of 4 or 8 compared to 32bit weight representations. Figure 3 shows a reducedprecision LSTM cell. Besides the wellknown gating and vectormatrix computations of LSTMs, bit clipping operations are introduced after each mathematical operation. Details of the LSTM cell can be found in [23].
IiiB2 DNN with Binary Weights
In [10], binaryweight DNNs are trained using the STE, i.e. deterministic and stochastic rounding is used during forward propagation, and the fullprecision weights are updated based on the gradients of the quantized weights. In [27], STE is used to quantize both the weights and the activations to a single bit and sign functions respectively. [33] trained ternary weights by setting weights below or above a certain threshold to , or zero otherwise. This has been extended in [65] to ternary weights by learning the factors and using gradient updates and a different threshold has been applied.
When dealing with recurrent architectures such as LSTMs, [35] observed that recent reducedprecision techniques for BNNs [10, 27]
cannot be directly extended to recurrent layers. In particular, a simple reduction of the precision of the forward pass to 1 bit in the LSTM layers suffers from severe performance degradation as well as the vanishing gradient problem. In
[65, 3] batchnormalization and a weightscaling is applied to the recurrent weights to overcome this problem. We adopt this approach, i.e. introducing a trainable scaling parameter , which maps the range of the recurrent activations to . Hence, each of the recurrent weight matrices and has its own scaling factor, i.e. . See also Fig. 4. This limits the recurrent weights to small values, preventing the LSTM to reach unstable states, i.e. avoids accumulating the cell states to large numbers. For binary weights, the LSTM cell equations are given as:(14a)  
(14b)  
(14c)  
(14d)  
(14e)  
(14f) 
where and are a binary version (i.e. hard sigmoid and sign function [10]) of the wellknown sigmoid and tanh activation functions. The weights and biases are the binary network parameters (i.e. with values of ), and are the scaling parameters for the recurrent network weights.
Iv Experiments
Iva Experimental Setup
The performance of the multichannel speech enhancement system is demonstrated by simulating a typical living room scenario with two static speakers S_{1} and S_{2}, two moving speakers D_{1} and D_{2}, and an isotropic background noise source I similar as in [39]. The floor plan of the setup is shown in Figure 5. The circular microphone array with microphones and a diameter of is shown in red labeled as Mic. Head movements of the static speakers S_{1} and S_{2} are simulated by random 3D position changes within . The trajectory of the moving speakers D_{1} and D_{2} random within a region of 2m 4m on both sides of the microphone array. The movement velocity is constant at .
We specify five scenarios for our experiments using this shoebox model:

Random vs. isotropic (RI): A static speaker with head movements is the random source. The position is randomly selected in the room for each new utterance to prevent the model from learning the position of the speaker.

Static1 vs. isotropic (S_{1}I): A stationary speaker at fixed position S_{1} and an isotropic background noise are used in this scenario. The head movements cause a varying phase especially at higher frequencies.

Static1 vs. static2 + isotropic (S_{1}S_{2}I): Two simultaneously talking speakers at position S_{1} and S_{2} embedded in isotropic background noise are used in this scenario.

Dynamic1 vs. isotropic (D_{1}I): The speaker moving in region D_{1} has to be tracked in the presence of ambient background noise. This challenges the tracking capabilities of the DNN mask estimation.

Dynamic1 vs. dynamic2 + isotropic (D_{1}D_{2}I): The separation capabilities of two speakers moving in D_{1} and D_{2} embedded in background noise is analysed.
These experimental setups are summarized in Table IV:
Experiment #  Desired source  Interfering source(s) 
1  random R  isotropic I 
2  S_{1}  isotropic I 
3  S_{1}  S_{2}, isotropic I 
4  D_{1}  isotropic I 
5  D_{1}  D_{2}, isotropic I 
IvB Data Generation
We use the Image Source Method (ISM) [36, 44] to simulate the ATFs in Eq. (2). This enables to generate multichannel recordings from a monaural source. The room is modeled as shoebox with a reflection coefficient of for each wall. The reflection order is which results in a reverberation time of . We generate a new set of ATFs every for the moving sources. The isotropic background noise is determined as
(15) 
where is the monaural noise source, , and denotes the eigenvalue and eigenvector matrices of the spatial coherence matrix for a spherical sound field [31]. The vector
denotes a uniformly distributed phase between
.IvC Training and Testing
We use 12776 utterances from the si_tr_s set of the WSJ0 [37] corpus for the speech sources in Eq. (2) for training. Additionally, 20 hours of different sound categories from YouTube [42] are used as isotropic background noise. All recordings are sampled at 16kHz and converted to the frequency domain with bins and 75% overlapping blocks. The sources are mixed with equal volume. For testing, we use 2907 utterances from the si_et_05 set of the WSJ0 corpus mixed with Youtube noise.
The ground truth gain masks required for training can be obtained for the desired signal as:
(16) 
The mask for the interfering signals is given as:
(17) 
The weak signal components, which do not contribute to any of the PSD matrices, are obtained as:
(18) 
Parameter specifies the amount of energy per frequency bin required for the signal to be assigned to either the desired or interfering class label. Note that the calculation of the ground truth masks requires the corresponding signal energies and to be known, which is why we used the ISM rather than existing multichannel speech databases such as [34].
By setting for each timefrequency bin, we can use the crossentropy
as loss function. For each (B)LSTM or dense layer a
tanh activation and batch normalization [29] is applied. We train for each of the five scenarios in Table IVa separate DNN. Model optimization is done using stochastic gradient descent with ADAM
[30] using the crossentropy between the optimal binary mask and the estimated maskof the respective model. To avoid overfitting, we use early stopping by observing the error on the validation set every 20 epochs.
IvD Performance evaluation
We use three different beamformers: the MVDR, GEVBAN and GEVPAN (see Section II) for each gain mask. The estimates of the PSD matrices are obtained using Eq. (10), where blocks. We apply the BeamformIt toolkit [2] as baseline. It uses DOA estimation [4] followed by a MVDR beamformer. To evaluate the performance of the enhanced signals , we use the Google SpeechtoText API [49] to perform Automatic Speech Recognition (ASR). Furthermore, we determine the SNR improvement as:
(19) 
where the optimal binary mask is used to measure the energy of the desired and interfering components in the beamformer output and the noisy inputs , respectively. The can be computed without having access to the beamforming weights , as is the case of the BeamformIt toolkit.
IvE Results
While improvements of memory footprint and computation time are independent of the underlying tasks, the prediction accuracy highly depends on the complexity of the data and the used neural network. Simple data sets allow for aggressive quantization without affecting prediction performance significantly, while binary quantization results in severe prediction degradation on more complex data sets.
Figure 6 shows speech mask estimations using (a) 32, (b) 8 (c) 4 and (d) 1bit DNNs from the mixture of scenario (S_{1}I) of the WSJ0 utterance “When its initial public offering is completed Ashland is expected to retain a 46% stake” from si_et_05. As noted in Section IIE, the activation function of the output layer is a fullprecision softmax function. The reduction of the weight precision introduces artifacts in (b), (c) and (d).
(a) 


(b) 


(c) 


(d) 

(a) 


(b) 


(c) 


(d) 


(e) 


(f) 


(g) 


(h) 

Figure 7 shows the corresponding logspectrograms. In particular, (a) shows the original source signal, (b) the noise, and (c) the mixture, (dh) shows the reconstructed source signals using BeamformIt and 32, 8, 4, and 1bit DNNs using a GEVBAN beamformer, respectively. Reduced precision DNNs generate reasonable predictions compared to the singleprecision baseline. BeamformIt is not able to remove the low frequency components of the car noise. The reducedprecision DNNs are able to attenuate the car noise in the background in a similar way as the 32bit baseline DNN.
This is also reflected in Table V, showing the SNR improvement on the test set for experiment 1  5. Maskbased beamformers outperform BeamformIt in all five experiments. Reducing the bitwidth slightly degrades the SNR performance. However this reduces the memory footprint of the models. There is a small difference between the maskbased beamformers, i.e. GEV performs slightly better than MVDR. In general, 8bit maskbased estimators achieve competitive SNR scores, comparable to the full precision baseline.
bits  experiment 1  GEV BAN  GEV PAN  MVDR 
32  BeamformIt      0.57 
32  DNN  8.09  8.37  7.36 
8  DNN  7.61  8.00  6.77 
4  DNN  4.36  5.81  4.17 
1  DNN  5.47  6.30  4.96 
bits  experiment 2  GEV BAN  GEV PAN  MVDR 
32  BeamformIt      0.46 
32  DNN  8.57  8.76  7.95 
8  DNN  8.43  8.63  7.87 
4  DNN  7.50  8.05  6.21 
1  DNN  7.60  8.03  6.72 
bits  experiment 3  GEV BAN  GEV PAN  MVDR 
32  BeamformIt      0.20 
32  DNN  11.69  11.96  10.48 
8  DNN  10.09  10.53  10.88 
4  DNN  10.29  10.96  6.65 
1  DNN  10.71  11.21  8.83 
bits  experiment 4  GEV BAN  GEV PAN  MVDR 
32  BeamformIt      0.19 
32  DNN  8.44  8.72  7.63 
8  DNN  8.11  8.46  7.27 
4  DNN  7.01  7.78  5.49 
1  DNN  6.62  7.30  5.76 
bits  experiment 5  GEV BAN  GEV PAN  MVDR 
32  BeamformIt      0.35 
32  DNN  12.73  13.15  10.72 
8  DNN  12.09  12.62  9.50 
4  DNN  10.26  11.09  4.60 
1  DNN  11.25  12.01  7.31 
Table VI reports the word error rate (WER). We use the 6 channel data processed with 32, 8, 4, and 1bit DNNs for speech mask estimation using GEVPAN, GEVBAN and MVDR beamformers. Groundtruth transcriptions were generated using original WSJ0 recordings. Singleprecision networks obtained the best overall WER in all experiments. In case of reduced precision networks, 8bit DNNs produce competitive results, when using MVDR beamformers. For the 4 and 1bit variants the performance degrades. For experiments with more than one dominant source BeamformIt fails. In general, results for single speaker scenarios (experiment 1, 2 and 4) are better.
bits  experiment 1  GEV BAN  GEV PAN  MVDR 
32  BeamformIt      21.38 
32  DNN  9.14  11.71  9.69 
8  DNN  12.13  16.57  10.62 
4  DNN  22.14  21.89  15.24 
1  DNN  29.97  40.23  15.07 
bits  experiment 2  GEV BAN  GEV PAN  MVDR 
32  BeamformIt      22.77 
32  DNN  8.15  11.10  9.64 
8  DNN  8.89  10.67  10.17 
4  DNN  12.79  17.56  12.21 
1  DNN  10.88  15.57  11.20 
bits  experiment 3  GEV BAN  GEV PAN  MVDR 
32  BeamformIt      84.68 
32  DNN  15.38  17.48  16.24 
8  DNN  24.84  26.02  24.69 
4  DNN  21.94  29.18  25.76 
1  DNN  20.78  26.79  21.89 
bits  experiment 4  GEV BAN  GEV PAN  MVDR 
32  BeamformIt      22.95 
32  DNN  13.99  19.12  14.63 
8  DNN  16.00  21.72  16.47 
4  DNN  27.66  37.00  20.08 
1  DNN  26.69  38.37  19.68 
bits  experiment 5  GEV BAN  GEV PAN  MVDR 
32  BeamformIt      80.90 
32  DNN  19.80  27.01  21.04 
8  DNN  24.33  33.23  23.81 
4  DNN  37.21  49.52  43.03 
1  DNN  31.92  44.09  31.18 
V Conclusion
We introduced a resourceefficient approach for multichannel speech enhancement using DNNs for speech mask estimation. In particular, we reduce the precision to 8, 4 and 1bit. We use a recurrent neural network structure capable of learning longterm relations. Limiting the bitwidth of the DNNs reduces the memory footprint and improves the computational efficiency while the degradation in speech mask estimation performance is marginal. When deploying the DNN in speech processing frontends only the reducedprecision weights and forwardpass calculations are required. This supports speech enhancement on lowcost, lowpower and limitedresource frontend hardware. We conducted five experiments simulating various cocktail party scenarios using the WSJ0 corpus. In particular, different beamforming architectures, i.e. MVDR, GEVBAN, and GEVPAN, which are combined with low bitwidth mask estimators have been evaluated. MVDR beamformers, using 8bit reducedprecision DNNs for estimating the speech mask, obtain competitive SNR scores compared to the singleprecision baselines. Furthermore, the same architecture achieve competitive WERs in single speaker scenarios, measured with the Google SpeechtoText API. If multiple speakers are introduced, the performance degrades. In the case of binary DNNs, we show a significant reduction of memory footprint while still obtaining an audio quality which is only slightly lower compared to singleprecision DNNs. We show that these tradeoffs can be readily exploited on today’s hardware, by benchmarking the core operation of binary DNNs on NVIDIA and ARM architectures.
In future, we aim to implement the system on a target hardware and measure the resource consumption and run time.
References
 [1] (201905) Fast batched matrix multiplication for small sizes using halfprecision arithmetic on gpus. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vol. , pp. 111–122. External Links: ISSN 15302075 Cited by: §IIIA, §IIIA.
 [2] (200709) Acoustic beamforming for speaker diarization of meetings. IEEE Transactions on Audio, Speech, and Language Processing 15 (7), pp. 2011–2021. Cited by: §IVD.
 [3] (2019) Learning recurrent binary/ternary weights. In International Conference on Learning Representations, Cited by: §IIIB2.
 [4] (2008) Microphone array signal processing. Springer, Berlin–Heidelberg–New York. Cited by: §IIA, §IVD.
 [5] (2008) Springer handbook of speech processing. Springer, Berlin–Heidelberg–New York. Cited by: §IIA, §II.
 [6] (2018) Exploring practical aspects of neural maskbased beamforming for farfield speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6697–6701. Cited by: §IIC.
 [7] (2001) Microphone arrays. Springer, Berlin–Heidelberg–New York. Cited by: §II.

[8]
(2016)
Eyeriss: a spatial architecture for energyefficient dataflow for convolutional neural networks
. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, Piscataway, NJ, USA, pp. 367–379. External Links: ISBN 9781467389471, Link, Document Cited by: §I. 
[9]
(2015)
An exploration of parameter redundancy in deep networks with circulant projections.
In
International Conference on Computer Vision (ICCV)
, pp. 2857–2865. Cited by: §I.  [10] (2015) BinaryConnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems (NIPS), pp. 3123–3131. Cited by: §I, §IIIB1, §IIIB2, §IIIB2, §IIIB2, footnote 4.
 [11] (2015) Training deep neural networks with low precision multiplications. In International Conference on Learning Representations (ICLR) Workshop, Vol. abs/1412.7024. Cited by: §IIIB1, §IIIB.
 [12] (2010) Binary coding of speech spectrograms using a deep autoencoder.. In Interspeech, pp. 1692–1695. Cited by: §IIE.
 [13] (2016) Improved MVDR beamforming using singlechannel mask prediction networks. In Interspeech, Cited by: §I.
 [14] (200108) Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Transactions on Signal Processing 49 (8). Cited by: §IIA.

[15]
(2018)
MorphNet: fast & simple resourceconstrained structure learning of deep networks.
In
2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018
, pp. 1586–1595. Cited by: §I.  [16] (2016) Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. In International Conference on Learning Representations (ICLR), Cited by: §I.
 [17] (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS), pp. 1135–1143. Cited by: §I.
 [18] (2009) Neural networks and learning machines. Third edition, Pearson Education. Cited by: §IID.
 [19] (2015) BLSTM supported GEV beamformer frontend for the 3RD CHiME challenge. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 444–451. Cited by: §I.
 [20] (201603) Neural network based spectral mask estimation for acoustic beamforming. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 196–200. Cited by: §I, §IIE.
 [21] (2016) Robust MVDR beamforming using timefrequency masks for online/offline ASR in noise. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 4, pp. 5210–5214. Cited by: §I.
 [22] (2015) Distilling the knowledge in a neural network. In Deep Learning and Representation Learning Workshop @ NIPS, Cited by: §I.
 [23] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §IIE, §IIIB1.
 [24] (201402) 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC), Vol. , pp. 10–14. External Links: Document, ISSN 01936530 Cited by: §I.
 [25] (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861. Cited by: §I.
 [26] (2006) Acoustic mimo signal processing. Springer, Berlin–Heidelberg–New York. Cited by: §II.
 [27] (2016) Binarized neural networks. In Advances in Neural Information Processing Systems (NIPS), pp. 4107–4115. Cited by: §IIIB2, §IIIB2.
 [28] (2016) SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <1mb model size. CoRR abs/1602.07360. Cited by: §I.
 [29] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In JMLR, pp. 448–456. Cited by: §IVC.
 [30] (2015) Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §IVC.
 [31] (2009) Room acoustics. 5th edition, Spoon Press, London–New York. Cited by: §II, §IVB.
 [32] (1989) Optimal brain damage. In Advances in Neural Information Processing Systems (NIPS), pp. 598–605. Cited by: §I.
 [33] (2016) Ternary weight networks. CoRR abs/1605.04711. Cited by: §IIIB2.
 [34] (200511) The multichannel wall street journal audio visual corpus (mcwsjav): specification and initial experiments. In IEEE Workshop on Automatic Speech Recognition and Understanding, 2005., Vol. , pp. 357–362. Cited by: §IVC.
 [35] (2016) Recurrent neural networks with limited numerical precision. CoRR abs/1608.06902. Cited by: §I, §IIIB1, §IIIB2, footnote 4.
 [36] (2007) Generating sensor signals in isotropic noise fields. The Journal of the Acoustical Society of America 122 (6), pp. 3464–3470. Cited by: §I, §IVB.
 [37] (1992) The design for the wall street journalbased csr corpus. In Proceedings of the Workshop on Speech and Natural Language, HLT ’91, Stroudsburg, PA, USA, pp. 357–362. External Links: ISBN 1558602720 Cited by: §I, §IVC.
 [38] (201405) Blind source extraction based on a directiondependent apriori SNR. In Interspeech, Cited by: §IIA.
 [39] (201912) Eigenvectorbased speech mask estimation for multichannel speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (12), pp. 2162–2172. External Links: ISSN 23299304 Cited by: §I, §I, §IIB, §IID, §IIE, §IVA.
 [40] (201703) DNNbased speech mask estimation for eigenvector beamforming. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, pp. 66–70. Cited by: §I, §IIE.

[41]
(201708)
Eigenvectorbased speech mask estimation using logistic regression
. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, 2017, pp. 2660–2664. Cited by: §I.  [42] (2018) PyTube – a lightweight, pythonic, dependencyfree, library for downloading youtube videos.. External Links: Link Cited by: §IVC.
 [43] (2019) Resourceefficient neural networks for embedded systems. JMLR submitted. Cited by: §I.
 [44] (2017) Pyroomacoustics: A python package for audio room simulations and array processing algorithms. CoRR abs/1710.04196. Cited by: §IVB.
 [45] (2018) Towards efficient forward propagation on resourceconstrained systems. In European Conference on Machine Learning (ECML), (English). Cited by: §IIIA.
 [46] (2016) Deep beamforming and data augmentation for robust speech recognition: results of the 4th CHiME challenge. In Proc. of the 4th Intl. Workshop on Speech Processing in Everyday Environments (CHiME 2016), Cited by: §I.
 [47] (2016) Fixedpoint performance analysis of recurrent neural networks. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 976–980. Cited by: §I, §IIIB1.

[48]
(200908)
Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals
. IEEE Transactions on Audio, Speech, and Language Processing 17 (6). Cited by: §IIA, §IIA.  [49] (2018) SpeechRecognition – a library for performing speech recognition, with support for several engines and apis, online and offline.. External Links: Link Cited by: §I, §IVD.
 [50] (200905) Relative transfer function identification using convolutive transfer function approximation. IEEE Transactions on audio, speech, and language processing 17 (4). Cited by: §IIA.
 [51] (2017) Soft weightsharing for neural network compression. In International Conference on Learning Representations (ICLR), Cited by: §I.
 [52] (2017) FINN: A framework for fast, scalable binarized neural network inference. In ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (ISFPGA), pp. 65–74. Cited by: §I.
 [53] (2017) FINN: a framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, FPGA ’17, pp. 65–74. Cited by: §IIIA.
 [54] (2017) Streamlined deployment for quantized neural networks. CoRR abs/1709.04060. Cited by: §IIIA.
 [55] (2011) Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop @ NIPS, Cited by: §I, §IIIA, §IIIB1.
 [56] (198804) Beamforming: a versatile approach to spatial filtering. IEEE International Conference on Acoustics, Speech, and Signal Processing 5 (5), pp. 4–24. Cited by: §I.
 [57] (2018) Training deep neural networks with 8bit floating point numbers. In Advances in Neural Information Processing Systems 31, pp. 7675–7684. Cited by: §IIIB.
 [58] (2007) Blind acoustic beamforming based on generalized eigenvalue decomposition. In IEEE Transactions on Audio, Speech, and Language Processing, Vol. 15, pp. 1529–1539. Cited by: §I, §IIB, §IIB.
 [59] (2008) Speech enhancement with a new generalized eigenvector blocking matrix for application in a generalized sidelobe canceller. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 73–76. Cited by: §IIB, §IIB.
 [60] (2018) ProdSumNet: reducing model parameters in deep neural networks via productofsums matrix decompositions. CoRR abs/1809.02209. Cited by: §I.
 [61] (2018) Training and inference with integers in deep neural networks. Cited by: §IIIB.
 [62] (2015) Deep fried convnets. In International Conference on Computer Vision (ICCV), pp. 1476–1483. Cited by: §I.
 [63] (2016) CambriconX: An accelerator for sparse neural networks. In International Symposium on Microarchitecture (MICRO), pp. 20:1–20:12. Cited by: §I.
 [64] (2016) DoReFanet: training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR abs/1606.06160. External Links: Link, 1606.06160 Cited by: §I, §IIIB1, §IIIB, footnote 4.
 [65] (2017) Trained ternary quantization. In International Conference on Learning Representations (ICLR), Cited by: §IIIB2, §IIIB2.
 [66] (2015) Representation learning for singlechannel source separation and bandwidth extension. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (12), pp. 2398–2409. External Links: Document, ISSN 23299290 Cited by: §IIE.
 [67] (201804) Resource efficient deep eigenvector beamforming. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Cited by: §I.
 [68] (2017) Learning transferable architectures for scalable image recognition. CoRR abs/1707.07012. Cited by: §I.
Comments
There are no comments yet.