Archives of Acoustics, 39, 3, pp. 411-420, 2014
10.2478/aoa-2014-0045

Two-Microphone Dereverberation for Automatic Speech Recognition of Polish

Mikolaj KUNDEGORSKI
School of Engineering and Computing Sciences, Durham University, Durham, UK
United Kingdom

Philip J.B. JACKSON
Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK
United Kingdom

Bartosz ZIÓŁKO
Department of Electronics, AGH University of Science and Technology, Kraków, Poland
Poland

Reverberation is a common problem for many speech technologies, such as automatic speech recognition (ASR) systems. This paper investigates the novel combination of precedence, binaural and statistical independence cues for enhancing reverberant speech, prior to ASR, under these adverse acoustical conditions when two microphone signals are available. Results of the enhancement are evaluated in terms of relevant signal measures and accuracy for both English and Polish ASR tasks. These show inconsistencies between the signal and recognition measures, although in recognition the proposed method consistently outperforms all other combinations and the spectral-subtraction baseline.
Keywords: speech enhancement; reverberation; ASR; Polish.
Full Text: PDF
Copyright © Polish Academy of Sciences & Institute of Fundamental Technological Research (IPPT PAN).

References

Alinaghi, A., Wang, W., Jackson, P. J. B., 2011. Integrating binaural cues and blind source separation method for separating reverberant speech mix- tures. In: Proc. of ICASSP, Praque. pp. 209–212.

Blauert, J., 1997. Spatial Hearing: The Psychophysics of Human Sound Localization, 2nd Edition. MIT Press.

Boll, S., 1979. Suppression of acoustic noise in speech using spectral sub- traction. Acoustics Speech and Signal Processing IEEE Trans. on 27 (2), 113–120.

Chien, J.-T., Lai, P.-Y., 2005. Car speech enhancement using a microphone array. Int. Journal of Speech Technology 8, 79–91.

Drgas, S., Kociński, J., Sęk, A., 2008. Logatom articulation index evaluation of speech enhanced by blind source separation and single-channel noise reduction. Archives of Acoustics 33 (4).

Fukumori, T., Nakayama, M., Nishiura, T., Yamashita, Y., Oct 2013. Esti- mation of speech recognition performance in noisy and reverberant envi- ronments using pesq score and acoustic parameters. In: Signal and Infor- mation Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific. pp. 1–4.

Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., Pallett, D. S., Dahlgren, N. L., Zue, V., 1993. TIMIT acoustic-phonetic continuous speech corpus. Linguistic Data Consortium, Philadelphia.

Gomez, R., Kawahara, T., 2010. Robust speech recognition based on derever- beration parameter optimization using acoustic model likelihood. Audio, Speech and Language Processing, IEEE Trans. on 18 (7), 1708–1716.

Grocholewski, S., 1998. First database for spoken Polish. In: Proc. of International Conference on Language Resources and Evaluation, Grenada. pp. 1059–1062.

Hartmann, W. M., 1999. How we localize sound. Physics Today 52 (11), 24–29.

Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B., 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29 (6), 82.

Hummersone, C., Mason, R., Brookes, T., 2010. Dynamic precedence effect modeling for source separation in reverberant environments. Audio, Speech, and Language Processing, IEEE Trans. on 18 (7), 1867–1871.

Jeub, M., Schafer, M., Esch, T., Vary, P., 2010. Model-based dereverberation preserving binaural cues. Audio, Speech, and Language Processing, IEEE Trans. on 18 (7), 1732 –1745.

Krishnamoorthy, P., Prasanna, S., 2009. Reverberant speech enhanceent by temporal and spectral processing. Audio, Speech, and Language Processing, IEEE Trans. on 17 (2), 253 –266.

Leonard, R., Doddington, G., 1993. Tidigits. Linguistic Data Consortium, Philadelphia.

Li, K.and Guo, Y., Fu, Q., Yan, Y., Jan 2012. A two microphone-based approach for speech enhancement in adverse environments. In: Consumer Electronics (ICCE), 2012 IEEE International Conference on. pp. 41–42.

Litovsky, R., Colburn, H., Yost, W., Guzman, S., Oct. 1999. The precedence effect. J. Acoust. Soc. Am. 106, 1633–1654.

Mandel, M., Weiss, R., Ellis, D., 2010. Model-based expectation- maximization source separation and localization. Audio, Speech, and Language Processing, IEEE Trans. on 18 (2), 382–394.

Nakatani, T., Kinoshita, K., Miyoshi, M., 2007. Harmonicity-based blind dereverberation for single-channel speech signals. Audio, Speech, and Language Processing, IEEE Trans. on 15 (1), 80 –95.

Naylor, P. A., Gaubitch, N. D., 2005. Speech dereverberation. In: Proc. of Int. Workshop Acoust. Echo Noise Control, Eindhoven.

Palomaki, K. J., Brown, G. J., Wang, D., 2004. A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation. Speech Communication 43 (4), 361–378.

Pearce, D., Hirsch, H., 2000. The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ISCA ITRW ASR. pp. 29–32.

Pearson, J., Lin, Q., Che, C., Yuk, D.-S., Jin, L., de Vries, B., Flanagan, J., 1996. Robust distant-talking speech recognition. In: Proc. of ICASSP, Atlanta. Vol. 1. pp. 21 –24.

Sawada, H., Araki, S., Makino, S., 2007. A two-stage frequency-domain blind source separation method for underdetermined convolutive mixtures. In: Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. pp. 139 –142.

Seltzer, M. L., Raj, B., Stern, R. M., 2004. Likelihood-mazimizing beamforming for robust hands-free speech recognition. Speech and Audio Processing, IEEE Trans. on 12, 489 – 498.

Shi, G., Aarabi, P., 2003. Robust digit recognition using phase-dependent time-frequency masking. In: Proc. of ICASSP, Hong Kong. pp. 684–687.

Vincent, E., Gribonval, R., Fevotte, C., 2006. Performance measurement in blind audio source separation. Audio, Speech, and Language Processing, IEEE Trans. on 14 (4), 1462 –1469.

Ward, D., Kennedy, R., Williamson, R., 2001. Constant directivity beamforming. In: Microphone Arrays. Springer-Verlag.

Wu, M., Wang, D., 2006. A two-stage algorithm for one-microphone reverberant speech enhancement. Audio, Speech, and Language Processing, IEEE Trans. on 14, 774–784.

Young, S. J., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P., 2006. The HTK Book Version 3.4. Cambridge University Press.

Ziółko, B., Manandhar, S., Wilson, R., Ziółko, M., Gałka, J., 2008. Application of HTK to the Polish language. In: Proc. of International Conference on Audio, Language and Image Processing, Shanghai.

Ziółko, M., Gałka, J., Ziółko, B., Jadczyk, T., Skurzok, D., Mąsior, M., 2011. Automatic speech recognition system dedicated for Polish. In: Proc. of Interspeech, Florence.




DOI: 10.2478/aoa-2014-0045