Archives of Acoustics, 45, 3, pp. 419–431, 2020
10.24425/aoa.2020.134058

A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems

Sadasivam UMA MAHESWARI
SSN College of Engineering
India

A. SHAHINA
SSN College of Engineering
India

Ramesh RISHICKESH
SSN College of Engineering
India

A. NAYEEMULLA KHAN
VIT University
India

Research work on the design of robust multimodal speech recognition systems making use of acoustic and visual cues, extracted using the relatively noise robust alternate speech sensors is gaining interest in recent times among the speech processing research fraternity. The primary objective of this work is to study the exclusive influence of Lombard effect on the automatic recognition of the confusable syllabic consonant-vowel units of Hindi language, as a step towards building robust multimodal ASR systems in adverse environments in the context of Indian languages which are syllabic in nature. The dataset for this work comprises the confusable 145 consonant-vowel (CV) syllabic units of Hindi language recorded simultaneously using three modalities that capture the acoustic and visual speech cues, namely normal acoustic microphone (NM), throat microphone (TM) and a camera that captures the associated lip movements. The Lombard effect is induced by feeding crowd noise into the speaker’s headphone while recording. Convolutional Neural Network (CNN) models are built to categorise the CV units based on their place of articulation (POA), manner of articulation (MOA), and vowels (under clean and Lombard conditions). For validation purpose, corresponding Hidden Markov Models (HMM) are also built and tested. Unimodal Automatic Speech Recognition (ASR) systems built using each of the three speech cues from Lombard speech show a loss in recognition of MOA and vowels while POA gets a boost in all the systems due to Lombard effect. Combining the three complimentary speech cues to build bimodal and trimodal ASR systems shows that the recognition loss due to Lombard effect for MOA and vowels reduces compared to the unimodal systems, while the POA recognition is still better due to Lombard effect. A bimodal system is proposed using only alternate acoustic and visual cues which gives a better discrimination of the place and manner of articulation than even standard ASR system. Among the multimodal ASR systems studied, the proposed trimodal system based on Lombard speech gives the best recognition accuracy of 98%, 95%, and 76% for the vowels, MOA and POA, respectively, with an average improvement of 36% over the unimodal ASR systems and 9% improvement over the bimodal ASR systems.
Keywords: Lombard speech; multimodal ASR; throat microphone; visual speech; Convolutional Neural Network; Hidden Markov Model; late fusion; intermediate fusion
Full Text: PDF
Copyright © The Author(s). This is an open-access article distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

References

Abdel-Hamid O., Mohamed A., Jiang H., Penn G. (2012), Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition, [in:] 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280, doi: 10.1109/ICASSP.2012.6288864.

Alexanderson S., Beskow J. (2014), Animated Lombard speech: motion capture, facial animation and visual intelligibility of speech produced in adverse conditions, Computer Speech & Language, 28(2): 607–618, doi: 10.1016/j.csl.2013.02.005.

Boril H. (2008), Robust speech recognition: Analysis and equalization of Lombard effect in Czech corpora, Ph.D. thesis, Czech Technical University in Prague, Czech Rep., https://personal.utdallas.edu/_hynek/.

Boril H., Hansen J.H. (2010), Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environments, IEEE Transactions on Audio, Speech, and Language Processing, 18(6): 1379–1393, doi: 10.1109/TASL.2009.2034770.

Bou-Ghazale S.E., Hansen J.H. (1994), Duration and spectral based stress token generation for hmm speech recognition under stress, [in:] 1994 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1994. ICASSP-94, Vol. 1, pp. I/413–I/416, doi: 10.1109/ICASSP.1994.389268.

Davis C., Kim J., Grauwinkel K., Mixdorff H. (2006), Lombard speech: auditory (A), visual (V) and AV effects, [in:] Proceedings of the Third International Conference on Speech Prosody, Citeseer, pp. 248–252.

Drugman T., Dutoit T. (2010), Glottal-based analysis of the Lombard effect, [in:] Interspeech, pp. 2610– 2613.

Garnier M., Henrich N. (2014), Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?, Computer Speech & Language, 28(2): 580–597, doi: 10.1016/j.csl.2013.07.005Get.

Garnier M., Henrich N., Dubois D. (2010), Influence of sound immersion and communicative interaction on the Lombard effect, Journal of Speech, Language, and Hearing Research, 53(3): 588–608, doi: 10.1044/1092-4388(2009/08-0138).

Graciarena M., Franco H., Sonmez K., Bratt H. (2003), Combining standard and throat microphones for robust speech recognition, IEEE Signal Processing Letters, 10(3): 72–74, doi: 10.1109/LSP.2003.808549.

Hansen J.H. (1994), Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect, IEEE Transactions on Speech and Audio Processing, 2(4): 598–614, doi: 10.1109/89.326618.

Hansen J.H., Bria O.N. (1990), Lombard effect compensation for robust automatic speech recognition in noise, [in:] First International Conference on Spoken Language Processing, pp. 1125–1128, https://www.isca-speech.org/archive/icslp_1990/i90_1125.html.

Hansen J.H., Varadarajan V. (2009), Analysis and compensation of Lombard speech across noise type and levels with application to in-set/out-of-set speaker recognition, IEEE Transactions on Audio, Speech, and Language Processing, 17(2): 366–378, 2009, doi: 10.1109/TASL.2008.2009019.

Heracleous P., Ishi C.T., Sato M., Ishiguro H., Hagita N. (2013), Analysis of the visual Lombard effect and automatic recognition experiments, Computer Speech & Language, 27(1): 288–300, doi: 10.1016/j.csl.2012.06.003.

Hinton G. et al. (2012), Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Processing Magazine, 29(6): 82–97, doi: 10.1109/MSP.2012.2205597.

Jou S.-C., Schultz T., Waibel A. (2004), Adaptation for soft whisper recognition using a throat microphone, [in:] Eighth International Conference on Spoken Language Processing, pp. 1493–1496, https://www.isca-speech.org/archive/interspeech_2004/i04_1493.html.

Junqua J.-C., Anglade Y. (1990), Acoustic and perceptual studies of Lombard speech: application to isolated-words automatic speech recognition, [in:] International Conference on Acoustics, Speech, and Signal Processing, ICASSP-90, Vol. 2, pp. 841–844, doi: 10.1109/ICASSP.1990.115969.

Khan A.N., Gangashetty S.V., Yegnanarayana B. (2003), Syllabic properties of three Indian languages: implications for speech recognition and language identification, [in:] International Conference on Natural Language Processing, pp. 125–134.

Lane H., Tranel B. (1971), The Lombard sign and the role of hearing in speech, Journal of Speech, Language, and Hearing Research, 14(4): 677–709, doi: 10.1044/jshr.1404.677.

Lombard E. (1911), The sign of the elevation of the voice [in French: Le signe de l’élévation de la voix], Annales des Maladies de l’Oreille, du Larynx, du Nez et du Pharynx, 37(2): 101–119.

Marxer R., Barker J., Alghamdi N., Maddock S. (2018), The impact of the Lombard effect on audio and visual speech recognition systems, Speech Communication, 100: 58–68, doi: 10.1016/j.specom.2018.04.006.

Palaz D., Collobert R., Magimai-Doss M. (2013), Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks, CoRR, Vol. abs/1304.1018, online, http://arxiv.org/abs/1304.1018.

Pisoni D., Bernacki R., Nusbaum H., Yuchtman M. (1985), Some acoustic-phonetic correlates of speech produced in noise, [in:] IEEE International Conference on Acoustics, Speech, and Signal Process ing, ICASSP’85, Vol. 10, pp. 1581–1584, doi: 10.1109/ICASSP.1985.1168217.

Rajasekaran P., Doddington G., Picone J. (1986), Recognition of speech under stress and in noise, [in:] IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’86, Vol. 11, pp. 733–736, doi: 10.1109/ICASSP.1986.1169207.

Roucos S., Viswanathan V., Henry C., Schwartz R. (1986), Word recognition using multisensor speech input in high ambient noise, [in:] IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’86, pp. 737–740, doi: 10.1109/ICASSP.1986.1169208.

Sainath T.N., Mohamed A., Kingsbury B., Ramabhadran B. (2013), Deep convolutional neural networks for LVCSR, [in:] 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8614–8618, doi: 10.1109/ICASSP.2013.6639347.

Shahina A. (2007), Processing throat microphone speech, Ph.D. thesis, Indian Institute of Technology, Madras.

Shahina A., Yegnanarayana B. (2007), Mapping speech spectra from throat microphone to closespeaking microphone: a neural network approach, EURASIP Journal on Advances in Signal Processing, 2007: 087219, doi: 10.1155/2007/87219.

Sadasivam U.M., Shahina A., Khan A.N., Divya J. (2015), Spectral transformation of Lombard speech to normal speech for speaker recognition systems, [in:] International Conference Soft Computing Systems.




DOI: 10.24425/aoa.2020.134058