A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems

Sadasivam UMA MAHESWARI; A. SHAHINA; Ramesh RISHICKESH; A. NAYEEMULLA KHAN

doi:10.24425/aoa.2020.134058

Authors

Sadasivam UMA MAHESWARI SSN College of Engineering, India
A. SHAHINA SSN College of Engineering, India
Ramesh RISHICKESH SSN College of Engineering, India
A. NAYEEMULLA KHAN VIT University, India

Abstract

Research work on the design of robust multimodal speech recognition systems making use of acoustic and visual cues, extracted using the relatively noise robust alternate speech sensors is gaining interest in recent times among the speech processing research fraternity. The primary objective of this work is to study the exclusive influence of Lombard effect on the automatic recognition of the confusable syllabic consonant-vowel units of Hindi language, as a step towards building robust multimodal ASR systems in adverse environments in the context of Indian languages which are syllabic in nature. The dataset for this work comprises the confusable 145 consonant-vowel (CV) syllabic units of Hindi language recorded simultaneously using three modalities that capture the acoustic and visual speech cues, namely normal acoustic microphone (NM), throat microphone (TM) and a camera that captures the associated lip movements. The Lombard effect is induced by feeding crowd noise into the speaker’s headphone while recording. Convolutional Neural Network (CNN) models are built to categorise the CV units based on their place of articulation (POA), manner of articulation (MOA), and vowels (under clean and Lombard conditions). For validation purpose, corresponding Hidden Markov Models (HMM) are also built and tested. Unimodal Automatic Speech Recognition (ASR) systems built using each of the three speech cues from Lombard speech show a loss in recognition of MOA and vowels while POA gets a boost in all the systems due to Lombard effect. Combining the three complimentary speech cues to build bimodal and trimodal ASR systems shows that the recognition loss due to Lombard effect for MOA and vowels reduces compared to the unimodal systems, while the POA recognition is still better due to Lombard effect. A bimodal system is proposed using only alternate acoustic and visual cues which gives a better discrimination of the place and manner of articulation than even standard ASR system. Among the multimodal ASR systems studied, the proposed trimodal system based on Lombard speech gives the best recognition accuracy of 98%, 95%, and 76% for the vowels, MOA and POA, respectively, with an average improvement of 36% over the unimodal ASR systems and 9% improvement over the bimodal ASR systems.

Keywords:

Lombard speech, multimodal ASR, throat microphone, visual speech, Convolutional Neural Network, Hidden Markov Model, late fusion, intermediate fusion

References

1. Abdel-Hamid O., Mohamed A., Jiang H., Penn G. (2012), Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition, [in:] 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280, https://doi.org/10.1109/ICASSP.2012.6288864

2. Alexanderson S., Beskow J. (2014), Animated Lombard speech: motion capture, facial animation and visual intelligibility of speech produced in adverse conditions, Computer Speech & Language, 28(2): 607–618, https://doi.org/10.1016/j.csl.2013.02.005

3. Boril H. (2008), Robust speech recognition: Analysis and equalization of Lombard effect in Czech corpora, Ph.D. thesis, Czech Technical University in Prague, Czech Rep., https://personal.utdallas.edu/_hynek/

4. Boril H., Hansen J.H. (2010), Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environments, IEEE Transactions on Audio, Speech, and Language Processing, 18(6): 1379–1393, https://doi.org/10.1109/TASL.2009.2034770

5. Bou-Ghazale S.E., Hansen J.H. (1994), Duration and spectral based stress token generation for hmm speech recognition under stress, [in:] 1994 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1994. ICASSP-94, Vol. 1, pp. I/413–I/416, https://doi.org/10.1109/ICASSP.1994.389268

6. Davis C., Kim J., Grauwinkel K., Mixdorff H. (2006), Lombard speech: auditory (A), visual (V) and AV effects, [in:] Proceedings of the Third International Conference on Speech Prosody, Citeseer, pp. 248–252.

7. Drugman T., Dutoit T. (2010), Glottal-based analysis of the Lombard effect, [in:] Interspeech, pp. 2610– 2613.

8. Garnier M., Henrich N. (2014), Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?, Computer Speech & Language, 28(2): 580–597, https://doi.org/10.1016/j.csl.2013.07.005Get

9. Garnier M., Henrich N., Dubois D. (2010), Influence of sound immersion and communicative interaction on the Lombard effect, Journal of Speech, Language, and Hearing Research, 53(3): 588–608, https://doi.org/10.1044/1092-4388%282009/08-0138%29

10. Graciarena M., Franco H., Sonmez K., Bratt H. (2003), Combining standard and throat microphones for robust speech recognition, IEEE Signal Processing Letters, 10(3): 72–74, https://doi.org/10.1109/LSP.2003.808549

11. Hansen J.H. (1994), Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect, IEEE Transactions on Speech and Audio Processing, 2(4): 598–614, https://doi.org/10.1109/89.326618

12. Hansen J.H., Bria O.N. (1990), Lombard effect compensation for robust automatic speech recognition in noise, [in:] First International Conference on Spoken Language Processing, pp. 1125–1128, https://www.isca-speech.org/archive/icslp_1990/i90_1125.html

13. Hansen J.H., Varadarajan V. (2009), Analysis and compensation of Lombard speech across noise type and levels with application to in-set/out-of-set speaker recognition, IEEE Transactions on Audio, Speech, and Language Processing, 17(2): 366–378, 2009, https://doi.org/10.1109/TASL.2008.2009019

14. Heracleous P., Ishi C.T., Sato M., Ishiguro H., Hagita N. (2013), Analysis of the visual Lombard effect and automatic recognition experiments, Computer Speech & Language, 27(1): 288–300, https://doi.org/10.1016/j.csl.2012.06.003

15. Hinton G. et al. (2012), Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Processing Magazine, 29(6): 82–97, https://doi.org/10.1109/MSP.2012.2205597

16. Jou S.-C., Schultz T., Waibel A. (2004), Adaptation for soft whisper recognition using a throat microphone, [in:] Eighth International Conference on Spoken Language Processing, pp. 1493–1496, https://www.isca-speech.org/archive/interspeech_2004/i04_1493.html

17. Junqua J.-C., Anglade Y. (1990), Acoustic and perceptual studies of Lombard speech: application to isolated-words automatic speech recognition, [in:] International Conference on Acoustics, Speech, and Signal Processing, ICASSP-90, Vol. 2, pp. 841–844, https://doi.org/10.1109/ICASSP.1990.115969

18. Khan A.N., Gangashetty S.V., Yegnanarayana B. (2003), Syllabic properties of three Indian languages: implications for speech recognition and language identification, [in:] International Conference on Natural Language Processing, pp. 125–134.

19. Lane H., Tranel B. (1971), The Lombard sign and the role of hearing in speech, Journal of Speech, Language, and Hearing Research, 14(4): 677–709, https://doi.org/10.1044/jshr.1404.677

20. Lombard E. (1911), The sign of the elevation of the voice [in French: Le signe de l’élévation de la voix], Annales des Maladies de l’Oreille, du Larynx, du Nez et du Pharynx, 37(2): 101–119.

21. Marxer R., Barker J., Alghamdi N., Maddock S. (2018), The impact of the Lombard effect on audio and visual speech recognition systems, Speech Communication, 100: 58–68, https://doi.org/10.1016/j.specom.2018.04.006

22. Palaz D., Collobert R., Magimai-Doss M. (2013), Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks, CoRR, Vol. abs/1304.1018, online, http://arxiv.org/abs/1304.1018

23. Pisoni D., Bernacki R., Nusbaum H., Yuchtman M. (1985), Some acoustic-phonetic correlates of speech produced in noise, [in:] IEEE International Conference on Acoustics, Speech, and Signal Process ing, ICASSP’85, Vol. 10, pp. 1581–1584, https://doi.org/10.1109/ICASSP.1985.1168217

24. Rajasekaran P., Doddington G., Picone J. (1986), Recognition of speech under stress and in noise, [in:] IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’86, Vol. 11, pp. 733–736, https://doi.org/10.1109/ICASSP.1986.1169207

25. Roucos S., Viswanathan V., Henry C., Schwartz R. (1986), Word recognition using multisensor speech input in high ambient noise, [in:] IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’86, pp. 737–740, https://doi.org/10.1109/ICASSP.1986.1169208

26. Sainath T.N., Mohamed A., Kingsbury B., Ramabhadran B. (2013), Deep convolutional neural networks for LVCSR, [in:] 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8614–8618, https://doi.org/10.1109/ICASSP.2013.6639347

27. Shahina A. (2007), Processing throat microphone speech, Ph.D. thesis, Indian Institute of Technology, Madras.

28. Shahina A., Yegnanarayana B. (2007), Mapping speech spectra from throat microphone to closespeaking microphone: a neural network approach, EURASIP Journal on Advances in Signal Processing, 2007: 087219, https://doi.org/10.1155/2007/87219

29. Sadasivam U.M., Shahina A., Khan A.N., Divya J. (2015), Spectral transformation of Lombard speech to normal speech for speaker recognition systems, [in:] International Conference Soft Computing Systems.

Online first
Early birds
2026, Vol 51
	No 1	No 2
2025, Vol 50
	No 1	No 2	No 3	No 4
2024, Vol 49
	No 1	No 2	No 3	No 4
2023, Vol 48
	No 1	No 2	No 3	No 4
2022, Vol 47
	No 1	No 2	No 3	No 4
2021, Vol 46
	No 1	No 2	No 3	No 4
2020, Vol 45
	No 1	No 2	No 3	No 4
2019, Vol 44
	No 1	No 2	No 3	No 4
2018, Vol 43
	No 1	No 2	No 3	No 4
2017, Vol 42
	No 1	No 2	No 3	No 4
2016, Vol 41
	No 1	No 2	No 3	No 4
2015, Vol 40
	No 1	No 2	No 3	No 4
2014, Vol 39
	No 1	No 2	No 3	No 4
2013, Vol 38
	No 1	No 2	No 3	No 4
2012, Vol 37
	No 1	No 2	No 3	No 4
2011, Vol 36
	No 1	No 2	No 3	No 4
2010, Vol 35
	No 1	No 2	No 3	No 4
2009, Vol 34
	No 1	No 2	No 3	No 4
2008, Vol 33
	No 1	No 2	No 3	No 4	No 4(S)
2007, Vol 32
	No 1	No 2	No 3	No 4	No 4(S)
2006, Vol 31
	No 1	No 2	No 3	No 4	No 4(S)
2005, Vol 30
	No 1	No 2	No 3	No 4
2004, Vol 29
	No 1	No 2	No 3	No 4
2003, Vol 28
	No 1	No 2	No 3	No 4
2002, Vol 27
	No 1	No 2	No 3	No 4
2001, Vol 26
	No 1	No 2	No 3	No 4
2000, Vol 25
	No 1	No 2	No 3	No 4
1999, Vol 24
	No 1	No 2	No 3	No 4
1998, Vol 23
	No 1	No 2	No 3	No 4
1997, Vol 22
	No 1	No 2	No 3	No 4
1996, Vol 21
	No 1	No 2	No 3	No 4
1995, Vol 20
	No 1	No 2	No 3	No 4
1994, Vol 19
	No 1	No 2	No 3	No 4
1993, Vol 18
	No 1	No 2	No 3	No 4
1992, Vol 17
	No 1	No 2	No 3	No 4
1991, Vol 16
	No 1	No 2	No 3-4
1990, Vol 15
	No 1-2		No 3-4
1989, Vol 14
	No 1-2		No 3-4
1988, Vol 13
	No 1-2		No 3-4
1987, Vol 12
	No 1	No 2	No 3-4
1986, Vol 11
	No 1	No 2	No 3	No 4
1985, Vol 10
	No 1	No 2	No 3	No 4
1984, Vol 9
	No 1-2		No 3	No 4
1983, Vol 8
	No 1	No 2	No 3	No 4
1982, Vol 7
	No 1	No 2	No 3-4
1981, Vol 6
	No 1	No 2	No 3	No 4
1980, Vol 5
	No 1	No 2	No 3	No 4
1979, Vol 4
	No 1	No 2	No 3	No 4
1978, Vol 3
	No 1	No 2	No 3	No 4
1977, Vol 2
	No 1	No 2	No 3	No 4
1976, Vol 1
	No 1	No 2	No 3	No 4

A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems

Downloads

Authors

Abstract

Keywords:

References

cover

ippt-pan

Issue

Pages

Section

DOI

Received

Revised

Accepted

Published

License

How to Cite

Principal Contact

Address

Support Contact