Archives of Acoustics, 43, 1, pp. 3–9, 2018
10.24425/118075

Application of Teager Energy Operator on Linear and Mel Scales for Whispered Speech Recognition

Branko R MARKOVIĆ
School of Electrical Engineering
Serbia

Jovan GALIĆ
School of Electrical Engineering
Serbia

Miomir MIJIĆ
School of Electrical Engineering
Serbia

This paper presents experimental results on whispered speech recognition based on Teager Energy Operator for linear and mel cepstral coefficients including the Cepstral Mean Subtraction normalization technique. The feature vectors taken into consideration are Linear Frequency Cepstral Coefficients, Teager Energy based Linear Frequency Cepstral Coefficients, Mel Frequency Cepstral Coefficients and Teager Energy based Mel Frequency Cepstral Coefficients. A speaker dependent scenario is used. For the recognition process, Dynamic Time Warping and Hidden Markov Models methods are applied. Results show a respectable improvement in whispered speech recognition as achieved by using the Teager Energy Operator with Cepstral Mean Subtraction.
Keywords: Teager energy operator; cepstral mean subtraction; whispered speech recognition; linear scale; mel scale; dynamic time warping; hidden Markov models
Full Text: PDF

References

Catford J.C. (1977), Fundamental problems in phonetics, Edinburgh: Edinburgh University Press.

De Veth J., Boves L. (1998), Channel normalization techniques for automatic speech recognition over the telephone, Speech Communication, 25, 149.

Dimitriadis D., Maragos P., Potamianos A. (2005), Auditory teager energy cepstrum coefficients for robust speech recognition, Proc. of European Speech Processing Conference, Lisbon, Portugal.

Fan X., Hansen J.H.L., Speaker identification with whispered speech based on modified LFCC parameters and feature mapping, Proc. of International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, May 2014, pp. 4553–4556.

Galić J., Jovičić S.T., Grozdić Đ., Marković B. (2014), HTK-based recognition of whispered speech, A. Ronzhin et al. (Eds.): SPECOM 2014, LNAI 8773, Springer International Publishing Switzerland 2014, 251.

Gang L., Heming Z. (2009), Formant frequency estimations of whispered speech in Chinese, Archives of Acoustics, 34, 2, 127–135.

Gang L., Heming Z. (2012), Joint factor analysis of channel mismatch in whispering speaker verification, Archives of Acoustics, 37, 4, 555–559.

Hansen J.H.L., Patil S. (2007), Speech under stress: analysis, modeling and recognition, In: Müller C. (Ed.), Speaker Classification I: Fundamentals, Features, and Methods, Springer, Berlin–Heidelberg, pp. 108–137.

Heracleous P. (2009), Using teager energy cepstrum and HMM distances in automatic speech recognition and analysis of unvoiced speech, International Journal of Information and Communication Engineering, 5, 1, 31–37.

Hidden Markov Model Toolkit (2016), http://htk.eng.cam.ac.uk/ (retrieved June 15, 2016).

Ito T., Takeda K., Itakura F. (2005), Analysis and recognition of whispered speech, Speech Communication, 45, 139–152.

Jovičić S.T.(1998), Formant feature differences between whispered and voiced sustained vowels, Acustica united with Acta Acoustica, 84, 4, 739–743.

Jovičić S.T., Šarić Z.M. (2008), Acoustic analysis of consonants in whispered speech, Journal of Voice, 22, 3, 263–274.

Kaiser J.F. (1983), Some observations on vocal tract operation from a fluid flow point of view, in: Vocal Fold Physiology: Biomechanics, Acoustics and Phonatory Control, I.R. Titze, R.C. Scherer (Rds), Denver Center for the Performing Arts, Denver, CO, pp. 358–386.

Kostek B. (1999), Soft computing in acoustics, applications of neural networks, fuzzy logic and rough sets to musical acoustics, Springer-Verlag, Berlin.

Kozierski P., Sadalla T., Drags S., Dobrowski A., Horla D. (2016), Kaldi toolkit in Polish whispery speech recognition, Przeglad Elektrotechniczny, R.92, 11, 301–304.

Marković B., Galić J., Grozdić Đ., Jovičić S.T. (2013), Application of DTW method for whispered speech recognition, Speech and Language 2013, 4th International Conference on Fundamental and Applied Aspects of Speech and Language, Belgrade, October 25–26.

Marković B., Jovičić S.T., Galić J., Grozdić Đ. (2013), Whispered speech database: design, processing and application, Proc. of 16th International Conference, TSD 2013, I. Habernal and V. Matousek (Eds.): TSD 2013, LNAI 8082, Springer-Verlag Berlin Heidelberg, pp. 591–598.

Neyman J., Pearson E. (1933), On the problem of the most efficient tests of statistical hypotheses, Philosophical Transactions of the Royal Society of London. Series A, 231, 289–337.

Rabiner L., Juang B-H. (1993), Fundamentals of speech recognition, Prentice Hall, New Jersey.

Sakoe H., Chiba S. (1978), Dynamic programming optimization for spoken word recognition, IEEE Transactions on Acoustics, Speech, and Signal Processing, 26, 1, 43–49.

Tsunoda K., Sekimoto S., Baer T. (2012), Brain activity in aphonia after a coughing episode: different brain activity in healthy whispering and pathological aphonic conditions, Journal of Voice, 26, 5, 668.e11–668.e13.

Zhang C., Hansen J.H.L. (2007), Analysis and classification of speech mode: whisper through shouted, Proc. of Interspeech 2007, pp. 2289–2292.

Zhou X., Garcia-Romero D., Duraiswami R., Espy-Wilson C., Shamma S. (2011), Linear versus mel frequency cepstral coefficients for speaker recognition, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, ASRU 2011, Waikoloa, HI, USA, December 11–15, pp. 559–564.




DOI: 10.24425/118075

Copyright © Polish Academy of Sciences & Institute of Fundamental Technological Research (IPPT PAN)