Archives of Acoustics, 50, 1, pp. 59-67, 2025
10.24425/aoa.2025.153652

The Influence of the Amplitude Spectrum Correction in the HFCC Parametrization on the Quality of Speech Signal Frame Classification

Stanislaw GMYREK
Wrocław University of Science and Technology
Poland

Robert HOSSA
Wrocław University of Science and Technology
Poland

Ryszard MAKOWSKI
Wrocław University of Science and Technology
Poland

The voiced parts of the speech signal are shaped by glottal pulse excitation, the vocal tract, and the speaker’s lips. Semantic information contained in speech is shaped mainly by the vocal tract. Unfortunately, the quasiperiodicity of the glottal excitation, in the case of HFCC parameterization, is one of the factors affecting the significant scatter of the feature vector values by introducing ripples into the amplitude spectrum. This paper proposes a method to reduce the effect of quasiperiodicity of the excitation on the feature vector. For this purpose, blind deconvolution was used to determine the vocal tract transfer function estimator and the corrective function of the amplitude spectrum. Then, on the basis of the obtained HFCC parameters, statistical models of individual Polish speech phonemes were developed in the form of mixtures of Gaussian distributions, and the influence of the correction on the quality of classification of speech frames containing Polish vowels was investigated. The aim of the correction was to narrow the GMM distributions, which, according to detection theory, reduces the classification errors. The results obtained confirm the effectiveness of the proposed method.
Keywords: automatic speech recognition; robust parametrization; amplitude spectrum correction; inverse filtering; GMM model; distance between GMM distributions
Full Text: PDF
Copyright © 2025 The Author(s). This work is licensed under the Creative Commons Attribution 4.0 International CC BY 4.0.

References

Alku P. (1991), Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering, [in:] Proceedings 2nd European Conference on Speech Communication and Technology (Eurospeech 1991), pp. 1081–1084, https://doi.org/10.21437/Eurospeech.1991-257.

Alku P. (1992), Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering, Speech Communication, 11(2–3): 109–118, https://doi.org/10.1016/0167-6393(92)90005-R.

Bozkurt B., Doval B., D’Alessandro C., Dutoit T. (2005), Zeros of Z-transform representation with application to source-filter separation in speech, IEEE Signal Processing Letters, 12(4): 344–347, https://doi.org/10.1109/LSP.2005.843770.

Cheveigne A., Kawahara H. (2002), YIN, a fundamental frequency estimator for speech and music, The Journal of the Acoustical Society of America, 111(4): 1917–1930, https://doi.org/10.1121/1.1458024.

Davis S., Mermelstein P. (1980), Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4): 357–366, https://doi.org/10.1109/TASSP.1980.1163420.

Dempster A.P., Laird N.M., Rubin D.B. (1977), Maximum-likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society. Series B (Methodological), 39(1): 1–38.

Dharanipragada S., Rao B.D. (2001), MCDR based feature extraction for robust speech recognition, [in:] IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 309–312, https://doi.org/10.1109/ICASSP.2001.940829.

Drugman T., Bozkurt B., Dutoit T. (2009), Complex cepstrum-based decomposition of speech for glottal source estimation, [in:] Proceedings of the Annual Conference of the International Speech Communication Association, InterSpeech, https://doi.org/10.21437/Interspeech.2009-27.

Drugman T., Bozkurt B., Dutoit T. (2011), A comparative study of glottal source estimation techniques, ArXiv, https://doi.org/10.48550/arXiv.2001.00840.

Goldberger J., Aronowitz H. (2005), A distance measure between GMMs based on the unscented transform and its application to speaker recognition, [in:] 9th European Conference on Speech Communication and Technology, InterSpeech, pp. 1985–1988, https://doi.org/10.21437/Interspeech.2005-624.

Hermansky H. (1990), Perceptual linear predictive (PLP) analysis of speech, The Journal of the Acoustical Society of America, 87(4): 1738–1752, https://doi.org/10.1121/1.399423.

Hermansky H., Fousek P. (2005), Multi-resolution RASTA filtering for TANDEM-based ASR, [in:] Proceedings of the Annual Conference of the International Speech Communication Association, InterSpeech, pp. 361–364, https://doi.org/10.21437/Interspeech.2005-184.

Hossa R., Makowski R. (2016), An effective speaker clustering method using UBM and ultra-short training utterances, Archives of Acoustics, 41(1): 107–118, https://doi.org/10.1515/aoa-2016-0011.

Julier S.J., Uhlmann J.K. (2004), Unscented filtering and nonlinear estimation, [in:] Proceedings of the IEEE, 92(3): 401–422, https://doi.org/10.1109/JPROC.2003.823141.

Koehler J., Morgan N., Hermansky H., Hirsch H.G., Tong G. (1994), Integrating RASTA-PLP into speech recognition, [in:] Proceedings of ICASSP ’94. IEEE International Conference on Acoustics, Speech and Signal Processing, https://doi.org/10.1109/ICASSP.1994.389266.

Kuan T.-W., Tsai A.-C., Sung P.-H., Wang J.-F., Kuo H.-S. (2016), A robust BFCC feature extraction for ASR system, Artificial Intelligence Research, 5(2), https://doi.org/10.5430/air.v5n2p14.

Kullback S. (1968), Information Theory and Statistics, Dover Publications, New York.

Makowski R. (2011), Automatic Speech Recognition – Selected Problems [in Polish: Automatyczne Rozpoznawanie Mowy – Wybrane Zagadnienia], Oficyna Wydawnicza Politechniki Wroclawskiej.

Moritz N., Anemuller J., Kollmeier B. (2015), An auditory inspired amplitude modulation filter bank for robust feature extraction in automatic speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(11): 1926–1937, https://doi.org/10.1109/TASLP.2015.2456420.

Mrówka P., Makowski R. (2008), Normalization of speaker individual characteristics and compensation of linear transmission distortions in command recognition systems, Archives of Acoustics, 33(2): 221–242.

Murthi M.N., Rao B.D. (2000), All-pole modeling of speech based on the minimum variance distortionless response spectrum, IEEE Transactions on Speech and Audio Processing, 8(3): 221–239, https://doi.org/10.1109/89.841206.

Plumpe M.D., Quatieri T.F., Reynolds D.A. (1999), Modeling of the glottal flow derivative waveform with application to speaker identification, IEEE Transactions on Speech and Audio Processing, 7(5): 569–586, https://doi.org/10.1109/89.784109.

Prasad N.V, Umesh S. (2013), Improved cepstral mean and variance normalization using Bayesian framework, [in:] 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 156–161, https://doi.org/10.1109/ASRU.2013.6707722.

Quatieri T.F. (2002), Discrete-Time Speech Signal Processing: Principles and Practice, Pearson Education.

Quereshi T.M., Syed K.S. (2011) A new approach to parametric modeling of glottal flow, Archives of Acoustics, 36(4): 695–712, https://doi.org/10.2478/v10168-011-0047-3.

Rabiner L., Juang B.-H. (1993), Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs.

Raitio T. et al. (2011), HMM-Based speech synthesis utilizing glottal inverse filtering, IEEE Transactions on Audio, Speech and Language Processing, 19(1): 1530–165, https://doi.org/10.1109/TASL.2010.2045239.

Sharma G., Umapathy K., Krishnan S. (2020), Trends in audio signal feature extraction methods, Applied Acoustics, 158: 107020, https://doi.org/10.1016/j.apacoust.2019.107020.

Skowronski M., Harris J.G. (2003) Improving the filter bank of a classic speech feature extraction algorithm, [in:] Proceedings of the 2003 International Symposium on Circuits and Systems, 2003. ISCAS ’03, pp. 281–284, https://doi.org/10.1109/ISCAS.2003.1205828.

Waaramaa T., Laukkanen A.M., Airas M., Alku P. (2010), Perception of emotional valences and activity levels from vowel segments of continuous speech, Journal of Voice, 24(1): 8–30, https://doi.org/10.1016/j.jvoice.2008.04.004.

Walker J., Murphy P. (2005), A review of glottal waveform analysis [in:] Progress in Nonlinear Speech Processing, Workshop on Nonlinear Speech Processing, Lecture Notes in Computer Science.

Wong D., Markel J., Gray A. (1979), Least squares glottal inverse filtering from the acoustic speech waveform, IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(4): 350–355, https://doi.org/10.1109/TASSP.1979.1163260.

Yin H., Hohmann V., Nadeu C. (2011), Acoustic features for speech recognition based on Gammatone filterbank and instantaneous frequency, Speech Communication, 53(5): 707–715, https://doi.org/10.1016/j.specom.2010.04.008.

Zambrzycka A. (2021), Adaptation in automatic speech recognition systems [in Polish: Adaptacja w systemach automatycznego rozpoznawania mowy], Ph.D. Thesis, Wrocław University of Science and Technology.




DOI: 10.24425/aoa.2025.153652