The Influence of the Amplitude Spectrum Correction in the HFCC Parametrization on the Quality of Speech Signal Frame Classification

Stanislaw GMYREK; Robert HOSSA; Ryszard MAKOWSKI

doi:10.24425/aoa.2025.153652

Authors

Stanislaw GMYREK Wrocław University of Science and Technology, Poland
Robert HOSSA Wrocław University of Science and Technology, Poland
Ryszard MAKOWSKI Wrocław University of Science and Technology, Poland

Abstract

The voiced parts of the speech signal are shaped by glottal pulse excitation, the vocal tract, and the speaker’s lips. Semantic information contained in speech is shaped mainly by the vocal tract. Unfortunately, the quasiperiodicity of the glottal excitation, in the case of HFCC parameterization, is one of the factors affecting the significant scatter of the feature vector values by introducing ripples into the amplitude spectrum. This paper proposes a method to reduce the effect of quasiperiodicity of the excitation on the feature vector. For this purpose, blind deconvolution was used to determine the vocal tract transfer function estimator and the corrective function of the amplitude spectrum. Then, on the basis of the obtained HFCC parameters, statistical models of individual Polish speech phonemes were developed in the form of mixtures of Gaussian distributions, and the influence of the correction on the quality of classification of speech frames containing Polish vowels was investigated. The aim of the correction was to narrow the GMM distributions, which, according to detection theory, reduces the classification errors. The results obtained confirm the effectiveness of the proposed method.

Keywords:

automatic speech recognition, robust parametrization, amplitude spectrum correction, inverse filtering, GMM model, distance between GMM distributions

References

1. Alku P. (1991), Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering, [in:] Proceedings 2nd European Conference on Speech Communication and Technology (Eurospeech 1991), pp. 1081–1084, https://doi.org/10.21437/Eurospeech.1991-257.

2. Alku P. (1992), Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering, Speech Communication, 11(2–3): 109–118, https://doi.org/10.1016/0167-6393(92)90005-R.

3. Bozkurt B., Doval B., D’Alessandro C., Dutoit T. (2005), Zeros of Z-transform representation with application to source-filter separation in speech, IEEE Signal Processing Letters, 12(4): 344–347, https://doi.org/10.1109/LSP.2005.843770.

4. Cheveigne A., Kawahara H. (2002), YIN, a fundamental frequency estimator for speech and music, The Journal of the Acoustical Society of America, 111(4): 1917–1930, https://doi.org/10.1121/1.1458024.

5. Davis S., Mermelstein P. (1980), Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4): 357–366, https://doi.org/10.1109/TASSP.1980.1163420.

6. Dempster A.P., Laird N.M., Rubin D.B. (1977), Maximum-likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society. Series B (Methodological), 39(1): 1–38.

7. Dharanipragada S., Rao B.D. (2001), MCDR based feature extraction for robust speech recognition, [in:] IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 309–312, https://doi.org/10.1109/ICASSP.2001.940829.

8. Drugman T., Bozkurt B., Dutoit T. (2009), Complex cepstrum-based decomposition of speech for glottal source estimation, [in:] Proceedings of the Annual Conference of the International Speech Communication Association, InterSpeech, https://doi.org/10.21437/Interspeech.2009-27.

9. Drugman T., Bozkurt B., Dutoit T. (2011), A comparative study of glottal source estimation techniques, ArXiv, https://doi.org/10.48550/arXiv.2001.00840.

10. Goldberger J., Aronowitz H. (2005), A distance measure between GMMs based on the unscented transform and its application to speaker recognition, [in:] 9th European Conference on Speech Communication and Technology, InterSpeech, pp. 1985–1988, https://doi.org/10.21437/Interspeech.2005-624.

11. Hermansky H. (1990), Perceptual linear predictive (PLP) analysis of speech, The Journal of the Acoustical Society of America, 87(4): 1738–1752, https://doi.org/10.1121/1.399423.

12. Hermansky H., Fousek P. (2005), Multi-resolution RASTA filtering for TANDEM-based ASR, [in:] Proceedings of the Annual Conference of the International Speech Communication Association, InterSpeech, pp. 361–364, https://doi.org/10.21437/Interspeech.2005-184.

13. Hossa R., Makowski R. (2016), An effective speaker clustering method using UBM and ultra-short training utterances, Archives of Acoustics, 41(1): 107–118, https://doi.org/10.1515/aoa-2016-0011.

14. Julier S.J., Uhlmann J.K. (2004), Unscented filtering and nonlinear estimation, [in:] Proceedings of the IEEE, 92(3): 401–422, https://doi.org/10.1109/JPROC.2003.823141.

15. Koehler J., Morgan N., Hermansky H., Hirsch H.G., Tong G. (1994), Integrating RASTA-PLP into speech recognition, [in:] Proceedings of ICASSP ’94. IEEE International Conference on Acoustics, Speech and Signal Processing, https://doi.org/10.1109/ICASSP.1994.389266.

16. Kuan T.-W., Tsai A.-C., Sung P.-H., Wang J.-F., Kuo H.-S. (2016), A robust BFCC feature extraction for ASR system, Artificial Intelligence Research, 5(2), https://doi.org/10.5430/air.v5n2p14.

17. Kullback S. (1968), Information Theory and Statistics, Dover Publications, New York.

18. Makowski R. (2011), Automatic Speech Recognition – Selected Problems [in Polish: Automatyczne Rozpoznawanie Mowy – Wybrane Zagadnienia], Oficyna Wydawnicza Politechniki Wroclawskiej.

19. Moritz N., Anemuller J., Kollmeier B. (2015), An auditory inspired amplitude modulation filter bank for robust feature extraction in automatic speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(11): 1926–1937, https://doi.org/10.1109/TASLP.2015.2456420.

20. Mrówka P., Makowski R. (2008), Normalization of speaker individual characteristics and compensation of linear transmission distortions in command recognition systems, Archives of Acoustics, 33(2): 221–242.

21. Murthi M.N., Rao B.D. (2000), All-pole modeling of speech based on the minimum variance distortionless response spectrum, IEEE Transactions on Speech and Audio Processing, 8(3): 221–239, https://doi.org/10.1109/89.841206.

22. Plumpe M.D., Quatieri T.F., Reynolds D.A. (1999), Modeling of the glottal flow derivative waveform with application to speaker identification, IEEE Transactions on Speech and Audio Processing, 7(5): 569–586, https://doi.org/10.1109/89.784109.

23. Prasad N.V, Umesh S. (2013), Improved cepstral mean and variance normalization using Bayesian framework, [in:] 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 156–161, https://doi.org/10.1109/ASRU.2013.6707722.

24. Quatieri T.F. (2002), Discrete-Time Speech Signal Processing: Principles and Practice, Pearson Education.

25. Quereshi T.M., Syed K.S. (2011) A new approach to parametric modeling of glottal flow, Archives of Acoustics, 36(4): 695–712, https://doi.org/10.2478/v10168-011-0047-3.

26. Rabiner L., Juang B.-H. (1993), Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs.

27. Raitio T. et al. (2011), HMM-Based speech synthesis utilizing glottal inverse filtering, IEEE Transactions on Audio, Speech and Language Processing, 19(1): 1530–165, https://doi.org/10.1109/TASL.2010.2045239.

28. Sharma G., Umapathy K., Krishnan S. (2020), Trends in audio signal feature extraction methods, Applied Acoustics, 158: 107020, https://doi.org/10.1016/j.apacoust.2019.107020.

29. Skowronski M., Harris J.G. (2003) Improving the filter bank of a classic speech feature extraction algorithm, [in:] Proceedings of the 2003 International Symposium on Circuits and Systems, 2003. ISCAS ’03, pp. 281–284, https://doi.org/10.1109/ISCAS.2003.1205828.

30. Waaramaa T., Laukkanen A.M., Airas M., Alku P. (2010), Perception of emotional valences and activity levels from vowel segments of continuous speech, Journal of Voice, 24(1): 8–30, https://doi.org/10.1016/j.jvoice.2008.04.004.

31. Walker J., Murphy P. (2005), A review of glottal waveform analysis [in:] Progress in Nonlinear Speech Processing, Workshop on Nonlinear Speech Processing, Lecture Notes in Computer Science.

32. Wong D., Markel J., Gray A. (1979), Least squares glottal inverse filtering from the acoustic speech waveform, IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(4): 350–355, https://doi.org/10.1109/TASSP.1979.1163260.

33. Yin H., Hohmann V., Nadeu C. (2011), Acoustic features for speech recognition based on Gammatone filterbank and instantaneous frequency, Speech Communication, 53(5): 707–715, https://doi.org/10.1016/j.specom.2010.04.008.

34. Zambrzycka A. (2021), Adaptation in automatic speech recognition systems [in Polish: Adaptacja w systemach automatycznego rozpoznawania mowy], Ph.D. Thesis, Wrocław University of Science and Technology.

Online first
2025, Vol 50
	No 1	No 2
2024, Vol 49
	No 1	No 2	No 3	No 4
2023, Vol 48
	No 1	No 2	No 3	No 4
2022, Vol 47
	No 1	No 2	No 3	No 4
2021, Vol 46
	No 1	No 2	No 3	No 4
2020, Vol 45
	No 1	No 2	No 3	No 4
2019, Vol 44
	No 1	No 2	No 3	No 4
2018, Vol 43
	No 1	No 2	No 3	No 4
2017, Vol 42
	No 1	No 2	No 3	No 4
2016, Vol 41
	No 1	No 2	No 3	No 4
2015, Vol 40
	No 1	No 2	No 3	No 4
2014, Vol 39
	No 1	No 2	No 3	No 4
2013, Vol 38
	No 1	No 2	No 3	No 4
2012, Vol 37
	No 1	No 2	No 3	No 4
2011, Vol 36
	No 1	No 2	No 3	No 4
2010, Vol 35
	No 1	No 2	No 3	No 4
2009, Vol 34
	No 1	No 2	No 3	No 4
2008, Vol 33
	No 1	No 2	No 3	No 4	No 4(S)
2007, Vol 32
	No 1	No 2	No 3	No 4	No 4(S)
2006, Vol 31
	No 1	No 2	No 3	No 4	No 4(S)
2005, Vol 30
	No 1	No 2	No 3	No 4
2004, Vol 29
	No 1	No 2	No 3	No 4
2003, Vol 28
	No 1	No 2	No 3	No 4
2002, Vol 27
	No 1	No 2	No 3	No 4
2001, Vol 26
	No 1	No 2	No 3	No 4
2000, Vol 25
	No 1	No 2	No 3	No 4
1999, Vol 24
	No 1	No 2	No 3	No 4
1998, Vol 23
	No 1	No 2	No 3	No 4
1997, Vol 22
	No 1	No 2	No 3	No 4
1996, Vol 21
	No 1	No 2	No 3	No 4
1995, Vol 20
	No 1	No 2	No 3	No 4
1994, Vol 19
	No 1	No 2	No 3	No 4
1993, Vol 18
	No 1	No 2	No 3	No 4
1992, Vol 17
	No 1	No 2	No 3	No 4
1991, Vol 16
	No 1	No 2	No 3-4
1990, Vol 15
	No 1-2		No 3-4
1989, Vol 14
	No 1-2		No 3-4
1988, Vol 13
	No 1-2		No 3-4
1987, Vol 12
	No 1	No 2	No 3-4
1986, Vol 11
	No 1	No 2	No 3	No 4
1985, Vol 10
	No 1	No 2	No 3	No 4
1984, Vol 9
	No 1-2		No 3	No 4
1983, Vol 8
	No 1	No 2	No 3	No 4
1982, Vol 7
	No 1	No 2	No 3-4
1981, Vol 6
	No 1	No 2	No 3	No 4
1980, Vol 5
	No 1	No 2	No 3	No 4
1979, Vol 4
	No 1	No 2	No 3	No 4
1978, Vol 3
	No 1	No 2	No 3	No 4
1977, Vol 2
	No 1	No 2	No 3	No 4
1976, Vol 1
	No 1	No 2	No 3	No 4

The Influence of the Amplitude Spectrum Correction in the HFCC Parametrization on the Quality of Speech Signal Frame Classification

Downloads

Authors

Abstract

Keywords:

References

Most read articles by the same author(s)

cover

ippt-pan

Issue

Pages

Section

DOI

Received

Accepted

Published

License

How to Cite

Principal Contact

Address

Support Contact