10.24425/aoa.2025.153652
The Influence of the Amplitude Spectrum Correction in the HFCC Parametrization on the Quality of Speech Signal Frame Classification
References
Alku P. (1991), Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering, [in:] Proceedings 2nd European Conference on Speech Communication and Technology (Eurospeech 1991), pp. 1081–1084, https://doi.org/10.21437/Eurospeech.1991-257.
Alku P. (1992), Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering, Speech Communication, 11(2–3): 109–118, https://doi.org/10.1016/0167-6393(92)90005-R.
Bozkurt B., Doval B., D’Alessandro C., Dutoit T. (2005), Zeros of Z-transform representation with application to source-filter separation in speech, IEEE Signal Processing Letters, 12(4): 344–347, https://doi.org/10.1109/LSP.2005.843770.
Cheveigne A., Kawahara H. (2002), YIN, a fundamental frequency estimator for speech and music, The Journal of the Acoustical Society of America, 111(4): 1917–1930, https://doi.org/10.1121/1.1458024.
Davis S., Mermelstein P. (1980), Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4): 357–366, https://doi.org/10.1109/TASSP.1980.1163420.
Dempster A.P., Laird N.M., Rubin D.B. (1977), Maximum-likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society. Series B (Methodological), 39(1): 1–38.
Dharanipragada S., Rao B.D. (2001), MCDR based feature extraction for robust speech recognition, [in:] IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 309–312, https://doi.org/10.1109/ICASSP.2001.940829.
Drugman T., Bozkurt B., Dutoit T. (2009), Complex cepstrum-based decomposition of speech for glottal source estimation, [in:] Proceedings of the Annual Conference of the International Speech Communication Association, InterSpeech, https://doi.org/10.21437/Interspeech.2009-27.
Drugman T., Bozkurt B., Dutoit T. (2011), A comparative study of glottal source estimation techniques, ArXiv, https://doi.org/10.48550/arXiv.2001.00840.
Goldberger J., Aronowitz H. (2005), A distance measure between GMMs based on the unscented transform and its application to speaker recognition, [in:] 9th European Conference on Speech Communication and Technology, InterSpeech, pp. 1985–1988, https://doi.org/10.21437/Interspeech.2005-624.
Hermansky H. (1990), Perceptual linear predictive (PLP) analysis of speech, The Journal of the Acoustical Society of America, 87(4): 1738–1752, https://doi.org/10.1121/1.399423.
Hermansky H., Fousek P. (2005), Multi-resolution RASTA filtering for TANDEM-based ASR, [in:] Proceedings of the Annual Conference of the International Speech Communication Association, InterSpeech, pp. 361–364, https://doi.org/10.21437/Interspeech.2005-184.
Hossa R., Makowski R. (2016), An effective speaker clustering method using UBM and ultra-short training utterances, Archives of Acoustics, 41(1): 107–118, https://doi.org/10.1515/aoa-2016-0011.
Julier S.J., Uhlmann J.K. (2004), Unscented filtering and nonlinear estimation, [in:] Proceedings of the IEEE, 92(3): 401–422, https://doi.org/10.1109/JPROC.2003.823141.
Koehler J., Morgan N., Hermansky H., Hirsch H.G., Tong G. (1994), Integrating RASTA-PLP into speech recognition, [in:] Proceedings of ICASSP ’94. IEEE International Conference on Acoustics, Speech and Signal Processing, https://doi.org/10.1109/ICASSP.1994.389266.
Kuan T.-W., Tsai A.-C., Sung P.-H., Wang J.-F., Kuo H.-S. (2016), A robust BFCC feature extraction for ASR system, Artificial Intelligence Research, 5(2), https://doi.org/10.5430/air.v5n2p14.
Kullback S. (1968), Information Theory and Statistics, Dover Publications, New York.
Makowski R. (2011), Automatic Speech Recognition – Selected Problems [in Polish: Automatyczne Rozpoznawanie Mowy – Wybrane Zagadnienia], Oficyna Wydawnicza Politechniki Wroclawskiej.
Moritz N., Anemuller J., Kollmeier B. (2015), An auditory inspired amplitude modulation filter bank for robust feature extraction in automatic speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(11): 1926–1937, https://doi.org/10.1109/TASLP.2015.2456420.
Mrówka P., Makowski R. (2008), Normalization of speaker individual characteristics and compensation of linear transmission distortions in command recognition systems, Archives of Acoustics, 33(2): 221–242.
Murthi M.N., Rao B.D. (2000), All-pole modeling of speech based on the minimum variance distortionless response spectrum, IEEE Transactions on Speech and Audio Processing, 8(3): 221–239, https://doi.org/10.1109/89.841206.
Plumpe M.D., Quatieri T.F., Reynolds D.A. (1999), Modeling of the glottal flow derivative waveform with application to speaker identification, IEEE Transactions on Speech and Audio Processing, 7(5): 569–586, https://doi.org/10.1109/89.784109.
Prasad N.V, Umesh S. (2013), Improved cepstral mean and variance normalization using Bayesian framework, [in:] 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 156–161, https://doi.org/10.1109/ASRU.2013.6707722.
Quatieri T.F. (2002), Discrete-Time Speech Signal Processing: Principles and Practice, Pearson Education.
Quereshi T.M., Syed K.S. (2011) A new approach to parametric modeling of glottal flow, Archives of Acoustics, 36(4): 695–712, https://doi.org/10.2478/v10168-011-0047-3.
Rabiner L., Juang B.-H. (1993), Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs.
Raitio T. et al. (2011), HMM-Based speech synthesis utilizing glottal inverse filtering, IEEE Transactions on Audio, Speech and Language Processing, 19(1): 1530–165, https://doi.org/10.1109/TASL.2010.2045239.
Sharma G., Umapathy K., Krishnan S. (2020), Trends in audio signal feature extraction methods, Applied Acoustics, 158: 107020, https://doi.org/10.1016/j.apacoust.2019.107020.
Skowronski M., Harris J.G. (2003) Improving the filter bank of a classic speech feature extraction algorithm, [in:] Proceedings of the 2003 International Symposium on Circuits and Systems, 2003. ISCAS ’03, pp. 281–284, https://doi.org/10.1109/ISCAS.2003.1205828.
Waaramaa T., Laukkanen A.M., Airas M., Alku P. (2010), Perception of emotional valences and activity levels from vowel segments of continuous speech, Journal of Voice, 24(1): 8–30, https://doi.org/10.1016/j.jvoice.2008.04.004.
Walker J., Murphy P. (2005), A review of glottal waveform analysis [in:] Progress in Nonlinear Speech Processing, Workshop on Nonlinear Speech Processing, Lecture Notes in Computer Science.
Wong D., Markel J., Gray A. (1979), Least squares glottal inverse filtering from the acoustic speech waveform, IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(4): 350–355, https://doi.org/10.1109/TASSP.1979.1163260.
Yin H., Hohmann V., Nadeu C. (2011), Acoustic features for speech recognition based on Gammatone filterbank and instantaneous frequency, Speech Communication, 53(5): 707–715, https://doi.org/10.1016/j.specom.2010.04.008.
Zambrzycka A. (2021), Adaptation in automatic speech recognition systems [in Polish: Adaptacja w systemach automatycznego rozpoznawania mowy], Ph.D. Thesis, Wrocław University of Science and Technology.
DOI: 10.24425/aoa.2025.153652