The Impact of Training Strategies on Overfitting in Vowel Classification Using PS-HFCC Parametrization for Automatic Speech Recognition

Stanislaw GMYREK; Urszula LIBAL; Robert HOSSA

doi:10.24425/aoa.2025.154823

Authors

Stanislaw GMYREK Faculty of Electronics, Photonics and Microsystem, Department of Acoustics, Multimedia and Signal Processing, Wroclaw University of Science and Technology, Poland 0000-0002-3596-5254
Urszula LIBAL Faculty of Electronics, Photonics and Microsystem, Department of Acoustics, Multimedia and Signal Processing, Wroclaw University of Science and Technology, Poland 0000-0002-3348-510X
Robert HOSSA Faculty of Electronics, Photonics and Microsystem, Department of Acoustics, Multimedia and Signal Processing, Wroclaw University of Science and Technology, Poland 0000-0002-3672-7936

Abstract

This paper investigates the overfitting problem in vowel classification task for automatic speech recognition (ASR). It utilizes a pitch synchronized human factor cepstral coefficients (PS-HFCC) as the parametrization method, which outperforms traditional methods like HFCC and mel-frequency cepstral coefficients (MFCC) in frame-level classification accuracy. While deep learning models are prevalent in contemporary ASR systems, they often lack explainability, a characteristic of classical classifiers. Therefore, this study examines overfitting phenomenon using a range of classifiers with well-understood properties. Specifically, it analyzes the impact of different training strategies on classifier performance, comparing the susceptibility to overfitting of several widely used classifiers, including the Gaussian mixture model (GMM), a standard approach in speech recognition. The analysis of training strategies considers various data splitting methods: random, speaker-based, and cluster-based. Our analysis of training strategies highlights the crucial role of data splitting methods: while random splitting is commonly used, it can lead to inflated accuracy due to overfitting. We demonstrate that speaker-independent splitting, where the classifier is trained on one set of speakers and tested on a separate, unseen set, is essential for robust evaluation and for accurately assessing generalization to new speakers. Potentially, the resulting insights may inform the future development and training of more reliable ASR systems.

Keywords:

automatic speech recognition, vowel classification, classifier training strategy, pitch synchronized human factor cepstral coefficient, overfitting, robust parametrization, speaker grouping

References

Bishop C.M. (2006), Pattern Recognition and Machine Learning, Springer, New York.

Breiman L. (2001), Random forests, Machine Learning, 45(1): 5–32, https://doi.org/10.1023/A:1010933404324.

Cherifi E., Guerti M. (2021), Conditional random fields applied to Arabic orthographic-phonetic transcription, Archives of Acoustics, 46(2): 237–247, https://doi.org/10.24425/aoa.2021.136574.

Cortes C., Vapnik V. (1995), Support-vector networks, Machine Learning, 20(3): 273–297, https://doi.org/10.1007/BF00994018.

Cover T., Hart P. (1967), Nearest neighbor pattern classification, IEEE Transactions on Information Theory, 13(1): 21–27, https://doi.org/10.1109/TIT.1967.1053964.

Davis S., Mermelstein P. (1980), Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4): 357–366, https://doi.org/10.1109/TASSP.1980.1163420.

de Cheveigné A., Kawahara H. (2002), YIN, a fundamental frequency estimator for speech and music, The Journal of the Acoustical Society of America, 111(4): 1917–1930, https://doi.org/10.1121/1.1458024.

Dempster A.P., Laird N.M., Rubin D.B. (1977), Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society: Series B (Methodological), 39(1): 1–22, https://doi.org/10.1111/j.2517-6161.1977.tb01600.x.

Gmyrek S., Hossa R. (2025a), Improving the vowel classification accuracy using varying signal frame length, Vibrations in Physical Systems, 36(1): 2025114, https://doi.org/10.21008/j.0860-6897.2025.1.14.

Gmyrek S., Hossa R. (2025b), Robust speech parametrization based on pitch synchronized cepstral solutions, International Journal of Electronics and Telecommunications, 71(3): 1–7, https://doi.org/10.24425/ijet.2025.153614.

Gmyrek S., Hossa R., Makowski R. (2023), Reducing the impact of fundamental frequency on the HFCC parameters of the speech signal, [in:] 2023 Signal Processing Symposium (SPSympo), pp. 49–52, https://doi.org/10.23919/SPSympo57300.2023.10302705.

Gmyrek S., Hossa R., Makowski R. (2024), Amplitude spectrum correction to improve speech signal classification quality, International Journal of Electronics and Telecommunications, 70(3): 569–574, https://doi.org/10.24425/ijet.2024.149580.

Goodfellow I., Bengio Y., Courville A. (2016), Deep Learning, MIT Press.

Hastie T., Tibshirani R., Friedman J. (2001), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York.

Hazen T. (2000), A comparison of novel techniques for rapid speaker adaptation, Speech Communication, 31(1): 15–33, https://doi.org/10.1016/S0167-6393(99)00059-X.

Hossa R., Makowski R. (2016), An effective speaker clustering method using UBM and ultra-short training utterances, Archives of Acoustics, 41(1): 107–118, https://doi.org/10.1515/aoa-2016-0011.

Jassem W. (1973), Fundamentals of Acoustic Phonetics [in Polish: Podstawy Fonetyki Akustycznej], PWN.

Kuhn M., Johnson K. (2013), Applied Predictive Modeling, Springer, New York.

Kundegorski M., Jackson P.J.B., Ziołko B. (2014), Two-microphone dereverberation for automatic speech recognition of Polish, Archives of Acoustics, 39(3): 411–420, https://doi.org/10.2478/aoa-2014-0045.

Libal U., Biernacki P. (2024a), Drone flight detection at an entrance to a beehive based on audio signals, Archives of Acoustics, 49(3): 459–468, https://doi.org/10.24425/aoa.2024.148796.

Libal U., Biernacki P. (2024b), MFCC-based sound classification of honey bees, International Journal of Electronics and Telecommunications, 70(4): 849–853, https://doi.org/10.24425/ijet.2024.152069.

Libal U., Biernacki P. (2024c), MFCC selection by LASSO for honey bee classification, Applied Sciences, 14(2): 913, https://doi.org/10.3390/app14020913.

Maciejko W. (2015), The effect of voice over IP transmission degradations on MAP-EM-GMM speaker verification performance, Archives of Acoustics, 40(3): 407–417, https://doi.org/10.1515/aoa-2015-0042.

Makowski R. (2011), Automatic Speech Recognition – Selected Problems [in Polish: Automatyczne Rozpoznawanie Mowy – Wybrane Zagadnienia], Oficyna Wydawnicza Politechniki Wrocławskiej.

Mauch M., Dixon S. (2014), Pyin: A fundamental frequency estimator using probabilistic threshold distributions, [in:] 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 659–663, https://doi.org/10.1109/ICASSP.2014.6853678.

McLachlan G., Peel D. (2000), Finite Mixture Models, Wiley-Interscience.

Naito M., Deng L., Sagisaka Y. (2002), Speaker clustering for speech recognition using vocal tract parameters, Speech Communication, 36(3–4): 305–315, https://doi.org/10.1016/S0167-6393(00)00089-3.

Nedeljković Ž., Milošević M., Durović Ž. (2020), Analysis of features and classifiers in emotion recognition systems: Case study of Slavic languages, Archives of Acoustics, 45(1): 129–140, https://doi.org/10.24425/aoa.2020.132489.

Ng A.Y. (2004), Feature selection, L1 vs. L2 regularization, and rotational invariance [in:] Proceedings of the Twenty-First International Conference on Machine Learning, pp. 78–85, https://doi.org/10.1145/1015330.1015435.

Piątek Z., Kłaczyński M. (2021), Acoustic methods in identifying symptoms of emotional states, Archives of Acoustics, 46(2): 259–269, https://doi.org/10.24425/aoa.2021.136580.

Quatieri T.F. (2001), Discrete-Time Speech Signal Processing: Principles and Practice, Prentice Hall, Upper Saddle River, NJ.

Rabiner L., Schafer R. (2010), Theory and Application of Digital Speech Processing, Pearson.

Reynolds D.A. (2009), Gaussian mixture models, [in:] Encyclopedia of Biometrics, Li S.Z., Jain A. [Eds.], Springer, https://doi.org/10.1007/978-0-387-73003-5196.

Rumelhart D.E., Hinton G.E., Williams R.J. (1986), Learning representations by back-propagating errors, Nature, 323(6088): 533–536, https://doi.org/10.1038/323533a0.

Skowronski M.D., Harris J. (2004), Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition, The Journal of the Acoustical Society of America, 116(3): 1774–1780, https://doi.org/10.1121/1.1777872.

Stefanowska A., Zieliński S.K. (2024), Speech emotion recognition using a multi-time-scale approach to feature aggregation and an ensemble of SVM classifiers, Archives of Acoustics, 49(2): 153–168, https://doi.org/10.24425/aoa.2024.148784.

Uma Maheswari S., Shahina A., Rishickesh R., Nayeemulla Khan A. (2020), A study on the impact of Lombard effect on recognition of Hindi syllabic units using CNN based multimodal ASR systems, Archives of Acoustics, 45(3): 419–431, https://doi.org/10.24425/aoa.2020.134058.

Upadhyaya P., Farooq O., Abidi M.R., Varshney P. (2015), Comparative study of visual feature for bimodal Hindi speech recognition, Archives of Acoustics, 40(4): 609–619, https://doi.org/10.1515/aoa-2015-0061.

Online first
2025, Vol 50
	No 1	No 2	No 3
2024, Vol 49
	No 1	No 2	No 3	No 4
2023, Vol 48
	No 1	No 2	No 3	No 4
2022, Vol 47
	No 1	No 2	No 3	No 4
2021, Vol 46
	No 1	No 2	No 3	No 4
2020, Vol 45
	No 1	No 2	No 3	No 4
2019, Vol 44
	No 1	No 2	No 3	No 4
2018, Vol 43
	No 1	No 2	No 3	No 4
2017, Vol 42
	No 1	No 2	No 3	No 4
2016, Vol 41
	No 1	No 2	No 3	No 4
2015, Vol 40
	No 1	No 2	No 3	No 4
2014, Vol 39
	No 1	No 2	No 3	No 4
2013, Vol 38
	No 1	No 2	No 3	No 4
2012, Vol 37
	No 1	No 2	No 3	No 4
2011, Vol 36
	No 1	No 2	No 3	No 4
2010, Vol 35
	No 1	No 2	No 3	No 4
2009, Vol 34
	No 1	No 2	No 3	No 4
2008, Vol 33
	No 1	No 2	No 3	No 4	No 4(S)
2007, Vol 32
	No 1	No 2	No 3	No 4	No 4(S)
2006, Vol 31
	No 1	No 2	No 3	No 4	No 4(S)
2005, Vol 30
	No 1	No 2	No 3	No 4
2004, Vol 29
	No 1	No 2	No 3	No 4
2003, Vol 28
	No 1	No 2	No 3	No 4
2002, Vol 27
	No 1	No 2	No 3	No 4
2001, Vol 26
	No 1	No 2	No 3	No 4
2000, Vol 25
	No 1	No 2	No 3	No 4
1999, Vol 24
	No 1	No 2	No 3	No 4
1998, Vol 23
	No 1	No 2	No 3	No 4
1997, Vol 22
	No 1	No 2	No 3	No 4
1996, Vol 21
	No 1	No 2	No 3	No 4
1995, Vol 20
	No 1	No 2	No 3	No 4
1994, Vol 19
	No 1	No 2	No 3	No 4
1993, Vol 18
	No 1	No 2	No 3	No 4
1992, Vol 17
	No 1	No 2	No 3	No 4
1991, Vol 16
	No 1	No 2	No 3-4
1990, Vol 15
	No 1-2		No 3-4
1989, Vol 14
	No 1-2		No 3-4
1988, Vol 13
	No 1-2		No 3-4
1987, Vol 12
	No 1	No 2	No 3-4
1986, Vol 11
	No 1	No 2	No 3	No 4
1985, Vol 10
	No 1	No 2	No 3	No 4
1984, Vol 9
	No 1-2		No 3	No 4
1983, Vol 8
	No 1	No 2	No 3	No 4
1982, Vol 7
	No 1	No 2	No 3-4
1981, Vol 6
	No 1	No 2	No 3	No 4
1980, Vol 5
	No 1	No 2	No 3	No 4
1979, Vol 4
	No 1	No 2	No 3	No 4
1978, Vol 3
	No 1	No 2	No 3	No 4
1977, Vol 2
	No 1	No 2	No 3	No 4
1976, Vol 1
	No 1	No 2	No 3	No 4

The Impact of Training Strategies on Overfitting in Vowel Classification Using PS-HFCC Parametrization for Automatic Speech Recognition

Downloads

Authors

Abstract

Keywords:

References

Other articles by the same author(s)

cover

ippt-pan

Issue

Pages

Section

DOI

Received

Accepted

Published

License

How to Cite

Principal Contact

Address

Support Contact