The Impact of Training Strategies on Overfitting in Vowel Classification Using PS-HFCC Parametrization for Automatic Speech Recognition
Abstract
This paper investigates the overfitting problem in vowel classification task for automatic speech recognition (ASR). It utilizes a pitch synchronized human factor cepstral coefficients (PS-HFCC) as the parametrization method, which outperforms traditional methods like HFCC and mel-frequency cepstral coefficients (MFCC) in frame-level classification accuracy. While deep learning models are prevalent in contemporary ASR systems, they often lack explainability, a characteristic of classical classifiers. Therefore, this study examines overfitting phenomenon using a range of classifiers with well-understood properties. Specifically, it analyzes the impact of different training strategies on classifier performance, comparing the susceptibility to overfitting of several widely used classifiers, including the Gaussian mixture model (GMM), a standard approach in speech recognition. The analysis of training strategies considers various data splitting methods: random, speaker-based, and cluster-based. Our analysis of training strategies highlights the crucial role of data splitting methods: while random splitting is commonly used, it can lead to inflated accuracy due to overfitting. We demonstrate that speaker-independent splitting, where the classifier is trained on one set of speakers and tested on a separate, unseen set, is essential for robust evaluation and for accurately assessing generalization to new speakers. Potentially, the resulting insights may inform the future development and training of more reliable ASR systems.
Keywords:
automatic speech recognition, vowel classification, classifier training strategy, pitch synchronized human factor cepstral coefficient, overfitting, robust parametrization, speaker groupingReferences
- Bishop C.M. (2006), Pattern Recognition and Machine Learning, Springer, New York.
- Breiman L. (2001), Random forests, Machine Learning, 45(1): 5–32, https://doi.org/10.1023/A:1010933404324.
- Cherifi E., Guerti M. (2021), Conditional random fields applied to Arabic orthographic-phonetic transcription, Archives of Acoustics, 46(2): 237–247, https://doi.org/10.24425/aoa.2021.136574.
- Cortes C., Vapnik V. (1995), Support-vector networks, Machine Learning, 20(3): 273–297, https://doi.org/10.1007/BF00994018.
- Cover T., Hart P. (1967), Nearest neighbor pattern classification, IEEE Transactions on Information Theory, 13(1): 21–27, https://doi.org/10.1109/TIT.1967.1053964.
- Davis S., Mermelstein P. (1980), Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4): 357–366, https://doi.org/10.1109/TASSP.1980.1163420.
- de Cheveigné A., Kawahara H. (2002), YIN, a fundamental frequency estimator for speech and music, The Journal of the Acoustical Society of America, 111(4): 1917–1930, https://doi.org/10.1121/1.1458024.
- Dempster A.P., Laird N.M., Rubin D.B. (1977), Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society: Series B (Methodological), 39(1): 1–22, https://doi.org/10.1111/j.2517-6161.1977.tb01600.x.
- Gmyrek S., Hossa R. (2025a), Improving the vowel classification accuracy using varying signal frame length, Vibrations in Physical Systems, 36(1): 2025114, https://doi.org/10.21008/j.0860-6897.2025.1.14.
- Gmyrek S., Hossa R. (2025b), Robust speech parametrization based on pitch synchronized cepstral solutions, International Journal of Electronics and Telecommunications, 71(3): 1–7, https://doi.org/10.24425/ijet.2025.153614.
- Gmyrek S., Hossa R., Makowski R. (2023), Reducing the impact of fundamental frequency on the HFCC parameters of the speech signal, [in:] 2023 Signal Processing Symposium (SPSympo), pp. 49–52, https://doi.org/10.23919/SPSympo57300.2023.10302705.
- Gmyrek S., Hossa R., Makowski R. (2024), Amplitude spectrum correction to improve speech signal classification quality, International Journal of Electronics and Telecommunications, 70(3): 569–574, https://doi.org/10.24425/ijet.2024.149580.
- Goodfellow I., Bengio Y., Courville A. (2016), Deep Learning, MIT Press.
- Hastie T., Tibshirani R., Friedman J. (2001), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York.
- Hazen T. (2000), A comparison of novel techniques for rapid speaker adaptation, Speech Communication, 31(1): 15–33, https://doi.org/10.1016/S0167-6393(99)00059-X.
- Hossa R., Makowski R. (2016), An effective speaker clustering method using UBM and ultra-short training utterances, Archives of Acoustics, 41(1): 107–118, https://doi.org/10.1515/aoa-2016-0011.
- Jassem W. (1973), Fundamentals of Acoustic Phonetics [in Polish: Podstawy Fonetyki Akustycznej], PWN.
- Kuhn M., Johnson K. (2013), Applied Predictive Modeling, Springer, New York.
- Kundegorski M., Jackson P.J.B., Ziołko B. (2014), Two-microphone dereverberation for automatic speech recognition of Polish, Archives of Acoustics, 39(3): 411–420, https://doi.org/10.2478/aoa-2014-0045.
- Libal U., Biernacki P. (2024a), Drone flight detection at an entrance to a beehive based on audio signals, Archives of Acoustics, 49(3): 459–468, https://doi.org/10.24425/aoa.2024.148796.
- Libal U., Biernacki P. (2024b), MFCC-based sound classification of honey bees, International Journal of Electronics and Telecommunications, 70(4): 849–853, https://doi.org/10.24425/ijet.2024.152069.
- Libal U., Biernacki P. (2024c), MFCC selection by LASSO for honey bee classification, Applied Sciences, 14(2): 913, https://doi.org/10.3390/app14020913.
- Maciejko W. (2015), The effect of voice over IP transmission degradations on MAP-EM-GMM speaker verification performance, Archives of Acoustics, 40(3): 407–417, https://doi.org/10.1515/aoa-2015-0042.
- Makowski R. (2011), Automatic Speech Recognition – Selected Problems [in Polish: Automatyczne Rozpoznawanie Mowy – Wybrane Zagadnienia], Oficyna Wydawnicza Politechniki Wrocławskiej.
- Mauch M., Dixon S. (2014), Pyin: A fundamental frequency estimator using probabilistic threshold distributions, [in:] 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 659–663, https://doi.org/10.1109/ICASSP.2014.6853678.
- McLachlan G., Peel D. (2000), Finite Mixture Models, Wiley-Interscience.
- Naito M., Deng L., Sagisaka Y. (2002), Speaker clustering for speech recognition using vocal tract parameters, Speech Communication, 36(3–4): 305–315, https://doi.org/10.1016/S0167-6393(00)00089-3.
- Nedeljković Ž., Milošević M., Durović Ž. (2020), Analysis of features and classifiers in emotion recognition systems: Case study of Slavic languages, Archives of Acoustics, 45(1): 129–140, https://doi.org/10.24425/aoa.2020.132489.
- Ng A.Y. (2004), Feature selection, L1 vs. L2 regularization, and rotational invariance [in:] Proceedings of the Twenty-First International Conference on Machine Learning, pp. 78–85, https://doi.org/10.1145/1015330.1015435.
- Piątek Z., Kłaczyński M. (2021), Acoustic methods in identifying symptoms of emotional states, Archives of Acoustics, 46(2): 259–269, https://doi.org/10.24425/aoa.2021.136580.
- Quatieri T.F. (2001), Discrete-Time Speech Signal Processing: Principles and Practice, Prentice Hall, Upper Saddle River, NJ.
- Rabiner L., Schafer R. (2010), Theory and Application of Digital Speech Processing, Pearson.
- Reynolds D.A. (2009), Gaussian mixture models, [in:] Encyclopedia of Biometrics, Li S.Z., Jain A. [Eds.], Springer, https://doi.org/10.1007/978-0-387-73003-5196.
- Rumelhart D.E., Hinton G.E., Williams R.J. (1986), Learning representations by back-propagating errors, Nature, 323(6088): 533–536, https://doi.org/10.1038/323533a0.
- Skowronski M.D., Harris J. (2004), Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition, The Journal of the Acoustical Society of America, 116(3): 1774–1780, https://doi.org/10.1121/1.1777872.
- Stefanowska A., Zieliński S.K. (2024), Speech emotion recognition using a multi-time-scale approach to feature aggregation and an ensemble of SVM classifiers, Archives of Acoustics, 49(2): 153–168, https://doi.org/10.24425/aoa.2024.148784.
- Uma Maheswari S., Shahina A., Rishickesh R., Nayeemulla Khan A. (2020), A study on the impact of Lombard effect on recognition of Hindi syllabic units using CNN based multimodal ASR systems, Archives of Acoustics, 45(3): 419–431, https://doi.org/10.24425/aoa.2020.134058.
- Upadhyaya P., Farooq O., Abidi M.R., Varshney P. (2015), Comparative study of visual feature for bimodal Hindi speech recognition, Archives of Acoustics, 40(4): 609–619, https://doi.org/10.1515/aoa-2015-0061.