Archives of Acoustics, 49, 2, pp. 153–168, 2024

Speech Emotion Recognition Using a Multi-Time-Scale Approach to Feature Aggregation and an Ensemble of SVM Classifiers

Faculty of Computer Science, Białystok University of Technology

Sławomir Krzysztof ZIELIŃSKI
ORCID ID 0000-0002-3205-974X
Faculty of Computer Science, Białystok University of Technology

Due to its relevant real-life applications, the recognition of emotions from speech signals constitutes a popular research topic. In the traditional methods applied for speech emotion recognition, audio features are typically aggregated using a fixed-duration time window, potentially discarding information conveyed by speech at various signal durations. By contrast, in the proposed method, audio features are aggregated simultaneously using time windows of different lengths (a multi-time-scale approach), hence, potentially better utilizing information carried at phonemic, syllabic, and prosodic levels compared to the traditional approach. A genetic algorithm is employed to optimize the feature extraction procedure. The features aggregated at different time windows are subsequently classified by an ensemble of support vector machine (SVM) classifiers. To enhance the generalization property of the method, a data augmentation technique based on pitch shifting and time stretching is applied. According to the obtained results, the developed method outperforms the traditional one for the
selected datasets, demonstrating the benefits of using a multi-time-scale approach to feature aggregation.
Keywords: speech emotion recognition; feature aggregation; ensemble classification
Full Text: PDF
Copyright © 2023 The Author(s). This work is licensed under the Creative Commons Attribution 4.0 International CC BY 4.0.


Abdel-Hamid L. (2020), Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features, Speech Communication, 122: 19–30, doi: 10.1016/j.specom.2020.04.005.

Basu S., Chakraborty J., Bag A., Aftabuddin M. (2017), A review on emotion recognition using speech, [in:] International Conference on Inventive Communication and Computational Technologies (ICICCT), doi: 10.1109/ICICCT.2017.7975169.

Bogdanov D. et al. (2013), ESSENTIA: An audio analysis library for music information retrieval, [in:] International Society for Music Information Retrieval Conference (ISMIR’13), pp. 493–498.

Cao H., Cooper D.G., Keutmann M.K., Gur R.C., Nenkova A., Verma R. (2014), CREMA-D: Crowdsourced emotional multimodal actors dataset, IEEE Transactions on Affective Computing, 5(4): 377–390, doi: 10.1109/TAFFC.2014.2336244.

Cao X., Jia M., Ru J., Pai T. (2022), Cross-corpus speech emotion recognition using subspace learning and domain adaption, EURASIP Journal on Audio, Speech, and Music Processing, 2022: 32, doi: 10.1186/s13636-022-00264-5.

Chatterjee R., Mazumdar S., Sheratt R.S., Halder R., Maitra T., Giri D. (2021), Real-time speech emotion analysis for smart home assistants, IEEE Transactions on Consumer Electronics, 67(1): 68–76, doi: 10.1109/TCE.2021.3056421.

Choi W.Y., Song K.Y., Lee C.W. (2018), Convolutional attention networks for multimodal emotion recognition from speech and text data, [in:] Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pp. 28–34, doi: 10.18653/v1/W18-3304.

Ekman P. (1992), An argument for basic emotions, [in:] Cognition and Emotion, 6(3–4): 169–200.

Eskimez S.E., Duan Z., Heinzelman W. (2018), Unsupervised learning approach to feature analysis for automatic speech emotion recognition, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5099–5103, doi: 10.1109/ICASSP.2018.8462685.

García-Martín E., Rodrigues C.F., Riley G., Grahn H. (2019), Estimation of energy consumption in machine learning, Journal of Parallel and Distributed Computing, 134: 75–88, doi: 10.1016/j.jpdc.2019.07.007.

Ghaleb E., Popa M., Asteriadis S. (2019), Metric learning-based multimodal audio-visual emotion recognition, IEEE MultiMedia, 27(1): 37–48, doi: 10.1109/MMUL.2019.2960219.

Guizzo E., Weyde T., Leveson J.B. (2020), Multitime-scale convolution for emotion recognition from speech audio signals, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), doi: 10.1109/ICASSP40776.2020.9053727.

Haq S., Jackson P.J.B. (2011), Multimodal emotion recognition, [in:] Machine Audition: Principles, Algorithms and Systems, Wang W. [Ed.], pp. 398–423, IGI Global Press, Hershey, doi: 10.4018/978-1-61520-919-4.

Jadhav S., He H., Jenkins K. (2018), Information gain directed genetic algorithm wrapper feature selection for credit rating, Applied Soft Computing, 69: 541–553, doi: 10.1016/j.asoc.2018.04.033.

Jain D.K., Shamsolmoali P., Sehdev P. (2019), Extended deep neural network for facial emotion recognition, Pattern Recognition Letters, 120: 69–74, doi: 10.1016/j.patrec.2019.01.008.

Kanwal S., Asghar S. (2021), Speech emotion recognition using clustering based GA-optimized feature set, IEEE Access, 9: 125830–125842, doi: 10.1109/ACCESS.2021.3111659.

Kaya H., Karpov A.A. (2018), Efficient and effective strategies for cross-corpus acoustic emotion, Neurocomputing, 275: 1028–1034, doi: 10.1016/j.neucom.2017.09.049.

Khalil R.A., Jones E., Babar M.I., Jan T., Zafar M.H., Alhussain T. (2019), Speech emotion recognition using deep learning techniques: A review, IEEE Access, 7: 117327–117345, doi: 10.1109/ACCESS.2019.2936124.

Kreyszig E. (1979), Advanced Engineering Mathematics, 4th ed., Wiley.

Lin Y.-L., Wei G. (2005), Speech emotion recognition based on HMM and SVM, [in:] Fourth International Conference on Machine Learning and Cybernetics, doi: 10.1109/ICMLC.2005.1527805.

Liu Z.-T., Xie Q., Wu M., Cao W.-H., Mei Y., Mao J.-W. (2018), Speech emotion recognition based on an improved brain emotion learning model, Neurocomputing, 309: 145–156, doi: 10.1016/j.neucom.2018.05.005.

Livingstone S.R., Russo F.A. (2018), The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English, PLOS ONE, 13(5): e0196391, doi: 10.1371/journal.pone.0196391.

Majkowski A., Kołodziej M., Rak R.J., Korczynski R. (2016), Classification of emotions from speech signal, [in:] Signal Processing Algorithms, Architectures, Arrangements and Applications (SPA), doi: 10.1109/SPA.2016.7763627.

Martin O., Kotsia I., Macq B., Pitas I. (2006), The eNTERFACE’05 audio-visual emotion database, [in:] 22nd International Conference on Data Engineering Workshops (ICDEW’06), doi: 10.1109/ICDEW.2006.145.

McFee B. et al. (2015), librosa: Audio and music signal analysis in Python, [in:] 14th Python in Science Conference, pp. 18–25, doi: 10.25080/Majora-7b98e3ed-003.

Milner R., Jalal M.A., Ng R.W.N.M., Hain T. (2019), A cross-corpus study on speech emotion recognition, [in:] IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), doi: 10.1109/ASRU46091.2019.9003838.

Mitchel M. (1996), An Introduction to Genetic Algorithms, MIT Press, Cambridge.

Mohino-Herranz I., Gil-Pita R., Alonso-Diaz S., Rosa-Zurera M. (2014), MFCC based enlargement of the training set for emotion recognition in speech, Signal & Image Processing: An International Journal (SIPIJ), 5(1), doi: 10.48550/arXiv.1403.4777.

Monaco A., Amoroso N., Bellantuono L., Pantaleo E., Tangaro S., Bellotti R. (2020), Multitime-scale features for accurate respiratory sound classification, Applied Sciences, 10(23): 8606, doi: 10.3390/app10238606.

Omman B., Eldho S.M. (2022), Speech emotion recognition using bagged support vector machines, [in:] International Conference on Computing, Communication, Security and Intelligent Systems (IC3SIS), doi: 10.1109/IC3SIS54991.2022.9885578.

Pandey S.K., Shekhawat H.S., Prasanna M.S. (2019), Deep learning techniques for speech emotion recognition: A review, [in:] 29th International Conference Radioelektronika (RADIOELEKTRONIKA), doi: 10.1109/RADIOELEK.2019.8733432.

Pedregosa F. et al. (2011), Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, 12(85): 2825–2830.

Picard R.W. (1995), Affective computing, M.I.T Media Laboratory Perceptual Computing Section Technical Report No. 321.

Pichora-Fuller M.K., Dupuis K. (2020), Toronto emotional speech set (TESS) (V1), University of Toronto Dataverse, doi: 10.5683/SP2/E8H2MF.

Sahoo S., Routray A. (2016), MFCC feature with optimized frequency range: An essential step for emotion recognition, [in:] 2016 International Conference on Systems in Medicine and Biology (ICSMB), doi: 10.1109/ICSMB.2016.7915112.

Sayed S., Nassef M., Badr A., Farag I. (2019), A Nested Genetic Algorithm for feature selection in high-dimensional cancer microarray datasets, Expert Systems with Applications, 121: 233–243, doi: 10.1016/j.eswa.2018.12.022.

Seknedy M.E., Fawzi S. (2021), Speech emotion recognition system for human interaction applications, [in:] Tenth International Conference on Intelligent Computing and Information Systems (ICICIS), doi: 10.1109/ICICIS52592.2021.9694246.

Seknedy M.E., Fawzi S. (2022), Speech emotion recognition system for Arabic speakers, [in:] 4th Novel Intelligent and Leading Emerging Sciences Conference (NILES), doi: 10.1109/NILES56402.2022.9942431.

Shahin I. (2020), Emotion recognition using speaker cues, [in:] Advances in Science and Engineering Technology International Conferences (ASET), pp. 1–5, doi: 10.48550/arXiv.2002.03566.

Sidorov M., Brester C., Minker W., Semenkin E. (2014), Speech-based emotion recognition: Feature selection by self-adaptive multi-criteria genetic algorithm, [in:] Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 3481–3485, (access: 10.11.2023).

Slaney M. (1998), Auditory Toolbox: A MATLAB Toolbox for Audtiory Modeling Work, version 2, Interval Research Corporation.

Stefanowska A., Zielinski S.K. (2023), Software repository for speech emotion recognition using a multi-time-scale approach to feature aggregation and an ensemble of SVM classifiers, GitHub, (access: 27.10.2023).

Su B.-H., Lee C.-C. (2021), A conditional cycle emotion gan for cross corpus speech emotion recognition, [in:] IEEE Spoken Language Technology Workshop (SLT), doi: 10.1109/SLT48900.2021.9383512.

Tamulevicius G., Korvel G., Yayak A.B., Treigys P., Bernataviciene˙ J., Kostek B. (2020), A study of cross-linguistic speech emotion recognition based on 2D feature spaces, Electronics, 9(10): 1725, doi: 10.3390/electronics9101725.

Tang D., Kuppens P., Geurts L., van Waterschoot T. (2021), End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network, EURASIP Journal on Audio, Speech and Music Processing, 18(2021), doi: 10.1186/s13636-021-00208-5.

Tao H., Shan S., Hu Z., Zhu C., Ge H. (2023), Strong generalized speech emotion recognition based on effective data augmentation, Entropy, 25(1): 68, doi: 10.3390/e25010068.

Tzirakis P., Zhang J., Schuller B.W. (2018), End-to-end speech emotion recognition using deep neural networks, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), doi: 10.1109/ICASSP.2018.8462677.

Wang Y., Huo H. (2019), Speech recognition based on genetic algorithm optimized support vector machine, [in:] 6th International Conference on Systems and Informatics (ICSAI), doi: 10.1109/ICSAI48974.2019.9010502.

Yang K. et al. (2023), Behavioral and physiological signals-based deep multimodal approach for mobile emotion recognition, [in:] IEEE Transactions on Affective Computing, 14(2): 1082–1097, doi: 10.1109/TAFFC.2021.3100868.

Yildirim S., Kaya Y., Kılıç F. (2021), A modified feature selection method based on metaheuristic algorithms for speech emotion recognition, Applied Acoustics, 173: 107721, doi: 10.1016/j.apacoust.2020.107721.

Young S. et al. (2006), The HTK Book, Cambridge University Engineering Department.

Zacharatos H., Gatzoulis C., Charalambous P., Chrysanthou Y. (2021), Emotion recognition from 3D motion capture data using deep CNNs, [in:] IEEE Conference on Games (CoG), doi: 10.1109/CoG52621.2021.9619065.

Zhang S., Chen A., Guo W., Cui Y., Zhao X., Liu L. (2020), Learning deep binaural representations with deep convolutional neural networks for spontaneous speech emotion recognition, IEEE Access, 8: 23496–23505, doi: 10.1109/ACCESS.2020.2969032.

Zhao J., Mao X., Chen L. (2018), Learning deep features to recognise speech emotion using merged deep CNN, IET Signal Process, 12(6): 713–721, doi: 10.1049/iet-spr.2017.0320.

Zhao J., Mao X., Chen L. (2019), Speech emotion recognition using deep 1D & 2D CNN LSTM networks, Biomedical Signal Processing and Control, 47: 312–323, doi: 10.1016/j.bspc.2018.08.035.

DOI: 10.24425/aoa.2024.148784