Speech Emotion Recognition Using a Multi-Time-Scale Approach to Feature Aggregation and an Ensemble of SVM Classifiers

Antonina STEFANOWSKA; Sławomir Krzysztof ZIELIŃSKI

doi:10.24425/aoa.2024.148784

Authors

Antonina STEFANOWSKA Faculty of Computer Science, Białystok University of Technology, Poland
Sławomir Krzysztof ZIELIŃSKI Faculty of Computer Science, Białystok University of Technology, Poland 0000-0002-3205-974X

Abstract

Due to its relevant real-life applications, the recognition of emotions from speech signals constitutes a popular research topic. In the traditional methods applied for speech emotion recognition, audio features are typically aggregated using a fixed-duration time window, potentially discarding information conveyed by speech at various signal durations. By contrast, in the proposed method, audio features are aggregated simultaneously using time windows of different lengths (a multi-time-scale approach), hence, potentially better utilizing information carried at phonemic, syllabic, and prosodic levels compared to the traditional approach. A genetic algorithm is employed to optimize the feature extraction procedure. The features aggregated at different time windows are subsequently classified by an ensemble of support vector machine (SVM) classifiers. To enhance the generalization property of the method, a data augmentation technique based on pitch shifting and time stretching is applied. According to the obtained results, the developed method outperforms the traditional one for the selected datasets, demonstrating the benefits of using a multi-time-scale approach to feature aggregation.

Keywords:

speech emotion recognition, feature aggregation, ensemble classification

References

1. Abdel-Hamid L. (2020), Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features, Speech Communication, 122: 19–30, https://doi.org/10.1016/j.specom.2020.04.005.

2. Basu S., Chakraborty J., Bag A., Aftabuddin M. (2017), A review on emotion recognition using speech, [in:] International Conference on Inventive Communication and Computational Technologies (ICICCT), https://doi.org/10.1109/ICICCT.2017.7975169.

3. Bogdanov D. et al. (2013), ESSENTIA: An audio analysis library for music information retrieval, [in:] International Society for Music Information Retrieval Conference (ISMIR’13), pp. 493–498.

4. Cao H., Cooper D.G., Keutmann M.K., Gur R.C., Nenkova A., Verma R. (2014), CREMA-D: Crowdsourced emotional multimodal actors dataset, IEEE Transactions on Affective Computing, 5(4): 377–390, https://doi.org/10.1109/TAFFC.2014.2336244.

5. Cao X., Jia M., Ru J., Pai T. (2022), Cross-corpus speech emotion recognition using subspace learning and domain adaption, EURASIP Journal on Audio, Speech, and Music Processing, 2022: 32, https://doi.org/10.1186/s13636-022-00264-5.

6. Chatterjee R., Mazumdar S., Sheratt R.S., Halder R., Maitra T., Giri D. (2021), Real-time speech emotion analysis for smart home assistants, IEEE Transactions on Consumer Electronics, 67(1): 68–76, https://doi.org/10.1109/TCE.2021.3056421.

7. Choi W.Y., Song K.Y., Lee C.W. (2018), Convolutional attention networks for multimodal emotion recognition from speech and text data, [in:] Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pp. 28–34, https://doi.org/10.18653/v1/W18-3304.

8. Ekman P. (1992), An argument for basic emotions, [in:] Cognition and Emotion, 6(3–4): 169–200.

9. Eskimez S.E., Duan Z., Heinzelman W. (2018), Unsupervised learning approach to feature analysis for automatic speech emotion recognition, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5099–5103, https://doi.org/10.1109/ICASSP.2018.8462685.

10. García-Martín E., Rodrigues C.F., Riley G., Grahn H. (2019), Estimation of energy consumption in machine learning, Journal of Parallel and Distributed Computing, 134: 75–88, https://doi.org/10.1016/j.jpdc.2019.07.007.

11. Ghaleb E., Popa M., Asteriadis S. (2019), Metric learning-based multimodal audio-visual emotion recognition, IEEE MultiMedia, 27(1): 37–48, https://doi.org/10.1109/MMUL.2019.2960219.

12. Guizzo E., Weyde T., Leveson J.B. (2020), Multitime-scale convolution for emotion recognition from speech audio signals, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP40776.2020.9053727.

13. Haq S., Jackson P.J.B. (2011), Multimodal emotion recognition, [in:] Machine Audition: Principles, Algorithms and Systems, Wang W. [Ed.], pp. 398–423, IGI Global Press, Hershey, https://doi.org/10.4018/978-1-61520-919-4.

14. Jadhav S., He H., Jenkins K. (2018), Information gain directed genetic algorithm wrapper feature selection for credit rating, Applied Soft Computing, 69: 541–553, https://doi.org/10.1016/j.asoc.2018.04.033.

15. Jain D.K., Shamsolmoali P., Sehdev P. (2019), Extended deep neural network for facial emotion recognition, Pattern Recognition Letters, 120: 69–74, https://doi.org/10.1016/j.patrec.2019.01.008.

16. Kanwal S., Asghar S. (2021), Speech emotion recognition using clustering based GA-optimized feature set, IEEE Access, 9: 125830–125842, https://doi.org/10.1109/ACCESS.2021.3111659.

17. Kaya H., Karpov A.A. (2018), Efficient and effective strategies for cross-corpus acoustic emotion, Neurocomputing, 275: 1028–1034, https://doi.org/10.1016/j.neucom.2017.09.049.

18. Khalil R.A., Jones E., Babar M.I., Jan T., Zafar M.H., Alhussain T. (2019), Speech emotion recognition using deep learning techniques: A review, IEEE Access, 7: 117327–117345, https://doi.org/10.1109/ACCESS.2019.2936124.

19. Kreyszig E. (1979), Advanced Engineering Mathematics, 4th ed., Wiley.

20. Lin Y.-L., Wei G. (2005), Speech emotion recognition based on HMM and SVM, [in:] Fourth International Conference on Machine Learning and Cybernetics, https://doi.org/10.1109/ICMLC.2005.1527805.

21. Liu Z.-T., Xie Q., Wu M., Cao W.-H., Mei Y., Mao J.-W. (2018), Speech emotion recognition based on an improved brain emotion learning model, Neurocomputing, 309: 145–156, https://doi.org/10.1016/j.neucom.2018.05.005.

22. Livingstone S.R., Russo F.A. (2018), The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English, PLOS ONE, 13(5): e0196391, https://doi.org/10.1371/journal.pone.0196391.

23. Majkowski A., Kołodziej M., Rak R.J., Korczynski R. (2016), Classification of emotions from speech signal, [in:] Signal Processing Algorithms, Architectures, Arrangements and Applications (SPA), https://doi.org/10.1109/SPA.2016.7763627.

24. Martin O., Kotsia I., Macq B., Pitas I. (2006), The eNTERFACE’05 audio-visual emotion database, [in:] 22nd International Conference on Data Engineering Workshops (ICDEW’06), https://doi.org/10.1109/ICDEW.2006.145.

25. McFee B. et al. (2015), librosa: Audio and music signal analysis in Python, [in:] 14th Python in Science Conference, pp. 18–25, https://doi.org/10.25080/Majora-7b98e3ed-003.

26. Milner R., Jalal M.A., Ng R.W.N.M., Hain T. (2019), A cross-corpus study on speech emotion recognition, [in:] IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), https://doi.org/10.1109/ASRU46091.2019.9003838.

27. Mitchel M. (1996), An Introduction to Genetic Algorithms, MIT Press, Cambridge.

28. Mohino-Herranz I., Gil-Pita R., Alonso-Diaz S., Rosa-Zurera M. (2014), MFCC based enlargement of the training set for emotion recognition in speech, Signal & Image Processing: An International Journal (SIPIJ), 5(1), https://doi.org/10.48550/arXiv.1403.4777.

29. Monaco A., Amoroso N., Bellantuono L., Pantaleo E., Tangaro S., Bellotti R. (2020), Multitime-scale features for accurate respiratory sound classification, Applied Sciences, 10(23): 8606, https://doi.org/10.3390/app10238606.

30. Omman B., Eldho S.M. (2022), Speech emotion recognition using bagged support vector machines, [in:] International Conference on Computing, Communication, Security and Intelligent Systems (IC3SIS), https://doi.org/10.1109/IC3SIS54991.2022.9885578.

31. Pandey S.K., Shekhawat H.S., Prasanna M.S. (2019), Deep learning techniques for speech emotion recognition: A review, [in:] 29th International Conference Radioelektronika (RADIOELEKTRONIKA), https://doi.org/10.1109/RADIOELEK.2019.8733432.

32. Pedregosa F. et al. (2011), Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, 12(85): 2825–2830.

33. Picard R.W. (1995), Affective computing, M.I.T Media Laboratory Perceptual Computing Section Technical Report No. 321.

34. Pichora-Fuller M.K., Dupuis K. (2020), Toronto emotional speech set (TESS) (V1), University of Toronto Dataverse, https://doi.org/10.5683/SP2/E8H2MF.

35. Sahoo S., Routray A. (2016), MFCC feature with optimized frequency range: An essential step for emotion recognition, [in:] 2016 International Conference on Systems in Medicine and Biology (ICSMB), https://doi.org/10.1109/ICSMB.2016.7915112.

36. Sayed S., Nassef M., Badr A., Farag I. (2019), A Nested Genetic Algorithm for feature selection in high-dimensional cancer microarray datasets, Expert Systems with Applications, 121: 233–243, https://doi.org/10.1016/j.eswa.2018.12.022.

37. Seknedy M.E., Fawzi S. (2021), Speech emotion recognition system for human interaction applications, [in:] Tenth International Conference on Intelligent Computing and Information Systems (ICICIS), https://doi.org/10.1109/ICICIS52592.2021.9694246.

38. Seknedy M.E., Fawzi S. (2022), Speech emotion recognition system for Arabic speakers, [in:] 4th Novel Intelligent and Leading Emerging Sciences Conference (NILES), https://doi.org/10.1109/NILES56402.2022.9942431.

39. Shahin I. (2020), Emotion recognition using speaker cues, [in:] Advances in Science and Engineering Technology International Conferences (ASET), pp. 1–5, https://doi.org/10.48550/arXiv.2002.03566.

40. Sidorov M., Brester C., Minker W., Semenkin E. (2014), Speech-based emotion recognition: Feature selection by self-adaptive multi-criteria genetic algorithm, [in:] Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 3481–3485, http://www.lrec-conf.org/proceedings/lrec2014/pdf/341_Paper.pdf. (access: 10.11.2023).

41. Slaney M. (1998), Auditory Toolbox: A MATLAB Toolbox for Audtiory Modeling Work, version 2, Interval Research Corporation.

42. Stefanowska A., Zielinski S.K. (2023), Software repository for speech emotion recognition using a multi-time-scale approach to feature aggregation and an ensemble of SVM classifiers, GitHub, https://github.com/antoninastefanowska/MTS-SVM-EmotionRecognition. (access: 27.10.2023).

43. Su B.-H., Lee C.-C. (2021), A conditional cycle emotion gan for cross corpus speech emotion recognition, [in:] IEEE Spoken Language Technology Workshop (SLT), https://doi.org/10.1109/SLT48900.2021.9383512.

44. Tamulevicius G., Korvel G., Yayak A.B., Treigys P., Bernataviciene˙ J., Kostek B. (2020), A study of cross-linguistic speech emotion recognition based on 2D feature spaces, Electronics, 9(10): 1725, https://doi.org/10.3390/electronics9101725.

45. Tang D., Kuppens P., Geurts L., van Waterschoot T. (2021), End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network, EURASIP Journal on Audio, Speech and Music Processing, 18(2021), https://doi.org/10.1186/s13636-021-00208-5.

46. Tao H., Shan S., Hu Z., Zhu C., Ge H. (2023), Strong generalized speech emotion recognition based on effective data augmentation, Entropy, 25(1): 68, https://doi.org/10.3390/e25010068.

47. Tzirakis P., Zhang J., Schuller B.W. (2018), End-to-end speech emotion recognition using deep neural networks, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP.2018.8462677.

48. Wang Y., Huo H. (2019), Speech recognition based on genetic algorithm optimized support vector machine, [in:] 6th International Conference on Systems and Informatics (ICSAI), https://doi.org/10.1109/ICSAI48974.2019.9010502.

49. Yang K. et al. (2023), Behavioral and physiological signals-based deep multimodal approach for mobile emotion recognition, [in:] IEEE Transactions on Affective Computing, 14(2): 1082–1097, https://doi.org/10.1109/TAFFC.2021.3100868.

50. Yildirim S., Kaya Y., Kılıç F. (2021), A modified feature selection method based on metaheuristic algorithms for speech emotion recognition, Applied Acoustics, 173: 107721, https://doi.org/10.1016/j.apacoust.2020.107721.

51. Young S. et al. (2006), The HTK Book, Cambridge University Engineering Department.

52. Zacharatos H., Gatzoulis C., Charalambous P., Chrysanthou Y. (2021), Emotion recognition from 3D motion capture data using deep CNNs, [in:] IEEE Conference on Games (CoG), https://doi.org/10.1109/CoG52621.2021.9619065.

53. Zhang S., Chen A., Guo W., Cui Y., Zhao X., Liu L. (2020), Learning deep binaural representations with deep convolutional neural networks for spontaneous speech emotion recognition, IEEE Access, 8: 23496–23505, https://doi.org/10.1109/ACCESS.2020.2969032.

54. Zhao J., Mao X., Chen L. (2018), Learning deep features to recognise speech emotion using merged deep CNN, IET Signal Process, 12(6): 713–721, https://doi.org/10.1049/iet-spr.2017.0320.

55. Zhao J., Mao X., Chen L. (2019), Speech emotion recognition using deep 1D & 2D CNN LSTM networks, Biomedical Signal Processing and Control, 47: 312–323, https://doi.org/10.1016/j.bspc.2018.08.035.

Online first
2025, Vol 50
	No 1	No 2
2024, Vol 49
	No 1	No 2	No 3	No 4
2023, Vol 48
	No 1	No 2	No 3	No 4
2022, Vol 47
	No 1	No 2	No 3	No 4
2021, Vol 46
	No 1	No 2	No 3	No 4
2020, Vol 45
	No 1	No 2	No 3	No 4
2019, Vol 44
	No 1	No 2	No 3	No 4
2018, Vol 43
	No 1	No 2	No 3	No 4
2017, Vol 42
	No 1	No 2	No 3	No 4
2016, Vol 41
	No 1	No 2	No 3	No 4
2015, Vol 40
	No 1	No 2	No 3	No 4
2014, Vol 39
	No 1	No 2	No 3	No 4
2013, Vol 38
	No 1	No 2	No 3	No 4
2012, Vol 37
	No 1	No 2	No 3	No 4
2011, Vol 36
	No 1	No 2	No 3	No 4
2010, Vol 35
	No 1	No 2	No 3	No 4
2009, Vol 34
	No 1	No 2	No 3	No 4
2008, Vol 33
	No 1	No 2	No 3	No 4	No 4(S)
2007, Vol 32
	No 1	No 2	No 3	No 4	No 4(S)
2006, Vol 31
	No 1	No 2	No 3	No 4	No 4(S)
2005, Vol 30
	No 1	No 2	No 3	No 4
2004, Vol 29
	No 1	No 2	No 3	No 4
2003, Vol 28
	No 1	No 2	No 3	No 4
2002, Vol 27
	No 1	No 2	No 3	No 4
2001, Vol 26
	No 1	No 2	No 3	No 4
2000, Vol 25
	No 1	No 2	No 3	No 4
1999, Vol 24
	No 1	No 2	No 3	No 4
1998, Vol 23
	No 1	No 2	No 3	No 4
1997, Vol 22
	No 1	No 2	No 3	No 4
1996, Vol 21
	No 1	No 2	No 3	No 4
1995, Vol 20
	No 1	No 2	No 3	No 4
1994, Vol 19
	No 1	No 2	No 3	No 4
1993, Vol 18
	No 1	No 2	No 3	No 4
1992, Vol 17
	No 1	No 2	No 3	No 4
1991, Vol 16
	No 1	No 2	No 3-4
1990, Vol 15
	No 1-2		No 3-4
1989, Vol 14
	No 1-2		No 3-4
1988, Vol 13
	No 1-2		No 3-4
1987, Vol 12
	No 1	No 2	No 3-4
1986, Vol 11
	No 1	No 2	No 3	No 4
1985, Vol 10
	No 1	No 2	No 3	No 4
1984, Vol 9
	No 1-2		No 3	No 4
1983, Vol 8
	No 1	No 2	No 3	No 4
1982, Vol 7
	No 1	No 2	No 3-4
1981, Vol 6
	No 1	No 2	No 3	No 4
1980, Vol 5
	No 1	No 2	No 3	No 4
1979, Vol 4
	No 1	No 2	No 3	No 4
1978, Vol 3
	No 1	No 2	No 3	No 4
1977, Vol 2
	No 1	No 2	No 3	No 4
1976, Vol 1
	No 1	No 2	No 3	No 4

Speech Emotion Recognition Using a Multi-Time-Scale Approach to Feature Aggregation and an Ensemble of SVM Classifiers

Downloads

Authors

Abstract

Keywords:

References

Most read articles by the same author(s)

cover

ippt-pan

Issue

Pages

Section

DOI

Received

Revised

Accepted

Published

License

How to Cite

Principal Contact

Address

Support Contact