Speech Emotion Recognition Using a Multi-Time-Scale Approach to Feature Aggregation and an Ensemble of SVM Classifiers

Downloads

Authors

  • Antonina STEFANOWSKA Faculty of Computer Science, Białystok University of Technology, Poland
  • Sławomir Krzysztof ZIELIŃSKI Faculty of Computer Science, Białystok University of Technology, Poland ORCID ID 0000-0002-3205-974X

Abstract

Due to its relevant real-life applications, the recognition of emotions from speech signals constitutes a popular research topic. In the traditional methods applied for speech emotion recognition, audio features are typically aggregated using a fixed-duration time window, potentially discarding information conveyed by speech at various signal durations. By contrast, in the proposed method, audio features are aggregated simultaneously using time windows of different lengths (a multi-time-scale approach), hence, potentially better utilizing information carried at phonemic, syllabic, and prosodic levels compared to the traditional approach. A genetic algorithm is employed to optimize the feature extraction procedure. The features aggregated at different time windows are subsequently classified by an ensemble of support vector machine (SVM) classifiers. To enhance the generalization property of the method, a data augmentation technique based on pitch shifting and time stretching is applied. According to the obtained results, the developed method outperforms the traditional one for the selected datasets, demonstrating the benefits of using a multi-time-scale approach to feature aggregation.

Keywords:

speech emotion recognition, feature aggregation, ensemble classification

References

1. Abdel-Hamid L. (2020), Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features, Speech Communication, 122: 19–30, https://doi.org/10.1016/j.specom.2020.04.005

2. Basu S., Chakraborty J., Bag A., Aftabuddin M. (2017), A review on emotion recognition using speech, [in:] International Conference on Inventive Communication and Computational Technologies (ICICCT), https://doi.org/10.1109/ICICCT.2017.7975169

3. Bogdanov D. et al. (2013), ESSENTIA: An audio analysis library for music information retrieval, [in:] International Society for Music Information Retrieval Conference (ISMIR’13), pp. 493–498.

4. Cao H., Cooper D.G., Keutmann M.K., Gur R.C., Nenkova A., Verma R. (2014), CREMA-D: Crowdsourced emotional multimodal actors dataset, IEEE Transactions on Affective Computing, 5(4): 377–390, https://doi.org/10.1109/TAFFC.2014.2336244

5. Cao X., Jia M., Ru J., Pai T. (2022), Cross-corpus speech emotion recognition using subspace learning and domain adaption, EURASIP Journal on Audio, Speech, and Music Processing, 2022: 32, https://doi.org/10.1186/s13636-022-00264-5

6. Chatterjee R., Mazumdar S., Sheratt R.S., Halder R., Maitra T., Giri D. (2021), Real-time speech emotion analysis for smart home assistants, IEEE Transactions on Consumer Electronics, 67(1): 68–76, https://doi.org/10.1109/TCE.2021.3056421

7. Choi W.Y., Song K.Y., Lee C.W. (2018), Convolutional attention networks for multimodal emotion recognition from speech and text data, [in:] Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pp. 28–34, https://doi.org/10.18653/v1/W18-3304

8. Ekman P. (1992), An argument for basic emotions, [in:] Cognition and Emotion, 6(3–4): 169–200.

9. Eskimez S.E., Duan Z., Heinzelman W. (2018), Unsupervised learning approach to feature analysis for automatic speech emotion recognition, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5099–5103, https://doi.org/10.1109/ICASSP.2018.8462685

10. García-Martín E., Rodrigues C.F., Riley G., Grahn H. (2019), Estimation of energy consumption in machine learning, Journal of Parallel and Distributed Computing, 134: 75–88, https://doi.org/10.1016/j.jpdc.2019.07.007

11. Ghaleb E., Popa M., Asteriadis S. (2019), Metric learning-based multimodal audio-visual emotion recognition, IEEE MultiMedia, 27(1): 37–48, https://doi.org/10.1109/MMUL.2019.2960219

12. Guizzo E., Weyde T., Leveson J.B. (2020), Multitime-scale convolution for emotion recognition from speech audio signals, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP40776.2020.9053727

13. Haq S., Jackson P.J.B. (2011), Multimodal emotion recognition, [in:] Machine Audition: Principles, Algorithms and Systems, Wang W. [Ed.], pp. 398–423, IGI Global Press, Hershey, https://doi.org/10.4018/978-1-61520-919-4

14. Jadhav S., He H., Jenkins K. (2018), Information gain directed genetic algorithm wrapper feature selection for credit rating, Applied Soft Computing, 69: 541–553, https://doi.org/10.1016/j.asoc.2018.04.033

15. Jain D.K., Shamsolmoali P., Sehdev P. (2019), Extended deep neural network for facial emotion recognition, Pattern Recognition Letters, 120: 69–74, https://doi.org/10.1016/j.patrec.2019.01.008

16. Kanwal S., Asghar S. (2021), Speech emotion recognition using clustering based GA-optimized feature set, IEEE Access, 9: 125830–125842, https://doi.org/10.1109/ACCESS.2021.3111659

17. Kaya H., Karpov A.A. (2018), Efficient and effective strategies for cross-corpus acoustic emotion, Neurocomputing, 275: 1028–1034, https://doi.org/10.1016/j.neucom.2017.09.049

18. Khalil R.A., Jones E., Babar M.I., Jan T., Zafar M.H., Alhussain T. (2019), Speech emotion recognition using deep learning techniques: A review, IEEE Access, 7: 117327–117345, https://doi.org/10.1109/ACCESS.2019.2936124

19. Kreyszig E. (1979), Advanced Engineering Mathematics, 4th ed., Wiley.

20. Lin Y.-L., Wei G. (2005), Speech emotion recognition based on HMM and SVM, [in:] Fourth International Conference on Machine Learning and Cybernetics, https://doi.org/10.1109/ICMLC.2005.1527805

21. Liu Z.-T., Xie Q., Wu M., Cao W.-H., Mei Y., Mao J.-W. (2018), Speech emotion recognition based on an improved brain emotion learning model, Neurocomputing, 309: 145–156, https://doi.org/10.1016/j.neucom.2018.05.005

22. Livingstone S.R., Russo F.A. (2018), The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English, PLOS ONE, 13(5): e0196391, https://doi.org/10.1371/journal.pone.0196391

23. Majkowski A., Kołodziej M., Rak R.J., Korczynski R. (2016), Classification of emotions from speech signal, [in:] Signal Processing Algorithms, Architectures, Arrangements and Applications (SPA), https://doi.org/10.1109/SPA.2016.7763627

24. Martin O., Kotsia I., Macq B., Pitas I. (2006), The eNTERFACE’05 audio-visual emotion database, [in:] 22nd International Conference on Data Engineering Workshops (ICDEW’06), https://doi.org/10.1109/ICDEW.2006.145

25. McFee B. et al. (2015), librosa: Audio and music signal analysis in Python, [in:] 14th Python in Science Conference, pp. 18–25, https://doi.org/10.25080/Majora-7b98e3ed-003

26. Milner R., Jalal M.A., Ng R.W.N.M., Hain T. (2019), A cross-corpus study on speech emotion recognition, [in:] IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), https://doi.org/10.1109/ASRU46091.2019.9003838

27. Mitchel M. (1996), An Introduction to Genetic Algorithms, MIT Press, Cambridge.

28. Mohino-Herranz I., Gil-Pita R., Alonso-Diaz S., Rosa-Zurera M. (2014), MFCC based enlargement of the training set for emotion recognition in speech, Signal & Image Processing: An International Journal (SIPIJ), 5(1), https://doi.org/10.48550/arXiv.1403.4777

29. Monaco A., Amoroso N., Bellantuono L., Pantaleo E., Tangaro S., Bellotti R. (2020), Multitime-scale features for accurate respiratory sound classification, Applied Sciences, 10(23): 8606, https://doi.org/10.3390/app10238606

30. Omman B., Eldho S.M. (2022), Speech emotion recognition using bagged support vector machines, [in:] International Conference on Computing, Communication, Security and Intelligent Systems (IC3SIS), https://doi.org/10.1109/IC3SIS54991.2022.9885578

31. Pandey S.K., Shekhawat H.S., Prasanna M.S. (2019), Deep learning techniques for speech emotion recognition: A review, [in:] 29th International Conference Radioelektronika (RADIOELEKTRONIKA), https://doi.org/10.1109/RADIOELEK.2019.8733432

32. Pedregosa F. et al. (2011), Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, 12(85): 2825–2830.

33. Picard R.W. (1995), Affective computing, M.I.T Media Laboratory Perceptual Computing Section Technical Report No. 321.

34. Pichora-Fuller M.K., Dupuis K. (2020), Toronto emotional speech set (TESS) (V1), University of Toronto Dataverse, https://doi.org/10.5683/SP2/E8H2MF

35. Sahoo S., Routray A. (2016), MFCC feature with optimized frequency range: An essential step for emotion recognition, [in:] 2016 International Conference on Systems in Medicine and Biology (ICSMB), https://doi.org/10.1109/ICSMB.2016.7915112

36. Sayed S., Nassef M., Badr A., Farag I. (2019), A Nested Genetic Algorithm for feature selection in high-dimensional cancer microarray datasets, Expert Systems with Applications, 121: 233–243, https://doi.org/10.1016/j.eswa.2018.12.022

37. Seknedy M.E., Fawzi S. (2021), Speech emotion recognition system for human interaction applications, [in:] Tenth International Conference on Intelligent Computing and Information Systems (ICICIS), https://doi.org/10.1109/ICICIS52592.2021.9694246

38. Seknedy M.E., Fawzi S. (2022), Speech emotion recognition system for Arabic speakers, [in:] 4th Novel Intelligent and Leading Emerging Sciences Conference (NILES), https://doi.org/10.1109/NILES56402.2022.9942431

39. Shahin I. (2020), Emotion recognition using speaker cues, [in:] Advances in Science and Engineering Technology International Conferences (ASET), pp. 1–5, https://doi.org/10.48550/arXiv.2002.03566

40. Sidorov M., Brester C., Minker W., Semenkin E. (2014), Speech-based emotion recognition: Feature selection by self-adaptive multi-criteria genetic algorithm, [in:] Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 3481–3485, http://www.lrec-conf.org/proceedings/lrec2014/pdf/341_Paper.pdf (access: 10.11.2023).

41. Slaney M. (1998), Auditory Toolbox: A MATLAB Toolbox for Audtiory Modeling Work, version 2, Interval Research Corporation.

42. Stefanowska A., Zielinski S.K. (2023), Software repository for speech emotion recognition using a multi-time-scale approach to feature aggregation and an ensemble of SVM classifiers, GitHub, https://github.com/antoninastefanowska/MTS-SVM-EmotionRecognition (access: 27.10.2023).

43. Su B.-H., Lee C.-C. (2021), A conditional cycle emotion gan for cross corpus speech emotion recognition, [in:] IEEE Spoken Language Technology Workshop (SLT), https://doi.org/10.1109/SLT48900.2021.9383512

44. Tamulevicius G., Korvel G., Yayak A.B., Treigys P., Bernataviciene˙ J., Kostek B. (2020), A study of cross-linguistic speech emotion recognition based on 2D feature spaces, Electronics, 9(10): 1725, https://doi.org/10.3390/electronics9101725

45. Tang D., Kuppens P., Geurts L., van Waterschoot T. (2021), End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network, EURASIP Journal on Audio, Speech and Music Processing, 18(2021), https://doi.org/10.1186/s13636-021-00208-5

46. Tao H., Shan S., Hu Z., Zhu C., Ge H. (2023), Strong generalized speech emotion recognition based on effective data augmentation, Entropy, 25(1): 68, https://doi.org/10.3390/e25010068

47. Tzirakis P., Zhang J., Schuller B.W. (2018), End-to-end speech emotion recognition using deep neural networks, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP.2018.8462677

48. Wang Y., Huo H. (2019), Speech recognition based on genetic algorithm optimized support vector machine, [in:] 6th International Conference on Systems and Informatics (ICSAI), https://doi.org/10.1109/ICSAI48974.2019.9010502

49. Yang K. et al. (2023), Behavioral and physiological signals-based deep multimodal approach for mobile emotion recognition, [in:] IEEE Transactions on Affective Computing, 14(2): 1082–1097, https://doi.org/10.1109/TAFFC.2021.3100868

50. Yildirim S., Kaya Y., Kılıç F. (2021), A modified feature selection method based on metaheuristic algorithms for speech emotion recognition, Applied Acoustics, 173: 107721, https://doi.org/10.1016/j.apacoust.2020.107721

51. Young S. et al. (2006), The HTK Book, Cambridge University Engineering Department.

52. Zacharatos H., Gatzoulis C., Charalambous P., Chrysanthou Y. (2021), Emotion recognition from 3D motion capture data using deep CNNs, [in:] IEEE Conference on Games (CoG), https://doi.org/10.1109/CoG52621.2021.9619065

53. Zhang S., Chen A., Guo W., Cui Y., Zhao X., Liu L. (2020), Learning deep binaural representations with deep convolutional neural networks for spontaneous speech emotion recognition, IEEE Access, 8: 23496–23505, https://doi.org/10.1109/ACCESS.2020.2969032

54. Zhao J., Mao X., Chen L. (2018), Learning deep features to recognise speech emotion using merged deep CNN, IET Signal Process, 12(6): 713–721, https://doi.org/10.1049/iet-spr.2017.0320

55. Zhao J., Mao X., Chen L. (2019), Speech emotion recognition using deep 1D & 2D CNN LSTM networks, Biomedical Signal Processing and Control, 47: 312–323, https://doi.org/10.1016/j.bspc.2018.08.035