Speech Emotion Recognition Using a Multi-Time-Scale Approach to Feature Aggregation and an Ensemble of SVM Classifiers
Abstract
Due to its relevant real-life applications, the recognition of emotions from speech signals constitutes a popular research topic. In the traditional methods applied for speech emotion recognition, audio features are typically aggregated using a fixed-duration time window, potentially discarding information conveyed by speech at various signal durations. By contrast, in the proposed method, audio features are aggregated simultaneously using time windows of different lengths (a multi-time-scale approach), hence, potentially better utilizing information carried at phonemic, syllabic, and prosodic levels compared to the traditional approach. A genetic algorithm is employed to optimize the feature extraction procedure. The features aggregated at different time windows are subsequently classified by an ensemble of support vector machine (SVM) classifiers. To enhance the generalization property of the method, a data augmentation technique based on pitch shifting and time stretching is applied. According to the obtained results, the developed method outperforms the traditional one for the selected datasets, demonstrating the benefits of using a multi-time-scale approach to feature aggregation.Keywords:
speech emotion recognition, feature aggregation, ensemble classificationReferences
1. Abdel-Hamid L. (2020), Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features, Speech Communication, 122: 19–30, https://doi.org/10.1016/j.specom.2020.04.005
2. Basu S., Chakraborty J., Bag A., Aftabuddin M. (2017), A review on emotion recognition using speech, [in:] International Conference on Inventive Communication and Computational Technologies (ICICCT), https://doi.org/10.1109/ICICCT.2017.7975169
3. Bogdanov D. et al. (2013), ESSENTIA: An audio analysis library for music information retrieval, [in:] International Society for Music Information Retrieval Conference (ISMIR’13), pp. 493–498.
4. Cao H., Cooper D.G., Keutmann M.K., Gur R.C., Nenkova A., Verma R. (2014), CREMA-D: Crowdsourced emotional multimodal actors dataset, IEEE Transactions on Affective Computing, 5(4): 377–390, https://doi.org/10.1109/TAFFC.2014.2336244
5. Cao X., Jia M., Ru J., Pai T. (2022), Cross-corpus speech emotion recognition using subspace learning and domain adaption, EURASIP Journal on Audio, Speech, and Music Processing, 2022: 32, https://doi.org/10.1186/s13636-022-00264-5
6. Chatterjee R., Mazumdar S., Sheratt R.S., Halder R., Maitra T., Giri D. (2021), Real-time speech emotion analysis for smart home assistants, IEEE Transactions on Consumer Electronics, 67(1): 68–76, https://doi.org/10.1109/TCE.2021.3056421
7. Choi W.Y., Song K.Y., Lee C.W. (2018), Convolutional attention networks for multimodal emotion recognition from speech and text data, [in:] Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pp. 28–34, https://doi.org/10.18653/v1/W18-3304
8. Ekman P. (1992), An argument for basic emotions, [in:] Cognition and Emotion, 6(3–4): 169–200.
9. Eskimez S.E., Duan Z., Heinzelman W. (2018), Unsupervised learning approach to feature analysis for automatic speech emotion recognition, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5099–5103, https://doi.org/10.1109/ICASSP.2018.8462685
10. García-Martín E., Rodrigues C.F., Riley G., Grahn H. (2019), Estimation of energy consumption in machine learning, Journal of Parallel and Distributed Computing, 134: 75–88, https://doi.org/10.1016/j.jpdc.2019.07.007
11. Ghaleb E., Popa M., Asteriadis S. (2019), Metric learning-based multimodal audio-visual emotion recognition, IEEE MultiMedia, 27(1): 37–48, https://doi.org/10.1109/MMUL.2019.2960219
12. Guizzo E., Weyde T., Leveson J.B. (2020), Multitime-scale convolution for emotion recognition from speech audio signals, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP40776.2020.9053727
13. Haq S., Jackson P.J.B. (2011), Multimodal emotion recognition, [in:] Machine Audition: Principles, Algorithms and Systems, Wang W. [Ed.], pp. 398–423, IGI Global Press, Hershey, https://doi.org/10.4018/978-1-61520-919-4
14. Jadhav S., He H., Jenkins K. (2018), Information gain directed genetic algorithm wrapper feature selection for credit rating, Applied Soft Computing, 69: 541–553, https://doi.org/10.1016/j.asoc.2018.04.033
15. Jain D.K., Shamsolmoali P., Sehdev P. (2019), Extended deep neural network for facial emotion recognition, Pattern Recognition Letters, 120: 69–74, https://doi.org/10.1016/j.patrec.2019.01.008
16. Kanwal S., Asghar S. (2021), Speech emotion recognition using clustering based GA-optimized feature set, IEEE Access, 9: 125830–125842, https://doi.org/10.1109/ACCESS.2021.3111659
17. Kaya H., Karpov A.A. (2018), Efficient and effective strategies for cross-corpus acoustic emotion, Neurocomputing, 275: 1028–1034, https://doi.org/10.1016/j.neucom.2017.09.049
18. Khalil R.A., Jones E., Babar M.I., Jan T., Zafar M.H., Alhussain T. (2019), Speech emotion recognition using deep learning techniques: A review, IEEE Access, 7: 117327–117345, https://doi.org/10.1109/ACCESS.2019.2936124
19. Kreyszig E. (1979), Advanced Engineering Mathematics, 4th ed., Wiley.
20. Lin Y.-L., Wei G. (2005), Speech emotion recognition based on HMM and SVM, [in:] Fourth International Conference on Machine Learning and Cybernetics, https://doi.org/10.1109/ICMLC.2005.1527805
21. Liu Z.-T., Xie Q., Wu M., Cao W.-H., Mei Y., Mao J.-W. (2018), Speech emotion recognition based on an improved brain emotion learning model, Neurocomputing, 309: 145–156, https://doi.org/10.1016/j.neucom.2018.05.005
22. Livingstone S.R., Russo F.A. (2018), The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English, PLOS ONE, 13(5): e0196391, https://doi.org/10.1371/journal.pone.0196391
23. Majkowski A., Kołodziej M., Rak R.J., Korczynski R. (2016), Classification of emotions from speech signal, [in:] Signal Processing Algorithms, Architectures, Arrangements and Applications (SPA), https://doi.org/10.1109/SPA.2016.7763627
24. Martin O., Kotsia I., Macq B., Pitas I. (2006), The eNTERFACE’05 audio-visual emotion database, [in:] 22nd International Conference on Data Engineering Workshops (ICDEW’06), https://doi.org/10.1109/ICDEW.2006.145
25. McFee B. et al. (2015), librosa: Audio and music signal analysis in Python, [in:] 14th Python in Science Conference, pp. 18–25, https://doi.org/10.25080/Majora-7b98e3ed-003
26. Milner R., Jalal M.A., Ng R.W.N.M., Hain T. (2019), A cross-corpus study on speech emotion recognition, [in:] IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), https://doi.org/10.1109/ASRU46091.2019.9003838
27. Mitchel M. (1996), An Introduction to Genetic Algorithms, MIT Press, Cambridge.
28. Mohino-Herranz I., Gil-Pita R., Alonso-Diaz S., Rosa-Zurera M. (2014), MFCC based enlargement of the training set for emotion recognition in speech, Signal & Image Processing: An International Journal (SIPIJ), 5(1), https://doi.org/10.48550/arXiv.1403.4777
29. Monaco A., Amoroso N., Bellantuono L., Pantaleo E., Tangaro S., Bellotti R. (2020), Multitime-scale features for accurate respiratory sound classification, Applied Sciences, 10(23): 8606, https://doi.org/10.3390/app10238606
30. Omman B., Eldho S.M. (2022), Speech emotion recognition using bagged support vector machines, [in:] International Conference on Computing, Communication, Security and Intelligent Systems (IC3SIS), https://doi.org/10.1109/IC3SIS54991.2022.9885578
31. Pandey S.K., Shekhawat H.S., Prasanna M.S. (2019), Deep learning techniques for speech emotion recognition: A review, [in:] 29th International Conference Radioelektronika (RADIOELEKTRONIKA), https://doi.org/10.1109/RADIOELEK.2019.8733432
32. Pedregosa F. et al. (2011), Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, 12(85): 2825–2830.
33. Picard R.W. (1995), Affective computing, M.I.T Media Laboratory Perceptual Computing Section Technical Report No. 321.
34. Pichora-Fuller M.K., Dupuis K. (2020), Toronto emotional speech set (TESS) (V1), University of Toronto Dataverse, https://doi.org/10.5683/SP2/E8H2MF
35. Sahoo S., Routray A. (2016), MFCC feature with optimized frequency range: An essential step for emotion recognition, [in:] 2016 International Conference on Systems in Medicine and Biology (ICSMB), https://doi.org/10.1109/ICSMB.2016.7915112
36. Sayed S., Nassef M., Badr A., Farag I. (2019), A Nested Genetic Algorithm for feature selection in high-dimensional cancer microarray datasets, Expert Systems with Applications, 121: 233–243, https://doi.org/10.1016/j.eswa.2018.12.022
37. Seknedy M.E., Fawzi S. (2021), Speech emotion recognition system for human interaction applications, [in:] Tenth International Conference on Intelligent Computing and Information Systems (ICICIS), https://doi.org/10.1109/ICICIS52592.2021.9694246
38. Seknedy M.E., Fawzi S. (2022), Speech emotion recognition system for Arabic speakers, [in:] 4th Novel Intelligent and Leading Emerging Sciences Conference (NILES), https://doi.org/10.1109/NILES56402.2022.9942431
39. Shahin I. (2020), Emotion recognition using speaker cues, [in:] Advances in Science and Engineering Technology International Conferences (ASET), pp. 1–5, https://doi.org/10.48550/arXiv.2002.03566
40. Sidorov M., Brester C., Minker W., Semenkin E. (2014), Speech-based emotion recognition: Feature selection by self-adaptive multi-criteria genetic algorithm, [in:] Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 3481–3485, http://www.lrec-conf.org/proceedings/lrec2014/pdf/341_Paper.pdf (access: 10.11.2023).
41. Slaney M. (1998), Auditory Toolbox: A MATLAB Toolbox for Audtiory Modeling Work, version 2, Interval Research Corporation.
42. Stefanowska A., Zielinski S.K. (2023), Software repository for speech emotion recognition using a multi-time-scale approach to feature aggregation and an ensemble of SVM classifiers, GitHub, https://github.com/antoninastefanowska/MTS-SVM-EmotionRecognition (access: 27.10.2023).
43. Su B.-H., Lee C.-C. (2021), A conditional cycle emotion gan for cross corpus speech emotion recognition, [in:] IEEE Spoken Language Technology Workshop (SLT), https://doi.org/10.1109/SLT48900.2021.9383512
44. Tamulevicius G., Korvel G., Yayak A.B., Treigys P., Bernataviciene˙ J., Kostek B. (2020), A study of cross-linguistic speech emotion recognition based on 2D feature spaces, Electronics, 9(10): 1725, https://doi.org/10.3390/electronics9101725
45. Tang D., Kuppens P., Geurts L., van Waterschoot T. (2021), End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network, EURASIP Journal on Audio, Speech and Music Processing, 18(2021), https://doi.org/10.1186/s13636-021-00208-5
46. Tao H., Shan S., Hu Z., Zhu C., Ge H. (2023), Strong generalized speech emotion recognition based on effective data augmentation, Entropy, 25(1): 68, https://doi.org/10.3390/e25010068
47. Tzirakis P., Zhang J., Schuller B.W. (2018), End-to-end speech emotion recognition using deep neural networks, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP.2018.8462677
48. Wang Y., Huo H. (2019), Speech recognition based on genetic algorithm optimized support vector machine, [in:] 6th International Conference on Systems and Informatics (ICSAI), https://doi.org/10.1109/ICSAI48974.2019.9010502
49. Yang K. et al. (2023), Behavioral and physiological signals-based deep multimodal approach for mobile emotion recognition, [in:] IEEE Transactions on Affective Computing, 14(2): 1082–1097, https://doi.org/10.1109/TAFFC.2021.3100868
50. Yildirim S., Kaya Y., Kılıç F. (2021), A modified feature selection method based on metaheuristic algorithms for speech emotion recognition, Applied Acoustics, 173: 107721, https://doi.org/10.1016/j.apacoust.2020.107721
51. Young S. et al. (2006), The HTK Book, Cambridge University Engineering Department.
52. Zacharatos H., Gatzoulis C., Charalambous P., Chrysanthou Y. (2021), Emotion recognition from 3D motion capture data using deep CNNs, [in:] IEEE Conference on Games (CoG), https://doi.org/10.1109/CoG52621.2021.9619065
53. Zhang S., Chen A., Guo W., Cui Y., Zhao X., Liu L. (2020), Learning deep binaural representations with deep convolutional neural networks for spontaneous speech emotion recognition, IEEE Access, 8: 23496–23505, https://doi.org/10.1109/ACCESS.2020.2969032
54. Zhao J., Mao X., Chen L. (2018), Learning deep features to recognise speech emotion using merged deep CNN, IET Signal Process, 12(6): 713–721, https://doi.org/10.1049/iet-spr.2017.0320
55. Zhao J., Mao X., Chen L. (2019), Speech emotion recognition using deep 1D & 2D CNN LSTM networks, Biomedical Signal Processing and Control, 47: 312–323, https://doi.org/10.1016/j.bspc.2018.08.035
2. Basu S., Chakraborty J., Bag A., Aftabuddin M. (2017), A review on emotion recognition using speech, [in:] International Conference on Inventive Communication and Computational Technologies (ICICCT), https://doi.org/10.1109/ICICCT.2017.7975169
3. Bogdanov D. et al. (2013), ESSENTIA: An audio analysis library for music information retrieval, [in:] International Society for Music Information Retrieval Conference (ISMIR’13), pp. 493–498.
4. Cao H., Cooper D.G., Keutmann M.K., Gur R.C., Nenkova A., Verma R. (2014), CREMA-D: Crowdsourced emotional multimodal actors dataset, IEEE Transactions on Affective Computing, 5(4): 377–390, https://doi.org/10.1109/TAFFC.2014.2336244
5. Cao X., Jia M., Ru J., Pai T. (2022), Cross-corpus speech emotion recognition using subspace learning and domain adaption, EURASIP Journal on Audio, Speech, and Music Processing, 2022: 32, https://doi.org/10.1186/s13636-022-00264-5
6. Chatterjee R., Mazumdar S., Sheratt R.S., Halder R., Maitra T., Giri D. (2021), Real-time speech emotion analysis for smart home assistants, IEEE Transactions on Consumer Electronics, 67(1): 68–76, https://doi.org/10.1109/TCE.2021.3056421
7. Choi W.Y., Song K.Y., Lee C.W. (2018), Convolutional attention networks for multimodal emotion recognition from speech and text data, [in:] Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pp. 28–34, https://doi.org/10.18653/v1/W18-3304
8. Ekman P. (1992), An argument for basic emotions, [in:] Cognition and Emotion, 6(3–4): 169–200.
9. Eskimez S.E., Duan Z., Heinzelman W. (2018), Unsupervised learning approach to feature analysis for automatic speech emotion recognition, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5099–5103, https://doi.org/10.1109/ICASSP.2018.8462685
10. García-Martín E., Rodrigues C.F., Riley G., Grahn H. (2019), Estimation of energy consumption in machine learning, Journal of Parallel and Distributed Computing, 134: 75–88, https://doi.org/10.1016/j.jpdc.2019.07.007
11. Ghaleb E., Popa M., Asteriadis S. (2019), Metric learning-based multimodal audio-visual emotion recognition, IEEE MultiMedia, 27(1): 37–48, https://doi.org/10.1109/MMUL.2019.2960219
12. Guizzo E., Weyde T., Leveson J.B. (2020), Multitime-scale convolution for emotion recognition from speech audio signals, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP40776.2020.9053727
13. Haq S., Jackson P.J.B. (2011), Multimodal emotion recognition, [in:] Machine Audition: Principles, Algorithms and Systems, Wang W. [Ed.], pp. 398–423, IGI Global Press, Hershey, https://doi.org/10.4018/978-1-61520-919-4
14. Jadhav S., He H., Jenkins K. (2018), Information gain directed genetic algorithm wrapper feature selection for credit rating, Applied Soft Computing, 69: 541–553, https://doi.org/10.1016/j.asoc.2018.04.033
15. Jain D.K., Shamsolmoali P., Sehdev P. (2019), Extended deep neural network for facial emotion recognition, Pattern Recognition Letters, 120: 69–74, https://doi.org/10.1016/j.patrec.2019.01.008
16. Kanwal S., Asghar S. (2021), Speech emotion recognition using clustering based GA-optimized feature set, IEEE Access, 9: 125830–125842, https://doi.org/10.1109/ACCESS.2021.3111659
17. Kaya H., Karpov A.A. (2018), Efficient and effective strategies for cross-corpus acoustic emotion, Neurocomputing, 275: 1028–1034, https://doi.org/10.1016/j.neucom.2017.09.049
18. Khalil R.A., Jones E., Babar M.I., Jan T., Zafar M.H., Alhussain T. (2019), Speech emotion recognition using deep learning techniques: A review, IEEE Access, 7: 117327–117345, https://doi.org/10.1109/ACCESS.2019.2936124
19. Kreyszig E. (1979), Advanced Engineering Mathematics, 4th ed., Wiley.
20. Lin Y.-L., Wei G. (2005), Speech emotion recognition based on HMM and SVM, [in:] Fourth International Conference on Machine Learning and Cybernetics, https://doi.org/10.1109/ICMLC.2005.1527805
21. Liu Z.-T., Xie Q., Wu M., Cao W.-H., Mei Y., Mao J.-W. (2018), Speech emotion recognition based on an improved brain emotion learning model, Neurocomputing, 309: 145–156, https://doi.org/10.1016/j.neucom.2018.05.005
22. Livingstone S.R., Russo F.A. (2018), The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English, PLOS ONE, 13(5): e0196391, https://doi.org/10.1371/journal.pone.0196391
23. Majkowski A., Kołodziej M., Rak R.J., Korczynski R. (2016), Classification of emotions from speech signal, [in:] Signal Processing Algorithms, Architectures, Arrangements and Applications (SPA), https://doi.org/10.1109/SPA.2016.7763627
24. Martin O., Kotsia I., Macq B., Pitas I. (2006), The eNTERFACE’05 audio-visual emotion database, [in:] 22nd International Conference on Data Engineering Workshops (ICDEW’06), https://doi.org/10.1109/ICDEW.2006.145
25. McFee B. et al. (2015), librosa: Audio and music signal analysis in Python, [in:] 14th Python in Science Conference, pp. 18–25, https://doi.org/10.25080/Majora-7b98e3ed-003
26. Milner R., Jalal M.A., Ng R.W.N.M., Hain T. (2019), A cross-corpus study on speech emotion recognition, [in:] IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), https://doi.org/10.1109/ASRU46091.2019.9003838
27. Mitchel M. (1996), An Introduction to Genetic Algorithms, MIT Press, Cambridge.
28. Mohino-Herranz I., Gil-Pita R., Alonso-Diaz S., Rosa-Zurera M. (2014), MFCC based enlargement of the training set for emotion recognition in speech, Signal & Image Processing: An International Journal (SIPIJ), 5(1), https://doi.org/10.48550/arXiv.1403.4777
29. Monaco A., Amoroso N., Bellantuono L., Pantaleo E., Tangaro S., Bellotti R. (2020), Multitime-scale features for accurate respiratory sound classification, Applied Sciences, 10(23): 8606, https://doi.org/10.3390/app10238606
30. Omman B., Eldho S.M. (2022), Speech emotion recognition using bagged support vector machines, [in:] International Conference on Computing, Communication, Security and Intelligent Systems (IC3SIS), https://doi.org/10.1109/IC3SIS54991.2022.9885578
31. Pandey S.K., Shekhawat H.S., Prasanna M.S. (2019), Deep learning techniques for speech emotion recognition: A review, [in:] 29th International Conference Radioelektronika (RADIOELEKTRONIKA), https://doi.org/10.1109/RADIOELEK.2019.8733432
32. Pedregosa F. et al. (2011), Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, 12(85): 2825–2830.
33. Picard R.W. (1995), Affective computing, M.I.T Media Laboratory Perceptual Computing Section Technical Report No. 321.
34. Pichora-Fuller M.K., Dupuis K. (2020), Toronto emotional speech set (TESS) (V1), University of Toronto Dataverse, https://doi.org/10.5683/SP2/E8H2MF
35. Sahoo S., Routray A. (2016), MFCC feature with optimized frequency range: An essential step for emotion recognition, [in:] 2016 International Conference on Systems in Medicine and Biology (ICSMB), https://doi.org/10.1109/ICSMB.2016.7915112
36. Sayed S., Nassef M., Badr A., Farag I. (2019), A Nested Genetic Algorithm for feature selection in high-dimensional cancer microarray datasets, Expert Systems with Applications, 121: 233–243, https://doi.org/10.1016/j.eswa.2018.12.022
37. Seknedy M.E., Fawzi S. (2021), Speech emotion recognition system for human interaction applications, [in:] Tenth International Conference on Intelligent Computing and Information Systems (ICICIS), https://doi.org/10.1109/ICICIS52592.2021.9694246
38. Seknedy M.E., Fawzi S. (2022), Speech emotion recognition system for Arabic speakers, [in:] 4th Novel Intelligent and Leading Emerging Sciences Conference (NILES), https://doi.org/10.1109/NILES56402.2022.9942431
39. Shahin I. (2020), Emotion recognition using speaker cues, [in:] Advances in Science and Engineering Technology International Conferences (ASET), pp. 1–5, https://doi.org/10.48550/arXiv.2002.03566
40. Sidorov M., Brester C., Minker W., Semenkin E. (2014), Speech-based emotion recognition: Feature selection by self-adaptive multi-criteria genetic algorithm, [in:] Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 3481–3485, http://www.lrec-conf.org/proceedings/lrec2014/pdf/341_Paper.pdf (access: 10.11.2023).
41. Slaney M. (1998), Auditory Toolbox: A MATLAB Toolbox for Audtiory Modeling Work, version 2, Interval Research Corporation.
42. Stefanowska A., Zielinski S.K. (2023), Software repository for speech emotion recognition using a multi-time-scale approach to feature aggregation and an ensemble of SVM classifiers, GitHub, https://github.com/antoninastefanowska/MTS-SVM-EmotionRecognition (access: 27.10.2023).
43. Su B.-H., Lee C.-C. (2021), A conditional cycle emotion gan for cross corpus speech emotion recognition, [in:] IEEE Spoken Language Technology Workshop (SLT), https://doi.org/10.1109/SLT48900.2021.9383512
44. Tamulevicius G., Korvel G., Yayak A.B., Treigys P., Bernataviciene˙ J., Kostek B. (2020), A study of cross-linguistic speech emotion recognition based on 2D feature spaces, Electronics, 9(10): 1725, https://doi.org/10.3390/electronics9101725
45. Tang D., Kuppens P., Geurts L., van Waterschoot T. (2021), End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network, EURASIP Journal on Audio, Speech and Music Processing, 18(2021), https://doi.org/10.1186/s13636-021-00208-5
46. Tao H., Shan S., Hu Z., Zhu C., Ge H. (2023), Strong generalized speech emotion recognition based on effective data augmentation, Entropy, 25(1): 68, https://doi.org/10.3390/e25010068
47. Tzirakis P., Zhang J., Schuller B.W. (2018), End-to-end speech emotion recognition using deep neural networks, [in:] IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/ICASSP.2018.8462677
48. Wang Y., Huo H. (2019), Speech recognition based on genetic algorithm optimized support vector machine, [in:] 6th International Conference on Systems and Informatics (ICSAI), https://doi.org/10.1109/ICSAI48974.2019.9010502
49. Yang K. et al. (2023), Behavioral and physiological signals-based deep multimodal approach for mobile emotion recognition, [in:] IEEE Transactions on Affective Computing, 14(2): 1082–1097, https://doi.org/10.1109/TAFFC.2021.3100868
50. Yildirim S., Kaya Y., Kılıç F. (2021), A modified feature selection method based on metaheuristic algorithms for speech emotion recognition, Applied Acoustics, 173: 107721, https://doi.org/10.1016/j.apacoust.2020.107721
51. Young S. et al. (2006), The HTK Book, Cambridge University Engineering Department.
52. Zacharatos H., Gatzoulis C., Charalambous P., Chrysanthou Y. (2021), Emotion recognition from 3D motion capture data using deep CNNs, [in:] IEEE Conference on Games (CoG), https://doi.org/10.1109/CoG52621.2021.9619065
53. Zhang S., Chen A., Guo W., Cui Y., Zhao X., Liu L. (2020), Learning deep binaural representations with deep convolutional neural networks for spontaneous speech emotion recognition, IEEE Access, 8: 23496–23505, https://doi.org/10.1109/ACCESS.2020.2969032
54. Zhao J., Mao X., Chen L. (2018), Learning deep features to recognise speech emotion using merged deep CNN, IET Signal Process, 12(6): 713–721, https://doi.org/10.1049/iet-spr.2017.0320
55. Zhao J., Mao X., Chen L. (2019), Speech emotion recognition using deep 1D & 2D CNN LSTM networks, Biomedical Signal Processing and Control, 47: 312–323, https://doi.org/10.1016/j.bspc.2018.08.035

