Archives of Acoustics, 46, 2, pp. 271–277, 2021

Audio Feature Space Analysis for Emotion Recognition from Spoken Sentences

West Pomeranian University of Technology

Tomasz MAKA
West Pomeranian University of Technology

An analysis of low-level feature space for emotion recognition from the speech is presented. The main goal was to determine how the statistical properties computed from contours of low-level features influence the emotion recognition from speech signals. We have conducted several experiments to reduce and tune our initial feature set and to configure the classification stage. In the process of analysis of the audio feature space, we have employed the univariate feature selection using the chi-squared test. Then, in the first stage of classification, a default set of parameters was selected for every classifier. For the classifier that obtained the best results with the default settings, the hyperparameter tuning using cross-validation was exploited. In the result, we compared the classification results for two different languages to find out the difference between emotional states expressed in spoken sentences. The results show that from an initial feature set containing 3198 attributes we have obtained the dimensionality reduction about 80% using feature selection algorithm. The most dominant attributes selected at this stage based on the mel and bark frequency scales filterbanks with its variability described mainly by variance, median absolute deviation and standard and average deviations. Finally, the classification accuracy using tuned SVM classifier was equal to 72.5% and 88.27% for emotional spoken sentences in Polish and German languages, respectively.
Keywords: speech analysis; classification; emotional speech
Full Text: PDF


Anagnostopoulos C.N., Iliou T., Giannoukos I. (2015), Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011, Artificial Intelligence Review, 43: 155–177, doi: 10.1007/s10462-012-9368-5.

Boersma P. (1993), Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound, Proceedings of the Institute of Phonetic Sciences, 17(1193): 97–110.

Boersma P., Weenink D. (2001), Praat, a system for doing phonetics by computer, Glot International, 5(9/10): 341–345.

Breiman L. (2001), Random forests, Machine Learning, 45(1): 5–32, doi: 10.1023/A:1010933404324.

Burkhardt F., Paeschke A., Rolfes M., Sendlmeier W., Weiss B. (2005), A database of German emotional speech, 9th European Conference on Speech Communication and Technology, 5: 1517–1520.

Chang C.-C., Lin C.-J. (2011), LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology, 2: 27:1–27:27, doi: 10.1145/1961189.1961199.

Davis S., Mermelstein P. (1980), Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4): 357–366, doi: 10.1109/TASSP.1980.1163420

Eyben F. (2016), Real-time speech and music classification by large audio feature space extraction, Springer, Cham, doi: 10.1007/978-3-319-27299-3.

Feraru S.M., Zbancioc M.D. (2013), Emotion recognition in Romanian language using lpc LPC features, [In:] 2013 E-Health and Bioengineering Conference (EHB), pp. 1–4, doi: 10.1109/EHB.2013.6707314.

Hao M., Tianhao Y., Fei Y. (2019), The SVM based on SMO optimization for speech emotion recognition, [In:] 2019 Chinese Control Conference (CCC), pp. 7884–7888, doi: 10.23919/ChiCC.2019.8866463.

Kathiresan T., Dellwo V. (2019), Cepstral derivatives in MFCCs for emotion recognition, [In:] 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), pp. 56–60, doi: 10.1109/SIPROCESS.2019.8868573.

Kuan T.-W., Tsai A.-C., Sung P.-H., Wang J.-F., Kuo H.-S. (2016), A robust BFCC feature extraction for ASR system, Artificial Intelligence Research, 5(2): 14–23, doi: 10.5430/air.v5n2p14.

Lee K. H., Kyun Choi H., Jang B. T., Kim D. H. (2019), A study on speech emotion recognition using a deep neural network, [In:] 2019 International Conference on Information and Communication Technology Convergence (ICTC), pp. 1162–1165, doi: 10.1109/ICTC46691.2019.8939830.

Markel J. D., Gray A.H.J. (1976), Linear Prediction of Speech, New York: Springer-Verlag.

Meng H., Yan T., Yuan F., Wei H. (2019), Speech emotion recognition from 3D log-mel spectrograms with deep learning network, IEEE Access, 7: 125868–125881, doi: 10.1109/ACCESS.2019.2938007.

Mitrovic D., Zeppelzauer M., Breiteneder C. (2010), Features for content-based audio retrieval, Advances in Computers, 78: 71–150, doi: 10.1016/S0065-2458(10)78003-7.

Pedregosa F. et al. (2011), Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, 12: 2825–2830, doi: 10.5555/1953048.2078195 .

Rajak R., Mall R. (2019), Emotion recognition from audio, dimensional and discrete categorization using CNNs, [In:] TENCON 2019 – 2019 IEEE Region 10 Conference (TENCON), pp. 301–305, doi: 10.1109/TENCON.2019.8929459.

Rao K.S., Reddy V.R., Maity S. (2015), Language Identification Using Spectral and Prosodic Features, Springer Publishing Company, Incorporated.

Slot K., Cichosz J., Bronakowski L. (2009), Application of voiced-speech variability descriptors to emotion recognition, [In:] 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, pp. 1–5, doi: 10.1109/CISDA.2009.5356537

Swain M., Routray A., Kabisatpathy P. (2018), Databases, features and classifiers for speech emotion recognition: a review, International Journal of Speech Technology, 21: 93–120, doi: 10.1007/s10772-018-9491-z.

Ververidis D., Kotropoulos C. (2006), Emotional speech recognition: Resources, features, and methods, Speech Communication, 48: 1162–1181, doi: 10.1016/j.specom.2006.04.003

Zhang H. (2004), The optimality of naive bayes, [In:] Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2004.

Zhu C., Ahmad W. (2019), Emotion recognition from speech to improve human-robot interaction, [In:] 2019 IEEE International Conference on Dependable, Autonomic and Secure Computing, pp. 370–375, doi: 10.1109/DASC/PiCom/CBDCom/CyberSciTech.2019.00076.

DOI: 10.24425/aoa.2021.136581

Copyright © Polish Academy of Sciences & Institute of Fundamental Technological Research (IPPT PAN)