Audio Feature Space Analysis for Emotion Recognition from Spoken Sentences

Lukasz SMIETANKA; Tomasz MAKA

doi:10.24425/aoa.2021.136581

Authors

Lukasz SMIETANKA West Pomeranian University of Technology, Poland
Tomasz MAKA West Pomeranian University of Technology, Poland

Abstract

An analysis of low-level feature space for emotion recognition from the speech is presented. The main goal was to determine how the statistical properties computed from contours of low-level features influence the emotion recognition from speech signals. We have conducted several experiments to reduce and tune our initial feature set and to configure the classification stage. In the process of analysis of the audio feature space, we have employed the univariate feature selection using the chi-squared test. Then, in the first stage of classification, a default set of parameters was selected for every classifier. For the classifier that obtained the best results with the default settings, the hyperparameter tuning using cross-validation was exploited. In the result, we compared the classification results for two different languages to find out the difference between emotional states expressed in spoken sentences. The results show that from an initial feature set containing 3198 attributes we have obtained the dimensionality reduction about 80% using feature selection algorithm. The most dominant attributes selected at this stage based on the mel and bark frequency scales filterbanks with its variability described mainly by variance, median absolute deviation and standard and average deviations. Finally, the classification accuracy using tuned SVM classifier was equal to 72.5% and 88.27% for emotional spoken sentences in Polish and German languages, respectively.

Keywords:

speech analysis, classification, emotional speech

References

1. Anagnostopoulos C.N., Iliou T., Giannoukos I. (2015), Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011, Artificial Intelligence Review, 43: 155–177, https://doi.org/10.1007/s10462-012-9368-5.

2. Boersma P. (1993), Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound, Proceedings of the Institute of Phonetic Sciences, 17(1193): 97–110.

3. Boersma P., Weenink D. (2001), Praat, a system for doing phonetics by computer, Glot International, 5(9/10): 341–345.

4. Breiman L. (2001), Random forests, Machine Learning, 45(1): 5–32, https://doi.org/10.1023/A:1010933404324.

5. Burkhardt F., Paeschke A., Rolfes M., Sendlmeier W., Weiss B. (2005), A database of German emotional speech, 9th European Conference on Speech Communication and Technology, 5: 1517–1520.

6. Chang C.-C., Lin C.-J. (2011), LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology, 2: 27:1–27:27, https://doi.org/10.1145/1961189.1961199.

7. Davis S., Mermelstein P. (1980), Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4): 357–366, https://doi.org/10.1109/TASSP.1980.1163420.

8. Eyben F. (2016), Real-time speech and music classification by large audio feature space extraction, Springer, Cham, https://doi.org/10.1007/978-3-319-27299-3.

9. Feraru S.M., Zbancioc M.D. (2013), Emotion recognition in Romanian language using lpc LPC features, [In:] 2013 E-Health and Bioengineering Conference (EHB), pp. 1–4, https://doi.org/10.1109/EHB.2013.6707314.

10. Hao M., Tianhao Y., Fei Y. (2019), The SVM based on SMO optimization for speech emotion recognition, [In:] 2019 Chinese Control Conference (CCC), pp. 7884–7888, https://doi.org/10.23919/ChiCC.2019.8866463.

11. Kathiresan T., Dellwo V. (2019), Cepstral derivatives in MFCCs for emotion recognition, [In:] 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), pp. 56–60, https://doi.org/10.1109/SIPROCESS.2019.8868573.

12. Kuan T.-W., Tsai A.-C., Sung P.-H., Wang J.-F., Kuo H.-S. (2016), A robust BFCC feature extraction for ASR system, Artificial Intelligence Research, 5(2): 14–23, https://doi.org/10.5430/air.v5n2p14.

13. Lee K. H., Kyun Choi H., Jang B. T., Kim D. H. (2019), A study on speech emotion recognition using a deep neural network, [In:] 2019 International Conference on Information and Communication Technology Convergence (ICTC), pp. 1162–1165, https://doi.org/10.1109/ICTC46691.2019.8939830.

14. Markel J. D., Gray A.H.J. (1976), Linear Prediction of Speech, New York: Springer-Verlag.

15. Meng H., Yan T., Yuan F., Wei H. (2019), Speech emotion recognition from 3D log-mel spectrograms with deep learning network, IEEE Access, 7: 125868–125881, https://doi.org/10.1109/ACCESS.2019.2938007.

16. Mitrovic D., Zeppelzauer M., Breiteneder C. (2010), Features for content-based audio retrieval, Advances in Computers, 78: 71–150, https://doi.org/10.1016/S0065-2458(10)78003-7.

17. Pedregosa F. et al. (2011), Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, 12: 2825–2830, https://doi.org/10.5555/1953048.2078195. .

18. Rajak R., Mall R. (2019), Emotion recognition from audio, dimensional and discrete categorization using CNNs, [In:] TENCON 2019 – 2019 IEEE Region 10 Conference (TENCON), pp. 301–305, https://doi.org/10.1109/TENCON.2019.8929459.

19. Rao K.S., Reddy V.R., Maity S. (2015), Language Identification Using Spectral and Prosodic Features, Springer Publishing Company, Incorporated.

20. Slot K., Cichosz J., Bronakowski L. (2009), Application of voiced-speech variability descriptors to emotion recognition, [In:] 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, pp. 1–5, https://doi.org/10.1109/CISDA.2009.5356537.

21. Swain M., Routray A., Kabisatpathy P. (2018), Databases, features and classifiers for speech emotion recognition: a review, International Journal of Speech Technology, 21: 93–120, https://doi.org/10.1007/s10772-018-9491-z.

22. Ververidis D., Kotropoulos C. (2006), Emotional speech recognition: Resources, features, and methods, Speech Communication, 48: 1162–1181, https://doi.org/10.1016/j.specom.2006.04.003.

23. Zhang H. (2004), The optimality of naive bayes, [In:] Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2004.

24. Zhu C., Ahmad W. (2019), Emotion recognition from speech to improve human-robot interaction, [In:] 2019 IEEE International Conference on Dependable, Autonomic and Secure Computing, pp. 370–375, https://doi.org/10.1109/DASC/PiCom/CBDCom/CyberSciTech.2019.00076.

Online first
2025, Vol 50
	No 1	No 2
2024, Vol 49
	No 1	No 2	No 3	No 4
2023, Vol 48
	No 1	No 2	No 3	No 4
2022, Vol 47
	No 1	No 2	No 3	No 4
2021, Vol 46
	No 1	No 2	No 3	No 4
2020, Vol 45
	No 1	No 2	No 3	No 4
2019, Vol 44
	No 1	No 2	No 3	No 4
2018, Vol 43
	No 1	No 2	No 3	No 4
2017, Vol 42
	No 1	No 2	No 3	No 4
2016, Vol 41
	No 1	No 2	No 3	No 4
2015, Vol 40
	No 1	No 2	No 3	No 4
2014, Vol 39
	No 1	No 2	No 3	No 4
2013, Vol 38
	No 1	No 2	No 3	No 4
2012, Vol 37
	No 1	No 2	No 3	No 4
2011, Vol 36
	No 1	No 2	No 3	No 4
2010, Vol 35
	No 1	No 2	No 3	No 4
2009, Vol 34
	No 1	No 2	No 3	No 4
2008, Vol 33
	No 1	No 2	No 3	No 4	No 4(S)
2007, Vol 32
	No 1	No 2	No 3	No 4	No 4(S)
2006, Vol 31
	No 1	No 2	No 3	No 4	No 4(S)
2005, Vol 30
	No 1	No 2	No 3	No 4
2004, Vol 29
	No 1	No 2	No 3	No 4
2003, Vol 28
	No 1	No 2	No 3	No 4
2002, Vol 27
	No 1	No 2	No 3	No 4
2001, Vol 26
	No 1	No 2	No 3	No 4
2000, Vol 25
	No 1	No 2	No 3	No 4
1999, Vol 24
	No 1	No 2	No 3	No 4
1998, Vol 23
	No 1	No 2	No 3	No 4
1997, Vol 22
	No 1	No 2	No 3	No 4
1996, Vol 21
	No 1	No 2	No 3	No 4
1995, Vol 20
	No 1	No 2	No 3	No 4
1994, Vol 19
	No 1	No 2	No 3	No 4
1993, Vol 18
	No 1	No 2	No 3	No 4
1992, Vol 17
	No 1	No 2	No 3	No 4
1991, Vol 16
	No 1	No 2	No 3-4
1990, Vol 15
	No 1-2		No 3-4
1989, Vol 14
	No 1-2		No 3-4
1988, Vol 13
	No 1-2		No 3-4
1987, Vol 12
	No 1	No 2	No 3-4
1986, Vol 11
	No 1	No 2	No 3	No 4
1985, Vol 10
	No 1	No 2	No 3	No 4
1984, Vol 9
	No 1-2		No 3	No 4
1983, Vol 8
	No 1	No 2	No 3	No 4
1982, Vol 7
	No 1	No 2	No 3-4
1981, Vol 6
	No 1	No 2	No 3	No 4
1980, Vol 5
	No 1	No 2	No 3	No 4
1979, Vol 4
	No 1	No 2	No 3	No 4
1978, Vol 3
	No 1	No 2	No 3	No 4
1977, Vol 2
	No 1	No 2	No 3	No 4
1976, Vol 1
	No 1	No 2	No 3	No 4

Audio Feature Space Analysis for Emotion Recognition from Spoken Sentences

Downloads

Authors

Abstract

Keywords:

References

cover

ippt-pan

Issue

Pages

Section

DOI

Received

Accepted

Published

License

How to Cite

Principal Contact

Address

Support Contact