Archives of Acoustics, 42, 3, pp. 375–383, 2017

Voiceless Stop Consonant Modelling and Synthesis Framework Based on Miso Dynamic System

Gražina KORVEL
ilnius University

Audio Acoustics Lab., Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology

A voiceless stop consonant phoneme modelling and synthesis framework based on a phoneme modelling in low-frequency range and high-frequency range separately is proposed. The phoneme signal is decomposed into the sums of simpler basic components and described as the output of a linear multiple-input and single-output (MISO) system. The impulse response of each channel is a third order quasipolynomial. Using this framework, the limit between the frequency ranges is determined. A new limit point searching three-step algorithm is given in this paper. Within this framework, the input of the low - frequency component is equal to one, and the impulse response generates the whole component. The high-frequency component appears when the system is excited by semi-periodic impulses. The filter impulse response of this component model is single period and decays after three periods. Application of the proposed modellingframework for the voiceless stop consonant phoneme has shown that the quality of the model is sufficiently good.
Keywords: speech synthesis; consonant phonemes; phoneme modelling framework; MISO system
Full Text: PDF
Copyright © Polish Academy of Sciences & Institute of Fundamental Technological Research (IPPT PAN).


AGH Corpora, Audiovisual Polish Speech Corpus, wdgpivxrpln (accessed Jan., 2017).

Bergier M. (2014), Instruction and production training practice on awareness raising, awareness in action: the role of consciousness in language acquisition, [in:] Second language learning and teaching, Łyda A., Szczesniak K. [Eds.], Springer International Publishing, doi: 10.1007/978-3-319-00461-7 7.

Birkholz P. (2013), Modeling consonant-vowel coarticulation for articulatory speech synthesis, PLoS ONE 8, 4, e60603, doi: 10.1371/journal.pone.0060603.

Brocki Ł., Marasek K. (2015), Deep belief neural networks and bidirectional long-short term memory hybrid for speech recognition, Archives of Acoustics, 40, 2, 191–195, doi: 10.1515/aoa-2015-0021.

Chai T., Draxler R.R. (2014), Root mean square error (RMSE) or mean absolute error (MAE)? Arguments against avoiding RMSE in the literature, Geoscientific Model Developement, 7, 1247–1250, doi: 10.5194/gmd-7-1247-2014.

Czyzewski A., Kostek B., Bratoszewski P., Kotus J., Szykulski M. (2017), An audio-visual corpus for multimodal automatic speech recognition, J. of Intelligent Information Systems, 1, 1–26, doi: 10.1007/s10844-016-0438-z.

Demenko G., Mobius B., Klessa K. (2010), Implementation of Polish speech synthesis for the boss system, Bulletin of the Polish Academy of Sciences Technical Sciences, 58, 3, doi: 10.2478/V10175-010-0035-1,

Domagała P., Richter L. (1994), Automatic discrimination of Polish stop consonants based on bursts analysis, Archives of Acoustics, 19, 2, 147–159, view/1084.

Driaunys K., Rudžionis V., Žvinys P. (2005), Analysis of vocal phonemes and fricative consonant discrimination based on phonetic acoustics features, Information Technology and Control, 34, 3, 257–262.

Dziubinski M., Kostek B. (2005), Octave error immune and instantaneous pitch detection algorithm, Journal of New Music Reseach, 34, 3, 273–292.

Gardzielewska H., Preis A. (2007), The intelligibility of Polish speech synthesized with a new sinewave synthesis method, Archives of Acoustics, 32, 3, 579– 589.

Gussmann E. (2007), The phonology of Polish, New York: Oxford University Press.

Igras M., Ziółko B., Jadczyk T. (2013), Audiovisual database of Polish speech recordings, Studia Informatica, 33, 2b, 163–172.

Jadczyk T., Ziółko M. (2015), Audio-visual speech processing system for Polish with dynamic Bayesian Network Models, Proceedings of the World Congress on Electrical Engineering and Computer Systems and Science (EECSS 2015), Barcelona, Spain, July 13–14, Paper No. 343.

Jassem W. (2003), Polish, Journal of the International Phonetic Association, 33, 103–107.

Johannessen J.B., Hagen K., Priestley J.J., Nygaard L. (2007), An advanced speech corpus for Norwegian, Proceedings of the 16th Nordic Conference of Computational Linguistics Nodalida-2007, 29–36, Tartu, Estonia, ISBN 978-9985-4-0513-0.

Korzinek D., Marasek K., Brocki Ł. (2011), Automatic transcription of Polish radio and television broadcast audio, Intelligent Tools for Building a Scientific Information Platform, Vol. 467, pp. 489–497, Springer.

Krynicki G. (2006), Contrasting selected aspects of Polish and English phonetics, _krynicki/my pres/my pres 6c.htm (accessed Jan. 2017).

Labarre T. (2011), LING550: CLMS project on Polish, clms project on polish.

Ladefoged P., Disner S.F. (2012), Vowels and consonants, 3rd Ed., Ladefoged P. [Ed.], Wiley-Blackwell, Chichester.

Oliver D., Szklanny K. (2006), Creation and analysis of a Polish speech database for use in unit selection synthesis, publikacje/lrec2006.pdf (accessed Jan. 2017).

Oostdijk N. (2000), The spoken Dutch corpus. Overview and first evaluation, Proceedings of LREC 2000, pp. 887–894, Athens, Greece.

Pinnis M., Auziňa I. (2010), Latvian text-to-speech synthesizer, Proceedings of the 2010 Conference on Human Language Technologies – The Baltic Perspective: Proceedings of the Fourth International Conference Baltic HLT 2010, pp. 69–72, Riga, Latvia: IOS Pres, doi:10.3233/978-1-60750-641-6-6.

Pinnis M., Auziňa I., Goba K. (2014), Designing the Latvian speech recognition corpus, Proceedings of 9th International Conference on Language Resources and Evaluation, LREC’14, pp. 1547–1553.

Pyž G., Šimonytė V., Slivinskas V. (2011), Modelling of Lithuanian speech diphthongs, Informatica, 22, 3, 411– 434.

Pyž G., Šimonytė V., Slivinskas V. (2014), Developing models of Lithuanian speech vowels and semivowels, Informatica, 25, 1, 55–72.

Raitio T., Lu H., Kane J., Suni A., Vainio M., King S., Alku P. (2014), Voice source modelling using deep neural networks for statistical parametric speech synthesis, [in:] European Signal Processing Conference, 6952838, European Signal Processing Conference, EUSIPCO, pp. 2290–2294, 22nd European Signal Processing Conference, EUSIPCO 2014, Lisbon, United Kingdom, 1–5 September.

Răskinis A., Dereškeviciutė S. (2007), An analysis of spectral attributes, characterizing the interaction of lithuanian voiceless velar stop consonants with their pre- and postvocalic context, Information Technology and Control, 36, 1, 68–75.

Ringys, T., Slivinskas, V. (2010), Lithuanian language vowel formant modelling using multiple input and single output linear dynamic system with multiple poles, Proceedings of the 5th International Conference on Electrical and Control Technologies (ECT-2010), pp. 117–120.

SAMPA Homepage (2005) [in Polish], (last revised 2005; accessed Jan. 2017).

SAMPA Homepage (2005), uk/home/sampa/index.html (last revised 2005; accessed Jan. 2017).

Sasirekha D., Chandra E. (2012), Text to speech: a simple tutorial, International Journal of Soft Computing and Engineering (IJSCE), 2, 1, 275–278.

Stănescu M., Cucu H., Buzo A., Burileanu C. (2012), ASR for low-resourced languages: building a phonetically balanced Romanian speech corpus, Proceedings of 20th European Signal Processing Conference, pp. 2060–2064.

Stevens K.N. (1993), Modelling affricate consonants, Speech Communication, 13, 1–2, 33–43.

Tabet Y., Boughazi M. (2011), Speech synthesis techniques. A survey, 7th International Workshop on Systems, Signal Processing and Their Applications (WOSSPA), pp. 67–70.

Tamulevičius G., Kaukėnas J. (2016), Adequacy analysis of autoregressive model for Lithuanian semivowels, Advances in Information, Electronic and Electrical Engineering (AIEEE), 2016 IEEE 4th Workshop on, doi: 10.1109/AIEEE.2016.7821825.

Tokuda K., Nankaku Y., Toda T., Zen H., Yamagishi J., Oura K. (2013), Speech synthesis based on hidden Markov Model, Proceedings of the IEEE, 101, 5, 1234–1252.

Upadhyaya P., Farooq O., Abidi M.R., Varshney P. (2015), Comparative study of visual feature for bimodal Hindi speech recognition, Archives of Acoustics, 40, 4, 609–619, doi: 10.1515/aoa-2015-0061.

VoxForge (2017), (accessed Jan. 2017).

Zelasko P., Ziółko B., Jadczyk T., Skurzok D. (2016), AGH corpus of Polish speech, Language Resources and Evaluation, 50, 3, 585–601, doi: 10.1007/S10579-015-9302-Y.

Zen H., Tokuda K., Black A.W. (2009), Statistical parametric speech synthesis, Speech Communication, 51, 11, 1039–1064.

Ziółko B., Gałka J., Suresh M., Wilson R., Ziółko M. (2009), Triphone statistics for Polish language, Human Language Technology: Challenges of the Information Society, LTC 2007, Lecture Notes in Computer Science, Vol. 5603, pp. 63–73, Springer, Berlin, Heidelberg.

Ziółko B., Ziółko M. (2011), Time durations of phonemes in Polish language for speech and speaker recognition, Human Language Technology. Challenges for Computer Science and Linguistics. Lecture Notes in Computer Science, Vol. 6562, 105–114, Springer Verlag.

DOI: 10.1515/aoa-2017-0039