Archives of Acoustics, 39, 4, pp. 501-509, 2014

System for Automatic Transcription of Sessions of the Polish Senate

Krzysztof MARASEK
Polish-Japanese Institute of Information Technology

Polish-Japanese Institute of Information Technology

Polish-Japanese Institute of Information Technology

This paper describes research behind a Large-Vocabulary Continuous Speech Recognition (LVCSR) system for the transcription of Senate speeches for the Polish language. The system utilizes several components: a phonetic transcription system, language and acoustic model training systems, a Voice Activity Detector (VAD), a LVCSR decoder, and a subtitle generator and presentation system. Some of the modules relied on already available tools and some had to be made from the beginning but the authors ensured that they used the most advanced techniques they had available at the time. Finally, several experiments where performed to compare the performance of both more modern and more conventional technologies.
Keywords: large vocabulary speech recognition, language modelling, transcription, transliteration, subtitles.
Full Text: PDF
Copyright © Polish Academy of Sciences & Institute of Fundamental Technological Research (IPPT PAN).


Brocki, Ł. (2010a). Koneksjonistyczny model języka polskiego. In XII International PhD Workshop OWD 2010 .

Brocki, Ł. (2010b). Koneksjonistyczny Model Języka w Systemach Rozpoznawania Mowy . PhD thesis, Polish-Japanese Institute of Information Technology.

Brocki, Ł., Koržinek, D., and Marasek, K. (2006). Recognizing connected digit strings using neural networks. In Text, Speech and Dialogue , Springer.

Brocki, Ł., Koržinek, D., and Marasek, K. (2008). Telephony based voice portal for a university.

Brocki, Ł., Marasek, K., and Koržinek, D. (2012a). Connectionist language model for polish. In Intelligent Tools for Building a Scientic Information Platform , Springer.

Brocki, Ł., Marasek, K., and Koržinek, D. (2012b). Multiple model text normalization for the polish language. In Foundations of Intelligent Systems , Springer.

Demenko, G., Grocholewski, S., Klessa, K., Ogórkiewicz, J., Wagner, A., Lange, M., Sledzinski, D., and Cylwik, N. (2008). Jurisdic: Polish speech database for taking dictation of legal texts. In LREC .

Federico, M., Bertoldi, N., and Cettolo, M. (2008). Irstlm: an open source toolkit for handling large scale language models. In Interspeech

Glass, J. R., Hsu, B.-J., et al. (2009). Language modeling for limited-data domains.

Graves, A., Eck, D., Beringer, N., and Schmidhuber, J. (2004). Biologically plausible speech recognition with lstm neural nets. In Biologically Inspired Approaches to Advanced Information Technology , Springer.

Hickson, I. (2012). Webvtt. living standard. World Wide Web Consortium .

Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural computation , 18(7):1527 1554.

Huijbregts, M. A. H. (2008). Segmentation, diarization and speech transcription: surprise data unraveled.

Jelinek, F. (1997). Statistical methods for speech recognition . MIT press.

Katsamanis, A., Black, M., Georgiou, P. G., Goldstein, L., and Narayanan, S. (2011). Sailalign: Robust long speech-text alignment. In Proc. of Workshop on New Tools and Methods for Very-Large Scale Phonetics Research .

Kneser, R. and Ney, H. (1995). Improved backing-o for mgram language modeling. In Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on , volume 1, IEEE.

Koržinek, D. and Brocki, Ł. (2007). Grammar based automatic speech recognition system for the polish language. In Recent Advances in Mechatronics , Springer.

Kos, M., Vlaj, D., and Kacic, Z. (1996). Sloparl-slovenian parliamentary speech and text corpus for large vocabulary continuous speech recognition.

Lee, A., Kawahara, T., and Shikano, K. (2001). Juliusan open source real-time large vocabulary recognition engine.

Lööf, J., Bisani, M., Gollan, C., Heigold, G., Homeister, B., Plahl, C., Schlüter, R., and Ney, H. (2006). The 2006 rwth parliamentary speeches transcription system. In INTERSPEECH .

Marasek, K. (2012). Ted polish-to-english translation system for the iwslt 2012. Proceedings IWSLT 2012 .

Marasek, K., Brocki, Ł., Koržinek, D., Szklanny, K., and Gubrynowicz, R. (2009). User-centered design for a voice portal. In Aspects of Natural Language Processing , Springer.

Michalewicz, Z. (1996). Genetic algorithms+ data structures= evolution programs . springer.

Miłkowski, M. (2012). The Polish language in the digital age. Springer.

Mori, R. D. (1998). Spoken Dialogue With Computers (Signal Processing and its Applications) . Academic Press.

Povey, D., Burget, L., Agarwal, M., Akyazi, P., Feng, K., Ghoshal, A., Glembek, O., Goel, N. K., Karaát, M., Rastrow, A., et al. (2010). Subspace gaussian mixture models for speech recognition. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on , IEEE.

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al. (2011). The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding .

Pražák, A., Psutka, J. V., Hoidekr, J., Kanis, J., Müller, L., and Psutka, J. (2006). Automatic online subtitling of the czech parliament meetings. In Text, Speech and Dialogue , Springer.

Przepiórkowski, A., Bańko, M., Górski, R., and Lewandowska-Tomaszczyk, B. (2012). Narodowy Korpus Języka Polskiego . Wydawnictwo Naukowe PWN, Warszawa.

Psutka, J. V. (2007). Benet of maximum likelihood linear transform (mllt) used at dierent levels of covariance matrices clustering in asr systems. In Text, Speech and Dialogue , Springer.

Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE , 77(2):257286.

Robinson, T., Hochberg, M., and Renals, S. (1996). The use of recurrent neural networks in continuous speech recognition. In Automatic speech and speaker recognition , Springer.

Romero-Fresco, P. (2011). Subtitling through speech recognition: Respeaking . St. Jerome Publishing.

Stolcke, A. et al. (2002). Srilm-an extensible language modeling toolkit. In INTERSPEECH .


Wells, J. C. Polish sampa. .

Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., et al. (2002). The htk book. Cambridge University Engineering Department , 3.

DOI: 10.2478/aoa-2014-0054