Interspeech2014 @ Singapore – Highlights


This year my colleagues from SpeechLab and I were invited to Interspeech2014 conference to present our work. The conference was held in Singapore, a great city with multilingual culture. The city was the right choice for NLP and speech conference. I was impressed by how well everything was organized from the opening ceremony to the very end. It made 24-hour flight NYC-Singapore well worth coming. The best part of the conference was a whole slew of interesting research papers in various areas of NLP.


A conference, in general, is a good place to get familiar with the current state-of-the-art research. Interspeech2014 wasn’t exception featuring great presentations from researchers all over the world. Below you can find a small list of some papers I came across and found interesting to read.

Word-Phrase-Entity Language Models: Getting More Mileage out of N-grams
Michael Levit, Sarangarajan Parthasarathy, Shuangyu Chang, Andreas Stolcke, Benoˆıt Dumoulin

We present a modification of the traditional n-gram language modeling approach that departs from the word-level data representation and seeks to re-express the training text in terms of tokens that could be either words, common phrases or instances of one or several classes. Our iterative optimization algorithm considers alternative parses of the corpus in terms of these tokens, reestimates token n-gram probabilities and also updates within-class distributions. In this paper, we focus on the cold start approach that only assumes the availability of the word-level training corpus, as well as a number of generic class definitions. Applied to the calendar scenario in the personal assistant domain, our approach reduces word error rates by more than 13% relative to the word-only n-gram language models. Only a small fraction of these improvements can be ascribed to a larger vocabulary.

Improving Spoken Document Retrieval by Unsupervised Language Model Adaptation Using Utterance-based Web Search
Robert Herms, Marc Ritter, Thomas Wilhelm-Stein, Maximilian Eibl

Information retrieval systems facilitate the search for annotated audiovisual documents from different corpora. One of the main problems is to determine domain-specific vocabulary like names, brands, technical terms etc. by using general language models (LM) especially in broadcast news. Our approach consists of two steps to overcome the out-of-vocabulary (OOV) problem to improve the spoken document retrieval performance. Therefore, we first separate the resulting transcript of a speech recognizer into blocks. Keywords are extracted from each transcribed utterance of a block for the search of web resources in an unsupervised manner in order to obtain adaptation data. These data are used to perform a block-specific adaptation of a general pronunciation dictionary and a general LM. The second step comprises the utilization of a certain adapted dictionary and LM in the speech recognizer to improve the vocabulary coverage and to regard the perplexity for a corresponding block at once. We evaluate this strategy on a dataset of summarized German broadcast news. Our experimental results show improvements of up to 11.7% for MAP of 18 different topics and 7.5% of WER in comparison to the base LM.

Prosody Contour Prediction with Long Short-Term Memory, Bi-Directional, Deep Recurrent Neural Networks
Raul Fernandez, Asaf Rendel, Bhuvana Ramabhadran, Ron Hoory

Deep Neural Networks (DNNs) have been shown to provide state-of-the-art performance over other baseline models in the task of predicting prosodic targets from text in a speech- synthesis system. However, prosody prediction can be affected by an interaction of short- and long-term contextual factors that a static model that depends on a fixed-size context window can fail to properly capture. In this work, we look at a recurrent formulation of neural networks (RNNs) that are deep in time and can store state information from an arbitrarily large input history when making a prediction. We show that RNNs provide improved performance over DNNs of comparable size in terms of various objective metrics for a variety of prosodic streams (notably, a relative reduction of about 6% in F0 mean-square error accompanied by a relative increase of about 14% in F0 variance), as well as in terms of perceptual quality assessed through mean-opinion-score listening tests.

Word Embeddings for Speech Recognition
Samy Bengio and Georg Heigold

Speech recognition systems have used the concept of states as a way to decompose words into sub-word units for decades. As the number of such states now reaches the number of words used to train acoustic models, it is interesting to consider approaches that relax the assumption that words are made of states. We present here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense. We show how embeddings can still allow to score words that were not in the training dictionary. Initial experiments using a lattice rescoring approach and model combination on a large realistic dataset show improvements in word error rate.

Show & Tell session

As a scientist, you must know the importance of getting, annotating, and cleaning the data. Most common approach is to hire somebody through Amazon Turk. However, it requires a person to commit their time to work a front of a computer. Knowing how ubiquitous mobile platforms are, it was just a matter of time to have users perform tasks on the go. Small start up, Crowdy is doing exactly that. A useful idea that worth trying in the future. Here is the abstract:

Crowdee: Mobile Crowdsourcing Micro-task Platform for Celebrating the Diversity of Languages
Babak Naderi, Tim Polzehl, André Beyer, Tibor Pilz, Sebastian Möller

This paper introduces a novel crowdsourcing platform provided to the community. The platform operates on mobile devices and makes data generation and labeling scenarios available for many related research tracks potentially covering also small and underrepresented languages. Besides the versatile ways for commencing studies using the platform, also active research on crowdsourcing itself becomes feasible. With special focus on speech and video recordings, the mobility and scalability of the platform is expected to stimulate and foster data-driven studies and insights throughout the community.