Large Scale Distributed Acoustic Modeling With Back-Off N-Grams
Ciprian Chelba
Tuesday, April 23, 2013
12:30 PM, Conference Room 5A
Abstract:
Google Voice Search is an application that provides a data-rich setup for both language and acoustic modeling research.
The approach we take revives an older approach to acoustic modeling that borrows from n-gram language modeling in an attempt to scale up both the amount of training data, and the model size (as measured by the number of parameters in the model), to approximately 100 times larger than current sizes used in automatic speech recognition.
Speech recognition experiments are carried out in an N-best list rescoring framework for Google Voice Search. We use 87,000 hours of training data speech along with transcription) obtained by filtering utterances in Voice Search logs on automatic speech recognition confidence.
Models ranging in size between 20--40 million Gaussians are estimated using maximum likelihood training. They achieve relative reductions in word-error-rate of 11% and 6% when combined with first-pass models trained using maximum likelihood, and boosted maximum mutual information, respectively. Increasing the context size beyond five phones (quinphones) does not help.
Bio:
Ciprian Chelba received his Diploma Engineer degree in 1993 from the Faculty of Electronics and Telecommunications at "Politehnica" University, Bucuresti, Romania, M.S in 1996 and Ph.D. in 2000 from the Electrical and Computer Engineering Department at the Johns Hopkins University.
Between 2000 and 2006 he worked as a Researcher in the Speech Technology Group at Microsoft Research, after which he joined Google, where he is currently Staff Research Scientist.
His research interests are in statistical modeling of natural language and speech, as well as related areas such as machine learning with an emphasis on large-scale data-driven modeling.
Recent projects include query stream language modeling for Google voice search, speech content indexing and ranking for search in spoken documents, discriminative language modeling for large vocabulary speech recognition, logs mining and large-scale acoustic modeling for large vocabulary speech recognition, language modeling for text input on soft-keyboards for mobile devices, as well as speech and text classification.