In this paper we address the issue of pronunciation model- ing for conversational speech synthesis. We experiment with two different HMM topologies (fully connected state model and forward connected state model) for sub-phonetic model- ing to capture the deletion and insertion of sub-phonetic states during speech production process.
We show that the experi- mented HMM topologies have higher log likelihood than the traditional 5-state sequential model. We also study the ?rst and second mentions of content words and their in?uence on the pronunciation variation.Finally we report phone recogni- tion experiments using the modi?ed HMM topologies.
1. INTRODUCTION Modeling of pronunciation variations in conversational speech is essential for speech recognition as well as speech synthe- sis. The state-of-art speech synthesis systems are built using unit selection databases of carefully read speech recorded in a controlled environment. While these systems produce high quality natural speech they produce little effect of a conversa- tion and lack the genre and style of conversational speech. the pronunciation variations .Jande used phonological rule system for adapting the pronunciation for faster speech rate .
Bennett et al. , used acoustic models trained on single speaker database to label the alternate pronunciations of the words: ”to, for, a, the” and used CART tree to predict the probable pronunciation with the given context . There has been considerable research in speech recogni- tion ?eld towards capturing the pronunciation variants.
Bates et al. , showed that prosodic features derived from energy, F0 and duration could be cues to model the pronunciation vari- ability . Nedel et al. used phone splitting technique to model the pronunciation variants of two phones AA and IY .
Most of the work in speech recognition and speech syn- thesis use multiple entries in the dictionary generated either manually or by automatic means. Typically an alternate entry of a word is generated by deletion, insertion and substitution of the phones in the base form of the word. This type of mod- eling makes a binary decision implying that the base form of a word undergoes a complete change as described by its pro- nunciation variant.
Recent studies have shown that a phone is not completely deleted or substituted but is modi?ed only par-