As per the news reports in July 2020, a group of scholars from the computer science department of Indian Institute of Technology, Madras is developing the technology to enable text-to-speech conversion for 13 Indian languages—Assamese, Bodo, Bangali, Gujarati, Hindi, Kannad, Malayalam, Manipuri, Marathi, Odia, Rajasthani, Tamil, Telugu, and their corresponding Indian English versions. A study in this regard was published in the journal IEEE/ACM Transactions on Audio, Speech, and Language Processing. This technique will make lectures originally presented in English available in all Indian languages.
Challenges
To make the synthesised speech sound as natural as possible, there is a need to convert punctuations into pauses of suitable lengths. This approach is for English language text into synthesised speech. But the problem arises when such an approach is used for Indian languages. One of the difficulty in the task in that in Indian languages there are no punctuations except period. For instance, the longest sentence in Indian languages can last up to 30 seconds, compared to English which could be about 6 seconds long and such long sentences in Indian languages are essentially phrase-based, and each phrase is almost a complete unit in itself.
Methodology of Study
The study involved voice professionals like new readers and radio jockeys who read out text carefully selected to be representative of various fields. The audio signal and the text were aligned including pauses and text was syllabified using rules and syllables and pauses were identified in the audio using acoustic properties. As the study was conducted by aligning the text and audio at the syllable-level, computing syllable-rate, number of syllables between pauses was straightforward. In this way the researchers collected 10 hours of data for every language, though an hour of speech contains about 350-400 sentences. Five hours of data was used for hypothesising, and a set of held out sentences from the database was used for testing the hypothesis. The text sentences were chosen in such a way which ensured that maximum domains were covered, including news, sports, fiction, etc.
The results obtained from the experimentation were tested on listeners by playing out spoken as well as synthesised sentences in a random order. It was a uniform improvement in all the Indian languages. The phrases of the text were synthesised, using the appropriate phrase-based synthesis-system, and the synthesised wave-forms were concatenated. This automated conversion of the written text to spoken form is very useful, especially in this time of online classes.