2 December 2022/19 min read

Overview of TTS at Interspeech 2022

We hope you enjoyed our first blog post about our highlights of Interspeech 2022. Here we’d like to cover the topics nearest and dearest to our work: speech synthesis! There was a ton of new research this year. With plenty of new datasets, new front-end text analysis, and new methods for creating expressive synthetic voices, we hope you enjoy!

Data
Linguistic front-end
Expressivity

Data

New language data

We love to see new non-English corpora released, this both enables the research community to test systems in other languages, but more importantly helps promote formal investigations of under-resourced languages in speech science. As is (somewhat unfortunately) standard in TTS research, most papers we saw at Interspeech this year continued to use English-only data, though there is growth in Mandarin. We want to start by highlighting four new corpora from various other languages.

	Authors	Language	Number of speakers	Hours	Licence	Data type
	Tran et al.	Vietnamese	1F	9.5	Research only	Found data: YouTube audiobooks
REYD	Webber et al.	Yiddish	3	8		Found data: archived audiobooks
BibleTTS	Meyer et al.	10 Sub-Saharan African languages: Akuapem Twi Asante Twi Chichewa Ewe Hausa Kikuyu Lingala Luganda Luo Yoruba	10 (one per language)	2047 (including unaligned)	Creative Commons: ShareAlike	Bible verses recorded in professional studio
J-MAC	Takamichi et al.	Japanese	39	31.5	Research only	Found data: audiobooks

Three of these four corpora were created from found audiobooks, reiterating the usefulness of this type of data for TTS.

Data collection for expressivity

This year, one growing theme towards more expressive synthesis was research into designing and commissioning more expressive data. The designs explored in these papers were varied, lying on a spectrum from scripted to unscripted text for voice actors. On the fully scripted side was STUDIES by Saito et al. They wrote interactions between teachers and students to elicit empathetic dialogue and therefore more emotional speech, and found that explicitly giving a TTS system the interlocutor’s (person that the speaker is responding to) emotion label and conversational context embedding resulted in equally natural speech as a system that only used the speaker’s emotion label. O’Mahony et al. supplemented LJ Speech with less scripted data from the Spotify Podcast dataset. They were able to improve question intonation by incorporating question/answer pairs from the podcasts in their training data. Adigwe and Klabbers compared using read-aloud sentences, performed dialogues, and semi-spontaneous dialogues. Beyond finding greater expressivity in the semi-spontaneous dialogues, they also asked voice actors for direct feedback on the different scripts, and found a preference for the multi-speaker setups (performed and semi-spontaneous dialogues). Lastly, Delvaux et al. had their speakers tell fully unscripted stories of important memories in order to capture natural emotional speech. Though they expected different dimensions of these memories to yield differences in acoustic parameters, instead they mostly found the emotional effects to have “limited impact”.

It seems there is still too much we don’t know about fully spontaneous speech to use it alone for TTS training data. However, there are clear benefits for both naturalness and expressivity to not using fully-scripted texts, and including dialogues. We’re excited to experiment with different ideas ourselves in our own data collections.

Data augmentation

We also saw a growing trend in using voice conversion (VC) to augment TTS training data. Comini et al. investigated a language-agnostic approach to augmenting 8-10 hours of neutral speech using only an hour of conversation speech, to represent a low-resource setting. They use an F0 predictor to reproduce the conversational target speaker’s F0, and feed this as input to a VC system to create additional training data. They are able to both control F0 in the VC output using their predictor and produce high-quality synthesised speech, using a system trained on this data, either on par or better than the previous best augmented system.

Terashima et al. similarly explored using VC to augment data for low-resource expressive TTS, but used algorithmic pitch-shifting on neutral data, rather than training an F0 predictor on already expressive data. They use this pitch-shifted data to train the VC model, whose output was then used as training data for a TTS system. They found higher naturalness and emotional similarity in the synthesised speech from the pitch-shifting system, than one trained on the output of a VC model without pitch-shifting.

Instead of using VC for training data augmentation, Bilínski et al. used normalising flows to generate new speakers that were unseen in the training data, both in TTS and VC. This is particularly interesting for us because our dubbing use case often requires many different speakers for a single video. As the authors point out, adding new speakers for TTS usually requires costly recording time with a new voice actor. So research into alternatives is exciting to see.

We’re keen to continue watching the data augmentation space as it grows. Data collection is a huge barrier to scaling TTS, and any methods that alleviate this bottleneck can be beneficial to TTS research as a whole.

Linguistic front-end

In recent years, “end-to-end” TTS systems have become increasingly popular. In theory, by operating directly on graphemes rather than phonemes, they remove the need for a front-end text-analysis component. In practice, however, these systems fail to model the complex pronunciation patterns some languages exhibit, resulting in frequent pronunciation errors. This highlights the importance of pronunciation control, which often goes hand-in-hand with a robust linguistic front-end. In line with this, we saw new research on two main front-end problems: homograph disambiguation and prosodic structure prediction.

Homograph disambiguation

Homographs are a long-standing issue in grapheme-to-phoneme (G2P) conversion. Data-driven methods have traditionally viewed G2P as a word-level task where the pronunciation of a word can be predicted from its spelling alone; however, the key to correctly interpreting homographs is the linguistic context surrounding the word. In recent years, with the increasing popularity of transformer-based language models in NLP (e.g., the BERT family), we’ve seen a trend shift in homograph disambiguation research towards exploiting contextual information from pre-trained word embeddings, which the work described in this section builds on.

Zhang et al. proposed Polyphone BERT, a simple yet effective approach to polyphone disambiguation in Mandarin, i.e. homograph disambiguation in the context of Chinese characters, specifically. Unlike BERT-based methods which make explicit use of word embeddings (e.g. input to a classifier), the authors leverage the contextual information they encode by fine-tuning a pre-trained BERT model to a pseudo-masked language modelling (MLM) task, achieving excellent results even in highly ambiguous settings.

Rather than filling in the blanks, the fine-tuning task consists in replacing polyphonic characters (the “masked” tokens in the input sentence) with contextually appropriate monophonic characters (i.e., their unambiguous counterparts), effectively turning MLM into homograph disambiguation. To this end, the authors established a mapping from polyphonic characters to new (made-up) monophonic characters, one for each possible Pinyin pronunciation (e.g., 泊1 and 泊2 for the Chinese polyphone 泊), and extended the pre-trained BERT vocabulary to include them. Since the new unambiguous characters have a one-to-one correspondence with Pinyin pronunciations, they can use the model’s predictions to retrieve the correct pronunciation from a Mandarin lexicon at inference time.

BERT is also at the foundation of Chen et al.’s g2pW, another approach to polyphone disambiguation, which, like g2pM two years ago, comes with a user-friendly Python package! While Polyphone BERT adapts BERT’s learnt representations to the task in question, g2pW uses the pre-trained model as a feature extractor that feeds into a more complex architecture with two main components: a weighted-softmax classifier and a part-of-speech (POS) tagger.

The input sentence is encoded with BERT to obtain a contextual embedding for the polyphonic character, which is then fed to the POS tagger in order to predict the polyphone’s POS tag. The predicted tag, along with the polyphonic character itself, is in turn used to compute a set of weights for the weighted-softmax classifier, each corresponding to one possible pronunciation of the polyphone. The classifier then determines which candidate pronunciation has the highest probability given the BERT embedding and the conditional weights. This architecture appears to outperform state-of-the-art systems such as last year’s PDF on the public CPP dataset (from the g2pM paper), but it’s unclear how it compares to Polyphone BERT, as the latter was only evaluated on a small subset of CPP.

Classification-based approaches to homograph disambiguation like g2pW often require a separate G2P component to deal with non-homographs or, in the case of Polyphone BERT, retrieve the pronunciation of the word post-disambiguation. Ploujnikov and Ravanelli presented SoundChoice, a multi-task learning approach that integrates homograph disambiguation into a sentence-level G2P model with very promising results.

The input combines character embeddings and BERT embeddings, derived from the graphemes and words in the input sentence, respectively. This way, the model benefits from both low-level graphemic information, useful for G2P, and high-level contextual information, necessary for disambiguation. The authors also use a weighted homograph loss to amplify the contribution of homograph errors to the overall loss, an important consideration given that low error rates can be achieved without successfully disambiguating homographs — generally, most words aren’t homographs, and only a subset of sentences in the training data contain homographs. We recommend you check out the paper for more details if you’re interested and see the model in action on Hugging Face!

Prosodic structure prediction

The role pronunciation plays in removing word-level ambiguity is analogous to that of prosody in conveying different meanings for the same sentence. An utterance consists of one or more prosodic phrases, typically arranged hierarchically and delimited by a pause or change in acoustic properties like F0 and duration. This phrasing can affect how we interpret sentences and is precisely what determines the prosodic structure of an utterance. Work on prosodic structure prediction (PSP) this year was characterised by the use of multi-task learning (MTL), a modelling framework frequently used in NLP to jointly solve two or more related subtasks, which, as SoundChoice above illustrates, has recently been adopted in linguistic front-end research with great success.

Chen et al.’s work on PSP in Mandarin differs from previous MTL approaches (e.g., Huang et al., Pan et al.) in how much context is provided to the model. Rather than linguistic features derived from the target sentence, which, by themselves, offer limited context on the prosody of the utterance, they condition the model’s predictions on groups of adjacent sentences (within a fixed-size window around the target sentence) — this is especially relevant in the context of long-form TTS, as inter-sentential linguistic information can affect the prosodic structure of the utterance.

To this end, their proposed architecture encodes character-, sentence- and discourse-level information from the input sentences into a combined representation. This multi-level contextual representation is then fed to an MTL decoder to predict the boundaries between prosodic constituents for each level of the Mandarin prosodic hierarchy — namely, prosodic word, prosodic phrase and intonational phrase, each regarded as a subtask of PSP in the MTL framework. In order to jointly optimise all subtasks, the authors condition those that correspond to higher-level prosodic constituents on those associated with lower-level ones (i.e., prosodic-phrase prediction gets conditioned on the prosodic-word subtask, and so on). This allows PSP to better model the hierarchical dependencies between the prosodic constituents in the hierarchy, contrary to conventional methods, where each task is viewed as an independent problem.

Park et al. presented an MLT-based approach to pitch-accent prediction in Japanese, a task which depends on PSP but is typically treated as a separate problem. Pitch accent, like stress, is a lexical (word-level) feature, but its position can shift when the word is pronounced in context, affecting the prosodic structure of the utterance. Training independent PSP and pitch-accent prediction models thus makes it difficult to capture the relationship between them, which motivates the use of an MTL framework.

The authors formulate the problem as sequence classification task where, for each word in the input text, at each position, the model simultaneously predicts whether there’s a phrase boundary and a pitch accent. Similar to the previous paper, this is achieved by sequentially conditioning the individual prediction subtasks according to the Japanese prosodic hierarchy (i.e., pitch-accent prediction gets conditioned on accent-phrase prediction, and intonation-phrase prediction, on accent-phrase prediction). The results show a significant improvement over conventional two-stage methods, suggesting that modelling the relationship between prosodic structure and pitch-accent is beneficial in predicting the latter.

While PSP is particularly important for languages like Mandarin and Japanese, we’d be interested to see its potential in TTS systems for non-tonal or pitch-accented languages, perhaps as a form of prosodic control, like the methods we describe in the next section.

Expressivity

The naturalness of TTS voices has improved massively in the past few years. Neural vocoders and sequence-to-sequence acoustic models were introduced 5-6 years ago, and we’ve had time to explore, experiment, and improve within these paradigms. While neither vocoding nor acoustic modelling are perfectly solved, prosody modelling stands out as a part of TTS that really underperforms. When TTS is used for long-form content, it becomes clear that the expressivity is lacking and that the prosody does not capture meaningful information related to the context of an utterance.

Control

One way to improve the expressivity and appropriateness of synthetic prosody is to use a human-in-the-loop system for control. This year, we saw two primary approaches to creating these systems: 1) modifying TTS models to account for additional prosodic information vs 2) harnessing information already present in the existing architecture.

Modifying the architecture

On the additional annotations side, Shin et al. took a very similar approach to the style-tag paper by Kim et al. last year, in which crowd-sourced style-tag annotations were used as training data for a TTS system, allowing for free-form style-tag inputs at inference time. However, they adapt the original approach for a multi-speaker setting by adding new loss terms. Their work reiterates the benefits of using this type of approach for control, especially because it is much more flexible than something that relies on specific acoustic features like F0 or energy.

Seshadri et al. modify FastSpeech 2 with an additional emphasis predictor that has a similar architecture as the F0, energy, and duration feature predictors. They show that they are able to use this modified system to generate perceivable emphasis at the word level whilst maintaining good quality synthesis (audio samples here).

Ju et al. look at directly controlling pitch. TriniTTS is an end-to-end system that, though inspired by VITS, differs greatly in that it includes pitch-related modules and does not use sampling. The authors demonstrate that this system is able to achieve better MOS scores in the pitch-control setting than FastPitch with a HiFi-GAN vocoder, and is also able to generate speech faster than VITS.

Harnessing the existing architecture

Contrary to the above approaches, Lenglet et al. aim to target a specific prosodic parameter for control whilst preserving segmental variations, suprasegmental variations, and co-variations in speech by using the existing embedding space of encoder-decoder TTS models. Specifically, they analyse the encoder embeddings w.r.t. the acoustics to figure out how to bias these embeddings and therefore control the speaking rate. We appreciate the benefits of such an approach: maintaining co-variation whilst simultaneously allowing for specific control could improve the efficiency and naturalness of control. This is something our Ctrl-P model, and other systems with explicit prosodic inputs like FastSpeech2, do not currently do. Maintaining co-variation directly trades off with disentangled control over the acoustic features, and the optimal behaviour will depend on the use case of these controls.

Tae et al. harness score-based generative modelling to allow granular editing of content and pitch without any additional training, optimisation, or architectural modifications. They do this by perturbing the Gaussian prior space whilst also applying masks and softening kernels to focus the edits only in the target region. Check out their samples page here.

Context for prosody modelling

In addition to pursuing controllable TTS, it’s also necessary to improve the appropriateness of predicted prosody without human control. We were excited to see a range of papers incorporating new information to the prosody model.

The context provided for prosody modelling can involve new features to provide additional types of information for the current utterance, or information from the surrounding utterances. Recently, we’ve seen many works using pre-trained language models as an additional type of context information. We’ve also seen more papers incorporating the previous sentence as context to determine an appropriate prosodic rendition.

In particular, research from Mitsui et al. did both! They used BERT to summarise both the current sentence and the past 10 sentences. Two summary vectors (for the current sentence and the past context) are used to predict an utterance-level acoustic latent variable. The utterance-level latent is predicted with an LSTM for a sequence of sentences in a dialogue! This sentence-level style predictor for dialogue was able to significantly improve MOS compared to a typical VITS TTS model, both for isolated sentences and for 1-2 mins dialogues. Seeing such an improvement for longer form stimuli is amazing. One of our native Japanese speakers was very impressed at the quality of their samples, stating that the dialogue was natural and convincing. We’d be very interested in any follow-up analysis comparing their top-line performance (using oracle embeddings) to human speech.

Nishimura et al. also proposed a TTS model for dialogue speech. Their approach combined context information from both the text and audio, though the contribution of each modality was not analysed. While they explored various additions to the model, we were most interested in their idea to fine-tune a wav2vec 2.0 model for their prosody prediction task. Unfortunately, their model performed better when trained from scratch. This may be due to their choice of pre-trained model, we hope that pre-training for prosody prediction can be useful if the self-supervised pre-training loss is designed with prosody in mind.

The last three papers were all from the TTS team at Amazon, all working on expressive speech. Makarov et al. incorporated and studied three design choices in a TTS model: training with multiple speakers, using large language models, and providing wider context. We believe that this is the first work to successfully improve TTS by using BERT directly for acoustic modelling (i.e. as input to the acoustic decoder), previous approaches have used language models specifically for prosody modelling. In addition, this is the first paper to show such successful results by incorporating wider context! They also performed analysis of the impact on duration prediction, finding that BERT does not improve phone duration prediction, but it does improve pause prediction.

The next paper, from Karlapati et al., extends CopyCat, a voice conversion model. In CopyCat2, they learn to predict prosody using a pre-trained BERT model, this means the model is also capable of TTS. They demonstrate that CopyCat2 is able to improve speaker similarity for voice conversion compared to CopyCat, suggesting that the word-level prosody representation enables better speaker disentanglement compared to using 16-frame windows. They also show that CopyCat2 has higher naturalness than Kathaka, a TTS model with a sentence-level prosody representation. This suggests that a word-level prosody representation has higher capacity and is still predictable.

And our final instalment from Amazon on prosody comes from Abbas et al. They look at improving duration prediction (including pause placement) using a BERT model and two different modelling approaches. In their first model, phrase breaks are predicted using BERT, this prediction is used to help drive duration prediction. Their second model, Cauliflow, is a normalising flow duration model. Here, they represent phrase-break information as “pause rate” but don’t directly supervise with a phrase-break loss. Cauliflow uses the same inputs as their first approach (including BERT) with the addition of pause rate and speaking rate. For human-in-the-loop control, pause rate should be easier to operate compared to placing phrase breaks. Finally, they show that a model without phrase-break supervision incorrectly captures a uni-modal duration distribution, while both of their models capture a duration distribution more similar to human speech.

Improving specific styles

Another promising approach to improving the utility of TTS models is to introduce new data specifically for a given use case. O’Mahony et al. demonstrate this for conversational style speech. As discussed above, they compare a model trained on LJ Speech with a model trained on a data mix, including 15h of conversational question and answer data. This new data was filtered from the Spotify podcast dataset and consists of about 15,000 speakers. They found that the datamix model improved performance on question prosody. However, answers were unaffected. This is unsurprising as in this paper TTS was performed on isolated sentences, meaning answers could not be delivered appropriately to the question context. The breakdown by questions and answers is an excellent demonstration of how to provide more insight about our models. We look forward to seeing further work on this question-answer dataset.

Zaïdi et al. propose a FastPitch-like model with a reference encoder that can adjust prosody and acoustic predictions. Their paper focuses on their model’s ability to achieve prosody transfer across text and uses LJ Speech, 5 neutral speakers, and 7 speakers with highly expressive performances. Through the lens of improving specific styles, we were curious about how the non-expressive speakers performed. From their samples page, we can hear that expressive speakers 1 and 2 produce a style more closely matching the characteristics of the reference compared to the non-expressive speakers. Interestingly, we thought speaker 3 had less expressive range than the other expressive speakers, and we thought LJ Speech had slightly more range than the non-expressive speakers. This suggests that, at least for this model, the capability to produce new styles is limited by the speaker’s data, which lines up with observations from our Ctrl-P model as well.

The final paper (also discussed above) provides a potential solution to this: collect a small amount of expressive speech for a single speaker and augment the speaker’s data using voice conversion (VC). Comini et al. applies this approach to low-resource expressive TTS and demonstrates results for several languages. Their F0-conditioned VC model is used to keep the target speaker’s prosody independent from the source speaker’s prosody. The F0 values are predicted by a model fine-tuned on the expressive data. In a MUSHRA test, their use of F0-conditioned VC significantly improves naturalness for 5 out of 9 speakers. Unfortunately, we can’t comment on how well this works for improving the conversational style as no results were presented on the faithfulness of the synthesised style, and no baselines using non-augmented data were presented.

Thanks to everyone who presented at Interspeech, it was great to attend the conference in person and we’re looking forward to the next one! Look out for us next time and come say hi. And if you’re interested in working with us take a look through our open roles!

Subscribe to the blog

Receive all the latest posts right into your inbox