/13 min read

ADEPT: A dataset for evaluating prosody transfer

We published two papers at Interspeech this year (hooray 🎉). The first, Ctrl-P provides interpretable, temporally-precise, and disentangled control of prosodic features in TTS, and you can read more about it in its accompanying blog post here. This blog post covers our second paper, ADEPT, whose related dataset can be found on Zenodo. For a step-by-step guide on how to perform an ADEPT evaluation, refer to our tutorial.

In the ADEPT paper, we propose a new system of evaluation for English prosody transfer in text-to-speech (TTS). By prosody transfer (PT) we mean the process of transferring the prosody from a natural reference speech sample onto generated synthesised speech, usually with the aim of making the synthesis more expressive. The aim of our proposal is two fold:

  1. We want to establish a benchmark of expected performance of PT models.
  2. We want to create a system by which PT models can be compared against each other.

But a bit of background first.

How has prosody transfer been evaluated in the past?

One common method for PT evaluation is the anchored prosody side-by-side (AXY) test, introduced in Skerry-Ryan et al. 2018, in which listeners rate a reference sample A on a 7-point scale of closeness between two samples: a baseline X and the model-generated Y. We see this method in other works, including Battenberg et al. 2019, and Gururani et al. 2020. Though such a method can clearly show improvement over a baseline, it is susceptible to cherry-picked references. For example, if a model is particularly good at transferring duration, and all the reference samples are sentences in which grammatical pauses matter prosodically, this model will score quite high in an AXY task, signifying that it is great at PT, even if it is poor at transferring F0.

Karlapati et al. 2020 overcome this problem by using linguists to evaluate their samples on five aspects of speech related to prosody: rhythm, emphasis, syllable length, melody, and loudness. Though effective, this method is expensive; trained linguists are hard to come by, which results in fewer overall evaluators. Additionally, such a method might be hard to scale across languages other than English, if there are even fewer native-speaking linguists in the target language.

We’ve also seen some researchers use naturalness MOS scores as a proxy for successful transfer, including Sun et al. 2020. But naturalness and prosody are two different aspects of speech. For example, you can have high naturalness with limited prosody (i.e. a monotonic assistant voice). Therefore we don’t believe MOS for naturalness is a sufficient metric for PT.

Perhaps my favourite subjective evaluation we’ve seen is that proposed by Lee et al. 2019, who claim that successful transfer of a song demonstrates successful transfer of prosody. Though this is a compelling proposal because pitch is fundamental to a song melody, F0 is not the only acoustic cue to prosody, as we discuss in the next section.

Ultimately, such a variety of evaluation methods demonstrates a clear lack of consistency in what is meant by prosody in the context of prosody transfer. Such a lack of clarity encumbers progress in this field, because improvements are very hard to measure.

But not to worry! In the following section, we introduce several prosodic phenomena in detail, and their perceivability, that one may encounter in prosody transfer tasks. In subsequent sections, we explain how we recorded samples of these phenomena, and used these natural recordings to determine appropriate evaluation designs based on such perceivability. Finally, we show how these evaluation designs can be used to evaluate PT models.

Previous research has told us a lot about prosody

In fact, linguistics has given us a good definition of prosody that we can work off of to build a more comprehensive evaluation system: high-level structures that account for F0, duration, amplitude, spectral tilt, and segmental reduction patterns in speech (Shattuck-Hufnagel & Turk 1996). Using this definition, we can identify many different classes of speech with such a prosodic effect.

Speech classes with prosodic effect

For ADEPT, we found six: emotion, interpersonal attitude, propositional attitude, topical emphasis, syntactic phrasing, and marked tonicity.

Whilst emotion refers to a speaker’s inner state, attitude is towards something external. For example, a speaker can sound happy or sad (emotion), but they can also sound friendly or polite towards the listener (interpersonal attitude) or incredulous or surprised by what they are saying (propositional attitude). Previous literature has shown all three of these classes of speech to have prosodic effects on F0, amplitude, duration, and spectral tilt.

Topical emphasis refers to particular words in a sentence being emphasised as opposed to other words. For example NOT in the sentence “I will NOT go.” It has prosodic effects on F0, amplitude, and duration.

By syntactic phrasing, we mean grammatical pauses within a sentence, such as the pause after yesterday in the sentence “Yesterday, I went to the park.” This class has durational prosodic effect both on the length of pauses themselves, and the word preceding a pause.

And lastly, we use marked tonicity to refer to the phenomenon in English speech in which there will always be a syllable that carries the greatest lexical stress across the sentence. Like topical emphasis, it has prosodic effects on F0, amplitude, and duration, but also on segmental reduction. If you are unfamiliar with segmental reduction, listen to the following sample, and notice how I have ‘reduced’ my pronunciation of the word to to being very short.



If you’re familiar with IPA, note how the word is reduced from /tuː/ to /tə/.

Perceivable subcategories and interpretations

Okay so we’ve identified all of these speech classes, but how is this helpful for evaluation? Well each of these classes have previously been shown to have prosodically distinguishable subcategories or interpretations. For example, for topical emphasis listen to the following samples:



In these three samples, the text is exactly the same: “Tian went to the office yesterday.” However, the topical emphasis varies. It lies either on the beginning content word Tian, the middle content word office, or at the end on the content word yesterday. Therefore, we see three subcategories of topical emphasis (beginning, middle, and end) that are prosodically distinguishable: listeners can differentiate between these subcategories using prosody alone. This is important for prosody TTS research because if researchers can show that a model produces these types of prosodically ambiguous sentences such that the listeners can perceive (and thus correctly classify) the subcategories, then they can claim that the model can successfully produce that prosody class. So we had to find such subcategories for all the classes from previous literature:

  • emotion: anger, disgust, fear, sadness, and joy
  • interpersonal attitude: contemptuous questions, authoritative statements, and polite statements
  • propositional attitude: obviousness statements, surprise statements, sarcasm statements, doubt statements, incredulity questions, and confirmation questions
  • topical emphasis: beginning, middle, and end

With syntactic phrasing and marked tonicity, we couldn’t find such subcategories with different prosodic effect. But instead, we were able to identify sentences whose different interpretations resulted in different prosody. For example, consider the sentence “For my dinner I will have either pork or chicken and fries.” which has two interpretations: 1) I will either have pork, or have chicken and fries. Or 2) I will have either pork or chicken, and additionally, I will have fries.



The first interpretation yields a syntactic phrase boundary after pork, which is realised as an audible pause in the first sample. The second similarly yields a phrase boundary and audible pause after chicken.

How we can use this information for PT evaluation

For each of the six classes, we crafted sentences that were sufficiently ambiguous for all subcategories or interpretations within the class. We wrote 33 such sentences for topical emphasis before realising that was waaaaay more than we needed (also writing these sentences is hard!). So for the other five classes we only wrote 20 sentences each.

Recording

It was then time to record in studio. To help our voice actors elicit the target prosody, we gave them contextual queues for each utterance. For example, for the sentence “The office is on the top floor” which was to be elicited with disgust (a subcategory of emotion), we gave additional context that there was no lift in the building.

For emotion, interpersonal attitude, propositional attitude, and topical emphasis, we also recorded each sentence in a neutral style, in which no subcategory was elicited. However, for syntactic phrasing and marked tonicity there was no such “neutral” style. For example, consider the sentence “For my dinner I will have either pork or chicken and fries” from above. One of the two interpretations is mandatory; it is not possible to have neither.

Therefore, per speaker we recorded:

  • (33 sentences) ×\times (3 subcategories + neutral) = 132 topical emphasis utterances
  • (20 sentences) ×\times (5 subcategories + neutral) = 120 emotion utts
  • 20 ×\times (3 subcategories + neutral) = 80 interpersonal attitude utts
  • 20 ×\times (6 subcategories + neutral) = 140 propositional attitude utts
  • 20 ×\times (2 interpretations) = 40 syntactic phrasing utts
  • 20 ×\times (2 interpretations) = 40 marked tonicity utts

That’s a grand total of 1104 utterances, or 552 per speaker. Whew that’s a lot.

Question design

Next it was time to figure out what sort of questions to ask listeners in order to facilitate disambiguation between the subcategories or interpretations. Therefore, we designed pre-tests to determine usable question setups for each class. In designing these pre-tests, we had three considerations:

  1. Should the question be formulated with multiple stimuli (left example) or a single stimulus (right example)?

    In which sample is the word office most emphasised?
    a.
    b.
    c.

    Which word is most strongly emphasised in the sample?
    a. Tian
    b. office
    c. yesterday

  2. Should we ask about the speech class directly, or indirectly in context? For example, the two questions above ask directly about the emphasis class. But instead, we could think about the context that would elicit such prosody, and ask listeners to place the sample in the appropriate context. For example:


    Which question is best answered by the sample?
    a. WHO went to the office yesterday?
    b. Tian went WHERE yesterday?
    c. Tian went to the office WHEN?
  3. Should the neutral sample be included as a stimulus? Of course we couldn’t include it for syntactic phrasing and marked tonicity because it didn’t exist. But for the other four classes, including neutral could further substantiate our results.

Ultimately, we did not try every possible design for each class (and we got some flack for this from one of our Interspeech reviewers 😬). But in our defence, that would have been 8 different test setups to try for each class with neutral samples, and 4 setups each for the other two classes. In total, that’s 40 different listening pre-tests to trial (using all the 1104 utterances) before actually running the real listening tests on the optimum design. Instead, we conducted pre-tests for each class, and if native listeners could disambiguate at least some of the subcategories or interpretations frequently enough, then we accepted that design without exploring others.

You are of course welcome to perform your own pre-tests (with the samples we’ve provided or with sentences of your own)! We’d love to know if there are question setups that result in even better recognition accuracies than what we report.

Elimination of sentences and subcategories

Using the results of the pre-tests, we were able to identify the 5 sentences from each speaker/class pair with the highest recognition accuracy across all subcategories/interpretations. By this we mean the sentences whose utterances were classified as the correct subcategory or interpretation the most times.

Within these top-5 sentences for each speaker/class pair, if a subcategory was not recognised at least 60% of the time, we eliminated it from the final design. This does not mean that previous research was wrong in describing these subcategories as distinguishable. Rather, it means that they were not distinguishable in our recorded data. The following subcategories were eliminated in this way:

  • emotion: disgust
  • propositional attitude

    • female and male: obviousness statements, doubt statements, confirmation questions
    • female only: sarcasm statements
Final design for each class

The table below gives a summary of the final task design for each class/speaker pair, with a total of 6 classes ×\times 2 speakers = 12 tasks in a full ADEPT evaluation. There is one question per sentence and subcategory/interpretation. For example, after the male propositional attitude pre-test, we were left with 3 subcategories (incredulity questions, sarcasm statements, surprise statements) ×\times 5 sentences = 15 questions in the final design.

Classnumber of
questions (F|M)
number of
choices (F|M)
audio stimulineutral
included
direct or indirect
question
emotion20       |       205       |       5multipleyesdirect
interpersonal attitude15       |       154       |       4multipleyesdirect
propositional attitude10       |       153       |       4multipleyesboth
topical emphasis15       |       153       |       3singlenoindirect
syntactic phrasing10       |       102       |       2single-indirect
marked tonicity10       |       102       |       2single-indirect

The number of choices per question is equal to the number of subcategories for the speaker/class pair (or number of interpretations per sentence), plus the neutral sample if it was included. We did not include neutral for topical emphasis because we ultimately went with a single stimulus question design, which would have meant questions whose single (neutral) sample had no correct answer.

Natural benchmark

Okay so finally we’re addressing one of the problems we introduced at the beginning: the fact that there is no established benchmark of expected performance for PT models. In the table below, we show the recognition accuracies (rounded to the nearest ones place) from two tasks: the female emotion task and the male emotion task. By recognition accuracy, we mean the percent of times a certain utterance was classified as its correct subcategory or interpretation. Each task was performed by 30 paid (self-proclaimed) native English speakers on MTurk. We verified their English ability with a transcription task.

Classsubcategoryfemale samplesmale samples
emotionanger
fear
joy
sadness
95%
80%
90%
88%
83%
52%
88%
62%

Because there were 5 choices per question, chance performance was 20%. So all these recognition accuracies are statistically significantly above chance (one-tailed binomial test; p ≤ 0.05).

These values then serve as our proposed emotion benchmark. For example, a PT model that perfectly transfers female anger from our samples should show a recognition accuracy of about 95% if the same 20-question female emotion task is performed. You can find the natural benchmark for all classes in Table 3 of the paper.

Comparing performance of PT models

Hopefully it is now clear how you can also use this corpus and our proposed evaluation framework to compare performance of PT models. You simply use our samples as the PT reference samples (don’t let the model see them during training!) and then perform whichever tasks are relevant for your use case. A higher recognition accuracy would imply that that model is better at transferring a particular subcategory or class (at least from our corpus). We demonstrate such a comparison in section 4 of the paper.

For example, in table 3 you can see that we tested our Ctrl-P model and found it was able to produce contemptuous interpersonal attitude at least as good as our female voice actor (~50%), whereas the Tacotron-Ref model performed more poorly (29%). However, for the male voice actor, Ctrl-P was only successful about 35% of the time, compared to 30% for Tacotron-Ref, and 52% for the voice actor.

You can also use these results to compare your model to itself across different classes of prosody. For example, our results show that the Tacotron-Ref model was much better at transferring female topical emphasis (all recognition accuracies were statistically significant) than female interpersonal attitudes (no recognition accuracies were statistically significant). This informative conclusion allows the researcher to see specifically how their model could be improved.

Further work

Okay this is all fine and dandy but the astute Papercup enthusiast may have noticed that the ADEPT paper only covers monolingual PT in English, which isn’t particularly useful for us because we do machine dubbing from English into other languages. As Americans say when impersonating the British: righty-ho!

To tackle this, we’d next like to validate if these classes and subcategories transfer into Spanish (and other languages). Are the sentences equally as ambiguous and can all the prosodic subcategories be elicited? If so, this could be a really valuable evaluation framework for us going forward.

Final notes

As I could not find a satisfactory photo relevant to evaluation, prosody, or speech, please enjoy the above stock photo of a puppy and kitten instead.

Subscribe to the blog

Receive all the latest posts right into your inbox

Alexandra Torresquintero

Alexandra Torresquintero

Data Engineer at Papercup. Enjoys the IPA and the occasional IPA.

Read More