Differences in Talker Embedding Extraction #90

rppravin · 2021-04-22T21:04:09Z

In the Auto VC paper, it seems,

S2 is estimated from > 20 sec of speech for a given talker
Es(X1) is estimated from the speech segment input to the content encoder

Even though both these representations are estimated for the same talker, they are estimated based on different input speech. S2 is potentially based on longer speech duration. So, there could be some differences between the two talker embedding representations.

However, in the codebase, S2 is reused in the place Es(X1). Any idea on how much impact this will have on the extent of dis-entanglement of content and talker representation? Since Es(X1) could be based on shorter speech duration, will it be useful to estimate it separately, so that network learns to dis-entangle only what is appropriate talker information for a given input speech segment?

Thanks,
Pravin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differences in Talker Embedding Extraction #90

Differences in Talker Embedding Extraction #90

rppravin commented Apr 22, 2021 •

edited

Differences in Talker Embedding Extraction #90

Differences in Talker Embedding Extraction #90

Comments

rppravin commented Apr 22, 2021 • edited

rppravin commented Apr 22, 2021 •

edited