Spotlight on speech synthesis – interview with Alex Perez

January 11, 2023
Yota Georgakopoulou

Speech synthesis has been in the limelight for the past couple of years, with many voice AI startups making their appearance in the market and use cases of the technology even making it to blockbuster releases. Fans marveled at young Luke Skywalker’s voice in Disney’s The Mandalorian or Val Kilmer’s voice in Paramount’s Top Gun: Maverick, both of which were recreated from earlier recordings. Although the technology has been around for over half a century, it has only been recent advancements in machine learning and deep neural networks that have allowed for the generation of lifelike synthetic speech at such high levels of accuracy, completely transforming the field and expanding the applications of the software. We discuss the current state of the technology with Dr. Alex Perez, lead scientist for speech synthesis at AppTek.


Q: Tell us a bit about yourself – where did you grow up, what did you study?

A: I was born 36 years ago in València, Spain, where I still live. València is the third biggest city in Spain, with a warm and sunny climate for most of the year, all kinds of commodities and an extensive cultural offer, and not as crowded as bigger cities like Madrid or Barcelona, so I was never tempted to move.

Dr. Alex Perez

I discovered my passion in technology and computers at a very young age, given that personal computers were not so common back then. Like most boys, I was a keen video game player, but more than that, I found computers fascinating and I wanted to learn more about them. At the age of 13 I started learning BASIC programming and web design on my brand-new Intel Pentium 200MHz shipped with Windows 95. I quickly became the computer kid in the neighborhood who you would call to fix your computer, so studying anything other than computer science was never a question for me.

I ended up studying at the Universitat Politècnica de València and specialized in Machine Learning and AI in my senior year. It was very exciting for me to see how machines learned from data, so I decided to pursue this further. I started by working on the application of automatic speech recognition (ASR) and machine translation (MT) in online learning and education via automatic subtitling, and I spent a few years working on these technologies before moving on to neural text-to-speech (TTS) for my PhD thesis in the context of speech-to-speech translation.

Q: Why did you switch to speech synthesis for your PhD?

I was always intrigued by speech synthesis so when such a position became available in our research group, I went for it. At the time it was not very common to work in text-to-speech, but to me it was more satisfying and impressive to make a computer speak and sound like a human than to make it transcribe dialogue.

My decision of course also had to do with my background in music. My father is a musician, so music has always been part of my life. I've also been playing drums for more than 15 years in various local rock bands. Because of that, I usually find myself playing tunes in my body all day long. I also spent some time learning about music production, audio processing, mixing and mastering, which is now quite handy when it comes to speech signal processing.

Q: Has there been a turning point in your life as a scientist so far?

A: Switching to speech synthesis was definitely a turning point for me, as I had no prior knowledge of the technology when I started working on it. I had already discovered machine learning during my bachelor years and I joined the Machine Learning and Language Processing research group of Universitat Politècnica de València in 2012, just before the deep learning era, where I initially worked on the application of ASR and MT on educational videos.

The turning point for me came in 2017, when the main text-to-speech research scientist in our group decided to leave his position, which I then applied for. I was already very attracted to the idea of producing human-like synthetic speech, also because of its links to music production, but mainly as the final step to building cascaded speech-to-speech translation systems by concatenating ASR, MT and TTS technologies.

This is when I started learning about speech synthesis, which eventually became the topic of my PhD thesis. The timing was great, as the field was moving from the traditional concatenative and statistical parametric approaches to pure deep learning-based models and the quality was improving a lot, which was very exciting and motivating for me.

Q: Why did you choose to work at AppTek?

A: I met AppTek’s Director of Science, Prof. Hermann Ney, in 2012 while working as a researcher for the EU-funded project transLectures, which focused on the development and application of ASR and MT technologies for the multilingual subtitling of large video-lecture repositories. He is without doubt one of the most renowned scientists in the field, who I admire both personally and professionally. Luckily for me, he has kept a close connection with the research group in València and this is how I was introduced to AppTek when a position became available for a speech synthesis scientist.

It was also very important to me that AppTek is a world leading company in language technologies, but as it is relatively small, it is also very flexible in terms of product design and technology research lines. I particularly like the fact that it is closer to its end customers and use-case scenarios than larger companies can be. Whilst I enjoy pure research, I find it more inspiring to build real-world services and applications that can be used and enjoyed by hundreds, thousands or even millions of people around the world.

Q: What is your assessment of the current state of TTS technology?

A: For some years now the quality and naturalness of synthetic speech has been shown to be on par with human speech under specific scenarios (e.g. neutral, reading-style, emotionless text-to-speech), though there is still a gap when it comes to more difficult settings, such as conversational spontaneous speech, voice acting, etc. This is not only because of the increased modelling challenge of such tasks, but also because it is difficult and expensive to acquire high-quality data to train such models.

Historically, the lack of naturalness in synthetic voices had greatly constrained the potential applications of TTS technology in the media and entertainment industry. I believe it has now reached a point where we will start seeing the adoption of TTS to reduce costs and improve scalability in content creation and in media localization.

Q: Other than data, what other challenges are there? What is the scientific community working on?

A: As in most machine learning tasks, there is ongoing effort invested in designing increasingly better deep learning architectures and algorithms to continuously improve the quality and naturalness of TTS models. Also, there are some advanced topics in text-to-speech that still pose a challenge in the field, such as cross-lingual voice cloning, zero-shot speaker adaptation, cross-speaker style transfer or incremental (streaming) text-to-speech.

Zero-shot speaker adaptation refers to synthesizing the voice of a new speaker without using any of the speaker’s speech data to train a model with, by just “listening” to a reference audio comprising no more than a couple of seconds. Cross-lingual voice cloning refers to the task of synthesizing speech in the voice of a speaker that has never spoken this language before. Both speaker adaptation and cross-lingual voice cloning is what we have already incorporated in our automatic dubbing pipeline at AppTek: we call this adaptive TTS.

Q: There is a lot of buzz in the market about voice conversion. Could you briefly explain what the difference is to adaptive TTS?

A: Voice conversion aims to change the speaker identity of a given utterance from one speaker to another, while keeping the linguistic content unchanged. While in TTS the machine is provided with text to synthesize new speech, in voice conversion the task involves going from a speech signal to another speech signal. As a result, many other attributes are readily available to be transferred to the target speaker utterance during voice conversion: the original speaker’s speaking style, intonation, stress and pacing.

With adaptive TTS, our goal is to synthesize speech from text in the voice of a new speaker, unseen during model training. To that end, the system is provided with the text and a reference utterance from this new speaker comprising no more than a couple of seconds. In this case, the linguistic contents of the reference utterance are not constrained to match the input text. Therefore, the intonation, stress and pacing can’t be as easily replicated from one sentence to another. From a localization point of view, voice cloning involves a human in the loop, whose speech gets converted to sound like another speaker, whereas in adaptive TTS, the process is fully automatic.

Q: What do you think of the future? What will the next milestone of the technology be?

A: I think the technology will keep getting better and better over the years and it will rapidly reach levels of naturalness even more indistinguishable from human speech. Leveraging high-quality large speech datasets, which have not been available until now, will undoubtedly improve naturalness even further.

I expect we will start seeing speech synthesis used more broadly, e.g. in TV news, podcasts and so on. It can already play a role in such use cases at the levels of quality we are reaching today. In the past year, we have seen films in which speech synthesis has been used to reproduce the voice of specific actors. I think this will become more common in dubbing, as the TTS will be able to mimic the voice characteristics of the original actors. Natural-sounding emotional acting voices are not perfect yet, but I believe we will get there soon. When we reach such levels of naturalness also in emotional speech, I have no doubt we will see TTS used widely also in films, cartoons etc. I guess you could call this the next milestone.

Q: How soon do you think it will be before we see TTS used widely in films?

A: The adoption of any technology typically takes a long time. I expect the tech itself will get there pretty fast, but in order to see it in widespread use it will take a while to convince people to use it for a specific use case, to have a proof of concept and be convinced it really works, to integrate the technology in production and then use it at scale. If we are talking about when we will see TTS used in many films, I expect this would happen in the next 5-10 years. There are much easier use cases than films of course, such as documentaries, where I expect we will see TTS used in bulk a lot sooner.

Q: What are you working on right now? What are you rolling out next at AppTek?

A: At AppTek we are not building TTS models only to offer standalone TTS services, but also to concatenate them, to integrate them in the company’s dubbing pipeline. As such, we need the TTS models to be able to adapt to unseen speaker voices, and we are also working on emotional speech synthesis, in order to be able to reproduce different emotions synthetically for the dubbing of TV shows and movies.

In the dubbing team we are also working on deploying the dubbing service. The machine learning models are already available, and we are in the process of deploying the service via an API and also by integrating it in the company’s workbench, so that end clients can customize the output to their liking.

Developing TTS models that can be used for the automatic dubbing of movies, TV shows, news or documentaries poses additional challenges to standard TTS.

Zero-shot speaker adaptation, where we build models that mimic the voice characteristics of the original (source language) speaker in the target language, is one area of focus. Cross-lingual emotion transfer is another: we are trying to build models able to reproduce the emotion perceived in the original utterance (e.g. sadness, excitement, etc.) into the translated synthetic utterance.

In the dubbing use case, the translated utterance should mostly follow the original utterance in terms of timings. This not only poses a challenge for the TTS but also for the MT, where we are building length controllable MT systems.

While we find that automatic dubbing can provide good results out-of-the-box under certain settings, we believe post-editing will be required in most cases for high quality output in the short-term. Thus, not only the (text) translations should be post-edited, but also the TTS prosody, pitch, speaking style or emotion should be controllable for best results. In order to give our clients that ability to do just that and modify the system output to their liking, we are integrating the dubbing service in the company’s workbench as well, which is a platform clients can use if they so wish to post-edit the system output.

Q: What are you looking forward to?

A: I very much look forward to seeing a complete film dubbed in a different language fully automatically, to listen to it and compliment the system on a job well done. Which is my goal here at AppTek, so watch this space!

AI and ML Technologies to Bridge the Language Gap
Find us on Social Media:
ABOUT APPTEK.ai

AppTek.ai is a global leader in artificial intelligence (AI) and machine learning (ML) technologies for automatic speech recognition (ASR), neural machine translation (NMT), natural language processing/understanding (NLP/U), large language models (LLMs)  and text-to-speech (TTS) technologies. The AppTek platform delivers industry-leading solutions for organizations across a breadth of global markets such as media and entertainment, call centers, government, enterprise business, and more. Built by scientists and research engineers who are recognized among the best in the world, AppTek’s solutions cover a wide array of languages/ dialects, channels, domains and demographics.

SEARCH APPTEK.AI
Copyright 2021 AppTek    |    Privacy Policy      |       Terms of Service     |      Cookie Policy