Spotlight on M&E: Interview with Dr. Volker Steinbiss, Part 2

February 9, 2022
Yota Georgakapoulou

This is the second part of a two-part interview with Dr. Volker Steinbiss, Managing Director, AppTek GmbH, where the bulk of the company’s scientific team is located. We discuss the company’s focus on the media and entertainment vertical, the challenges it seeks to address and the innovative solutions that AppTek brings to the table across a range of technologies, ASR, MT, TTS and S2S. Read the first part of the interview here.



Q; What are the challenges of the media sector in your opinion, and what are the innovative solutions that AppTek can bring to the table to overcome them?

A: Media is a growing sector, as we discussed, and there is increasingly more content to localize, so people are turning to technology to do that faster. The challenging part of the work is to bridge the commercial need and the technology offering in a meaningful and smart way. While this is all about making business and creating customer value, what fascinates me most is that you can only be successful by understanding the customer needs (to some extent at least) and then translate them into well-understood technical challenges that can be solved with the technology that exists today.  We can solve them principally with our language technology and machine learning expertise, implement the solution (already having the next fifty or so languages in mind) in a massively scalable way, and sell the service for a price that won't be undercut by one of the tech giants. Frankly, getting this done makes my day!

Dr Volker Steinbiss

This is what we do at AppTek. We help companies understand where you can make a dent in the technology that would have a significant impact in the business or in the lives of the professionals who do the actual work, the subtitlers in this case. It is about developing more user-friendly AI. My favorite example here is Intelligent Line Segmentation (ILS). If you know the technology inside out, you can solve different issues with it – the idea for us was to achieve tangible results for our clients, e.g. use our knowledge in machine learning to teach the machine how to format text in subtitles. What we did is let the machine view a lot of nicely made subtitles created by professionals and learn the patterns and the nuances in them. Our ILS system develops a ‘feel’ for appropriate text segmentation which compares well to the decisions made by professionals, as was shown by independent evaluations we commissioned.

My other favorite example from the media sector is the metadata-informed machine translation systems that our team is working on. In domains like patent translation, where there is not much variation in the text, any customized machine translation system would perform well by analyzing larger chunks of text, i.e. going beyond the sentence level. For multimodal content such as subtitling, however, there is a lot of information that is not apparent in a phrase (and neighboring phrases) but still affects translation choices, such as the gender of the speaker. This is a common issue in languages such as Russian or Greek among many others. In subtitling, it can be hard for an MT system to gather such information, which might be hidden in the video or contextual to the piece. But if such metadata is provided, MT can do a better job. So there needs to be an easy way to provide this metadata to the MT system, either automatically or manually by using some button in the UI, which would simplify the work of a post-editor.

Q: There were a lot of stories in the press about the misuse of technologies like MT in the media and entertainment sector and how they can lead to impoverished translations, as the Spanish union of translators pointed out with respect to the Spanish subtitles for the Netflix hit Squid Game. What is your answer to the translators who are arguing that MT should never be used in entertainment?

Never say never! You used the term "misuse" already and that says it all, right? My recommendation is to introduce new technologies wisely - there are many ways to go about it wrongly and you’d better think twice before acting and make sure to involve all stakeholders. One bad idea is the all-or-nothing attitude. Like elsewhere in life, this doesn't match reality. There is content that provides excellent quality for automatic processing and there is content that is difficult, in many dimensions. Before you deep dive, get your toes wet, walk in
shallow water and try swimming where it is still shallow.

The challenging part, as I mentioned before, is to bridge commercial need and technology in a meaningful and smart way. This must happen on both ends! It is clear that the technical people must understand the world of the customer. In media and entertainment there are speaker changes, line breaks and subtitle breaks, character limitations, and all sorts of standards that make a difference with respect to subtitle quality. This is why AppTek came up with ILS – as a reaction to a customer need that we identified.

The language technology user must also understand how to use technology in a beneficial way and where to stop. Pretending MT works perfectly where it doesn't, makes no business sense and neither does pretending every media snippet is a masterpiece of fine art that must be treated by the most skilled professionals on the planet. Understanding both the limits and the huge potential of technology is very important. From there, you experiment, you measure, you start with the low-hanging fruit and cautiously go ahead, learning on the way. "Paths are made by walking", said Franz Kafka.

The stakeholders involved have been focusing on their own point of view until now. The technologists make a call for data, scripts if possible, to build domain-specific systems that will boost the quality of automated same-language subtitles. Language service providers might be skeptical to see all the scientific effort going towards improving transcription quality while nobody seems to care about speaker diarization. Once the two groups begin to interact more with each other, things are bound to improve further.

There is content that fits best the technical possibilities of the MT and other that doesn't. On the one side of the spectrum, we may have a telenovela, with hundreds of episodes and quite repetitive language, and on the other something humoristic with a lot of puns and witty script that is hard to translate. Or, in the case of ASR, something noisy with a lot of music and people speaking in a strong accent, one over the other, will have poor transcription results, whereas a video with a single talking head speaking in neutral accent, at a good pace and with good sound quality will be close to perfect. For TTS, emotional speech is more challenging than the well-paced, neutral voice in a documentary.

Content profiling is the answer. Smart people pick the right content to use technology on and work the ropes to understand how to improve. They train their resources on this type of content, and then they take the next steps. They start with the easy, low-hanging fruit, and then they progress, and in the end maybe they only process 80 or 90% of the content with technology and they leave 10%, for special treatment. It is the responsibility of technical folks to be clear about the range that the machine can achieve, so as to manage expectations, but also of the user to understand what this means in practice.

Q: What are your teams at AppTek working on right now? What are you planning on rolling out next?

A: We are really excited about putting all the pieces together to have an integrated dubbing and voiceover pipeline. Our working environment at AppTek is seamless, we have all the teams that in other companies would be separate, all sitting under one roof and interacting with one another on a daily basis.

Most of our effort goes into maintaining our current favorable competitive position on the technology front, which is the basis of everything we offer as a company. We compete against the most powerful corporations in the world, so falling behind is not an option. We just rolled out our latest English ASR system, with impressive performance gains. We next plan to roll out metadata-informed MT, where context information beyond the text, such as speaker gender, is used in the translation process.

On top of this, we are very excited to have built a pipeline that spans several technologies: it is automatic revoicing technology, the so called Speech-to-Speech (S2S) translation – the full pipeline from, say, the German Tagesschau evening news to a US English voiceover with the respective voices as you can see in this demo. After decades of work in the field I am quite used to futuristic technology, but this really blew my mind. I believe this is the next big thing in the field - besides business, it's about connecting people and cultures.

Q: Any last word of advice or something else you would like to add?

A: There is a widely found perception that technology simply pops up and the customer picks it off the shelf and uses it as is, and it's either good or bad, and you either thrive or you suffer. Such thinking is a waste of opportunities. First, the technology can be good on some material and bad on others, so the first thing to do is choose wisely – see above.

It is also important not to let external forces be the only ones to drive the direction of technology development (e.g. research publicly funded with your tax money), but to state your needs and provide business opportunities for agile high-tech companies by taking the next steps together, exploring together. As an industry, I believe media and entertainment is in need of developing quality metrics to measure performance, best practices for the use of language technology, and ideally for the stakeholders in it to join forces in providing the bulk of data that machine learning (AI) needs to do a better job.

Regarding the users of the technology, it is important for them to embrace it in a way that it serves them and that drives development further and faster as a result of their expert consult. This means constantly challenging the state of the technology so that it better fits their needs. The way to do this is by participating in evaluations, testing, proofs of concept and providing valuable feedback and data. The entire industry would benefit from this.

AI and ML Technologies to Bridge the Language Gap
Find us on Social Media:
ABOUT APPTEK.ai

AppTek.ai is a global leader in artificial intelligence (AI) and machine learning (ML) technologies for automatic speech recognition (ASR), neural machine translation (NMT), natural language processing/understanding (NLP/U), large language models (LLMs)  and text-to-speech (TTS) technologies. The AppTek platform delivers industry-leading solutions for organizations across a breadth of global markets such as media and entertainment, call centers, government, enterprise business, and more. Built by scientists and research engineers who are recognized among the best in the world, AppTek’s solutions cover a wide array of languages/ dialects, channels, domains and demographics.

SEARCH APPTEK.AI
Copyright 2021 AppTek    |    Privacy Policy      |       Terms of Service     |      Cookie Policy