We don’t see many solutions around human language technology such as speech recognition and audio fingerprinting in this market. Could you shed some light on this topic?
Sure. Human language is a vast and complex subject and the technologies that are being researched and developed to understand it better are just as complicated and constantly being evolved. We have been working on this technology for the last 24 years. But to simplify it, one of the areas within this is speech recognition, which is used in multiple ways for telephony that is primarily for interactive voice response system, telephony intercepts and so on. Then there is speech recognition for dictation that is used in several vertical markets such as healthcare and government projects. Then there are embedded speech recognition apps like Siri, where you talk to your car, radio and so on.
Within broadcast, what you are primarily looking at is machine translation (MT) for TV, where you recognise the language in which the programme is broadcast and translate that audio and also, convert it to text.
How does machine translation work?
MT is primarily about translating from the source language to the target language. Historically, experts have followed the lexical approach for translations, where the technology analyses the language based on semantics, syntax, lexicons, morphology and translates it.
In the late 1990s, we started using the statistical approach, where we fed the source and target languages into the computer and trained it to translate intelligently. Google uses this method but it’s not fool-proof.
We now have a patent for a hybrid method of machine translation, which uses both the rule-based system as well as statistical translation. It improves the translation substantially and breaks it into categories providing more information and greater fluency. Hybrid technology offers you the best of both worlds.
Where have you used your technology so far?
We have implemented this in several areas but mainly for government use.
We are now looking to commercialise this for media entities. There are multiple ways in which this can be used by broadcasters.
Media monitoring is one of them. The capability to automatically ingest tens of thousands of channels and use automatic speech recognition (ASR) to generate the text in the source language and then translate that to the target language opens up huge possibilities.
Once you have the source language, the translated language as well as the transcript for say, a news story or a speech, you can populate that into a search engine and it will show up. So, for instance, if you type in Obama, it will show you every programme and every channel, on which that name was uttered down to the second when it appeared on the channel, whether it was on a Chinese channel, an Arabic channel or an English channel. That kind of media monitoring can be extended across languages and multiple platforms including TV, web, social media, newspapers and magazines.
The second way in which this technology can be used is for closed captioning or subtitling. How do you create a closed caption solution automatically as an anchor speaks or a broadcast is going on? Our technology is able to generate and type the text within five to six seconds from the speaker uttering those words. This is speaker independent. The accuracy associated with that text at the moment is more than 85%. If we have the chance to adapt the system and train it a little bit on the profile of the speaker, we will be able to reach 95% accuracy in that speech.
We did this with Al Arabiya channel, where we created automatic closed captioning, which you can also view on their web site www.alarabiya.net.
If you watch the Arabic programmes, you can see English subtitles that are auto generated and there is a transcript as well. This is an evolving technology; it has been used in TV for a long time. But in recent years, this has become more technology-based and, therefore, there is greater consistency and the turnaround time is faster. We are able to finish a news bulletin in 15 minutes now.
People say they want to access their content anywhere and when they want it. Actually, if they receive content anywhere, anytime along with the ability to hear it in a language that they understand, that opens up a lot of possibilities for broadcasters to reach new markets.
We can achieve this with the help of technology. For instance, if I’m in Italy and I’m an Arab and I want to watch something on TV, I am lost because it is in Italian. If I have an app on my mobile that allows me to hear it in Arabic, imagine how different the situation will be.
Technology is a great enabler, and this is a great example of that. Essentially, you are talking about creating multi-lingual media based on a user’s requirement.
I guess the next question has to be about metadata then?
Yes. US law makes closed captioning mandatory. A lot of the data, therefore, comes with text and so, when you enter a subject, you are able to find where it is mentioned. Closed captioning doesn’t exist in most other places. So you end up with some metadata that just gives you a brief about what was said. For instance, you may see a live interview with a dignitary in Egypt but no text to show what he said because it is a live interview and there is no law with regards to closed captioning. Our technology takes that interview, generates the text and indexes it so anything that was said in that interview becomes rich metadata, which you can search. I attended some of the Arabic TV conferences here. They lack the capability to generate a fully annotated Arabic video with its text in order to be searchable.
Closed captioning just doesn’t exist. Go to any of the web sites of the major broadcasters in the Arab world. They do not have the text of what they speak 24/7, although they do generate the text for some programmes, if there is a specific requirement for it.
But imagine if you have 24/7 closed captioning where everything on the channel can be translated as well as read in a language that you understand?
There is, for instance, a wealth of audio lying wasted on tapes here, in this region, with just a brief metadata if you are lucky. So there might be precious information on those tapes but who knows about it. The goal is to create rich metadata that makes content accessible.
Content is king but if people can’t access that content because no one knows in which tape it is stored, it’s not very helpful. Arab broadcasters have hundreds of hours of audio tape here but have no idea what’s in it. That brings us to the third most important use of this solution — digitisation and metadata creation.
It’s important to convert the content on those audio tapes into digital files. Once that is done, we can convert that content automatically into text so that every word spoken on that tape can be fully accessible.
But you are not involved in the digitisation process, are you?
No. That’s why we have partnered with Integracast MENA, which is a CMS affiliate in Dubai. They will undertake such projects here and provide the digitisation. Our technology primarily helps identify languages, translate them intelligently into the target language and create rich metadata around it.
Another instance of this is the work we do with a major studio in the US. They have several tapes that have videos in different languages. The person who looks at those tapes may not be able to identify whether the video is in Czech or Spanish. The technology takes care of automatic language identification to avoid people making such errors.
Are you bringing this to CABSAT?
Yes, the solution for this region is branded as IQURIOUS and is being done in partnership with Integracast and AppTek. It has several modules. One module has been used with Al Arabiya, as I mentioned before, to translate and transcribe. We also work with a major media entity in Abu Dhabi on media monitoring.
You mentioned the ability to extend this technology to other devices?
That’s right. We are working on some second screen concepts but they are still in the development stage.
80% of the people in the US presently use the second screen while they are watching TV. We have developed a system called Tivvy, which is second-screen TV.
Tivvy works on the principle of audio and video fingerprinting. If you point your iPhone to a channel on your TV, it takes a fingerprint of that audio and video and tells you which channel you are watching, whether it is Fox, Al Jazeera or something else.
Essentially, it uses video and audio recognition. The technology recognises the pixels and the contours to generate information for the user. We can also link it to e-commerce and social media.
That can further be extended to give you details about the actor, which in turn, can be connected with an e-commerce web site that allows you to buy the same glasses or T-shirt that a particular actor is wearing.
This is the new direction the media business will take because it will help you to share clips immediately with your social media friends and participate in an immediate poll. The purpose is to create a shared TV environment and enhance the user experience.
Mohammad Shihadah is the founder of AppTek, a US-based firm that has been developing human language technologies for the last 24 years. In an exclusive interview with Vijaya Cherian, Shihadah talks about how this technology will add value to the television business and why Middle East broadcasters will benefit from it
We don’t see many solutions around human language technology such as speech recognition and audio fingerprinting in this market. Could you shed some light on this topic?
Sure. Human language is a vast and complex subject and the technologies that are being researched and developed to understand it better are just as complicated and constantly being evolved. We have been working on this technology for the last 24 years. But to simplify it, one of the areas within this is speech recognition, which is used in multiple ways for telephony that is primarily for interactive voice response system, telephony intercepts and so on. Then there is speech recognition for dictation that is used in several vertical markets such as healthcare and government projects. Then there are embedded speech recognition apps like Siri, where you talk to your car, radio and so on.
Within broadcast, what you are primarily looking at is machine translation (MT) for TV, where you recognise the language in which the programme is broadcast and translate that audio and also, convert it to text.
How does machine translation work?
MT is primarily about translating from the source language to the target language. Historically, experts have followed the lexical approach for translations, where the technology analyses the language based on semantics, syntax, lexicons, morphology and translates it.
In the late 1990s, we started using the statistical approach, where we fed the source and target languages into the computer and trained it to translate intelligently. Google uses this method but it’s not fool-proof.
We now have a patent for a hybrid method of machine translation, which uses both the rule-based system as well as statistical translation. It improves the translation substantially and breaks it into categories providing more information and greater fluency. Hybrid technology offers you the best of both worlds.
Where have you used your technology so far?
We have implemented this in several areas but mainly for government use.
We are now looking to commercialise this for media entities. There are multiple ways in which this can be used by broadcasters.
Media monitoring is one of them. The capability to automatically ingest tens of thousands of channels and use automatic speech recognition (ASR) to generate the text in the source language and then translate that to the target language opens up huge possibilities.
Once you have the source language, the translated language as well as the transcript for say, a news story or a speech, you can populate that into a search engine and it will show up. So, for instance, if you type in Obama, it will show you every programme and every channel, on which that name was uttered down to the second when it appeared on the channel, whether it was on a Chinese channel, an Arabic channel or an English channel. That kind of media monitoring can be extended across languages and multiple platforms including TV, web, social media, newspapers and magazines.
The second way in which this technology can be used is for closed captioning or subtitling. How do you create a closed caption solution automatically as an anchor speaks or a broadcast is going on? Our technology is able to generate and type the text within five to six seconds from the speaker uttering those words. This is speaker independent. The accuracy associated with that text at the moment is more than 85%. If we have the chance to adapt the system and train it a little bit on the profile of the speaker, we will be able to reach 95% accuracy in that speech.
We did this with Al Arabiya channel, where we created automatic closed captioning, which you can also view on their web site www.alarabiya.net.
If you watch the Arabic programmes, you can see English subtitles that are auto generated and there is a transcript as well. This is an evolving technology; it has been used in TV for a long time. But in recent years, this has become more technology-based and, therefore, there is greater consistency and the turnaround time is faster. We are able to finish a news bulletin in 15 minutes now.
People say they want to access their content anywhere and when they want it. Actually, if they receive content anywhere, anytime along with the ability to hear it in a language that they understand, that opens up a lot of possibilities for broadcasters to reach new markets.
We can achieve this with the help of technology. For instance, if I’m in Italy and I’m an Arab and I want to watch something on TV, I am lost because it is in Italian. If I have an app on my mobile that allows me to hear it in Arabic, imagine how different the situation will be.
Technology is a great enabler, and this is a great example of that. Essentially, you are talking about creating multi-lingual media based on a user’s requirement.
I guess the next question has to be about metadata then?
Yes. US law makes closed captioning mandatory. A lot of the data, therefore, comes with text and so, when you enter a subject, you are able to find where it is mentioned. Closed captioning doesn’t exist in most other places. So you end up with some metadata that just gives you a brief about what was said. For instance, you may see a live interview with a dignitary in Egypt but no text to show what he said because it is a live interview and there is no law with regards to closed captioning. Our technology takes that interview, generates the text and indexes it so anything that was said in that interview becomes rich metadata, which you can search. I attended some of the Arabic TV conferences here. They lack the capability to generate a fully annotated Arabic video with its text in order to be searchable.
Closed captioning just doesn’t exist. Go to any of the web sites of the major broadcasters in the Arab world. They do not have the text of what they speak 24/7, although they do generate the text for some programmes, if there is a specific requirement for it.
But imagine if you have 24/7 closed captioning where everything on the channel can be translated as well as read in a language that you understand?
There is, for instance, a wealth of audio lying wasted on tapes here, in this region, with just a brief metadata if you are lucky. So there might be precious information on those tapes but who knows about it. The goal is to create rich metadata that makes content accessible.
Content is king but if people can’t access that content because no one knows in which tape it is stored, it’s not very helpful. Arab broadcasters have hundreds of hours of audio tape here but have no idea what’s in it. That brings us to the third most important use of this solution — digitisation and metadata creation.
It’s important to convert the content on those audio tapes into digital files. Once that is done, we can convert that content automatically into text so that every word spoken on that tape can be fully accessible.
But you are not involved in the digitisation process, are you?
No. That’s why we have partnered with Integracast MENA, which is a CMS affiliate in Dubai. They will undertake such projects here and provide the digitisation. Our technology primarily helps identify languages, translate them intelligently into the target language and create rich metadata around it.
Another instance of this is the work we do with a major studio in the US. They have several tapes that have videos in different languages. The person who looks at those tapes may not be able to identify whether the video is in Czech or Spanish. The technology takes care of automatic language identification to avoid people making such errors.
Are you bringing this to CABSAT?
Yes, the solution for this region is branded as IQURIOUS and is being done in partnership with Integracast and AppTek. It has several modules. One module has been used with Al Arabiya, as I mentioned before, to translate and transcribe. We also work with a major media entity in Abu Dhabi on media monitoring.
You mentioned the ability to extend this technology to other devices?
That’s right. We are working on some second screen concepts but they are still in the development stage.
80% of the people in the US presently use the second screen while they are watching TV. We have developed a system called Tivvy, which is second-screen TV.
Tivvy works on the principle of audio and video fingerprinting. If you point your iPhone to a channel on your TV, it takes a fingerprint of that audio and video and tells you which channel you are watching, whether it is Fox, Al Jazeera or something else.
Essentially, it uses video and audio recognition. The technology recognises the pixels and the contours to generate information for the user. We can also link it to e-commerce and social media.
That can further be extended to give you details about the actor, which in turn, can be connected with an e-commerce web site that allows you to buy the same glasses or T-shirt that a particular actor is wearing.
This is the new direction the media business will take because it will help you to share clips immediately with your social media friends and participate in an immediate poll. The purpose is to create a shared TV environment and enhance the user experience.
AppTek.ai is a global leader in artificial intelligence (AI) and machine learning (ML) technologies for automatic speech recognition (ASR), neural machine translation (NMT), natural language processing/understanding (NLP/U), large language models (LLMs) and text-to-speech (TTS) technologies. The AppTek platform delivers industry-leading solutions for organizations across a breadth of global markets such as media and entertainment, call centers, government, enterprise business, and more. Built by scientists and research engineers who are recognized among the best in the world, AppTek’s solutions cover a wide array of languages/ dialects, channels, domains and demographics.