A: I was born in Russia but my family moved to Germany when I was 3.5 y.o., so I grew up and studied in Germany. As a result, my Russian is limited to household level, but I speak German and English fluently. I am very interested in languages and wanted to learn a language with a completely different structure to European languages. I started learning Chinese when I travelled to China a few years ago and it’s been fascinating. I am now able to communicate in Chinese to get things done and discuss most topics with my teachers, though I still have a lot of trouble understanding accents.
A: As a kid I loved playing computer games and wanted to make my own, so my mother started buying me programming books. I quickly discovered I had a knack for coding and it was a lot of fun for me, so I decided to study computer science – that’s how I moved to Aachen and I’ve been here ever since. Speech recognition came later, when I worked on my PhD. I had originally decided to study data mining as I’m interested in getting insights from data, but this did not work out for me, so I decided to switch to Prof. Dr.-Ing. Hermann Ney’s chair at the university and focus on the more challenging topic of human language technologies instead – I’ve never looked back.
A: Both technologies were very interesting to me, but in speech you have (at least in theory) a ground truth, i.e. the task is a lot more clearly defined. In MT you can have different equally valid translations, as language is not only about the spoken word, but also about culture and in some cases it’s very hard to bring the context across in different languages. In Chinese for instance, there are thousands of sayings and you may be able to translate them in a sentence or two, but in Chinese many of them are only four characters long. In MT you measure the output using a BLEU score, but this is of course a very imperfect measure of semantic similarity. In ASR, things are more straightforward in that respect, although on a technical level ASR is a harder task than MT simply because the amount of data that is being processed is a lot larger. This means you need to be computationally more efficient, which I enjoy, as it fits with my desire to process large amounts of data and design efficient algorithms. I like to focus on efficiency and this is part of my work here at AppTek.
A: I had been working in Prof. Ney’s scientific team for about a decade before joining AppTek full time, and I also completed an internship at Google, so I had the experience of working in a large company too. What I like best about AppTek is the freedom it offers me to explore my own ideas and the fact that I have more control over the technology stack. In a smaller company it is easier to get things done and there are still a lot of challenging tasks to solve, so I prefer working in such an environment.
I also like the fact that I can continue using RTWH Aachen’s speech toolkit, RASR, which I’ve been maintaining for a while and which I know well how to optimize. In fact, I can carry over all the tools that I used at university, where I was building an automation pipeline, and this makes my life easier.
A: The priority until recently at AppTek had been language expansion, so as to cover all the major languages we are requested. Now that we cover a large selection of languages, we are focusing on adding more automation to the pipeline, so we can regularly iterate on languages and not just on the basis of client demand.
We have just completed a major redesign of our streaming server that increases parallelization and extensibility. My vision is to make the speech team scalable, so everybody can train models and rapidly update them. The goal is to update all our models on a monthly basis, as soon as we crawl or get new data in, so we are always at the cutting-edge for any language we offer. Another thing we work on is bringing together our batch and streaming models.
A: Yes, it’s the same models used for both, but the difference is in how fast one collapses the search space. The machine will recognize the ending of a word very shortly after a word has ended in the speech signal, but this word end is only one out of many hypotheses. You might need to hear the next word to be sure that the previous word is correct, that it makes sense in the context of the next word, so it takes some time for the search space to collapse in the past. For ASR applications like live captioning there is a limit, because you need to output the words as quickly as possible. We look at the best hypothesis at each time frame, and although the last couple of words might still change, everything else is fixed, so we prune away all other hypotheses. This means we might lose some accuracy, as we could have found another half-a-word sequence that fits the data better, but if there is a latency requirement, we need to collapse the search space faster.
A: What adds to the latency is that there are a number of things that need to happen in the pipeline. When we get the speech signal, first we perform feature extraction. We put the signal through our voice-activity detector to cut out the speech parts. Then we put it through the neural network, which gives us the scores for each phoneme, and then we try to find the best path for each word. So by the time the decoder looks at the word end, a second might have passed already from the time the audio was actually spoken.
At AppTek we have traditionally prioritized quality over speed or latency. Thus our current generation of streaming models have a latency of around 1.6 seconds which works fine for many applications. In live captioning, for example, the accepted latency can be quite a few seconds. In other applications like interpreting, you can overcome some of the latency by pushing through recognition updates in real-time. We’re working on bringing the latency further down without compromising quality by integrating a new transformer neural-network model into our streaming system, which has the capacity for lower latency as it requires less future context.
A: The better question here is how closely related these languages are. An ASR model includes the spelling and the pronunciation, and also multiple pronunciation variants. So it’s not a problem to have a model that includes data from different language variants, as long as the model is large enough to learn from the data. If the variation is too large for the given size of the model, it might degrade the recognition accuracy. I prefer to work with larger models, for example a global English model to include all English variants, as this fits more commercial use cases.
A: One could write a whole book about the topic of hybrid versus end-to-end, but, in short, end-to-end systems do not use a pronunciation dictionary and are thus to some extent easier to build, but often underperform on tasks with small amounts of data. The decision for which architecture to use can depend on many factors and the distinction between hybrid and end-to-end is not black and white. For example, RNN-T (a neural-network architecture used in many end-to-end systems) can also be used in hybrid systems (i.e. together with a pronunciation lexicon) where it also performs well. I think the more important aspect is the choice of NN architecture and the preparation of your data.
Neural network acoustic models have been around for a long time and only with advances in computing power started to overtake probabilistic models, like the Gaussian mixture models. And even then improvements where gradual. The trend of using more and more computing resources and data will certainly continue and is probably the most important factor in the advances still to be made in the field. One of the current hot topics in the community is unsupervised training on vast amounts of unlabeled data, which promises nice gains but also requires significant infrastructure and data to work well.
For use-cases where you have a lot of high-quality data to train systems with, ASR quality is already very good as measured by word error rate. When it comes to improving the ASR for specific domains or noisy environments though, there is still significant room for improvement. Then there are also textual features that are important to customers, but do not enjoy as much popularity in the research community, like punctuation and speaker diarization. This is one of the areas where AppTek distinguishes itself from many of our competitors.
AppTek.ai is a global leader in artificial intelligence (AI) and machine learning (ML) technologies for automatic speech recognition (ASR), neural machine translation (NMT), natural language processing/understanding (NLP/U), large language models (LLMs) and text-to-speech (TTS) technologies. The AppTek platform delivers industry-leading solutions for organizations across a breadth of global markets such as media and entertainment, call centers, government, enterprise business, and more. Built by scientists and research engineers who are recognized among the best in the world, AppTek’s solutions cover a wide array of languages/ dialects, channels, domains and demographics.