AppTek Blog | AppTek at INTERSPEECH 2023 Dublin

‍

The. AppTek team is in Dublin Ireland this week for the 24th INTERSPEECH Conference. INTERSPEECH Is the world’s largest and most comprehensive conference on the science and technology of spoken language processing. The AppTek Science team will be participating across multiple tracks including oral and paper presentations as well as a Keynote Speech. Visit the AppTek Team at the event at one of the following tracks:

‍

End-to-End Models – Friend or Foe of Speech Research?

Wednesday, 23 August – 08:30 – 09:30 – The Auditorium

Ralf Schlüter (Keynote Speaker) ; Moderated by Roger Moore

End-to-end architectures have revolutionized performance in many areas of speech technology. You no longer need to be an expert in speech to build, for example, an ASR system with performance our community only dreamed of a decade ago. INTERSPEECH has always valued the symbiotic relationship between speech science and speech technology, with linguists, phoneticians, computer scientists and engineers all learning from one another. Put simply, we need each other, or so we have always liked to believe. But where now? Does the dominance of end-to-end architectures, coupled with vast amounts of speech data and compute power mean we can learn anything we need to directly from a speech signal, without needing to understand what’s going on? Can the speech technologists go it alone? Do speech scientists care? Can speech technology and speech science working alongside one another achieve greater research outcomes than apart?

‍

Take the Hint: Improving Arabic Diacritization with Partially-Diacritized Text

Wednesday, 23 August - P6.11 13:30 - 15:30

Parnia Bahar, Mattia Di Gangi, Nick Rossenbach, Mohammad Zeineldeen

Automatic Arabic diacritization is useful in many applications, ranging from reading support for language learners to accurate pronunciation predictor for downstream tasks like speech synthesis. While most of the previous works focused on models that operate on raw non-diacritized text, production systems can gain accuracy by first letting humans partly annotate ambiguous words. In this paper, we propose 2SDiac, a multi-source model that can effectively support optional diacritics in input to inform all predictions. We also introduce Guided Learning, a training scheme to leverage given diacritics in input with different levels of random masking. We show that the provided hints during test affect more output positions than those annotated. Moreover, experiments on two common benchmarks show that our approach i) greatly outperforms the baseline also when evaluated on non-diacritized text; and ii) achieves state-of-the-art results while reducing the parameter count by over 60%.

‍

Mixture Encoder for Joint Speech Separation and Recognition

Wednesday, 23 August - S2.16 13:30 - 15:30

Simon Berger, Peter Vieting, Christoph Boeddeker, Ralf Schlüter, Reinhold Haeb-Umbach

Multi-speaker automatic speech recognition (ASR) is crucial for many real-world applications, but it requires dedicated modeling techniques. Existing approaches can be divided into modular and end-to-end methods. Modular approaches separate speakers and recognize each of them with a single-speaker ASR system. End-to-end models process overlapped speech directly in a single, powerful neural network. This work proposes a middle-ground approach that leverages explicit speech separation similarly to the modular approach but also incorporates mixture speech information directly into the ASR module in order to mitigate the propagation of errors made by the speech separator. We also explore a way to exchange cross-speaker context information through a layer that combines information of the individual speakers. Our system is optimized through separate and joint training stages and achieves a relative improvement of 7% in word error rate over a purely modular setup on the SMS-WSJ task.

‍

RASR2: The RWTH ASR Toolkit for Generic Sequence to Sequence Speech Recognition

Wednesday, 23 August - O15.4 17:00-17:20

Wei Zhou, Eugen Beck, Simon Berger, Ralf Schlüter, Hermann Ney‍

Modern public ASR tools usually provide rich support for training various sequence-to-sequence (S2S) models, but rather simple support for decoding open-vocabulary scenarios only. For closed-vocabulary scenarios, public tools supporting lexical-constrained decoding are usually only for classical ASR, or do not support all S2S models. To eliminate this restriction on research possibilities such as modeling unit choice, we present RASR2 in this work, a research-oriented generic S2S decoder implemented in C++. It offers a strong flexibility/compatibility for various S2S models, language models, label units/topologies and neural network architectures. It provides efficient decoding for both open- and closed-vocabulary scenarios based on a generalized search framework with rich support for different search modes and settings. We evaluate RASR2 with a wide range of experiments on both switchboard and Librispeech corpora.

‍

Competitive and Resource Efficient Factored Hybrid HMM Systems are Simpler Than You Think

Thursday, 24 August - P3.9 - 10:00 - 12:00

Tina Raissi, Christoph Lüscher, Moritz Gunz, Ralf Schlüter, Hermann Ney
‍

Building competitive hybrid hidden Markov model~(HMM) systems for automatic speech recognition~(ASR) requires a complex multi-stage pipeline consisting of several training criteria. The recent sequence-to-sequence models offer the advantage of having simpler pipelines that can start from-scratch. We propose a purely neural based single-stage from-scratch pipeline for a context-dependent hybrid HMM that offers similar simplicity. We use an alignment from a full-sum trained zero-order posterior HMM with a BLSTM encoder. We show that with this alignment we can build a Conformer factored hybrid that performs even better than both a state-of-the-art classic hybrid and a factored hybrid trained with alignments taken from more complex Gaussian mixture based systems. Our finding is confirmed on Switchboard 300h and LibriSpeech 960h tasks with comparable results to other approaches in the literature, and by additionally relying on a responsible choice of available computational resources.

‍

AppTek at INTERSPEECH 2023 Dublin

Home / Speech Technology Blog

End-to-End Models – Friend or Foe of Speech Research?

Wednesday, 23 August – 08:30 – 09:30 – The Auditorium

Take the Hint: Improving Arabic Diacritization with Partially-Diacritized Text

Wednesday, 23 August - P6.11 13:30 - 15:30

Mixture Encoder for Joint Speech Separation and Recognition

Wednesday, 23 August - S2.16 13:30 - 15:30

RASR2: The RWTH ASR Toolkit for Generic Sequence to Sequence Speech Recognition

Wednesday, 23 August - O15.4 17:00-17:20

Competitive and Resource Efficient Factored Hybrid HMM Systems are Simpler Than You Think

Thursday, 24 August - P3.9 - 10:00 - 12:00

AI and ML Technologies to Bridge the Language Gap

Find us on Social Media:









ABOUT APPTEK.ai

SEARCH APPTEK.AI

SITEMAP

LATEST NEWS

LATEST BLOG POSTS