Condensation of Spoken Text, or How to Create Readable Automatic Captions

March 31, 2025
Dr. Evgeny Matusov & Dr. Yota Georgakopoulou

Closed captions, live and offline, have been around for half a century. They are the text version of the spoken part of an audiovisual production and are formatted in a couple of lines of text at the bottom of a screen. Though their use today is widespread, they were originally developed to service the needs of the deaf and hard-of-hearing community as well as people with developmental and learning challenges, who rely on them to be able to make sense out of several hundreds of hours of new video content that are produced every day. From movies and TV series to unscripted live content such as news, sports broadcasts, etc., closed captions make content accessible to all, allowing people with a hearing impairment to stay informed and get entertained via the same channels that the wider population uses.

The production of live closed captions originally was a labor-intensive and costly process involving highly trained stenocaptioners. Those were gradually replaced by voice writers, i.e. captioners using Automatic Speech Recognition (ASR) technology to create closed captions with significantly less training and hence cost. As ASR systems continued to improve, fully automatic captions have made their appearance and are claiming their place in the market alongside human-in-the-loop production workflows. The pandemic pushed their use forward, as they provided a much-needed instant communication solution to the deaf and hard-of-hearing community during the lockdowns, and people found ways to cope with automated captions and use them despite their flaws.

The challenges of using ASR output for automated captioning

Clearly, one of the main challenges of ASR output for any use case lies in its accuracy. The more high-profile the content and the higher the risk of its misinterpretation, the higher the accuracy bar. While ASR is not 100% perfect – no transcript is, not even ones created by humans – the technology has made such large strides recently that for certain languages, such as English, the accuracy of its performance is impressive. Some would say au par with professional transcription quality.

Captions, however, are more than just strings of words on a screen. While ASR systems produce uncapitalized and unpunctuated text, such text is hard to read and certainly not acceptable for broadcast captioning. Additionally, in unscripted contents, speakers can talk faster than the audience could possibly read if the same amount of dialogue was written out, and they often introduce hesitations, filler words and other meaningless or less important features of speech when they talk.

In order to create a useful transcript out of ASR output, a process called Inverse Text Normalization (ITN) needs to be applied. This involves adding capitalization and punctuation in the text and formatting it in a more human-readable format with expressions like dates, currencies, mathematical operations, etc., converted to digits and non-text symbols. The same goes for acronyms, which are not readable in their phonetic spelling.

On top of this, captions introduce additional constraints to the text, the length of which is limited to a specific number of lines as well as characters per line and which is further determined by the exact duration of the corresponding utterance in the audio track. This is because there is only so much space available on the screen for the display of the captions and only so much time for someone to be able to read this text. In the case of pop-on captions, the segmentation of the text within the two caption lines and across captions also needs to take into account semantic units, so that the text is displayed in a way that reading is facilitated.

While all challenges above have been tackled to a lesser or higher degree by various providers, the biggest hurdle with automatic captions still lies in their verbatim nature. ASR systems by design are built to produce a word-for-word output of the audio track that they are fed with. In the case of unscripted content, this includes all the hesitations, fillers, stuttering, incomplete phrases and other speech elements that are typical of oral discourse. On top of this, the conversion of speech into text that people will read is hampered by the fact that we cannot read as fast as we hear, which is problematic for captions of high speech rate videos.

The solution implemented in manual workflows to overcome this issue is to condense the text by removing speech elements that do not contribute to the overall meaning, such as hesitations and repetitions, and to further omit words or phrases that are low in information content as well as rephrase text to shorter alternatives with the same or equivalent meaning. This editing down (or out) of text is the core and most frequent technique applied by captioners and subtitlers when captioning output in the same language as the audio or translating it in another language.

How to automatically condense text for captioning purposes

In order to address this challenge, we built a neural sequence-to-sequence translation system capable of removing words and phrases with low information content from ASR output, and even perform some rephrasing to ensure the length of each caption segment is appropriate for a comfortable reading speed while retaining as much of the information content as possible. In other words, we have trained a neural machine translation (NMT) system to ‘translate’ in the same language by applying length constraints in its output.

To train such a model we used various types of data: manually annotated data created by professional subtitlers, synthetically created data produced out of a two-way (round-trip) translation with length control applied in the NMT output, as well as synthetically created data produced via the automatic alignment of subtitle files with different levels of verbosity for the same piece of content.

Additionally, we extended our model with specific length control features so as to have more fine-grained control over the desired output length. By computing the length ratio between a source sentence and its target language translation, the NMT system learned to produce ‘short’ or ‘extra short’ sentences as required. We also used a length-difference positional encoding which counts down from the desired output length to zero and the model learns to stop at the right position. The target sentence length can be set by the user.

To ensure the output text fits the captioning constraints in terms of reading speed, we applied a variable desired length value to adjust the length of each condensed sentence to the desired reading speed. We expect the system to be able to compute the goal text length in characters on the basis of the set reading speed and the actual duration of the utterance.

Consider the following example or raw ASR output versus a condensed and formatted transcript that our condensation model produces in English.


EXAMPLE 1

Raw ASR output:
and eh we believe that this will result in more revenue gains across the board as one might expect you know
Condensed and formatted transcript:
And we believe that this will result in more revenue gains across the board.

In order to further improve the system, we employed a neural architecture that uses the extended context of the preceding sentences to allow the system to achieve a higher or lesser level of compression depending on whether a certain entity for instance is mentioned in the preceding sentence and can thus be omitted or automatically replaced with a pronoun, as would be the case in professionally created captions. We have also foreseen the use of a list of important terms/words that should not be dropped or altered in any way during text compression, which can be customizable by the user and can include, aside from proper nouns, other words that can be important in context for the correct understanding of an utterance, such as ‘deny’, ‘agree’, etc.

Neural text condensation is the new way to produce automatic captions

One of the distinguishing features of our approach as compared to other automatic text correction systems is that it does not require a large amount of hand-corrected data for the system to be trained. Nor does it require parsing a given sentence and identifying the parts of the syntactic parse that carry little information and which can be removed according to criteria defined by rules. Instead, no explicit rules are required in our approach, while parsing, which would have been problematic on data that includes recognition errors, is avoided altogether.

The additional benefit of such an automatic text condensation system is that it can learn to correct recognition errors or ignore them if they happen to be in a part of the sentence that is removed during condensation. This can further improve the user experience in the case of a fully automatic scenario where the output of the text condensation system is directly presented to the user in the form of captions or subtitles. It would also improve the efficiency of human post-editing for human-in-the-loop scenarios.

Below are several examples that showcase the capabilities of our system to produce succinct and effectively condensed sentences out of (sometimes disfluent) speech input. Note for instance how the system is choosing to omit the correct lexical items that carry the least amount of information in Examples 2 and 4; how inverse text normalization is used in Example 3 to create a more edited and readable version of the speech output; how ASR errors or speech disfluencies are ignored or corrected in Examples 5 and 8.

EXAMPLE 2

Raw ASR output:
and i'm interested about consumer demand right here right now particularly for consumer

Condensed and formatted transcript:
I'm interested about consumer demand right here, particularly for consumer.

EXAMPLE 3

Raw ASR output:
block fi will try to collect about six hundred and eighty million dollars owed by ft excess

Condensed and formatted transcript:
Block Fi will try to collect about $680 million owed by FT excess.

EXAMPLE 4

Raw ASR output:
investors feeling a bit more optimistic by the u s relative to to rest the

Condensed and formatted transcript:
Investors feeling more optimistic by the U.S. relative to rest.

EXAMPLE 5

Raw ASR output:
the subway and the railway workers both are demanding they want more work forces at at at at the

Condensed and formatted transcript:
The subway and the railway workers are demanding they want more work forces.

EXAMPLE 6

Raw ASR output:
and apple's manufacturing plant a sigh saying that the drug is developing with biogen isn't to blame for two deaths involving brain bleeding toyota producing seven hundred and seventy one thousand three hundred and eighty two vehicles in october watch those automakers this is bloomberg

Condensed and formatted transcript:
And Apple's manufacturing plant saying the drug is developing with Biogen isn't to blame for two deaths involving brain bleeding, Toyota producing 771,382 vehicles in October.

EXAMPLE 7

Raw ASR output:
with the protests we see many many people want this economy

Condensed and formatted transcript:

With the protests, many people want this economy.

EXAMPLE 8


Raw ASR output:
over iran star forward christian pulisic was taken to hospital after crashing into iran's goalkeeper while scoring

Condensed and formatted transcript:
Christian Pulisic was taken to hospital after crashing into Iran's goalkeeper.

Advancing captioning automation to the next level

With significant progress in automatic caption accuracy, the next frontier is to optimize caption speed to match different program genres and audience needs. Automatic captions have been criticized strongly for their fast presentation speeds when used in verbose programming, as this can become a blocker to viewer comprehension. By refining automatic captioning, we can expand accessibility to a broader range of content, benefiting the d/Deaf and hard-of-hearing community as well as the wider audience.

AppTek’s cutting-edge system is configured to tackle condensation by selectively omitting less critical lexical items rather than paraphrasing. This approach aligns with the user expectations, keeping captions closely synchronized with the audio for a seamless experience. Maintaining verbatim accuracy (to the extent possible) enhances accessibility for those who rely on lip reading, experience audio processing disorders or have some hearing and prefer the flexibility of moving freely between lip reading and the sound.

AppTek’s condensation system operates in the cloud to allow for independent deployment, scalability and easy replacement. If you would like to find out more as to how you can use it to produce appropriately condensed automatic captions, please contact info@apptek.com.

AI and ML Technologies to Bridge the Language Gap
Find us on Social Media:
ABOUT APPTEK.ai

AppTek.ai is a global leader in artificial intelligence (AI) and machine learning (ML) technologies for automatic speech recognition (ASR), neural machine translation (NMT), natural language processing/understanding (NLP/U), large language models (LLMs)  and text-to-speech (TTS) technologies. The AppTek platform delivers industry-leading solutions for organizations across a breadth of global markets such as media and entertainment, call centers, government, enterprise business, and more. Built by scientists and research engineers who are recognized among the best in the world, AppTek’s solutions cover a wide array of languages/ dialects, channels, domains and demographics.

SEARCH APPTEK.AI
Copyright 2021 AppTek    |    Privacy Policy      |       Terms of Service     |      Cookie Policy