Smart Machine Translation Control with Glossaries and Markup

December 18, 2024
Evgeny Matusov, Lead Science Architect

If you’ve ever used general purpose machine translation tools, you will have noticed that they’re far from perfect. While they can handle simple sentences and general text reasonably well, things can get messy when dealing with technical terms, brand names, or even formatting like italics or bold text. Translators are often used to clean up such errors, which can become very onerous, or even annoying, to fix. They are part of the reason why translators refer to working on errors of this type as janitorial activity and still view machine translation with skepticism.

Machine Translation (MT) made significant improvements in the fluency of its output even for the most demanding domains, yet challenges such as the above still persist. These are addressed one by one, improving the MT output continuously and incrementally. At AppTek, our researchers have been working on systems that allow translators to control how specific terms and text styles are handled. Let’s explore how these improvements are making machine translations more accurate, customizable, and user-friendly.


The Challenges of Traditional Machine Translation

Machine translation has come a long way, especially with neural networks, which produce more fluent and natural-sounding sentences. However, these systems still have limitations:

1. Term Consistency: Translating specific words, such as product names ("Windows"), place names ("Paris"), or set phrases, can go wrong depending on the context.
2. Handling Tags and Formatting: Machine translations often strip out or misplace formatting tags like <b> for bold or <i> for italics instead of ensuring their proper placement in the generated sentence.
3. Placeholders: In technical or software manuals, you’ll often see placeholders to represent specific names or events like %(brand) or [1]. These must stay untouched, yet machine translation systems sometimes "translate" or mishandle them instead of handling them according to their semantic role.

These issues are particularly troublesome for translation professionals, yet the implementation of solutions to them is not as straightforward as one might expect.

The problem lies in the attention mechanism that is found at the very core of every neural machine translation (NMT) system. It does not know how to recongize terms or tags, as it simply processes words or tokens as numerical representations with matrices and weights in order to generate sequences in the target language – it is just crunching numbers. LLMs follow the same Transformer architecture of the smaller NMT models and face the same issues. Even if instructed, they will not always keep the tags or use glossary translations, and such issues get even more pronounced for low resource language pairs.



The AppTek Solution: Smarter Neural Machine Translation

At AppTek, we’ve tackled these challenges head-on by developing innovative approaches to explicitly model glossaries and markup tags within our NMT systems. This involved ensuring the systems are glossary- and markup-aware already at training time. To do this we extended word or token information with additional information found in annotations of the training data. This way we enabled the precise handling of glossaries, placeholders, and tags while maintaining overall translation quality:

1. Glossary Integration: Users can provide specific words or terms that the machine translation system must translate a certain way.
2. Markup Awareness: The system now understands formatting tags (e.g., <b> for bold or <i> for italics) and places them correctly in the output.
3. Placeholder Handling: Placeholders like %(brand) are recognized and left untouched to ensure consistency in software or manuals.

By teaching our NMT system to "pay attention" to these special terms and formatting, translations become far more reliable and require less post-editing by professionals.



How It Works: Glossaries to the Rescue

Imagine translating a sentence about Vienna. Is it the capital of Austria or Vienna, a small town in Virginia? Machine translation doesn’t always know the difference.

With AppTek’s glossary-based override feature, you can specify how terms like "Vienna" should be translated. For example here, from English into German:

Once you arrive in {{{ Vienna | Vienna }}}, you must give the money to {{{ John Doe | Max Mustermann }}} or whatever is his name.

Sobald du in Vienna angekommen bist, musst du Max Mustermann das Geld geben, oder wie auch immer er heißt.

In this case, Vienna remains unchanged because it’s a U.S. town, and the placeholder name "John Doe" is replaced with the correct German equivalent "Max Mustermann."

However the system doesn’t just replace words for their correct equivalents. For morphologically-rich languages, such as Croatian, the grammar of which is more complex, it correctly declines the terms even though they are listed in the glossary in their base forms. Take this example from English into Croatian:

You must give the money to {{{ John Doe | Ivan Horvat }}} or whatever is his name.
Moraš dati novac Ivanu Horvatu ili kako se već zove.

The system correctly adjusts "Ivan Horvat" to its dative case ("Ivanu Horvatu") even though the glossary provides the term in the nominative case as is standard. This kind of grammatical flexibility makes translations far more accurate and reduces the need for post-editing.

An even more complex case involves the use of prepositions in verbal phrases which in some languages depend on the nouns, when the noun is replaced by a placeholder tag as in the example below from Engish into German.

If you have a general question about {{{ Facebook | %(platform_name) }}} please email us at {{{ example@abc.com | %(contact_email) }}}.

If you have a general question about {{{ life | %(concept) }}} please email us at {{{ example@abc.com | %(email) }}}.

Wenn Sie eine allgemeine Frage zu %(platform_name) haben, senden Sie uns bitte eine E-Mail an %(email).

Wenn Sie eine allgemeine Frage über %(concept) haben, senden Sie uns bitte eine E-Mail an %(email).

Here, we are using a representative noun as the source word but forcing its translation to a placeholder with the glossary override mechanism. This results in the correct use of the preposition in German (either “zu” or “über”) that depends on whether something is abstract or concrete, like a social media platform. Without this functionality, the wrong preposition is generated in the translation.

Training an NMT system to generate glossary terms in context correctly is no mean feat. This would be possible with explicitly tagged parallel data; however, such data is rarely available. At AppTek, automatically aligned words and phrases are randomly tagged and thus are “connected” with each other in non-annotated parallel data – this tagged data is then mixed with plain text parallel data for a glossary-aware NMT training.

Customers of our MT systems can annotate words or phrases directly in the input sentence, or they can upload a glossary directly to AppTek’s MT API. The latter is given an ID to be used with specific translation requests, and functionality which specifies what types of matches are allowed, e.g. singular vs. plural, case sensitive vs. case insensitive, etc. When this option is selected, it is important to ensure the glossaries are clean and lean, containing only domain-specific terms that the NMT system might struggle with and no ambiguous terms that might result in inaccurate translations.


Tagging and Markup: Keeping the Style Intact

If you’ve worked with HTML or software manuals, you know how important formatting is. Words in italics, bold text, or tags like <a:link> can’t simply disappear in the translation. At AppTek, we leveraged the same factored approach to train our system so as to accurately transfer markup while accounting for potential reordering during translation.

Here’s a simple example of AppTek’s NMT with tag modelling from English into German:

When applying for a visa, a <a:link>foreign national</a:link> <b>must</b> bring all the necessary documents.

Bei der Visumbeantragung <b>muss</b> ein <a:link>Ausländer</a:link> alle notwendigen Dokumente mitbringen.

Notice how the markup-aware system not only correctly passes on the tags unchanged and tags the correct words in the translation without the translator lifting a finger (“must” -> “muss”, “foreign national” -> “Ausländer”), but also swaps the order of the tags around when the structure of the sentence requires this.

The seamless integration of glossary override and markup transfer features that AppTek’s system offers, allows it to also apply tags around translations that are matched with glossary entries, a feature that is known as “term highlighting”. This can prove especially useful in interactive settings, where users can view both the source sentence and its target translation, enabling them to quickly identify, memorize, or act upon the translations of specific, important terms.



Real-World Example: Helping Doctors and Patients Communicate

One of the most exciting use cases of AppTek’s term highlighting feature comes from a project funded by the German Health Ministry. The goal was to help non-professional medical interpreters in remote interpretation settings communicate between German-speaking doctors and patients who spoke other languages. The speech of the latter was automatically transcribed and translated in real time and the interpreters could choose to use the automatic translations or not via an app.

Doctors often use complex medical terms, which interpreters need to get right. AppTek’s system highlighted the glossary-based translations of medical terms in color within the translation app, making it easier for the interpreters to quickly spot the translation as shown in the image:



On the left, you see the transcript and translation of the conversation in alternating turns between the German-speaking doctor and the Arabic-speaking patient, with the glossary-based translations and the corresponding source terms highlighted in green. On the right is a different view in the app where the glossary terms can be added or edited.

Improved machine translation with glossaries and markup isn’t just useful to amateur interpreters or doctors. Term highlighting has wide-ranging applications in fields such as education and intelligence/analytics, where precise term recognition and translation are of great significance.



More Than Just Machine Translation

Machine translation has come a long way, but it’s still evolving. By giving users control over glossaries, formatting, and placeholders, AppTek is making translations more accurate, reliable, and useful across industries. Whether you’re translating subtitles, software manuals, or helping doctors communicate, these innovations ensure that machine translation works for you, not the other way around.

For businesses, this means faster, more accurate translations with fewer errors. For translators, it means less time fixing mundane errors and more time focusing on more critical aspects of the translation task.

If you’re curious to see how smarter NMT can streamline your translation workflow, now is the time to explore the possibilities. After all, the future of machine translation isn’t just about understanding language—it’s about understanding context, style, and the little details that matter.

AI and ML Technologies to Bridge the Language Gap
Find us on Social Media:
ABOUT APPTEK.ai

AppTek.ai is a global leader in artificial intelligence (AI) and machine learning (ML) technologies for automatic speech recognition (ASR), neural machine translation (NMT), natural language processing/understanding (NLP/U), large language models (LLMs)  and text-to-speech (TTS) technologies. The AppTek platform delivers industry-leading solutions for organizations across a breadth of global markets such as media and entertainment, call centers, government, enterprise business, and more. Built by scientists and research engineers who are recognized among the best in the world, AppTek’s solutions cover a wide array of languages/ dialects, channels, domains and demographics.

SEARCH APPTEK.AI
Copyright 2021 AppTek    |    Privacy Policy      |       Terms of Service     |      Cookie Policy