Surprisingly good, but not as good as DeepL or Google Translate. Yet.
ChatGPT has taken the world by storm. 100 million users just two months after OpenAI launched it: that’s not something others will accomplish again any time soon. Whether in the news, on tech blogs, or on LinkedIn, new use cases surface almost every day. Translating texts is one of them. That’s why ChatGPT is also interesting for us as a translation agency.
The translation industry has been dealing with different language models for quite some time, including the various GPT versions. However, the general assessment of the systems is still very different.
In their latest report on machine translation, for example, Intento already sees GPT-4 and ChatGPT among the ten best machine translation systems, at least for some language pairs. At Google, the performance of at least the predecessor model GPT-3 and similar systems is estimated somewhat more cautiously, it seems.
Our own test runs with ChatGPT fall right in between. Stylistically, ChatGPT is often better and more flexible than DeepL and Google Translate. In terms of content, however, it regularly “mangles” the texts.
The quality of ChatGPT translations:
a case study
For our test run, we chose a text excerpt from a German online shop. The excerpt consisted of 763 words and contained product names, feature lists and product descriptions.
To fully exploit the possibilities of ChatGPT, we created three different prompts. We asked 1. for a simple translation from German into English, 2. for a pre-edit followed by a translation and 3. for a pre-edit plus translation followed by a post-edit. In 2. ChatGPT should first optimise the source text for machine translation and only then translate it. The post-editing in 3. should clean up stylistic and conceptual inconsistencies and localise units of measurement.
We used DeepL as a basis for comparison. Especially for German-English, it is considered the leading machine translation system. Finally, we subjected all versions to a quality check in which we counted the type and frequency of errors.
How does ChatGPT work?
Some basics for a better understanding.
Before we describe our experiences with ChatGPT, it is worth taking a brief look at the technology itself. ChatGPT is a generative language model – or simply LLM. Such models can generate amazingly human-sounding text by calculating a plausible continuation of the previous text or prompt.
Considering that there is an almost infinite number of possible questions and prompts from users, the question is: How is this possible? The answer: by training the system with a lot of data on many different topics. GPT-4, for example, was trained with over 100 trillion parameters. That’s millions of web pages, books, etc.
A special kind of neural network, a transformer, recognises patterns and correlations in this data. Using these patterns, the language model then calculates probabilities for how a text fragment can be continued. Unlike conventional auto-completion systems, such as those used on mobile phones or in other word-processing programmes, it takes into account not only a sentence, but also longer sections of text and sometimes even the entire text.
An analogy may be helpful here: Imagine you are reading a crime novel. To find out who is responsible for a mysterious murder is, you pay attention to every possible clue while reading. You collect these clues and gradually put the pieces of the puzzle together. Perhaps there is a clear prime suspect, but at least initially there are many possible culprits. Transformers handle text in a similar way to predict the next word – only on a mathematical level.
The system does not simply choose the most likely word, the prime suspect. Instead, it considers many options with different probabilities and regularly chooses lower ranked words as well. This makes the texts seem less predictable and therefore more creative. More human. For this reason, ChatGPT outputs different texts even in response to the same prompt. It is precisely this flexibility that makes the system so interesting for translations. Conventional MT systems such as DeepL or Google Translate are more rigid and hardly take the context into account.
By the way: If you want to understand the technology in more detail, we recommend Stephen Wolfram’s “What Is ChatGPT Doing…. and Why Does It Work?” as an introduction.
The quality of ChatGPT translations:
In general, it’s striking how many mistakes both DeepL and ChatGPT make. 67 to 101 errors in about three pages of text would not be acceptable for any company. So there is no getting around a professional post-editing by a human translator.
In our example text, ChatGPT was able to reduce the number of serious errors slightly compared to DeepL, but this came at a price: a significant increase in minor and medium errors. In this case, however, pre-editing was also able to reduce the number of medium errors somewhat.
The evaluation above took into account how serious the errors were, but it did not yet distinguish between error types. In the context of a quality check, these are usually weighted. For most customers, for example, the correct spelling of their products is more important than correctly placed commas. If we include our standard weighting, we get the following error values:
Here, too, we see that although serious errors cannot be avoided by ChatGPT, they can be reduced somewhat compared to DeepL.
Style and text flow
An advantage with disadvantages.
Almost 40 percent of the segments that ChatGPT had translated matched the translation from DeepL exactly. That was a surprise for us as well. Almost 80 percent were very similar. This similarity was reduced to 67 percent by pre-editing and to 55 percent by post-editing – for better or worse.
ChatGPT excelled in product descriptions that were available as continuous text. The texts sounded more fluent and less clumsy than those from DeepL. The language level increased significantly compared to DeepL, especially with post-editing. While DeepL remained very close to the source text in terms of sentence structures, ChatGPT occasionally broke away from it and thus produced a text that sounded more natural to a native speaker.
However, this flexible approach to the source text came at a price. For example, some sentences were omitted, presumably because they were considered redundant in terms of content. This can be positive for readability but often is problematic in a web shop, for example if the texts belong to different sections on the website and have to be present there.
Another problem ChatGPT had was the capitalisation of list entries. The language model sometimes wrote materials like “90% cotton”, “10% viscose” in capital letters, even if the entries were written in lower case before and after. Such inconsistencies are easy to fix, as they do not require rewriting entire sentences, but in total they can be tedious, time-consuming, and costly.
ChatGPT vs. DeepL:
style and terminology
The amount of work and cost required to raise ChatGPT’s translation to a professional level also depends on the number of errors. After all, even many minor errors add up to a lot of work if the text is to be smoothed out. But the nature of errors is also important. Adding a comma is one thing, having to rewrite a sentence is another.
Compared to DeepL, ChatGPT was particularly convincing stylistically. It did not make any serious mistakes and made errors of medium severity much less frequently. On the other hand, it made quite a few minor errors. The errors shown include our standard weighting.
You can see very clearly that both pre-editing and post-editing have a positive effect on serious terminological errors. At the same time, the number of minor errors increased. ChatGPT tended to handle product names better, but material names and product properties worse. Those who want an error-free text hardly reduce the revision amount and effort required.
A technical challenge
How CAT software deals with machine translation systems.
ChatGPT cannot and should not be blamed for every mistake made in these translations. Whether and how new products should be translated, for example, often cannot be decided without consulting the client. For this reason, we often develop term databases, which we then use in combination with a machine translation, if such a translation is desired by the client. In this way, a fair amount of serious terminological errors can be avoided with both DeepL and ChatGPT. However, this also increases the effort.
The maintenance and integration of terminology databases and translation memories is usually done using CAT software (CAT = computer aided translation). However, if machine translation systems such as DeepL and ChatGPT are controlled via CAT tools, a technical problem arises: CAT tools such as memoQ or Trados usually send the source text to the MT system segment by segment. The MT system then translates each segment individually.
This hardly reduces the quality of DeepL. Even in the online version, it only seems to take the previous sentence into account. For a large language model like GPT-4, however, this is poison. Important context is lost. The better style and linguistic flexibility is precisely due to the ability to process context beyond individual sentences. The way CAT software works thus occasionally reduces the quality of the ChatGPT output and increases the likelihood of continuity errors and inconsistencies.
What to do? Translation agencies like DialogTicket hope, of course, that the developers of CAT tools will adjust the software accordingly. If they did, we could integrate large language models into our translation processes in ways that are similar to how we integrated MT systems. For the time being, however, those who want to fully exploit the strengths of ChatGPT should consider alternative solutions, for example via an API interface.
ChatGPT vs. DeepL:
Omission errors and spelling
As mentioned above, ChatGPT tends to delete passages that it “perceives” as redundant, at least if you do not transmit the segments individually via a CAT tool. Even when using CAT software, regardless of the translation system used, it can happen that certain passages are overlooked when loading a document and are then not translated. The weighted omission errors listed in the evaluations below are not of this kind, however.
If ChatGPT is only asked to translate, spelling and punctuation deteriorate compared to DeepL. Pre-editing and post-editing make a real difference. It is a pity that omission errors occurred during pre-editing and post-editing. After all, both systems were able to avoid serious and moderately serious grammatical errors. MT systems certainly have improved quite a bit during the last ten years or so.
Conclusion: ChatGPT versus MT industry standard
Advantages and disadvantages at a glance.
We would like to provide you with a list of the strengths and weaknesses of ChatGPT. This list goes beyond the case study we presented and includes experiences we made with the system over these past few months and points made in the literature. Let’s start with the advantages:
- Flexibility: Unlike DeepL or Google Translate, ChatGPT is able to take into account specifications that go beyond a mere translation. Pre-editing and post-editing can be integrated directly into the translation process. Stylistic changes can also be implemented quickly.
- Style: ChatGPT can produce texts that sound fluent and natural. We see opportunities here, especially with marketing texts, where conventional systems often fail. Where a literal translation is not important or where it would be inappropriate, ChatGPT has the edge.
- Context: Unlike DeepL, ChatGPT can take the context of a text into account and include it in the translation – but only if it is not controlled via one of the common CAT tools. This can avoid continuity errors that classic MT systems are prone to make.
- The ability to correct and learn: If there are problems with a translation, you can follow up directly with a prompt. This does not always work, but sometimes it does. And “sometimes” is still better than “never”, because classic MT systems do not offer this possibility at all.
Background knowledge: ChatGPT can make use of information that goes beyond the source text when translating. This sometimes improves the translation, but sometimes also creates problems.
This brings us to the disadvantages:
- Compliance: Currently, many texts should not be uploaded for compliance reasons. Anyone who wants to use ChatGPT for translations in a corporate setting thus needs an approval process. Many texts may not be forwarded to third parties as a matter of principle, especially not to parties outside Europe.
- Language support: The quality of a ChatGPT translation varies greatly from language to language. German, English, Spanish seem to work reasonably well, other languages not so much. DeepL and Google offer much broader language support.
- Accuracy: Precisely because ChatGPT utilizes information external to the text, incorrect information sometimes creeps in. Because it pays attention to a good flow of language, it sometimes deletes passages.
- Speed: DeepL and Google Translate can translate large amounts of text quickly. ChatGPT is comparatively slow, especially if you consider the time needed to adjust prompts etc. In addition, it does not yet accept longer texts in one go.
We hope that ChatGPT and similar language models will run on local servers in the near future. This would allow companies to better navigate the risks associated with these systems. ChatGPT is stylistically very good. But it’s is only partly suitable as a translation system in its current version. We currently see it more as a useful addition to existing systems, not so much as as a replacement. Much depends on what improvement future updates will bring.