Opening the black box, how machine translation works (in broad strokes)

I never really made it a secret that I am very critical of machine translation, and can get rather heated when someone suggests using them for “serious” applications. I will admit that some of it is probably my pride as a translator, knowing that translation is an act that requires a level of human ability that computers won’t be able to touch for many years to come.

However, the largest part is because I’ve studied (although very briefly) the science and theory behind these systems as a part of my Masters coursework. From that experience, I’ve seen first hand what sort of ugly little engineering problems stand in the way of even the best systems today. Much of machine translation’s appeal is in it’s magical black-box properties. The following quote probably sums it up the sentiment:

“If you put garbage in a computer nothing comes out but garbage. But this garbage, having passed through a very expensive machine, is somehow ennobled and none dare criticize it.” – Anonymous

The problem is that these systems are very large, quite complex, and appear like magic to the typical user. Because it looks like magic, they put more trust in it than is warranted, and this is dangerous. So, as a translator that admits that machine translation has limited but valid uses, I’d like to spread information on how these things work, so people can decide for themselves without seeing just magic.

The goal

As with all engineering problems, machine translation starts with a goal: we want to take a text in a given language (the source language) and convert it into a second text in another language (the target language), such that the meaning is preserved. It’s more or less the broad goal of translation in general, though you can spend centuries arguing over specific points, like what exactly does that “meaning” part really mean? (I’ll try to write about that some other day…)

Don’t let the ‘well duh’ simplicity fool you, it’s just really one way to state the goal. I’ve seen machine translation theoretical models that have some novel interpretations of what is equivalent to the goal. For example, one way to interpret the Hidden Markov Model approach to machine translation is that you mean to express your meaning in the target language, but somehow it came out in the source language, so the system goes through and tries to find the sentence in the target language that has the greatest probability of generating that sentence in the source language, using probability models of the sequence of words and such.

HMMs are but one out of the many ways to approach machine translation, and I’ll discuss some broad approaches later on. For now, just appreciate the fact that the very simple and supposedly “clear” goal we had can take on some pretty crazy expressions as mathematical/computational models.

Broad categories of engines

For this piece, I’ll be taking Wikipedia as my broad outline source since I’m not a formal expert in the field by any stretch of the imagination. I do have enough training to read through and convert the formulas and notions into something more easily digested, as well as applying my own experiences and knowledge where applicable.

In very broad sweeps, you can classify the engines that do machine translation into a few categories.

  1. Rule-based systems – translate using linguistic rules
  2. Statistical systems – translate using statistical methods on bilingual corpora
  3. Example-based systems – translate using a corpora of bilingual examples and extrapolating from them

Because my coursework was more focused on the area of information retrieval (that is, search engines) I am most familiar with the statistical systems, which are currently the most popular and powerful according to Wikipedia. I do have some passing knowledge of rule-based, and almost none with example-based systems.

Rule-based systems

The foundation for rule-based systems is very easy to understand intuitively. Languages can be considered to have two big pieces, the meaning of the words — the semantics, and then the rules for putting the words together to create the meaning of a sentence as a whole — the syntax and morphology, etc. This is all very intuitive because we learn grammar in school, so we are intimately familiar with the notion that languages comes with certain rules that must be followed if you want to be understood.

Looking at various descriptions of rule-based systems, one very common notion is the use of an intermediate language, an interlingua. What this does is you take your language, extract out as much information about it, from the forms of the words, how they’re ordered, the parts of speech relationships, and so on, then store all this into an abstract representation. Later, you take this representation, and applying another set of rules, you convert to the target language, making your translation.


On the whole, this is a very intuitive way of thinking about translation. But as with most intuitive things, the details get really scary. At the heart of rule based systems is that it’s always an open question of exactly what you want to capture in your interlingua (and how you can capture it), what are the rules, and the ever-present “Word Sense Disambiguation” problem. There are always many more issues large and small, but those three stand out as the most obvious ones to me. Of those, Word Sense Disambiguation is a general problem with machine translation, so I’ll cover it at the end in its own section.

The hardest job I think of rule-based systems is the generation of rules. I can’t think of any natural language off the top of my head that is 100% regular. Artificial languages like Esperanto which are strictly designed to be regular don’t count. We are all familiar with situations where a given language rule has exceptions, “you can make a noun plural by adding an ‘s’ at the end, unless it ends in ‘y’, then it’s ‘-ies’ or unless it’s a special word with an irregular plural like ’1 virus’ and ’20 viruses (or as some argue virulently2 20 virii!!)’…”

Then there are just word usage rules that are really difficult to deal with, like “you can drive a car, but you pilot a plane.” Then there are cases where people don’t even agree on whether a rule is valid! Different dialects often can have this problem, even native speakers matched by demographics can disagree. The list of difficult rules in a language goes on and on and on, so you can get a sense of just how difficult the process can become.

Capturing things into the interlingua then springboards off the rule problem, because you might want to record some language feature, or you may actually have recorded it in the source language, but you aren’t able to record/encode the language feature easily in the target language. Meanwhile, you can’t store everything or else your system would be impossibly slow. You have to pick and choose, and that introduces even more places to make mistakes.

Statistical systems

These systems are based statistical methods using huge corpora of data to power the underlying probabilities that drive the system. They can vary greatly in how they are implemented, Google’s translation engine is classified under here. These systems are often based on the probability of chains of words appearing, and so use the huge corpora of text to figure out how likely a given sentence appears.

A bit of math

If you look at the Wikipedia page for this you’ll see Bayes’ Theorem as the foundation . To clarify, I’ll use t for the target language, and s for the source language.

P(t|s) = P(s|t) P(t)

The probability P(t|s) (probability of seeing t if we see s) is our target, we want to find the sentence that gives us the highest probability, the most likely translation of s->t. In order to do that, we look at the other side of the formula.

On the other side, we have the translation model P(s|t), of often we see s when we see t. We also then have a language model P(t), how often we see t in general, which would supposedly throw out illegal and awkward sentences.

Note: For those who are more familiar with Bayes Theorem, you’ll note that a term is missing on the left side, P(s). This is because when we’re translating, we’re given t, so P(s) would have to be 1 because we have it right in front of us.

Now, you’re probably looking on this and going “What the heck? Where do the numbers come in?” This is the complicated part. The formula I just described above is just the general outline of statistical models, there can be other general models depending on your approach. In order to get actual numbers, we have to do parameter estimation. where we take our data and make a best guess estimation of what its value is. There are countless ways of estimating parameters, and depending on how you do it, and how they’re put together, your system can do better or worse. Even if many people are following the same general recipe, there’s more than enough room to be creative.


First let’s look at the translation model, P(s|t). Here, we’re looking at the relationship between s and t, so we’re going to want to use information that takes into account both and one way to do that is to have 2 bodies of text, one a translation of another, and use that as data to help figure out P(s|t).

Let’s consider a single word for illustration purposes. The English word “where” (in ‘where is the cake’) translates to the Japanese word “doko” (in ‘Keeki wa doko desu ka?’). Note that the word order is totally different, but for the sake of keeping things simple, let’s say you magically (maybe through your handy dictionary) know that “where” and “doko” are a translation pair. Now, you look at a few million other texts, and find that “where” and “doko” appear in parallel translations some 70% of the time (I made that number up) because in English we can use ‘where’ in the sense of ‘this is where we eat’ which is more ‘koko wa watashi-tachi ga taberu tokoro desu’.

With the above information, you can get a sense of just how often you should use the word ‘where’ should be used when you see ‘doko.’ Things get a much more complicated when you have to start accounting for more than one word, you might have to look at 2-word sequences, or 3-word ones, but this is the basic idea. Also, you can see that there’s the problem where the word orders are totally different, so you can’t assume word 1 in s associates with word 1 in t.

Next, let’s look at the language model P(t). Often, these are represented as n-gram models, which just means “we look at a sequence n words long, and see how likely it is we see such a sequence.” n can be 1 or more, though by about 4-5, it becomes extremely rare to find matches on a practical level. Why is that? Well, think about how many words you know (a loaded question, the definition of a ‘word’ is HARD, we’re not getting into that unless you guys want me to write about it), but let’s say you use about 60k words. If you were looking at a 1-gram (unigram) model, you’d just go and count how often you see a given word, divide by all the words you’ve seen so far, and presto! you have your probability model.

For 2-grams (bigrams) you’d have to consider all pairs of words. so that’s 60k*60k = 3.6 Million possibilities, and most of them are probably rarely seen (how often do you see ‘electric anthill’ in a normal sentence?). You’d have to go through a huge amount of data to get a sense of how often something pops up, because who knows, once in awhile you might see something about electric anthills (Google sees 25 for me). For 3-grams, the number is 216 trillion. At this point you just simply run out of sufficient data, and most things will be 0, even though they should be slightly nonzero.

Get to this point, and you can then come up with little mathematical tricks to make sure that nothing is purely 0, and everything starts out with a sliver of probability, and you can then argue for a few years over exactly how much is enough, or how to adjust things so that it comes closer to reality as you train on data.

As a side note, the HMMs I mentioned earlier is yet another way to to estimate parameters, and the technique can be applied to some form of the estimations above such as the n-gram models, which look for the probability of a stream of words, maybe to estimate how likely a sequence is for something you’ve never seen before.


After the above description, notice that I made absolutely 0 mention of the “meanings” of words. It’s just probabilities and things appearing within a few words of each other. There are places where you can put notions of meaning, to help make searching for possible sentences faster and better, but the general form doesn’t require it.

Now do you begin to see why I think machine translation won’t catch up to humans for many years, if at all? There’s no real magic inside the black box. It’s all very fascinating and sometimes brilliant pieces of mathematical formulas and probabilities dancing across the processor, but in the end, it’s all symbol manipulation with no understanding. The fact that machine translation has come as far as it has using just these notions is the real miracle, and the fact that we have trouble passing certain barriers a testament to the complexity what humans do in their lives.

Word Sense Disambiguation (WSD)

Earlier, I had mentioned that the problem of WSD is common to all translation systems. This is something of an understatement, since it’s true of all translation period. The problem is simple, words have more than one meaning, and it makes a difference which meaning you pick. Normally, from the context, readers will be able to figure out the meaning they’re supposed to use, however this isn’t always true. Sometimes, it’s just plain bad writing, and other times there’s a disagreement on what something can possibly mean.

Human translators handle this in two ways. First, they understand the language, so they can use the context and pick the right meaning. Machines are horrible at understanding context, in fact, figuring out the topic of a paragraph or pulling out an answer to a question from a sentence (topic extraction) is still an open problem as far as I know. Second, human translators can do research. If they don’t know how to handle something, they can contact people, look up history, do all sorts of work to figure out exactly what a word is supposed to mean. Machines can’t do this of course.


So, above I’ve sketched, in extremely broad strokes, how machine translations work as I understand them. As technology moves onward, I imagine that this would eventually become obsolete. I’ve simplified many things and glossed over lots more because the cutting edge requires much more mathematical expertise than I have, plus it would require reading the literature on the topics, which I don’t have the training for to be honest.

I might have gotten some details wrong, so I’ve tried to avoid getting to deep into those details, so it shouldn’t be too bad. I still don’t recommend reading citing this as any authority, but I feel there’s not nearly enough explanation for what goes on with technological systems. Without decent explanations of what goes on, people have no way of making educated judgments on how far to trust a given system, and I find that to be extremely dangerous.

I hope that readers here now have a better sense of what happens under the hood of popular translation systems such as Babelfish (powered by SYSTRANS), or Google’s statistical translator, or the Fujitsu ATLAS-II. If by exposing the guts of the systems, I’ve convinced just one person out in the world that machine translation isn’t a panacea, then I’ll be overjoyed.

Next time, I think I’ll write a bit about my opinions on where, and when machine translation can be important.


Commenting is closed for this article.