SCMP - Linguist Teaches Machines the Mysteries of Language

Linguist Teaches Machines the Mysteries of Language

Comparing the patterns in blocks of text the key to De Kai Wu's programs for Chinese-English computer translation

Kanglei Wang

Imagine learning how to translate from Chinese to English by reading millions of sentences from Hong Kong's bilingual Legislative Council transcripts.

You look at the Chinese. You look at the English below.

Actually, you don't know either language: you are a cluster of 75 computers in Professor De Kai Wu's computational linguistics and musicology lab at the University of Science and Technology.

But as a machine, you are not only looking at the unfamiliar sentences two at a time, as a person would. Instead, you are using statistics to relate huge heaps of data to one another simultaneously.

You notice that in thousands of instances, the English "government building" appears in the same chunk of text as the Chinese phrase for it, so it is highly probable that these chunks mean the same thing.

You study these bilingual patterns, billions of them, cranking away at your algorithms.

Mostly, you work unguided, making your own dictionary as you go along based on the multitude of connections you detect between groups of words. When you make a mistake, a human researcher may correct you with a few programming keystrokes, and over time, you learn to make the right associations in the right context.

This is the world of computational linguistics: a field that strives to model natural human language on the computer. Google Translate and Siri are some of the recent products of these hi-tech linguists.

But even a decade ago (which in the cyberage is more like a century), Wu was already finding up to 86 to 96 per cent accuracy rates between English and Chinese translations from a set of computers that, yes, really did read Legco documents.

His program did not just operate with individual words, but used chunks of other, smaller chunks in relation to each other, a mathematical model called inversion transduction grammars which enormously sped up the process of learning to translate.

For this work, he was last month honoured as one of only 17 founding fellows around the world of the Association for Computational Linguistics, and the only one from China.

He pioneered the computational study of English-Chinese language pairs, which no one else was doing at the time; the US first put funding into Chinese translations around 1999, years after Wu started his work. In 1995 he launched Silc, a multilingual engine that handled the first web translations from English to Chinese.

"First-mover technology takes time," he said, pointing out that Google Translate is still not making money. He said developers in Asia had to be patient investing in new fields.

But computational linguistics is gaining ground: Wu is just about to close multimillion-dollar translation technology research projects with the European Union and the Defence Advanced Research Projects Agency, the US military arm that funds most US computer science research.

Decades ago, machines would try to learn English grammar by simply processing millions of sentences and trying to find a common structure.

"That's like tying a child to a chair, blindfolding the child, and making them hear millions of sentences of English," Wu said.

His computers are not trying to "learn" English or Chinese, per se - at least not separately. For translation to work, what the machines need to do is figure out the relationships between the two languages and then match them.

That's how humans learn language, too: a child from birth to six or seven years old is constantly matching relationships between what they sense in their environment to the spoken language they hear.

They learn that the round thing on the ground is a ball because they hear their parents say "ball" repeatedly around the object, even if it is mixed up with other words, and so they eventually associate the image with the sound. They, like the computers, are making correlations, not between Chinese and English, but between the language of their environment and the language of words.

In this sense, we are all translators. A child translates an action into a meaning. Wu's machines translate from one language to another. A newspaper reader translates the text on the page into a narrative.

"All your thinking and cognition is taking the world as you see it and translating to an internal language," Wu said.

The original meaning of translate is to move or transform between one place or form and aonther. To be translatable, then, means to be able to be shifted, to be transformed.

Wu is used to translating. He grew up in the US Midwest, but from the age of seven went back to China, including Hong Kong, in the summer. He remembers seeing the disparity between the US and post-Cultural Revolution China, and how much of it had to do with language and culture.

"You see the cultural disconnects, that English speakers aren't understanding something about the Chinese situation and vice versa.

"This is where we can make a difference," Wu said.

dekai@cs.ust.hk
Last updated: 2013.01.31

HLTC

Linguist Teaches Machines the Mysteries of Language

Comparing the patterns in blocks of text the key to De Kai Wu's programs for Chinese-English computer translation