(Presented at the Chinese Language Processing Workshop, University of Pennsylvania, Philadelphia, 30 June to 2 July 1998. Current draft at http://www.cs.ust.hk/~dekai/papers/segmentation.html.)

A Position Statement on Chinese Segmentation

Dekai Wu
HKUST, Human Language Technology Center, Dept. of Computer Science, Hong Kong
dekai@cs.ust.hk
1 July 1998
 

The naive notion of a word boundary is essentially alien to Chinese. The notion is borrowed from languages with whitespace-separated characters. There is no absolute need to insert whitespaces between Chinese characters.

Given this, our research group takes a pragmatic approach to defining segments, arising from computational concerns. Our approach is methodologically different from, but possibly complementary to, theoretical linguistic attempts to characterize segments.1

Assume the following computationally-motivated conditions:

Based on these assumptions,  I suggest a single general principle, from which the rest of the paper follows:
 
Monotonicity Principle for segmentation. A valid basic segmentation unit (segment or token) is a substring that no processing stage after the segmenter needs to decompose.

This means a segmenter must not prematurely commit to long segments. Segmentation should be conservative, identifying only certain segments. A later stage always has the option to further group consecutive segments if needed.2

I'll completely avoid using the term "word" here, for two reasons:

Can segment-hood be defined independently of the application?

Under the Monotonicity Principle, good Chinese segmentation is always defined relative to one or more applications. Segmentation into segments must not impair accuracy of the application(s).

Can there be a universal gold standard?

The Monotonicity Principle dictates that any universal gold standard for segmentation include at least the following condition:
 
A substring constitutes a valid segment only if no possible application would ever need to decompose it for any reason.

 Criteria for corpus annotation of segments

For human corpus annotators, the Monotonicity Principle implies a guideline:
 
A sharable general-purpose corpus should only commit to segments that no target application would ever need to decompose for any reason, whether structural or statistical.

A corpus should be re-usable for many applications, and needs to be chunked into short tokens that will not impair the accuracy for any application.

Since the corpus' segments should not be prematurely overcommitted from any application's standpoint, the annotator must be aware of all kinds of target applications that will use the corpus. As a minimum set of target applications, we suggest MT, IR, information extraction, summarization, and language modeling for ASR.

This requires well-educated human annotators. Even so, there is a serious problem with lack of inter-annotator agreement. See Wu and Fung (ANLP-94) for a study employing nk-blind evaluation.

Note that the prohibition against decomposing segments applies to statistical models as well. This means, for example, that the segmentation of a corpus should not contain long segments that prevent correct collection of  n-gram statistics (where a "gram" is a segment).

I also suggest a second guideline for annotating segments in corpora:
 
The criteria for annotating corpus segments should not require presence in a reference dictionary or corpus.

The reason for this anti-criterion is Zipf's Law. Many valid segments can be absent from a reference dictionary or corpus, even with today's large corpus sizes. This should not be allowed to artificially reduce annotation accuracy. Dictionaries should be built from corpora, not the other way around.

Criteria for evaluation of automatic segmenters

The criteria are different for general-purpose versus application-specific segmenters. General-purpose segmenters are subject to the same criteria as general-purpose corpora.
 
A general-purpose automatic segmenter should only commit to segments that no target application would ever need to decompose for any reason, whether structural or statistical.

In contrast, criteria for application-specific segmenters should not be the same as criteria for general-purpose corpora. Ideally, an automatic segmenter should be specialized for its intended application, and for efficiency's sake should find the maximal-length tokens that don't impair accuracy on that application. This allows higher performance than with general-purpose segmenters, which are bound by the concerns of irrelevant applications.
 
An application-specific automatic segmenter should commit to the longest possible segments that will never need to be decomposed by later stages in the target application.

General-purpose segmenters produce minimal segments that are safe; they may compomise efficiency but not accuracy. Application-specific segmenters are more dangerous since they are greedier and are susceptible to premature commitment leading to unrecoverable "too-long segments" or "crossing-segments" errors.

A different approach to application-specific segmentation that eliminates the danger of premature commitment is task-driven segmentation. Task-driven segmentation is performed in tandem with the application (parsing, translating, named-entity labeling, etc.) rather than as a preprocessing stage. To optimize accuracy, modern systems make use of integrated statistically-based scores to make simultaneous decisions about segmentation and parsing/translation.
 
Application-specific segmentation is most accurately performed by task-driven segmentation.

Systems employing task-driven segmentation include:

I now consider some constructs that have been particularly tricky to assess with respect to segment-hood.

Especially for strings generated via productive "derivational" processes (reduplication, affixation, contraction), it can be quite unclear what the later stages should be expected to handle. Although many consider such"derived" strings to be segments, our principle implies an important caveat. Consider, for example, if we mark a reduplicative construct such as gao1gao1xing4xing4 as a single segment; assuming gao1gao1xing4xing4 is not in the dictionary, then how could a later processing stage know this is gao1xing4 with a modified meaning, unless it breaks up the segment so that it can look up gao1xing4 in the dictionary?

The Monotonicity Principle allows two approaches.

  1. Do not mark "derivational" constructs as segments.
  2. Mark "derivational" constructs as segments, subject to the following proviso.
For segments not in the dictionary, the segment annotation should include the base form and type of derivational process.

This is similar to the way morphological analyzers work for English, which output a stem plus tense/aspect/number features. It relieves later stages from needing to resegment and re-analyze. Specifically,

Of course, if the derived construct (reduplicative compound, affixed form, or contraction) is in the lexicon, there is no problem.

Convergence with theoretical notions of segment-hood

Computational standpoints like the Monotonicity Principle tend to fit well with cognitive modeling concerns. The Monotonicity Principle reflects a hypothesis about sentence processing:
 
Determinism Hypothesis for segmentation. In sentence interpretation, humans employ a fast preprocessing segmentation stage that tokenizes input sentences into substrings that no later processing stage needs to decompose, except for garden paths.
 
There is a linguistic notion of a Chinese morpheme. Usually these are single syllables/characters, but some argue that strings like pu2tao3 are morphemes since the characters "never" (rarely) occur in any other strings. Such strings are always segments by the Monotonicity Principle.

Feng (1997) suggests the notion of a prosodic word, defined by criteria of prosody and maximum character length. It is plausible that prosodic patterns evolved to maximize efficient yet accurate listener interpretation. This would be consistent with the Monotonicity Principle.

Conventionalized strings like chi1fan4 would be in the lexicon and are legitimate segment candidates. I do not consider constructions like chi1 ...yadda yadda... fan4 to be words, but rather, lexical contructions in Fillmore's sense.
 

Notes

  1. This paper may refer to participants at the Chinese Language Processing Workshop, but the statements contained are not intended to represent the opinions of anyone but the author.
  2. The opposite tack is sometimes advocated:  first segment into long strings, and then subsequently resegment in later stages. This is not a real solution since it simply passes the buck to a later stage without actually obtaining any leverage from the idea of segmentation.