The naive notion of a word boundary is essentially alien to Chinese. The notion is borrowed from languages with whitespace-separated characters. There is no absolute need to insert whitespaces between Chinese characters.
Given this, our research group takes a pragmatic approach to defining segments, arising from computational concerns. Our approach is methodologically different from, but possibly complementary to, theoretical linguistic attempts to characterize segments.1
Assume the following computationally-motivated conditions:
Monotonicity Principle for segmentation. A valid basic segmentation unit (segment or token) is a substring that no processing stage after the segmenter needs to decompose. |
This means a segmenter must not prematurely commit to long segments. Segmentation should be conservative, identifying only certain segments. A later stage always has the option to further group consecutive segments if needed.2
I'll completely avoid using the term "word" here, for two reasons:
A substring constitutes a valid segment only if no possible application would ever need to decompose it for any reason. |
A sharable general-purpose corpus should only commit to segments that no target application would ever need to decompose for any reason, whether structural or statistical. |
A corpus should be re-usable for many applications, and needs to be chunked into short tokens that will not impair the accuracy for any application.
Since the corpus' segments should not be prematurely overcommitted from any application's standpoint, the annotator must be aware of all kinds of target applications that will use the corpus. As a minimum set of target applications, we suggest MT, IR, information extraction, summarization, and language modeling for ASR.
This requires well-educated human annotators. Even so, there is a serious problem with lack of inter-annotator agreement. See Wu and Fung (ANLP-94) for a study employing nk-blind evaluation.
Note that the prohibition against decomposing segments applies to statistical models as well. This means, for example, that the segmentation of a corpus should not contain long segments that prevent correct collection of n-gram statistics (where a "gram" is a segment).
I also suggest a second guideline for annotating segments in corpora:
The criteria for annotating corpus segments should not require presence in a reference dictionary or corpus. |
The reason for this anti-criterion is Zipf's Law. Many valid segments can be absent from a reference dictionary or corpus, even with today's large corpus sizes. This should not be allowed to artificially reduce annotation accuracy. Dictionaries should be built from corpora, not the other way around.
A general-purpose automatic segmenter should only commit to segments that no target application would ever need to decompose for any reason, whether structural or statistical. |
In contrast, criteria for application-specific segmenters should not
be the same as criteria for general-purpose corpora. Ideally, an automatic
segmenter should be specialized for its intended application, and for efficiency's
sake should find the maximal-length tokens that don't impair accuracy on
that application. This allows higher performance than with general-purpose
segmenters, which are bound by the concerns of irrelevant applications.
An application-specific automatic segmenter should commit to the longest possible segments that will never need to be decomposed by later stages in the target application. |
General-purpose segmenters produce minimal segments that are safe; they may compomise efficiency but not accuracy. Application-specific segmenters are more dangerous since they are greedier and are susceptible to premature commitment leading to unrecoverable "too-long segments" or "crossing-segments" errors.
A different approach to application-specific segmentation that eliminates
the danger of premature commitment is task-driven segmentation.
Task-driven segmentation is performed in tandem with the application (parsing,
translating, named-entity labeling, etc.) rather than as a preprocessing
stage. To optimize accuracy, modern systems make use of integrated statistically-based
scores to make simultaneous decisions about segmentation and parsing/translation.
Application-specific segmentation is most accurately performed by task-driven segmentation. |
Systems employing task-driven segmentation include:
Especially for strings generated via productive "derivational" processes (reduplication, affixation, contraction), it can be quite unclear what the later stages should be expected to handle. Although many consider such"derived" strings to be segments, our principle implies an important caveat. Consider, for example, if we mark a reduplicative construct such as gao1gao1xing4xing4 as a single segment; assuming gao1gao1xing4xing4 is not in the dictionary, then how could a later processing stage know this is gao1xing4 with a modified meaning, unless it breaks up the segment so that it can look up gao1xing4 in the dictionary?
The Monotonicity Principle allows two approaches.
For segments not in the dictionary, the segment annotation should include the base form and type of derivational process. |
This is similar to the way morphological analyzers work for English, which output a stem plus tense/aspect/number features. It relieves later stages from needing to resegment and re-analyze. Specifically,
Determinism Hypothesis for segmentation. In sentence interpretation, humans employ a fast preprocessing segmentation stage that tokenizes input sentences into substrings that no later processing stage needs to decompose, except for garden paths. |
Feng (1997) suggests the notion of a prosodic word, defined by criteria of prosody and maximum character length. It is plausible that prosodic patterns evolved to maximize efficient yet accurate listener interpretation. This would be consistent with the Monotonicity Principle.
Conventionalized strings like chi1fan4 would be in the lexicon
and are legitimate segment candidates. I do not consider constructions
like chi1 ...yadda yadda... fan4 to be words, but rather,
lexical contructions in Fillmore's sense.