Logo IMG
HOME > PAST ISSUE > May-June 1999 > Article Detail


The Web of Words

Brian Hayes

A Theory of Knowledge

The tree structures within the lexical graph make WordNet into a rough taxonomy of the world. The database codifies a theory of knowledge. Concepts accorded a place near the root of a tree are identified as central, basic, primary; those near the leaves are marginal or peripheral. But who determines the branching pattern of the noun and verb trees? Are the fundamental categories of thought inherent in the language, or are they inventions of the lexicographer?

For the noun hierarchy, WordNet posits 11 "unique beginners"—nouns or noun phrases that have no hypernyms. The unique beginners are entity, abstraction, psychological feature, natural phenomenon, activity, event, group, location, possession, shape and state. Miller does not argue that these particular choices are the only possible ones, but neither are the 11 categories entirely arbitrary or idiosyncratic. They reflect an analysis by Philip N. Johnson-Laird of Princeton of the classes of nouns that can be modified by various adjectives. The categories also satisfy another important criterion: The hierarchy has a place for every English noun.

In practice, no single, immutable hierarchy can possibly capture the structure of the entire lexicon. Conflicts over the classification of words emerge not only at the root but at all levels of the tree. Consider the small subtree of nouns that denote close family relations. The generic term relative can serve as the root of the subtree, but what are its immediate hyponyms? In one scheme the relative node has two subordinate nodes, kinswoman and kinsman. At the next level the subordinates of kinswoman include mother, sister, daughter; those of kinsman include father, brother, son. But the same words could equally well be organized another way, giving relative three hyponyms, namely parent, sibling and child; then each of these nodes divides by gender into mother and father, sister and brother, daughter and son. As it happens, WordNet is inconsistent in its treatment of the lexical family tree. Sister and brother follow the first model: They are listed as hyponyms of kinswoman and kinsman respectively. But mother and father are hyponyms of parent, and similarly daughter and son are grouped under child.

Of course neither solution is clearly right or wrong. Whether the more natural division is by gender or by generation depends on context, and people have no trouble keeping both hierarchies in mind at the same time. Furthermore, both schemes lose some of their appealing symmetry when they are extended to other relatives, such as cousin, which carries no indication of gender in English. (But I can't resist noting that cousin derives from the Latin consobrinus, which originally referred only to a cousin on the mother's side, and which derives in turn from soror, sister.)

These tangled hierarchies are not the only oddities lurking in the lexical trees. One might reasonably suppose that hyponymy, meronymy and other relations between synsets would be transitive. In many cases they are. A mouse is a mammal, a mammal is an animal, and sure enough a mouse is an animal. Often, however, language fails to obey the Aristotelian rules. A house has a door and a door has a knob, but most people would find it odd to say that a house has a knob.

In citing these inconsistencies and peculiarities I don't mean to suggest that the graph-theoretical approach to the lexicon is fundamentally wrong. On the contrary, I would argue that these bugs are features! It seems to me that WordNet is most illuminating just where the construction of the graph runs into difficulties. This is where we may have a chance to learn something about the underlying structure of the language.

Discourse in theoretical linguistics proceeds largely by example and counterexample, by constructing sentences that the prototypical native speaker would or wouldn't find acceptable. Much of this discourse has been carried on with rather small specimens of language—with "toy" grammars and lexicons that generate only a tiny subset of all possible sentences. But building a lexicon of 50 or 100 words and leaving the rest of the language as an exercise for the reader risks missing something important. Indeed, it is characteristic of graphs that some properties do not emerge until the last nodes and edges are added.

comments powered by Disqus


Of Possible Interest

Letters to the Editors: The Truth about Models

Spotlight: Briefings

Computing Science: Belles lettres Meets Big Data

Subscribe to American Scientist