Logo IMG
HOME > PAST ISSUE > May-June 1999 > Article Detail


The Web of Words

Brian Hayes

Six Degrees of Lexicography

Looking on language as a graph invites the kinds of questions that graph theorists ask. Many of these questions have to do with connectivity—with the number of edges linking pairs of nodes.

The ultimate in connectivity is a clique, which is a graph where every node is directly linked to every other node. But a clique is not a plausible architecture for a lexical graph. (For one thing, there are not enough words to uniquely name all the relations between words.)

At the opposite end of the connectivity spectrum is a graph with no edges at all, just isolated nodes. Again the lexical graph can look nothing like this. Suppose a word occupied such a lonely outpost. With no relations to any other words—no synonyms, no hypernyms, no antonyms—what could it possibly mean? What could one say about it—or say with it? Even pairs of nodes linked only to each other are problematic. They would be no more useful than the dictionary—famous in the lore of lexicography although I don't know if it really exists—that defines furze as gorse and then defines gorse as furze.

A more realistic question is whether the lexical graph consists of a single connected component. Can you find at least one continuous path from any given node to any other? For WordNet in its present form the answer is clearly No. With the exception of some adverb-adjective links, there are no edges between different parts of speech, and even within the noun and verb hierarchies the graph breaks into several disconnected pieces. But the mind's lexical graph is surely richer in relations than WordNet and may well be connected. Trying to find a plausible path of relations between two randomly chosen words is easy enough that it doesn't even make a very good parlor game. (And suppose you found a pair of words that stumped everybody, that seemed to have nothing whatever in common. They would then be related by their shared membership in the unusual class of words that are not otherwise related.)

Figure 2. Words for common relationsClick to Enlarge Image

In the end the question of connectivity comes down to what kinds of relations between words qualify as edges of the graph. Within the mental lexicon there are surely many more kinds of links than WordNet admits. Consciously or unconsciously, we form word associations based on common etymology (river and arrive), based on assonance or rhyme (slumber and encumber), based on pairing in familiar phrases (law and order). None of these relations are likely candidates for inclusion in WordNet, but a few other kinds of edges could be important additions to the graph.

Selection rules operating between parts of speech might be the most valuable enhancements. As noted above, adjectives can supply useful clues to the classification of nouns. For example, the adjectives living and dead can describe only biological organisms (except in figurative uses). In a similar way, many verbs are restricted to certain classes of subjects and objects. You can count sheep or noses, but you can't count water; you can open a door but not a contradiction. Would adding such constraints to the graph be a practical undertaking? That depends on whether or not the constraints follow the hyponymy and troponymy hierarchies. If the selection rule for the object of eat could be encoded by creating a single link to the node for food, which would then subsume all the hyponyms of food, the number of edges added to the graph would be manageable. But English is surely not quite that tidy; in the worst case, inserting a separate edge between every verb and each of its potential objects would bring an exponential explosion in the number of edges.

Even though WordNet does not currently include selection-rule information, Philip Resnik of the University of Maryland has used the WordNet database, along with a large corpus of English prose, to derive probabilistic selection rules. For example, he notes that if you choose a sentence at random from the corpus, the subject is more likely to be a hyponym of person than a hyponym of insect. But if the verb in the sentence happens to be buzz, the probability of finding an insect in the subject position rises considerably. The magnitude of the change in probability conveys information about the strength of the selection rule.

Another possible elaboration of WordNet would address what Roger Chaffin of the University of Connecticut has called "the tennis problem." Miller writes: "Suppose you wanted to learn the specialized vocabulary of tennis and asked where in WordNet you could find it. The answer would be everywhere and nowhere. Tennis players are in the noun.person file, tennis equipment is in noun.artifact, the tennis court is in noun.location, the various strokes are in noun.act, and so on. Other topics have similarly dispersed vocabularies. At least part of the dissatisfaction with a purely hierarchical organization of nouns can be attributed to this neglect of co-occurrence relations."

comments powered by Disqus


Of Possible Interest

Computing Science: Computer Vision and Computer Hallucinations

Engineering: Anonymous Design

Sightings: Cell by Cell, Life Appears

Subscribe to American Scientist