Belles lettres Meets Big Data

Quantitative analysis of poetry and prose has roots deep in the 19th century.

Brian Hayes

The Shrinking English Sentence

2014-07CompsciFp264top.jpgClick to Enlarge ImageMy second precocious digital humanist is Lucius Adelno Sherman (1847–1933), who was born and educated in New England. After graduating from Yale in 1871, he stayed in New Haven to teach Greek, Latin, and English at the Hopkins Grammar School. He also published a translation of the Swedish poem Frithiof’s Saga, and worked toward a Yale Ph.D., which was awarded in 1875. His doctoral dissertation, “A Grammatical Analysis of the Old English Poem, ‘The Owl and the Nightingale,’” already suggests a quantitative bent to his literary work. For example, he lists the frequencies of various prepositions, conjunctions, and forms of negation in the poem.

In 1882 Sherman accepted an appointment in the English department at the University of Nebraska in Lincoln. At the time, the Lincoln campus consisted of a single building, but both the city and the university were growing at a frenzied pace and would soon become a cultural capital of the prairies. Sherman remained in Lincoln through the rest of his life and career, serving in due course as department chair and dean. He published extensively on Shakespeare and other Elizabethan dramatists, and wrote plays of his own. And then there was the peculiar book that concerns me here: Analytics of Literature: A Manual for the Objective Study of English Prose and Poetry, published in 1893.

The first half of Analytics is a fairly conventional introduction to rhetoric and poetics, with chapters on meter and rhyme, figures of speech, the emotional force of words—a lot of close reading. Then Sherman suddenly goes all quantitative, launching into a discussion of sentence length. Sherman was motivated by broader questions than the authorship puzzles that concerned Mendenhall. While teaching the historical development of English literature, Sherman took note of pervasive changes in sentence structure. In the chronological progression from Chaucer in the 14th century to Shakespeare in the 17th to Emerson in the 19th, sentences seemed to grow simpler, to lose much of their “heaviness” and intricacy. Of course Sherman was hardly the first to notice that modern English syntax differs from medieval and Elizabethan practice, but his approach was novel: He believed the nature of the change should be susceptible to scientific inquiry. “The right way and the only way to learn the facts and principles of English prose development was plainly to study the literature objectively, with scalpel and microscope in hand.”

An obvious way to begin this inquiry was simply to measure the lengths of sentences, and thus Sherman undertook a great counting project. By experiment he found that a sample of 500 sentences was enough to characterize an author’s habits, and so he tallied such samples for a dozen writers. Some basic facts quickly emerged. Robert Fabyan, writing circa 1500, produced sentences with an average length of 63 words. Edmund Spenser, a century later, wrote 50-word sentences. By the time we come to Ralph Waldo Emerson in the middle of the 19th century, the average sentence length has dropped to 20.5 words. Comparing the overall averages for early and modern writers “furnish[es] evidence that the English prose sentence had dropped something like half its weight since Shakespeare’s times.”

As his sentence-length data accumulated, Sherman began noticing other patterns. Some of his observations are mere curiosities; for example, he remarked on an excess of odd numbers in the sentence lengths of Thomas Babington Macaulay and an excess of prime numbers in those of Thomas De Quincy. But Sherman also explored the distribution of sentence lengths—what Mendenhall might have called the sentence spectrum—which conveys much more information than a simple average. (See illustration above.) He also remarked on rhythms created by variations in length from one sentence to the next. (See illustration below.) And he went on to examine the evolution of subtler linguistic properties such as the number of verbs per sentence and the linkage between clauses in complex sentences.

2014-07CompsciFp264bot.jpgClick to Enlarge Image


