First Links in the Markov Chain
Probability and poetry were unlikely partners in the creation of a computational tool
Counting Vowels and Consonants
Markov first addressed the issue of dependent variables and the law of large numbers in 1906. He began with a simple case—a system with just two states. On the assumption that all four transition probabilities are greater than 0 and less than 1, he was able to prove that as the system evolves over time, the frequency of each state converges to a fixed average value. Over the next few years Markov extended and generalized the proof, showing that it applies to a broad class of models.
This series of results achieved at least one of Markov’s goals: It forced Nekrasov to retreat from his claim that the law of large numbers implies free will. But the wider world of mathematics did not take much notice. One thing lacking was any hint of how these ideas might be applied to practical events. Markov was proudly aloof from such matters. He wrote to Chuprov: “I am concerned only with questions of pure analysis.… I refer to the question of the applicability of probability theory with indifference.”
By 1913, however, Markov had apparently had a change of heart. His paper on Onegin was certainly a work of applied probability theory. It made a lasting impression, perhaps in part because of the novelty of applying mathematics to poetry. Perhaps too because the poem he chose is a treasured one, which Russian schoolchildren recite.
From a linguistic point of view, Markov’s analysis was at a very superficial level. It did not address the meter or rhyme or meaning of Pushkin’s verse. It treated the text as a mere stream of letters. Simplifying further still, the letters were lumped into just two classes, vowels and consonants.
Markov’s sample comprised the first 20,000 letters of the poem, which is about an eighth of the total. He eliminated all punctuation and white space, jamming the characters into one long, unbroken sequence. In the first phase of his analysis he arranged the text in 200 blocks of 10×10 characters, then counted the vowels in each row and column. From this tabulation he was able to calculate both the mean number of vowels per 100-character block and the variance, a measure of how widely samples depart from the mean. Along the way he tallied up the total number of vowels (8,638) and consonants (11,362).
In a second phase Markov returned to the unbroken sequence of 20,000 letters, combing through it to classify pairs of successive letters according to their pattern of vowels and consonants. He counted 1,104 vowel-vowel pairs and was able to deduce that there were 3,827 double consonants; the remaining 15,069 pairs must consist of a vowel and a consonant in one order or the other.
With these numbers in hand, Markov could estimate to what extent Pushkin’s text violates the principle of independence. The probability that a randomly chosen letter is a vowel is 8,638/20,000, or about 0.43. If adjacent letters were independent, then the probability of two vowels in succession would be (0.43)2, or about 0.19. A sample of 19,999 pairs would be expected to have 3,731 double vowels, more than three times the actual number. Thus we have strong evidence that the letter probabilities are not independent; there is an exaggerated tendency for vowels and consonants to alternate. (Given the phonetic structure of human language, this finding is not a surprise.)
Markov did all of his counting and calculating with pencil and paper. Out of curiosity, I tried repeating some of his work with an English translation of Onegin. Constructing 10×10 tables on squared paper was tedious but not difficult. Circling double vowels on a printout of the text seemed to go quickly—10 stanzas in half an hour—but it turned out I had missed 62 of 248 vowel-vowel pairs. Markov was probably faster and more accurate than I am; even so, he must have spent several days on these labors. He later undertook a similar analysis of 100,000 characters of a memoir by another Russian writer, Sergei Aksakov.
A computer reduces the textual analysis to triviality, finding all double vowels in four milliseconds. The result of such an analysis, shown in the illustration above, suggests that written English is rather vowel-poor (or consonant-rich) compared with Russian, and yet the structure of the transition matrix is the same. The probability of encountering a vowel depends strongly on whether the preceding letter is a vowel or a consonant, with a bias toward alternation.