Science Needs More Moneyball
Baseball's data-mining methods are starting a similar revolution in research
The Moneyball story, in book and film, champions a data-mining revolution that changed professional baseball. On the surface, Moneyball is about Billy Beane, the general manager of the Oakland A’s, who found a way to lead his cash-strapped club to success against teams with much bigger payrolls. Beane used data to challenge what everyone else managing baseball “knew” to be true from intuition, experience and training. He pioneered methods to identify outstanding players he could afford because they were undervalued by the traditional statistics used by the baseball elite.
This film was marketed as a sports movie. When I saw it, I knew right away what Moneyball is really about: the thrill and triumph of data mining. It’s an instructive tale of how existing data can be examined for meaning in ways that were never intended or imagined when they were originally collected. Beane and his colleagues challenged the time-honored trinity of batting average, home runs and runs batted in (RBIs) as the essence of the offensive value of a player, replacing these statistics with newer measures based on the same data. They worked off theories developed by baseball writer and historian Bill James, who posited in the 1970s that the traditional stats were really imperfect measurements. James’s approach didn’t just replace one intuition with another. He let the game decide which stats did the best job of predicting offensive output.
This approach is not easy. Trying to directly predict the number of games won would confound the skill of a team’s offense with its pitching and fielding. James figured that one could test each offensive stat by trying to predict the total number of runs produced by each team over the course of a season, thus eliminating any effects of defense. It turned out that on-base percentage and slugging percentage were far superior to any individual offensive statistics used up to that point. James and others similarly devised statistics for pitching and fielding that were more independent of context.
Beane’s use of the new statistics is appealing because it defeated the wisdom and training of other industry experts. His approach is summed up in one of the best scenes from the Moneyball film. Armed with his new data-mining methods, Beane challenges other talent evaluators about a player they all deem “good.” A scout counters him, praising the player’s swing. Beane’s reply: “If he’s such a good hitter, why doesn’t he hit good?”
In other words, expert intuition aside, the data don’t lie.