All sciences are based on empirical evidence. Data quantifies the observed or measured phenomena; they can stimulate new hypotheses, support or disprove existing theories, and ultimately lead us to a better understanding of the world. The information revolution has dramatically intensified this process by enabling vast growth in the volume, rate, and quality of data, and thus in the ways in which science is conducted. Astronomy has been on the front lines of these developments for the past three or four decades, and it is still pushing the boundary today.
In every field of research, the volume and acquisition rates of data have been increasing exponentially, doubling every year or two. Fortunately, our ability to process the data has been increasing at the same pace. The devices that generate data (such as digital detectors) and those that help us process the data (such as computers) are both described by Moore’s law. Although this rapid growth in data quantity is the most visible consequence of the information technology revolution, there has also been a great increase in the complexity, quality, and richness of the data. These areas present both the most daunting tests and the greatest opportunities on the path from data to knowledge.
Rapid growth in data quantity has been accompanied by an increase in the complexity, quality,and richness of the data.
Astronomers have always been quick to embrace new technologies for the detection and measurement of radiation from the sky, starting with photographic plates in the 19th century and continuing through a plethora of digital detectors and computers in the 20th century. In the 1980s astronomers helped develop revolutionary new imaging arrays, such as charge-coupled devices (CCDs), for collecting visible light from distant objects; these devices were soon followed by their counterparts for collecting infrared and other wavelengths. Ever since, astronomical data has been “born digital.” Even the old photographic plates, such as those from the Palomar Observatory Sky Surveys, have been scanned and converted to a digital format suitable for computer processing. One consequence of this transformation is that the astronomy community has become both computationally savvy and data savvy.
In the 1980s, astronomical images were measured in kilobytes and megabytes. By the 1990s, a new generation of digital sky surveys—for example, the Sloan Digital Sky Survey, which covered large areas of the sky on different wavelengths—entered the terabyte regime, collecting trillions of bytes of data. These surveys detected hundreds of millions of sources, such as stars and galaxies, and they recorded tens to hundreds of descriptive numbers for each of them.
To cope with this data avalanche, astronomers started developing data-processing pipelines that transformed the raw bits coming from the detectors into fully calibrated images, spectra, and catalogs of sources and their properties. Today, every modern astronomical data-gathering system has such a pipeline. Researchers have implemented new data-management techniques to organize and access these vast data sets, and they have begun using machine-learning tools to help analyze the information.
Where the Data Avalanche Begins
Most data in astronomy do not come from the familiar large telescopes on the ground, such the twin Keck 10- meter telescopes in Hawaii, or from the high-profile space-based telescopes, such as the Hubble Space Telescope; both of these observatories produce data of high quality and depth, but they focus on small, specific targets. Most data come instead from the more modest telescopes, such as the 48-inch Samuel Oschin telescope at Palomar Observatory, which survey large fractions of the sky and detect vast numbers of sources. Data mining is then used to select the most interesting sources for follow-up by the flagship facilities.
This data avalanche continues with the current generation of synoptic (big-picture) sky surveys, such as the Catalina Sky Survey in Arizona, the Zwicky Transient Facility at Palomar, and many others. These facilities repeatedly scan the sky, looking for objects that move, such as asteroids, or that change in brightness, such as cosmic explosions or variable stars. We have thus moved from massive panoramic photography to massive panoramic cinematography of the sky; from massive data sets to massive data streams; and from the terabyte regime to a petabyte regime, with exabyte (quintillion bytes) data sets on the horizon.
By the late 1990s, as digital sky surveys became the largest data producers in astronomy, it became obvious to a number of astronomers—including Alex Szalay at Johns Hopkins University, Tom Prince and myself at the California Institute of Technology (Caltech), and many others—that we needed a fresh approach to conduct science with such enormous and growing data sets. In addition to tackling the data gathered by the individual observatories, sky surveys, and space missions, each of which formed its own archive, there was a need to combine data sets (for example from different wavelengths) to uncover knowledge that might be present but not recognizable in each of the separate data sets.
Realizing that the traditional desktop- scale approaches were no longer adequate to keep up with the large digital sky surveys, scientists began developing tools to deal with the fast-expanding data volume. Through discussions among these scientists from the late 1990s to the early 2000s, the concept of a virtual observatory (VO) was born.
The VO was intended to be a complete, distributed, online framework for astronomy involving massive and complex data sets; it would enable scientists to discover, access, combine, and analyze data from a broad range of observatories, sky surveys, space missions, and even theoretical simulations. In the United States, a strong endorsement by the Decadal Survey of Astronomy and Astrophysics (used to set priorities for federal astronomy spending) and sponsorship by the National Science Foundation and NASA led to the creation of the National Virtual Observatory and its successor, the Virtual Astronomical Observatory. Other countries quickly followed suit. Today we have an International Virtual Observatory Alliance that serves as a coordinating body for these national organizations.
The Age of the VO
What the VO has evolved into is a global data grid for astronomy. Essentially all of the astronomical data that matter (unless protected by a proprietary period), whether from the ground or space, on any wavelength, from nearly every observatory, space mission, or sky survey, are now accessible to anyone in the world with an internet connection. The data’s accessibility by amateur astronomers has proven especially valuable. Amateur-led research projects have discovered supernovas, measured positions of newfound asteroids, and monitored outbursts of interesting variable sources.
All of the astronomical literature has also been digitized and placed online. That literature is increasingly being cross-linked to the data sets used for the publications, through electronic journals, services such as the NASA/IPAC Extragalactic Database (which was developed with NASA by Caltech’s Infrared Processing and Analysis Center), NASA’s Astrophysics Data System (essentially the comprehensive, global library of astronomy), and the arXiv preprint server. This leveling of the playing field is enormously empowering, as it enables talented scientists and students everywhere, no matter how remote, to perform first-rate science and to make notable discoveries. Today, you can be a successful observational astronomer and never see a telescope in your life: The archives are the new sky, and the algorithms are the new instruments.
Although the VO does provide some data exploration tools, it doesn’t yet fulfill the scientific potential enabled by this wealth of data. Rather, many such tools are being developed by individual researchers and groups to mine knowledge from the vast data resources. The emerging bridge field of astroinformatics—analogous to bioinformatics, geoinformatics, and other “x-informatics”—connects astronomy with computer science and engineering, statistics, and other sources of data exploration and knowledge-discovery methodologies. Over the past decade, the growing astroinformatics community has developed many such tools and shared them broadly, leading to numerous new insights.
The sheer size of modern astronomical data sets enables systematic studies of unprecedented scope. We can investigate the structure of our galaxy, the organization and evolution of families of galaxies and large-scale structures in the universe, quasars and other objects powered by supermassive black holes, the nature of dark matter and dark energy, planetary systems around other stars, the population of potentially hazardous asteroids, and much more.
Because the progress in information technology does not show any signs of slowing down, the scope of astronomy continues to expand. Through our measurements, we are mapping the physical universe to a vast data cyberspace. We are standing at a frontier, but definitely not the final one.
Astronomy is not unique. A similar story is unfolding in essentially every other field of science. All sciences are being transformed by big data, with similar challenges and solutions. The emerging methodology serves as the new universal language of science, akin to the roles played by mathematics and statistics in the past. Collectively, we are redeveloping the scientific method for the computationally enabled, data-intensive science of the 21st century. Who knows what great discoveries will emerge from this technology-driven evolutionary transformation?