Harnessing the Web to Track the Next Outbreak

Innovations in data science and disease surveillance are changing the way we respond to public health threats.

Medicine Technology Anthropogeography Human Ecology

Current Issue

This Article From Issue

November-December 2016

Volume 104, Number 6
Page 346

DOI: 10.1511/2016.123.346

On March 14, 2014, a French news site reported a strange fever in Macenta, a small village in Guinea. The story described a “new disease whose name is unknown” that caused victims to bleed from their orifices, and had already “killed eight people and contaminated several others.”

Julian Simmonds/Telegraph Media Group Limited 2014

Ad Right

The date of this first media report preceded official government reports on the emerging Ebola outbreak by almost 10 days. Eventually, the virus infected more than 28,000 people and killed more than 11,300, devastating West Africa and inciting fear around the world. Digital surveillance, enabled by recent advances in big data and related technologies, can help governments respond more quickly to public health threats. The online disease alert system HealthMap picked up on that first media report and relayed it to the public the same day.

This knowledge is critical in a world where factors such as climate change, population growth, urbanization, and global travel fuel the spread of infectious diseases and pandemics, and are exacerbated by forces such as the anti-vaccine movement and increasing antimicrobial resistance.

Traditional Disease Monitoring

Public health surveillance relies on systematically and continuously collecting, analyzing, and disseminating information as quickly as possible to guide disease prevention and control. These tracking systems serve a central role in informing the decisions of public health officials, particularly when it comes to tackling communicable illnesses. Standard disease monitoring has relied on reports from traditional sources, such as health ministries and health-care agencies. Such monitoring uses two types of surveillance, passive and sentinel, both of which have limitations. Passive surveillance, which involves a national health system gathering information from health-care providers and laboratories, often suffers from incomplete and delayed reporting, particularly in resource-poor settings. Sentinel surveillance is a more laborious approach, which involves monitoring the rate of specific diseases in a small cohort of people to estimate trends in the general population.

As a result, traditional public health approaches are often delayed by the communication chain, as data make the journey from patient, to health practitioner, to confirmatory laboratories, to local administrative bodies within regional and national ministries of health, before finally reaching authoritative world bodies such as the World Health Organization (WHO). The process can take days, weeks, or months, during which time a disease can spread globally. One report showed that in 1996, it could take as long as 167 days from an outbreak’s start to its discovery. Newer, electronic surveillance methods allow data to go directly from citizens to public health agencies in near real time.

Big Data and Technology

The proliferation of big data in public health has created new opportunities for understanding and visualizing complex global health risks. Digital epidemiology has transformed the public health community’s ability to detect diseases, risks, patterns, and outcomes, by aggregating and filtering online and mobile data from news, social media, blogs, and other informal sources. These applications complement traditional sentinel surveillance methods by providing near real-time, geospatial data about emerging risks, often visualized using interactive maps and dashboards.

Web-based electronic information sources have the potential to play an important role in early event detection and in increasing public awareness of a situation. The current, highly local information they produce can help identify events that may go unreported in the public health system. Food poisoning is a great example, because those sickened by food often do not visit a doctor. Although someone ill might tweet or take to Yelp warning others to avoid a restaurant, there is often no formal reporting. Web sources such as these are potentially useful for detection of food safety events. The challenge remains, however, in distinguishing which reports are actually relevant from the large volume of unrelated chatter on the Internet.

Over the past decade, several Web-based early-warning systems have emerged that collect disease-specific data from informal sources: The Medical Information System (MedISys) mines the Internet for public health hazards; ProMED-mail enables public health experts to report and disseminate disease risk information; and HealthMap, BioCaster, and EpiSPIDER crawl Web-based news and other resources to visualize emerging infectious disease outbreaks through mapping interfaces. The use of news media and other nontraditional sources of surveillance data can facilitate early detection of outbreaks and increase public awareness of health concerns prior to their formal recognition. Combined, these systems enhance both the timeliness and sensitivity of early disease detection. One study that gathered data from 1996 to 2006 found a lag between informal communication of an outbreak and official reporting by WHO Disease Outbreak News of 16 days, an interminable amount of time for a highly communicable disease to circulate undetected by health authorities.

Illustration by Katelynn O'Brien

In 2006, our team developed HealthMap, an online global disease alert system, with researchers at Boston Children’s Hospital and Harvard Medical School. The aim of the website is to provide a comprehensive, real-time overview of infectious disease activity by geographic location for a diverse audience, from public health officials to international travelers.

The system incorporates automation and expert review to aggregate reports on new and ongoing infectious disease outbreaks in more than 15 languages. It processes information from more than 200,000 disparate sources, including international online news aggregators, eyewitness reports (approved alerts created by community members through the site’s mobile app), expert-curated discussions, and validated official reports (such as ones from ministries of health websites).

The software captures the inherently geospatial nature of disease, along with time, case counts, and notable characteristics of the outbreak. Relevant articles are filtered from noise using a computer algorithm that analyzes text patterns, known as natural language processing (NLP). Human experts curate information, correcting misclassified alerts (false positives, for example, an automatically generated report of a disease outbreak that is not actually disease related) or changing meta tags automatically assigned by the system before this information is displayed on the HealthMap dashboard. These misclassifications may occur when the system detects words that would typically be disease related: (Justin) Bieber Fever, or an outbreak of crime. In turn, analyst corrections are used to improve the automated processes.

Following automated processing, alerts are posted directly to the public map and partner pages. The system detects common, seasonal, or endemic conditions, as well as outbreak and epidemic situations. Global health authorities such as the WHO and the U.S. Department of Defense, as well as major news outlets, have credited the software with early-warning detection of new and recurrent infectious disease outbreaks.

Diseases of Direct Contact

Novel data sources and technologies are improving public health surveillance for a variety of different infectious diseases, particularly vector-borne diseases (transmitted through direct contact, such as mosquito, tick, or flea bites), zoonotic diseases (transmitted through animals, such as bird flu), and foodborne diseases. In addition to earlier detection of outbreaks, this monitoring can help policymakers, especially in resource-strapped nations, make decisions on the best course of response based on the potential spread of an infectious disease, saving time and lives. Retrospective reviews of data from past outbreaks offer insights into how fast a disease spreads and helps identify other factors that might have been instrumental in the outbreak, such as travel patterns or the lack of sufficient health care.

Image courtesy of the authors.

The need for better understanding of vector-borne diseases has been driving innovative methods of early detection of several notable illnesses, such as dengue, malaria, and Zika. According to the WHO, more than 17 percent of all infectious diseases are vector-borne, and account for more than one million deaths per year.

Image courtesy of the authors.

Dengue hemorrhagic fever is one of the most widespread vector-borne diseases. It is endemic in more than 100 countries, notably in Southeast Asia, the Americas, and the Western Pacific Islands, and it affects an estimated 2.5 billion people. Our group collaborated with Google to develop Dengue Trends, an application for real-time detection of dengue activity based on Google search queries. In a study published in 2011 in PloS Neglected Tropical Diseases, our team evaluated whether such searches are a viable data source for the early detection and monitoring of dengue epidemics. Specifically, we examined queries aggregated from Bolivia, Brazil, India, Indonesia, and Singapore, and found that they provided information nearly in real time, as compared to official sources.

Additionally, we have evaluated the use of such unofficial data sources for tracking the recent geographic expansion of dengue across Latin America, and have compared unofficial reports against the areas that the U.S. Centers for Disease Control and Prevention (CDC) has declared endemic. We found that disease data from online media, when used in combination with traditional case reporting, not only improves the timeliness of outbreak discovery and knowledge dissemination but also provides value for public health decision making and forecasting models. Models using dengue-related queries from Google searches adequately estimated true dengue activity measured by the WHO and ministries of health. Our system additionally contributes to DengueMap, part of the CDC’s online dengue information resource, and produces its own Dengue Viral Global Consensus Outbreak Map to highlight geographic areas with endemic risk of dengue.

Another vector-borne disease with a high global burden is malaria. The WHO indicates that nearly half the world’s population—approximately 3.2 billion people—is at risk of contracting the illness. It is a leading cause of death and disease in developing countries, where young children and pregnant women are the most vulnerable. Over the past 15 years, malaria incidence has fallen by 37 percent around the world, and mortality has dropped by 60 percent. In an effort to reduce the global incidence of malaria cases by a further 90 percent, the World Health Assembly adopted what it calls a “Global Technical Strategy for Malaria 2016–2030.”

The proliferation of digital data and the elimination of traditional hierarchical communication barriers have accelerated responses to outbreaks.

One of the challenges to eliminating malaria is the lack of health care infrastructure to effectively identify and treat infected individuals, underscoring the importance of digital surveillance. Leapfrog technology (where areas with lagging technology skip intermediate steps and adopt modern technology, such as jumping to cell phones rather than installing land lines) and mobile devices have shown the potential for enhancing the coverage, timeliness, and transparency of public health reporting. Our project experimented with the use of micromonetary incentives to increase public reports of malaria illness in urban centers of India. Self-reports about malaria diagnosis status and related information were solicited online via Amazon’s Mechanical Turk, a market in which anyone can post micro-tasks and responders (“Turkers”) receive a stated fee for completed tasks. The study found that the prevalence of self-reported diagnoses of malaria were comparable to official prevalence reports found in literature. This work demonstrated the first use of harnessing micromonetary incentives and online reporting for public health surveillance, and highlighted the effective use of online systems such as Mechanical Turk to complement and even enhance traditional survey methods. To further advance digital surveillance for malaria detection, the HealthMap project has built models from Google search queries to estimate malaria activity trends in Thailand, and contributed to malaria forecasting models for endemic countries, such as Uganda.

In May 2015, locally acquired cases of the frightening Zika virus were confirmed in Brazil. There, a surge in infants born with microcephaly (small heads and underdeveloped brains) and other neurological disorders prompted the WHO to declare these Zika-related disorders a Public Health Emergency of International Concern. Early in the outbreak, an international team of researchers used HealthMap data to build models projecting the international spread of Zika virus from Brazil, drawing on digital disease, climatic, and traveler data. Results published in the journal TheLancet indicated that Zika had the potential to rapidly spread across Latin America and the Caribbean, with seasonal transmission in many parts of the United States and year-round transmission in certain areas, including parts of Florida and Texas. The HealthMap project has been tracking the Zika outbreak through a prospective timeline, and interpreting incidence data alongside maps of the distribution of Aedes aegypti and Aedes albopictus, the principal mosquito vectors of the Zika virus. Health authorities and the public can use the application to track new cases. As there is currently no vaccine or treatment available, seeing where Zika cases are in near real time can aid in the public’s decision making on such topics as travel or family planning.

Animals Harboring Disease

According to the CDC, 6 out of every 10 infectious diseases in humans are spread from animals. Such zoonotic diseases make up the majority of emerging infectious diseases, and their prevalence has been associated with a variety of factors, including climate change and the encroachment of human settlements and agriculture on natural ecosystems.

Image courtesy of the authors.

In 2009, the pandemic influenza A (H1N1) outbreak demonstrated the importance of new digital methods for zoonotic disease tracking and response. On April 1, 2009, HealthMap flagged a news story from Mexico detailing a mysterious respiratory illness in Veracruz that killed two people. In collaboration with the New England Journal of Medicine’s H1N1 Influenza Center, our group created an interactive map of worldwide cases, posted frequent Twitter updates on the outbreak, and rapidly disseminated breaking news alerts to users (an estimated one million people used HealthMap to monitor H1N1 activity). During the two major waves of the H1N1 pandemic, HealthMap collected more than 87,000 reports from both informal and official sources. We also tracked the rise in the number of countries with informal reports of suspected or confirmed cases.

In March 2013, avian influenza A (H7N9), a subtype of influenza viruses previously detected only in birds, emerged for the first time in humans in China. During this outbreak, HealthMap reported a steady increase in H7N9 cases throughout the country (all associated with exposure to live poultry or potentially contaminated environments), including 38 cases and 10 deaths. Notably, when a Chinese hospital employee shared a picture of the medical record of a patient with H7N9 on the Chinese social media website Sina Weibo, the action was credited with accelerating the government’s acknowledgement of new cases.

Image courtesy of the authors.

More recently, our work on zoonotic diseases has focused on Ebola, the hemorrhagic fever that swept through West Africa in 2014. Within six months of the outbreak, HealthMap aggregated, classified, and visualized more than 13,000 alerts. The project has since explored in West Africa the correlation between the incidence of digital Ebola reports and reported acts of aggravation (such as riots) or, conversely, with positive public health actions. This qualitative analysis indicated that local aggravating events and regional interventions, as reported in real time by media outlets, were effective proxy measures of changes in Ebola incidence. Further, the software has examined the velocity of the spread of the virus and embedded predictive modeling into the platform. Responding to diseases such as Ebola that spread by human contact requires the cooperation of the public. Patients presenting with symptoms need to be quickly isolated, and burials should be done safely. If the affected population is better informed, public health personnel could focus less on disease control and more on treating those already affected.

Foodborne Diseases

Diseases can also spread to humans from food tainted with viruses, bacteria, or parasites. The CDC attributes 48 million illnesses, 128,000 hospitalizations, and 3,000 deaths annually to food-based pathogens and unspecified agents, but many foodborne illness outbreaks often go unreported through official channels. Though the news-making E. coli outbreak at the Chipotle restaurant chain caused 55 people to fall ill in 2016, many foodborne cases do not result in a health care interaction, meaning there are no formal reports from health care providers to track.

Image courtesy of the authors.

To address this problem, our group tested whether restaurant reviews on Yelp.com (a publicly available business review site) could support foodborne illness surveillance efforts. We obtained reviews from 5,824 food services businesses from 2005 to 2012, and compared digital reports of foodborne illness episodes to official outbreak reports from the CDC. We saw a very similar distribution of foodborne illness reports by implicated foods between CDC and Yelp reports. These findings suggest that social media can provide information on foodborne illnesses, as well as implicated foods and locations, and could complement traditional foodborne disease reporting. Health authorities such as the Chicago Department of Public Health have adopted the use of social media mining for similar signal detection.

The U.S. Food and Drug Administration (FDA) has also incorporated digital surveillance into its efforts. In January 2011, President Barack Obama signed into law the FDA Food Safety Modernization Act, a sweeping reform that emphasizes the need to enhance surveillance and prevention efforts. In collaboration with the FDA’s Office of International Programs, we used HealthMap’s core technology as the foundation for a new application that can identify, map, and describe potential contamination in the food supply chain. SupplyChainMap monitors online news and social media for early-warning signals of microbial, chemical, and fungal food contamination among global food suppliers. The application uses automated text processing to tag information regarding location, contaminant type, food group, and company or brand name. Events are categorized as concerning food safety, food fraud, food quality, or food defense. The tool monitors the food supply chain from China to the United States and detects consumer-reported food safety and quality events prior to an outbreak. Preliminary analyses have demonstrated that the system generates timely information regarding food contamination and can effectively trace digitally reported events to commercial trade risks. Dynamic analytics, such as a Sankey diagram (a flow diagram that depicts movement from source to destination), allow users to explore the relationship between implicated products, contaminants, and source location; they allow regulatory agencies to decide which manufacturers and importers need inspection and what products require further scrutiny.

Other Threats to Public Health

Researchers are increasingly recognizing the value of social media for assessing health-related behaviors and sentiments relevant to disease control. Vaccine hesitancy is a well-known issue in public health and is a driver of vaccine-preventable disease. When a certain amount of the population is not protected against vaccine-preventable, communicable illnesses such as whooping cough and measles, diseases once subdued by modern medicine can reemerge.

In addition to the anti-vaccine movement, the growth of antimicrobial resistance has emerged as a major global threat to public health.

To keep abreast of vaccine-hesitant conversations online and enable proactive and targeted communication by public health professionals, we adapted our technology so that it uses vaccine-specific search taxonomy and categorizes content by sentiment toward more than 30 vaccines. This software, called Vaccine Sentimeter, enables researchers to follow specific events associated with changes in public sentiment toward vaccines and analyzes online conversations on the topic. For example, when the American TV show Katie featured potential severe side effects of the human papillomavirus vaccine in an episode, data showed that the immediate reaction on mainstream and social media was critical of the show, citing lack of scientific information and balanced information, indicating support for the vaccine. This immediate vaccine-positive reaction waned quickly, however, while vaccine-negative reactions persisted on social media.

Image courtesy of the authors.

In addition to the antivaccine movement, the growth of antimicrobial resistance has emerged as a major global threat to public health. The danger is growing in every region of the world, and in some cases has rendered useless antibiotics once considered to be very strong and effective. In the United States, the CDC reports an estimated 2 million illnesses and 23,000 deaths are attributed to antibiotic-resistant bacteria or fungi each year. With few replacements to existing drugs on the horizon, some scientists warn we could be entering a postantibiotic era.

Our group has delved into this topic, using online open-source data and applying a search taxonomy that includes not only mentions of drug resistance cases but also specific pathogens and mechanisms of resistance. Further, public hospital antibiograms, lab tests performed to determine the sensitivity of isolated bacterial strains to drugs, have been manually collected and entered into the system. Aggregating these data, the system displays rates of antibiotic resistance for various drugs and pathogens, specific to user-selected areas.

Challenges of Disease Surveillance

The use of informal data sources for digital surveillance, though game-changing, presents several challenges. The sheer volume of Web-based information is daunting, and can make it difficult to pluck a signal from noise. These data are unstructured, requiring computational methods such as machine learning and natural language processing to make sense of the data. Additionally, there is the potential for false reports, which can include misinformation, disinformation, or reporting bias. All of these challenges are a good reminder that HealthMap data complements traditional public health data. Although digital surveillance can make the public a stakeholder in outbreaks, it can also complicate risk communication. Nonetheless, the value of news and social media for digital disease detection is undisputable.

Important questions remain about how to systematically ensure patient confidentiality in an environment where social media users post their information publicly. Our group takes care to present data in aggregate or otherwise de-identify public data as well as offer opt-out options in our social media data aggregation. The industry as a whole, along with academia and government, will need to establish more guidelines and policies as the field moves forward.

Based on our 10-year experience with HealthMap, we anticipate that digital surveillance will produce even more comprehensive views and interactive analyses of distilled data and insights to further advance population health. Maps with a multitude of juxtaposed data layers—including weather, geography, land use, and endemic diseases—will give the military insight into how risky it is to drop a paratrooper on the ground, for example, or tell an emergency-response worker whether it is wise to enter a questionable zone. As a changing world and overwhelming data make for a more complicated picture of public health, it is our hope that a data-driven approach will create a clearer picture of health threats, and help everyone better manage their own risk and exposure to disease.

Bibliography

  • Anema, A., et al. 2014. Digital surveillance for enhanced detection and response to outbreaks. Lancet Infectious Diseases 14:1035–1037.
  • Bahk, C. Y., D. A. Scales, S. R. Mekaru, J. S. Brownstein, and C. C. Freifeld. 2015. Comparing timeliness, content, and disease severity of formal and informal source outbreak reporting. BMC Infectious Diseases 15:135.
  • Bahk, C. Y., M. Cumming, L. Paushter, L. C. Madoff, A. Thomson, and J. S. Brownstein. 2016. Publicly available online tool facilitates real-time monitoring of vaccine conversations and sentiments. Health Affairs 35:341–347.
    • Bhatt, S., et al. 2013. The global distribution and burden of dengue. Nature 496:504-507.
    • Bogoch, I. I., et al. 2016. Anticipating the international spread of Zika virus from Brazil. Lancet 387:335–336.
    • Brownstein, J. S., C. C. Freifeld, and L. C. Madoff. 2009. Digital disease detection– Harnessing the Web for public health surveillance. New England Journal of Medicine 360:2153–2157.
    • Chan, E. H., et al. 2010. Global capacity for emerging infectious disease detection. Proceedings of the National Academy of Sciences of the U.S.A. 107:21701–21706.
    • Chunara, R., et al. 2012. Online reporting for malaria surveillance using micro-monetary incentives in urban India, 2010–2011. Malaria Journal 11:43.
    • Freifeld, C. C., et al. 2010. Participatory epidemiology: Use of mobile phones for community-based health reporting. PloS Medicine 7:e1000376.
    • Gluskin, R. T., M. A. Johansson, M. Santillana, and J. S. Brownstein. 2014. Evaluation of Internet-based dengue query data: Google Dengue Trends. PLoS Neglected Tropical Diseases 8:e2713.
    • Majumder, M. S., S. Kluberg, M. Santillana, S. Mekaru, and J. S. Brownstein. 2015. 2014 ebola outbreak: Media events track changes in observed reproductive number. PLoS Current Outbreaks 28:7.
    • Nsoesie, E. O., S. A. Kluberg, and J. S. Brownstein. 2014. Online reports of foodborne illness capture foods implicated in official foodborne outbreak reports. Preventive Medicine 67:264–269.
    • Ocampo, A. J., R. Chunara, and J. S. Brownstein. 2013. Using search queries for malaria surveillance, Thailand. Malaria Journal 12:390.
    • Salathé, M., C. C. Freifeld, S. R. Mekaru, A. F. Tomasulo, and J. S. Brownstein. 2013. Influenza A (H7N9) and the importance of digital epidemiology. New England Journal of Medicine 369:401–404.

American Scientist Comments and Discussion

To discuss our articles or comment on them, please share them and tag American Scientist on social media platforms. Here are links to our profiles on Twitter, Facebook, and LinkedIn.

If we re-share your post, we will moderate comments/discussion following our comments policy.