Digital Disease Detection: Can Wikipedia Monitor Diseases Globally?

For better or worse, the Internet has become the world’s number one source for health-related information. Think about it: how many times have you Googled symptoms when you were sick? Or how often have you used the Internet to search for answers to questions you think of after visiting a doctor? Because we can pick up and quantify these health information-searching behaviors, it is possible to estimate the levels of disease. Researchers have demonstrated this using search trends from WebMD, Google, Yahoo and most recently Wikipedia.

In a recent Digital Disease Detection post, Dave McIver describes a study that suggested that we could use Wikipedia to track influenza-like-illness (ILI) in the United States by looking at influenza-related page views. But why limit ourselves to tracking flu in the United States or other developed nations? What other diseases can we track using Wikipedia? Where and how can we track them? These novel digital methods of disease detection and monitoring potentially offer the greatest improvements in regions of the world lacking the kind of high-quality ground truth data found in the United States.

Wikipedia, unlike Twitter and Google, makes its complete data freely available for anyone to download. Every hour, access logs containing the number of views each Wikipedia page receives in each language are released. However, unlike Twitter and Google, Wikipedia data does not contain any explicit geo-location information.

Wikipedia data are released and aggregated at the language level. If the geographic distribution of language speakers is mostly clustered in a single location, then one can assume that most of those speakers are in that location. For example, most Thai speakers are in Thailand; therefore, we can assume that data from the Thai Wikipedia are coming mostly from Thailand. The same is true for Polish speakers (concentrated in Poland), and many other language-location pairings.

Another way to geo-locate Wikipedia data is if the disease or outbreak of interest is located in a single location amongst speakers of a particular language. For example, although Portuguese is spoken in many places, the only Portuguese-speaking nation where dengue fever is overwhelmingly prevalent is Brazil. Thus, looking at the Portuguese Wikipedia for dengue can give us a good sense of the dengue activity in Brazil.

To see if we could track disease levels around the world using the aforementioned methods, we downloaded the entire history of page views for seven different language Wikipedia—Portuguese (Brazil), Chinese (China), Japanese (Japan), Polish (Poland), Norwegian (Norway), Thai (Thailand) and English (United States)—and built models from time series of disease-related Wikipedia pages. We trained linear models for influenza, dengue, tuberculosis, HIV/AIDS, bubonic plague, cholera, and Ebola in nine countries using the official data from each nation’s respective governmental health organization (e.g., CDC, Thai Ministry of Health).

So, how well did our models perform at monitoring diseases around the world? All our flu and dengue models did well; our flu models were successfully able to track flu in Poland, Thailand, Japan, and the United States with high accuracy (the model fit (r2) ranged from 0.80 to 0.92, where 1 is best). We found our success in the United States surprising, given that English is spoken all over the world. Similarly, our dengue models for Brazil and Thailand performed well (r2 of 0.86 and 0.74, respectively). Of the three tuberculosis models we built, the Chinese and Thai models showed promise (r2 of 0.78 and 0.69, respectively) whereas the Norwegian one did not (r2 of 0.48).

Perhaps our model failures are more interesting than our model successes. We had less success with our HIV/AIDS, plague, cholera, and Ebola models. For the models of plague in the United States, Ebola in Uganda/Democratic Republic of Congo, and cholera in Haiti, we suspect that the number of page views of the disease-related pages drown out the actual observations of the disease. In the cases of plague and Ebola, these diseases are widely known by people but are extremely rare (e.g., plague has less than a handful of cases in the United States in the previous few years). Furthermore, especially in the cases of cholera in Haiti and Ebola in Africa, these outbreaks occur in regions with low Internet penetration, further limiting the chances that direct observations are being logged by Wikipedia. Our HIV/AIDS models in China and Japan faced a different problem—the disease incubation period is so long (years, even decades) that the variation we observed in the official data likely does not accurately reflect the true incidence of the diseases.

What happens when we do not have “gold standard” data by which we can build and validate a model? This is a very real concern in many developing nations where the health infrastructure does not have sufficient resources to accurately monitor diseases. It is precisely in these cases where digital disease detection methods can potentially be transformative since it would allow us to cheaply and sustainably track diseases in regions that have minimal public health surveillance.

To assess this, we wanted to see if any of our models are transferable to different locations without having to train them using “gold standard” data. Surely, people in differing locations behave similarly when they get sick and will visit the same types of Wikipedia pages. If that is indeed the case, then it may be possible to train a model in one country where we have good ground truth data and use that model in another country that does not have trustworthy ground truth data. By looking at whether the page views from the same Wikipedia pages across different languages correlated with disease levels, it is possible to get a sense of whether we can transfer models from one location to another. For example, if people in the United States, Thailand, and Poland all look at the same Wikipedia page when sick and you can use this page to infer flu levels, then it is conceivable that you can do this in other countries as well without building a new model. Among our flu models, we found that transferability of models may be possible.

Wikipedia, one of the Internet’s most popular websites, is an exciting and novel data source that shows extreme potential for disease surveillance worldwide. Unlike other Internet data sources such as Google, Twitter, or Facebook, which may release a limited dataset free of charge, complete Wikipedia data are available freely to anyone. This is an extremely important fact if we are to further develop Wikipedia data into an operational disease monitoring system. While Wikipedia is currently limited because its data do not contain any explicit geo-location information (i.e., where are the page views coming form), this can be easily fixed by the Wikimedia Foundation, which could release the data aggregated not only by language, but also by geographical location. Maybe then we will be able to develop a Wiki Flu trends!

For more information, see our pre-print article, Global disease monitoring and forecasting with Wikipedia at



Nick Generous is a post-masters research associate at Los Alamos National Laboratory. His work focuses on Internet disease detection and operation tools for biosurveillance.

Reid Priedhorsky is a postdoctoral research associate at Los Alamos National Laboratory. His work focuses on large-scale data analysis and collaborative computing, with a focus on empowering communities to make better decisions in pursuit of a sustainable and just global future.

Geoffrey Fairchild is a computer science Ph.D. student at the University of Iowa and a graduate research assistant at the Los Alamos National Laboratory. His research applies computer science, geographic information systems, and mathematics to epidemiological problems in order to simulate, analyze, and predict disease spread. 

Sara Del Valle is a scientist and project leader at Los Alamos National Laboratory. Her research focuses on developing mathematical and computational models for mitigating the spread of infectious diseases with a special interest in using social media to model and forecast human behavior.

Related Posts