Can we predict diseases with data science?

Advances in forecasting are helping improve health care.


In the early part of 2020, the onset and rapid spread of the COVID-19 pandemic saw epidemiologists taking center stage, with forecasts on how rapidly the pandemic would spread. 

The term R0, the reproduction number which indicates how contagious a disease is, went from the pages of journals and research papers to featuring in drawing room conversations.

In today’s “new normal,” where awareness and fear of infectious diseases are at their highest, the question “Can we predict diseases like the weather?” behooves examination.

“All models are wrong, but some are useful.” – George E. P. Box, British statistician

Forecasting is a hazardous business to be in. Even with a wealth of macroeconomic and financial data, most financial economists failed to see the onset of the 2008 Great Recession. 

Weather forecasting involves some of the most complicated systems with reams of satellite and sensor data being processed by supercomputers, yet we are all too familiar with the predicted thunderstorm not showing up.

Despite the inherent risk of errors and failures, there is a tremendous utility in the endeavor of forecasting, especially when it comes to diseases. 

Knowing the timing, magnitude, and trajectory of a seasonal disease like the flu helps governments not only mobilize, allocate and manage resources to contain the spread but also ensure proper and timely communication to the general public.

Mechanics of disease forecasting

All forecasting efforts involve the following: deciding what to forecast, acquiring and processing data relevant to what needs to be forecasted, choosing suitable methods or models, and measuring the effectiveness of the forecast. 

These can be understood as follows using the example of the FluSight challenge which runs from late October/early November to mid-May of the next year.

Forecast targets: This includes forecasting for the onset of the “flu season,” peak week, peak intensity, season duration, weekly ILI percent 1–4 weeks ahead, peak weekly hospitalization rate, weekly hospitalization rates 1–4 weeks ahead, etc.

Data used: Participants in the challenge relied on historical flu datasets, social media posts, search engine trends as well as weather data to build their models. 

Generally, the CDC utilizes a wide variety of standard data sources (for example, prescription drug sales, school leave records, outpatient illness reports, etc.) as well as additional data sources (for example, satellite imagery, crawled news reports from the Internet, etc.) for its efforts.

Models used: A wide variety of approaches have been used ranging from simple time series models (predict the future value based on past values) to compartmental models (divide the population into compartments such as susceptible, infectious, recovered, etc. and define rates of movement between compartments) to ensemble models (combine forecasts from multiple models into one forecast). 

Each of these models has its own pros and cons, based on how simple or complex the method is inherent assumptions, how much computing power is required, etc.

Effectiveness measurement: Simple error-based metrics (was the forecast correct or not?) are not very useful in disease forecasting. 

FluSight uses several effective measurement techniques, one of which is allowing forecasts to be tied to a probability, similar to weather forecasts which say that there’s a 20% chance of rain today. 

Peak week forecasts could say that the probability of the peak occurring in week 1 is 10%, in week 2 is 70% and week 3 is 20%, meaning there’s a 100% of the peak occurring between weeks 1 and 3 but 0% chance before or after.

The future of disease forecasting

In the upcoming years, forecasting diseases will only get better due to three major reasons: 

*Richer and better sources of data with the advent of wearable devices, improved ambient computing, where smart speakers now pick up data on what’s happening inside houses, people signing up for genetic tests thus providing gene data, etc.

* Advances in forecasting methodologies where deep neural networks, specifically innovations around recurrent neural networks, are increasingly being utilized and are proving their effectiveness compared to traditional statistical techniques

* Wider awareness and demand both from governments and individuals for accurate information on the spread of diseases 

As is the case in any endeavor around data these days, regulations around privacy, data sharing, and storage will continue to be challenges that need to be solved before we get accurate disease forecasts available on our smartphones, much like how we get weather updates.