Accessible Published by De Gruyter September 16, 2020

The Data Science of COVID-19 Spread: Some Troubling Current and Future Trends

Rex W. Douglass, Thomas Leo Scherer ORCID logo and Erik Gartzke


One of the main ways we try to understand the COVID-19 pandemic is through time series cross section counts of cases and deaths. Observational studies based on these kinds of data have concrete and well known methodological issues that suggest significant caution for both consumers and produces of COVID-19 knowledge. We briefly enumerate some of these issues in the areas of measurement, inference, and interpretation.

1 Introduction

The SARS-COV-2 global pandemic has exposed weaknesses throughout our institutions, and the sciences are no exception. Given the deluge of official statistics and 300+ new COVID-19 working papers posted each day,[1] it is imperative for both consumers and producers of COVID-19 knowledge to be clear on what we do and do not know. In this brief review, we enumerate ways that data science has highlighted these weaknesses and is helping to address them.

In terms of understanding where we are, how we got here, and what is likely to follow, here are some things we need to know. We need to know the rate of spread of COVID-19 in a population R, over time Rt, across different political and demographic communities Rct, and prior to any non-pharmaceutical interventions Rc0. We need to know how many cases of active infection exist in a community Ict and how many of those infections resulted in death Dct. We need to know the causal effect of interventions Xct on rate of spread, between the observed treated populations Rct1 and the counterfactual populations had they not been treated Rct0. To do so, we need some plausible causal identification strategy that allows us to account for the fact that interventions are themselves chosen and implemented in response to changes in Rct, and that many outside factors likely drive both Rct and Xct simultaneously. These unknowns give rise to fundamental problems of measurement, inference, and interpretation.

2 Measurement

For the first several months of the pandemic and still in most countries now, there is no direct measure of Ict. Very few countries have implemented an ideal regularly timed national survey like the U.K.’s Office for National Statistics COVID-19 infection survey (Pouwels et al. 2020). More typically, we are reliant on serological estimates of Cumulative Infections CIct that measure the presence of antibodies indicative of infection at some point in the past. These are also still rare, and they have false positive rates that make them inappropriate for populations with low infection rates (Peeling et al. 2020).

More commonly available are Confirmed Cases CCct. COVID-19 tests are administered in a jurisdiction, and positive results are anonymized, tabulated, reported, and aggregated by increasingly nested bureaucracies. These bureaucracies are concerned primarily with releasing legally required contemporary measurements and not maintaining consistent historical time series. This has resulted in the world’s largest, most desperate scavenger hunt to scrape, transcribe, and translate counts disseminated in oral briefings, public websites, PDFs, and even static images (Alamo et al. 2020). Teams from every country are working in often uncoordinated and duplicated efforts to compile government reporting into consistent panel data; these teams include newspapers (Sun et al. 2020), nonprofits (USAFacts 2020), large private companies (Wolf, Ary, and Firooz 2020; Zhang, Donthini, and Microsoft Open Source 2020), consortiums of volunteers (COVID-19 India Org Data Operations Group 2020; Yang et al. 2020; Zhang and Donthini 2020), and Wikipedians.

The resulting ecosystem of panel datasets vary in spatial and temporal coverage, have little metadata about sources or changing definitions, and generally do not handle revisions to past counts from reporting sources. Direct comparisons between sources reveal worrying disagreements and temporal artifacts like reporting delays, seasonalities, discontinuities, and sudden revisions in counts both upwards and downwards (Wang et al. 2020). It is not obvious how to correctly account for these problems or adjudicate between conflicting sources without a clear ground truth. There also is no permanent archive of the raw source material meaning reconstructing the full chain of evidence may no longer be possible.

Likewise, we do not have direct measures of Deaths Dct but only Confirmed Deaths CDct. CDct suffers from all of the problems of Confirmed Cases CCct except for possibly less under-reporting depending on if the person died at home or in medical care. Choosing CDct as the lesser of two evils, many projects attempt to take plausible values of the Infected Fatality Rate IFR = D/I to back out an estimation for Ict (Meyerowitz-Katz and Merone 2020). Others have turned to estimating Excess Deaths EDct, which is a number proportional to the number of total deaths reported in an area above what would be expected given the number of deaths reported in previous years (Weinberger et al. 2020). EDct is also not a direct estimate of Dct as it can include deaths that were not caused by COVID-19 directly, e.g. other health conditions that received inadequate care during this period, and similarly can undercount the number of COVID-19 caused deaths as lockdowns reduce mobility and economic activity that might typically lead to deaths, e.g. car accidents.

Confirmed case and death counts mechanically depend on testing, but records of tests administered Tct are even worse. In the U.S., much of what we know about trends in testing patterns come from journalistic efforts like the Covid-Tracking Project (Lipton et al. 2020). They encountered all of the regular problems plus additional ones specific to ambiguity to what kind of test count is being reported (testing encounters, number of people tested, number of swabs tested, etc.). The type of test performed (and its false positive and false negative rate) is almost never included as metadata. Nor are the rules about how tests are being rationed and distributed being recorded systematically.

The general failure to track COVID-19 spread directly has led to a proliferation of innovative attempts to use other signals such as web searches, searches of medical databases, social media posts, fevers reported by home thermometer, and traditional flu symptom surveillance networks (Kogan et al. 2020). While promising, proxy measures require ground truthing and regular calibration using something like regularly timed serological surveys on smaller geographic samples of the population. It is precisely the lack of such capabilities that are motivating the search for alternatives in the first place.

Finally, non-pharmaceutical interventions are tracked by several academic and nonprofit teams (Cheng et al. 2020; Hale et al. 2020). These interventions are intended to limit human mobility which is more directly measured by cell phone data which are being provided by companies like Google, Apple, and SafeGraph.

3 Inference

The workhorse theoretical model for infectious disease spread is the Susceptible, Exposed, Infectious, and Removed (SEIR) compartmental model (Brauer and Castillo-Chavez 2012). The intuition behind the SEIR model is that there are mechanical relationships, such as previous infections or deaths removing candidates from infection, the timing between exposure to the next possible transmission, and the degree to which immunity may exist in the population, which induce nonlinearities in disease spread. Disease spreads slowly at first, accelerates, and then burns out if left to its own devices. SEIR should be considered the theoretical floor for analysis, and an entire menagerie of extensions account for demography, testing, mobility, social networks, etc.

The necessity of directly including testing in models of disease spread can’t be understated. Per capita cases are so temporarily correlated with per capita testing rates they are more of a proxy of testing availability than infections (Kaashoek and Santillana 2020). Spatially, per capita testing rates correlate with urbanity and a wide range of co-morbids (Souch and Cossman 2020). How many tests are given and to who varies systematically in response to conditions on the ground with both periods of rationing and blitzes.

Measuring the effect of interventions is difficult because they are assigned endogenously in response to both local conditions and national signals. Similarly, populations responded to both government orders and local conditions, often reducing their activity prior to being ordered to and also increasing their activity prior to being officially allowed to. Governments, the public, and the disease are all responding simultaneously to each other in often nonlinear and unobserved ways. Statistical instruments that cause government interventions but do not directly cause testing rates or rate of spread except through the government intervention are few and far between. Further, interventions are often implemented simultaneously or in a rolling cumulative pattern directly in response to changes in cases and testing results, making isolating the effect of any one treatment exceptionally difficult.

Even if we had an exogenous intervention, its treatment effect on the rate of spread is still unlikely to be identified since almost any intervention will affect both cases and testing. Estimating an effect on just spread requires imposing additional assumptions, e.g. sharp constraints on some parameters and informative priors on the relationship between the number of tests and the number of cases (Kubinec and Carvalho 2020).

4 Interpretation

One promising development is rigorous forecast evaluations (Reich et al. 2020). Notoriously, many early simple growth models fit to the takeoff period of infections performed well right up until the curve broke and then failed entirely. A parade of predicted peaks in cases since continue the tradition, with groups celebrating success on uninteresting short-term autocorrelations while ignoring failures on actually interesting shifts in trends. All we can do is develop a very long memory of predictions and constantly hold models accountable for their long run out-of-sample performance on unseen future data.

Other trends in initial COVID-19 work and reporting are less promising. Especially concerning is observational work that presents correlations as evidence of causation. Without identification, correlations on short highly autocorrelated time series are as likely to be misleading as informative. The SEIR model expects a nonlinear and highly autocorrelated pattern of an increasing infection rate that then levels off independent of any interventions. An unscrupulous, or naive, analyst can easily find interventions that increased (or decreased) spread solely by where those interventions land in the natural disease cycle, completely independent of the intervention’s actual effect.

Another concern is the pursuit of statistically distinguishable correlations over actually attempting to explain variation in COVID-19 outcomes themselves. Papers that can show a particular political party or demographic group is ‘worse’ on some COVID-19 dimension receive much attention. Such results lack strong explanatory power or clear policy recommendations, and so while great for making headlines, they do little to help us end the current pandemic.

Perhaps our greatest concern is the desire to set up straw man null hypotheses and then presenting the inability to reject them as positive evidence for medical and safety decisions, e.g. arguing that social distancing might not be required because a model was unable to statistically distinguish a large uptick in cases following a specific mass-meeting. In the best of circumstances, absence of evidence is not evidence of absence. Our underfit, undertheorized, and underperforming observational models are not the best of circumstances, and they are not sufficiently sensitive to evaluate more than macro-level general trends.

5 Conclusion

This necessarily brief review omitted positive developments in studying COVID-19 outside of macro-observational settings. There has been remarkable progress in areas of diagnosis, clinical treatment, and phylogenetic tracking. Data science has contributed to the rapid collaboration, development, and dissemination of research in a way not seen in prior disease outbreaks. We also neglected topics like tracing, and the accompanying contributions from the tech field such as monitoring through mobile apps and social media. Further, our review is overly U.S.-centric, with other countries like South Korea monitoring the disease so effectively they succeeded at containment without having to resort to difficult mitigation.

Any policy prescription toward COVID-19 should be viewed with a healthy respect for how little we actually know about the history of this pandemic. Practitioners working on these questions and with these data will be deeply familiar with many of these concerns, but some may be especially subtle or less prominently discussed within one’s main field of study. At a minimum, there is research being produced today which ignores many of these known methodological problems and subsequently generates confusion for novice consumers of analysis. We hope that this partial enumeration of challenges in COVID-19 measurement, inference, and interpretation is compelling justification for intellectual caution among consumers and producers of COVID-19 knowledge alike.

Funding source: Office of Naval Research

Award Identifier / Grant number: N00014-19-1-2491

Funding source: Charles Koch Foundation


Our thanks to the Center for Peace and Security Studies and its members, and to the Office of Naval Research [N00014-19-1-2491] and the Charles Koch Foundation for financial support. Thank you to all who provided feedback on the early draft, including two anonymous reviewers.

    Author contributions: Conceptualization, R.W.D., T.L.S., and E.G.; Investigation, R.W.D.; Writing–Original Draft, R.W.D.; Writing–Review & Editing, R.W.D. and T.L.S.; and Funding–E.G.


Alamo, T., D. G. Reina, M. Mammarella, and A. Abella. 2020. “Covid-19: Open-Data Resources for Monitoring, Modeling, and Forecasting the Epidemic.” Electronics 9 (5): 827. Multidisciplinary Digital Publishing Institute. Search in Google Scholar

Brauer, F., and C. Castillo-Chavez. 2012. Mathematical Models in Population Biology and Epidemiology, Vol. 2, New York, NY: Springer. Search in Google Scholar

Cheng, C., J. Barceló, A. S. Hartnett, R. Kubinec, and L. Messerschmidt. 2020. “COVID-19 Government Response Event Dataset (CoronaNet V.1.0).” Nature Human Behaviour 4 (7): 756–68. Nature Publishing Group. Search in Google Scholar

COVID-19 India Org Data Operations Group. 2020. “Dataset for Tracking COVID-19 Spread in India.” COVID-19 India Org Data Operations Group. (Accessed August 20, 2020). Search in Google Scholar

Hale, T., A. Petherick, T. Phillips, and S. Webster. 2020. Variation in Government Responses to COVID-19. Blavatnik School of Government. Working Paper 31. Search in Google Scholar

Kaashoek, J., and M. Santillana. 2020. COVID-19 Positive Cases, Evidence on the Time Evolution of the Epidemic or an Indicator of Local Testing Capabilities? A Case Study in the United States. SSRN Scholarly Paper ID 3574849. Rochester, NY: Social Science Research Network, Search in Google Scholar

Kogan, N. E., L. Clemente, P. Liautaud, J. Kaashoek, N. B. Link, A. T. Nguyen, F. S. Lu, P. Huybers, B. Resch, C. Havas, A. Petutschnig, J. Davis, M. Chinazzi, B. Mustafa, W. P. Hanage, A. Vespignani, and M. Santillana. 2020. An Early Warning Approach to Monitor COVID-19 Activity with Multiple Digital Traces in Near Real-Time. arXiv:2007.00756 [Q-Bio, Stat], July. Search in Google Scholar

Kubinec, R., and L. Carvalho. 2020. A Retrospective Bayesian Model for Measuring Covariate Effects on Observed COVID-19 Test and Case Counts. April. SocArXiv, Search in Google Scholar

Lipton, Z., J. Ellington, Smike, J. Ouyang, K. Riley, J. Ellinger, J. Hammerbacher, O. Lacan, J. Crane, and space-buzzer. 2020. The Covid-Tracking Project. Zenodo, Search in Google Scholar

Meyerowitz-Katz, G., and L. Merone. 2020. A Systematic Review and Meta-Analysis of Published Research Data on COVID-19 Infection-Fatality Rates. medRxiv, May. Cold Spring Harbor Laboratory Press, Search in Google Scholar

Peeling, R. W., C. J. Wedderburn, P. J. Garcia, D. Boeras, N. Fongwen, J. Nkengasong, A. Sall, A. Tanuri, and D. L. Heymann. 2020. “Serology Testing in the COVID-19 Pandemic Response.” The Lancet Infectious Diseases 20 (9): e245–9. Elsevier. Search in Google Scholar

Pouwels, K. B., T. House, J. V. Robotham, P. Birrell, A. B. Gelman, N. Bowers, I. Boreham, H. Thomas, J. Lewis, I. Bell, J. I. Bell, J. Newton, J. Farrar, I. Diamond, P. Benton, and S. Walker. 2020. Community Prevalence of SARS-CoV-2 in England: Results from the ONS Coronavirus Infection Survey Pilot. medRxiv, July. Cold Spring Harbor Laboratory Press, Search in Google Scholar

Reich, N. G., J. Niemi, K. House, A. Hannan, E. Cramer, S. Horstman, S. Xie, Y. Gu, N. Wattanachit, J. Bracher, S. Y. Wang, C. Gibson, S. Woody, M. L. Li, R. Walraven, har96, X. Zhang, jinghuichen, G. Espana, X. Xinyue, H. Biegel, L. Castro, Y. Wang, qjhong, E. Lee, A. Baxter, S. Bhatia, E. Ray, and abrennen, and ERDC CV19 Modeling Team. 2020. Reichlab/Covid19-Forecast-Hub: Pre-publication Snapshot. Zenodo, Search in Google Scholar

Souch, J. M., and J. S. Cossman. 2020. “A Commentary on Rural-Urban Disparities in COVID-19 Testing Rates Per 100,000 and Risk Factors.” The Journal of Rural Health, (00): 1–3, Search in Google Scholar

Sun, A., T. Fehr, A. Tse, Rachel, and W. Andrews. 2020. New York Times Coronavirus (Covid-19) Data in the United States. Zenodo, Search in Google Scholar

USAFacts. 2020. US Coronavirus Cases and Deaths. Zenodo, Search in Google Scholar

Wang, G., Z. Gu, X. Li, Y. Shan, M. Kim, Y. Wang, L. Gao, and L. Wang. 2020. Comparing and Integrating US COVID-19 Daily Data from Multiple Sources: A County-Level Dataset with Local Characteristics. arXiv:2006.01333 [Stat], June. Search in Google Scholar

Weinberger, D. M., J. Chen, T. Cohen, F. W. Crawford, F. Mostashari, D. Olson, V. E. Pitzer, N. G. Reich, M. Russi, L. Simonsen, A. Watkins, and C. Viboud. 2020. “Estimation of Excess Deaths Associated with the COVID-19 Pandemic in the United States, March to May 2020.” JAMA Internal Medicine, Search in Google Scholar

Wolf, A., A. Ary, and H. Firooz. 2020. Yahoo Knowledge Graph COVID-19 Datasets. Zenodo, Search in Google Scholar

Yang, T., K. Shen, S. He, E. Li, P. Sun, P. Chen, L. Zuo, J. Hu, Y. Mo, W. Zhang, H. Zhang, J. Chen, and Y. Guo. 2020. CovidNet: To Bring Data Transparency in the Era of COVID-19. arXiv:2005.10948 [Cs, Q-Bio], July. Search in Google Scholar

Zhang, C., C. Donthini, and Microsoft Open Source. 2020. Bing-COVID-19-Data. Zenodo, Search in Google Scholar

Zohrab, J., R. Block, C. Chamberlain, L. Davis, M. Nguyeñ^, A. Gifillan, A. Hughes, B. Wolfgang, and andys1376. 2020. COVID Atlas Li. Zenodo, Search in Google Scholar

Received: 2020-08-20
Accepted: 2020-08-28
Published Online: 2020-09-16

© 2020 Walter de Gruyter GmbH, Berlin/Boston