Jump to ContentJump to Main Navigation
Show Summary Details
More options …


Official Journal of the Society to Improve Diagnosis in Medicine (SIDM)

Editor-in-Chief: Graber, Mark L. / Plebani, Mario

Ed. by Argy, Nicolas / Epner, Paul L. / Lippi, Giuseppe / Singhal, Geeta / McDonald, Kathryn / Singh, Hardeep / Newman-Toker, David

Editorial Board: Basso , Daniela / Crock, Carmel / Croskerry, Pat / Dhaliwal, Gurpreet / Ely, John / Giannitsis, Evangelos / Katus, Hugo A. / Laposata, Michael / Lyratzopoulos, Yoryos / Maude, Jason / Sittig, Dean F. / Sonntag, Oswald / Zwaan, Laura

See all formats and pricing
More options …

Beyond Dr. Google: the evidence on consumer-facing digital tools for diagnosis

Michael L. MillensonORCID iD: http://orcid.org/0000-0001-8364-1927 / Jessica L. Baldwin
  • Center for Innovations in Quality, Effectiveness and Safety, Michael E. DeBakey VA Medical Center, Houston, TX, USA
  • Department of Medicine, Baylor College of Medicine, Houston, TX, USA
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Lorri Zipperer / Hardeep Singh
  • Center for Innovations in Quality, Effectiveness and Safety, Michael E. DeBakey VA Medical Center, Houston, TX, USA
  • Department of Medicine, Baylor College of Medicine, Houston, TX, USA
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
Published Online: 2018-07-23 | DOI: https://doi.org/10.1515/dx-2018-0009


Over a third of adults go online to diagnose their health condition. Direct-to-consumer (DTC), interactive, diagnostic apps with information personalization capabilities beyond those of static search engines are rapidly proliferating. While these apps promise faster, more convenient and more accurate information to improve diagnosis, little is known about the state of the evidence on their performance or the methods used to evaluate them. We conducted a scoping review of the peer-reviewed and gray literature for the period January 1, 2014–June 30, 2017. We found that the largest category of evaluations involved symptom checkers that applied algorithms to user-answered questions, followed by sensor-driven apps that applied algorithms to smartphone photos, with a handful of evaluations examining crowdsourcing. The most common clinical areas evaluated were dermatology and general diagnostic and triage advice for a range of conditions. Evaluations were highly variable in methodology and conclusions, with about half describing app characteristics and half examining actual performance. Apps were found to vary widely in functionality, accuracy, safety and effectiveness, although the usefulness of this evidence was limited by a frequent failure to provide results by named individual app. Overall, the current evidence base on DTC, interactive diagnostic apps is sparse in scope, uneven in the information provided and inconclusive with respect to safety and effectiveness, with no studies of clinical risks and benefits involving real-world consumer use. Given that DTC diagnostic apps are rapidly evolving, rigorous and standardized evaluations are essential to inform decisions by clinicians, patients, policymakers and other stakeholders.

Keywords: consumerism; crowdsourcing; diagnostic error; digital health; evidence-based medicine; health apps; health information technology; mHealth; patient engagement


The 2015 National Academy of Medicine (NAM) report Improving Diagnosis in Health Care concluded that most people will experience at least one diagnostic error in their lifetime [1]. The report, noting that over a third of adults go online to diagnose a health condition [2], urged professionals to direct patients to reliable online resources. How to determine the reliability of online resources, however, remains an unresolved question.

Currently available online resources have graduated beyond keyword searches on Google. Increasingly, they include sophisticated direct-to-consumer (DTC) diagnostic tools that use algorithms, sensors and “crowdsourcing” [3] to create Web 2.0 personalization and interactivity [4] for functions ranging from triage and differential diagnosis of common ailments to detecting skin changes suggestive of cancer.

With over a quarter million health apps available in major app stores [5], popular DTC diagnostic apps have been downloaded from tens of thousands to tens of millions of times [6]. Possible benefits include faster, more convenient and more targeted information to improve diagnosis [7] and reduction of unneeded visits and tests, but there is also the potential for unintended outcomes [8] such as inappropriate treatment and diagnostic error. The Food and Drug Administration (FDA) has long exempted “low risk” apps from its approval process [9], and the current FDA commissioner has said that apps helping consumers self-diagnose are an innovation that regulations should not impede [10]. Nonetheless, there are as yet no accepted vetting processes enabling clinicians or patients to distinguish between reliable apps and “digital snake oil” [11]. Diagnostic apps specifically have received scant attention in comparison to health management ones, even in overviews of the field [12], [13].

We conducted a scoping review to characterize the current state of evidence on how interactive, DTC diagnostic apps available to consumers perform and what methods are used to evaluate them.


Funding for our work was provided by the Gordon and Betty Moore Foundation; however, the foundation had no role in study design; collection, analysis and interpretation of data; or approval of final publication. Our scoping review used Arksey and O’Malley’s five-stage methodological framework [14] summarized in Table 1.

Table 1:

Steps involved in scoping review.

Formulating research questions

An initial search in PubMed, Google Scholar, and the lay literature through General Reference Center Gold revealed a highly heterogeneous literature in which information of interest was often subsumed in broader examinations of diagnostic and/or health management apps. That search generated four research questions: what clinical conditions do these apps address? What functionality is involved in producing a tentative diagnosis? What methodologies are evaluators using to assess these apps? And what are the results of app evaluations, including evidence on risks and benefits? Our findings were intended to help guide medical practice, consumer choice and health policy by identifying the strengths and weaknesses of the evidence in the current literature and by highlighting evidence gaps.

Identification of relevant studies

With a medical librarian (LZ), we conducted a structured search of PubMed and Google Scholar for the period January 1, 2014–June 30, 2017, focusing on apps suggesting an initial diagnosis and marketed DTC without FDA approval. The timeframe was chosen in an attempt to minimize the inclusion of possibly technologically irrelevant evaluations of older apps. A lack of common keywords and inconsistent indexing made a structured and reproducible PubMed search difficult, leading to an iterative search process. Moreover, as no existing U.S. National Library of Medicine MeSH terms were closely related to our topic, we used broader, related terms such as “smartphone” and “diagnostic self-evaluation”. In addition, we manually reviewed selected bibliographies, even if slightly outside the time frame. We also searched the lay literature through General Reference Center Gold and by looking more broadly at trade and general-interest publications, websites and reports from organizations active in this field [15]. We also interviewed physicians, researchers, digital health entrepreneurs and a venture capitalist.

Study selection

We included original research, descriptive studies and literature reviews related to diagnostic software applications consumers might commonly use, whether web-based or apps developed for a specific platform (e.g. iPhone) [16]. We excluded apps subject to FDA approval, those in a research phase, those using physical tests (e.g. Bluetooth-connected pregnancy tests) and static content (e.g. keyword searches).

Two authors (MLM and JLB) assessed full-text articles for relevance, given that an abstract might not accurately reflect whether an evaluation of a particular diagnostic app was performed. When there was a question about article inclusion, it was discussed with a third author (HS).

Data charting

Two authors (MLM and JLB) reviewed articles and organized information pertaining to type of digital platform(s), study design, app attributes, outcomes investigated and major findings [17].

Data summarization

Data was summarized according to app functionality; diseases evaluated; evaluation methodologies (including selection criteria, descriptions of app attributes and testing of diagnostic functionality); and study results.


Overview of selected studies

We identified 30 peer-reviewed articles and research letters (Tables 2 and 3) and six non-peer reviewed articles [47], [48], [49], [50], [51], [52] meeting our definition. Although we focused on diagnostic apps, these were often described within broader studies evaluating medical apps.

Table 2:

Peer-reviewed descriptive studies of direct-to-consumer (DTC) diagnostic apps.

Table 3:

Peer-reviewed assessments of diagnostic performance of direct-to-consumer (DTC) diagnostic apps.

Conditions evaluated

The greatest number of articles (10) focused on dermatology-related diagnostic apps, primarily conditions associated with malignancy [20], [25], [28], [34], [36], [39], [40], [41], [45], [46]. Next were eight articles on apps providing diagnostic and triage advice for a broad range of conditions [6], [19], [22], [24], [26], [27], [43], [44]. Other diagnostic areas included infectious disease [one article on acute infectious conditions; one article on sexually transmitted infections (STIs) [23], [38]; mental health issues (one article on depression) [32]; neurology (one article on Alzheimer’s disease) [30]; general oncology (two) [18], [21]; orthopedics (one on knee pain [33], one on hand surgery) [37]; eye and vision issues (one) [31]; otolaryngology (one general) [35]; rheumatology (one on inflammatory arthritis) [42]; and urology (one general) [29].

App functionality

The evaluations covered three broad functional categories of apps, with some articles including apps falling into more than one category. The largest category (20) involved medical symptom checkers that apply algorithms to user-answered questions to generate probable diagnoses and/or triage advice. The second most-common category (12) included apps that applied image processing technology and algorithms to smartphone photos. Articles we found were exclusively focused on conditions of the skin and eyes. Finally, five articles involved crowdsourcing using an online, distributed problem-solving model. (A prominent app in this category, CrowdMed, applies an algorithm to diagnostic suggestions submitted online by clinical and non-clinical “medical detectives” and then provides a second opinion.)

Evaluation methodologies

Most studies evaluated multiple apps. However, some focused on a specific app due to app developer funding [24], app prominence (e.g. WebMD’s symptom checker) or a desire to show the need for greater regulation [36]. Selection criteria for which apps were included in evaluations appeared somewhat arbitrary. Some studies simply described the presence or absence of particular attributes, such as whether there was a disclosed privacy policy. App cost was not consistently addressed, nor did researchers consistently note that “free” apps may sell user data.

Assessment methodologies ranged from a structured rating grid completed by two expert panels to “think-aloud” feedback from consumers during use. User characteristics that were examined included age, gender, education, income, years of home ownership, health literacy, and computer literacy. As noted in Table 3, some studies engaged multiple experts to review app content and features, while others assessed performance directly by comparing an app’s suggested diagnosis to a reference diagnosis from a clinician or other source, such as structured clinical vignettes. Although these apps are classified as low-risk devices by the FDA, it is important to note that we found no studies of accuracy or clinical risks and benefits based upon real-world use by consumers.

Quantitative studies of these apps’ accuracy most often expressed their results in terms of percentage of true positives (or percent of responses correctly assigned to a specific category), sensitivity, and/or specificity for app-generated diagnoses when compared to diagnoses from a clinician or other reference source. Less commonly reported quantitative measures included positive predictive value, negative predictive value, and nonparametric statistics (e.g. kappa, chi-square, odds ratio) (Table 3).

Evaluation results

Potential privacy and security problems were highlighted by several studies; e.g. symptom checkers for STIs were rated as “poor to very poor” on informed consent, disclosure of privacy and confidentiality policies and possible conflicts of interest [23]. A similar conclusion was reached in a study of apps for detecting Alzheimer’s disease [30].

Meanwhile, the cost of apps was difficult to ascertain. In the most comprehensive information we found, symptom checkers for both professionals and patients were said to range in price from “under $1 to $80 or more” [6]. In a study of dermatological diagnostic and management apps, app prices were given as ranging from 99 cents to $139.99 [20]. In neither study were prices for DTC diagnostic apps broken down separately. Only one of the three studies of the CrowdMed app mentioned its significant cost; i.e. users must offer a minimum $200 reward to the “crowd” of “medical detectives”.

Actual diagnostic performance varied widely. A study of 23 general symptom checkers by Semigran et al. found an 80% rate of appropriate triage advice in emergent cases, but just 33% for appropriate self-care suggestions. Still, researchers judged these interactive apps preferable to a static Google search [43]. In a non-peer reviewed “contest”, the Babylon Check symptom checker, which was not included in the Semigran study, was pitted against a junior doctor and experienced nurse using a standardized case and compared favorably [47]. A separate, non peer-reviewed article by the app’s sponsor concluded that Babylon Check produced accurate triage advice in 88.2% of cases (based on pre-determined case vignettes), vs. 75.5% for doctors and 73.5% for nurses [50]. However, we also found articles calling into question some of the findings and asking for an independent evaluation and additional evidence for its accuracy [53].

Peer-reviewed results of general symptom checkers for particular diseases, rather than for general medical and triage advice, showed few favorable results. In one study, the diagnosis suggested by a symptom checker matched a final diagnosis related to hand surgery just 33% of the time [37], while in another, a symptom checker provided “frequently inaccurate” advice related to inflammatory joint disease [42]. Specialty symptom checkers – like the general ones, based on answers to user questions – also fared poorly. An app for knee pain diagnoses had an accuracy rate of 58% [33]. Apps to screen for Alzheimer’s disease were all rated “poor to very poor”, and the authors noted that one tested app even concluded the user had the condition no matter what data were entered [30].

However, when specialty symptom checkers used data directly entered from sensors, they sometimes showed more promise, albeit with significant variability in the findings. For example, while one study warned of substantial potential for patient harm from a dermatology app’s misleading results [36], another study of that same app using a different methodology 2 years later found an accuracy rate of 81% in detecting melanoma, a sensitivity of 73% and a specificity of 39.3% [39]. Meanwhile, vision diagnostic apps using sensors and directly targeting consumers received cautiously positive assessments in two non peer-reviewed articles [48], [51].

No studies examined actual patient outcomes. The closest approximation came in two studies of CrowdMed. In one study, patients said the app provided helpful guidance [27], while in another, users had fewer provider visits and lower utilization [24]. The patient’s ultimate correct diagnosis was however, never confirmed. There were evaluations of consumer characteristics related to performance with varying results. Luger et al. found that individuals who diagnosed their symptoms more accurately using a symptom checker were slightly younger [38] while Powley et al. concluded that neither age nor gender had a significant impact on usability [42]. Hageman et al. identified more familiarity with the Internet as contributing to “optimal use and interpretation” [37].

Some study designs raised questions of evaluator bias against the interactive apps. Among the criticisms were whether a particular evaluation overweighed relatively rare diagnoses [54] or failed to compare app use for triage to a realistic consumer alternative, such as a telephone triage line [49]. Our scoping review raised similar concerns; e.g. studies in which an orthopedist assessed whether a symptom checker could “guess” the correct diagnosis [37], a dermatologist setting out to show the need for greater regulation [36] and an otolaryngologist comparing a symptom checker’s diagnostic accuracy to his own [35]. This potential bias could be due to the tendency to judge algorithms differently than fellow humans [55].


Patient diagnosis is evolving “from art to digital data-driven science”, both within and outside the exam room [56]. DTC diagnostic technology is rapidly evolving: the second half of 2017, for example, witnessed the widespread online dissemination of a depression-assessment questionnaire [57], as well as with the debut of smartphone enhancements utilizing sensors and AI that target the same condition [58]. The pace of change should inspire urgency to improve the evidence base on app performance. However, most of the studies we identified simply described various apps’ attributes, a finding similar to the conclusions of a broad systematic review of mHealth apps [59].

Our findings demonstrate the need to accelerate investments into evaluation and research related to consumer facing diagnostic apps. Conversely, there appears to be some progress in evaluating physician-facing diagnostic apps, such as determining accuracy of diagnosing complex cases by the Isabel clinical decision support system [60] and determining test ordering and diagnostic accuracy of an app for testing and diagnosis for certain hematologic conditions [61]. A recent systematic review and meta-analysis concluded that differential diagnosis generators (often used as apps) “have the potential to improve diagnostic practice among clinicians” [62]. Nevertheless, the review found many studies with poor methodological quality, in addition to high between-study heterogeneity [62].

Based on our review, we make three key recommendations to advance research, policy, and practice. First, researchers should consistently name all individual apps evaluated and provide all results by individual app. Apps are medical devices, and accurate and timely diagnosis is a significant public health issue. Given that some of these publicly available apps seemed to perform far better than others, identification is central to enabling the type of clinician-patient partnership recommended by NAM’s Improving Diagnosis report, as well as the accountability that comes from policy oversight and replication of research findings. Since these products are aimed at consumers, price information should also routinely be included.

Second, evaluations of apps should explicitly address underlying technological and functional differences. These may or may not be tied to whether an app is accessed via a web browser or is downloaded. Functionally, for example, an app relying on algorithmic analysis of answers to questions, even if it is downloaded to a mobile device, is very different than algorithmic analysis of data from that device’s sensors. In turn, the technological basis of those algorithms – for example, the use of artificial intelligence (AI) – has substantial future implications. For example, current evidence suggests that the sensor-based diagnoses of DTC dermatology apps are approaching high reliability [40] and that general symptom checker accuracy might be significantly improved with AI [50]. These technological distinctions should be recognized by researchers and can inform evidence-based discussions about the clinical and economic impact of consumer use of DTC diagnostic apps and the appropriate public policy response.

Third, researchers should validate and standardize evaluation methodologies. The Standards for Universal reporting of patient Decision Aid Evaluation (SUNDAE) checklist for decision aids studies may serve as one example [63]. In addition to ensuring that evaluations name individual apps and identify their functionality appropriately, a methodology should include agreed-upon sampling and selection criteria; characteristics related to usability and performance; and standards for assessing sensitivity, specificity, and other measures of app accuracy. These actions will help avoid bias while also ensuring that the evidence base aligns with the varying needs of clinicians, patients, researchers, private-sector entrepreneurs, and policymakers.


Overall, the current evidence base on DTC, interactive diagnostic apps is sparse in scope, uneven in the information provided, and inconclusive with respect to safety and effectiveness, with no studies of clinical risks and benefits involving real-world consumer use. Although some studies we examined rigorously determined the sensitivity and specificity of app-generated diagnoses, methodologies varied considerably. Given that DTC diagnostic apps are rapidly evolving, more frequent and rigorous evaluations are essential to inform decisions by clinicians, patients, policymakers, and other stakeholders.


We thank Annie Bradford, PhD for help with the medical editing. We also thank Kathryn M. McDonald, MM, PhD and Daniel Yang, MD for their valuable comments on an earlier draft of this manuscript.


About the article

Corresponding author: Michael L. Millenson, BA, Health Quality Advisors LLC, Highland Park, IL 60035, USA; and Northwestern University Feinberg School of Medicine, Department of General Internal Medicine and Geriatrics, Chicago, IL, USA

Received: 2018-04-01

Accepted: 2018-06-01

Published Online: 2018-07-23

Published in Print: 2018-09-25

Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.

Research funding: This project was funded by the Gordon and Betty Moore Foundation, Funder ID: 10.13039/100000936, Grant number: 5492. Dr. Singh is additionally supported by the VA Health Services Research and Development Service (Presidential Early Career Award for Scientists and Engineers USA 14-274), the VA National Center for Patient Safety and the Agency for Healthcare Research and Quality (R01HS022087) and in part by the Houston VA HSR&D Center for Innovations in Quality, Effectiveness and Safety (CIN13-413).

Employment or leadership: None declared.

Honorarium: None declared.

Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication.

Citation Information: Diagnosis, Volume 5, Issue 3, Pages 95–105, ISSN (Online) 2194-802X, ISSN (Print) 2194-8011, DOI: https://doi.org/10.1515/dx-2018-0009.

Export Citation

©2018 Walter de Gruyter GmbH, Berlin/Boston.Get Permission

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

Hamish Fraser, Enrico Coiera, and David Wong
The Lancet, 2018

Comments (0)

Please log in or register to comment.
Log in