The 2015 National Academy of Medicine (NAM) report Improving Diagnosis in Health Care concluded that most people will experience at least one diagnostic error in their lifetime . The report, noting that over a third of adults go online to diagnose a health condition , urged professionals to direct patients to reliable online resources. How to determine the reliability of online resources, however, remains an unresolved question.
Currently available online resources have graduated beyond keyword searches on Google. Increasingly, they include sophisticated direct-to-consumer (DTC) diagnostic tools that use algorithms, sensors and “crowdsourcing”  to create Web 2.0 personalization and interactivity  for functions ranging from triage and differential diagnosis of common ailments to detecting skin changes suggestive of cancer.
With over a quarter million health apps available in major app stores , popular DTC diagnostic apps have been downloaded from tens of thousands to tens of millions of times . Possible benefits include faster, more convenient and more targeted information to improve diagnosis  and reduction of unneeded visits and tests, but there is also the potential for unintended outcomes  such as inappropriate treatment and diagnostic error. The Food and Drug Administration (FDA) has long exempted “low risk” apps from its approval process , and the current FDA commissioner has said that apps helping consumers self-diagnose are an innovation that regulations should not impede . Nonetheless, there are as yet no accepted vetting processes enabling clinicians or patients to distinguish between reliable apps and “digital snake oil” . Diagnostic apps specifically have received scant attention in comparison to health management ones, even in overviews of the field , .
We conducted a scoping review to characterize the current state of evidence on how interactive, DTC diagnostic apps available to consumers perform and what methods are used to evaluate them.
Funding for our work was provided by the Gordon and Betty Moore Foundation; however, the foundation had no role in study design; collection, analysis and interpretation of data; or approval of final publication. Our scoping review used Arksey and O’Malley’s five-stage methodological framework  summarized in Table 1.
Formulating research questions
An initial search in PubMed, Google Scholar, and the lay literature through General Reference Center Gold revealed a highly heterogeneous literature in which information of interest was often subsumed in broader examinations of diagnostic and/or health management apps. That search generated four research questions: what clinical conditions do these apps address? What functionality is involved in producing a tentative diagnosis? What methodologies are evaluators using to assess these apps? And what are the results of app evaluations, including evidence on risks and benefits? Our findings were intended to help guide medical practice, consumer choice and health policy by identifying the strengths and weaknesses of the evidence in the current literature and by highlighting evidence gaps.
Identification of relevant studies
With a medical librarian (LZ), we conducted a structured search of PubMed and Google Scholar for the period January 1, 2014–June 30, 2017, focusing on apps suggesting an initial diagnosis and marketed DTC without FDA approval. The timeframe was chosen in an attempt to minimize the inclusion of possibly technologically irrelevant evaluations of older apps. A lack of common keywords and inconsistent indexing made a structured and reproducible PubMed search difficult, leading to an iterative search process. Moreover, as no existing U.S. National Library of Medicine MeSH terms were closely related to our topic, we used broader, related terms such as “smartphone” and “diagnostic self-evaluation”. In addition, we manually reviewed selected bibliographies, even if slightly outside the time frame. We also searched the lay literature through General Reference Center Gold and by looking more broadly at trade and general-interest publications, websites and reports from organizations active in this field . We also interviewed physicians, researchers, digital health entrepreneurs and a venture capitalist.
We included original research, descriptive studies and literature reviews related to diagnostic software applications consumers might commonly use, whether web-based or apps developed for a specific platform (e.g. iPhone) . We excluded apps subject to FDA approval, those in a research phase, those using physical tests (e.g. Bluetooth-connected pregnancy tests) and static content (e.g. keyword searches).
Two authors (MLM and JLB) assessed full-text articles for relevance, given that an abstract might not accurately reflect whether an evaluation of a particular diagnostic app was performed. When there was a question about article inclusion, it was discussed with a third author (HS).
Two authors (MLM and JLB) reviewed articles and organized information pertaining to type of digital platform(s), study design, app attributes, outcomes investigated and major findings .
Data was summarized according to app functionality; diseases evaluated; evaluation methodologies (including selection criteria, descriptions of app attributes and testing of diagnostic functionality); and study results.
Overview of selected studies
The greatest number of articles (10) focused on dermatology-related diagnostic apps, primarily conditions associated with malignancy , , , , , , , , , . Next were eight articles on apps providing diagnostic and triage advice for a broad range of conditions , , , , , , , . Other diagnostic areas included infectious disease [one article on acute infectious conditions; one article on sexually transmitted infections (STIs) , ; mental health issues (one article on depression) ; neurology (one article on Alzheimer’s disease) ; general oncology (two) , ; orthopedics (one on knee pain , one on hand surgery) ; eye and vision issues (one) ; otolaryngology (one general) ; rheumatology (one on inflammatory arthritis) ; and urology (one general) .
The evaluations covered three broad functional categories of apps, with some articles including apps falling into more than one category. The largest category (20) involved medical symptom checkers that apply algorithms to user-answered questions to generate probable diagnoses and/or triage advice. The second most-common category (12) included apps that applied image processing technology and algorithms to smartphone photos. Articles we found were exclusively focused on conditions of the skin and eyes. Finally, five articles involved crowdsourcing using an online, distributed problem-solving model. (A prominent app in this category, CrowdMed, applies an algorithm to diagnostic suggestions submitted online by clinical and non-clinical “medical detectives” and then provides a second opinion.)
Assessment methodologies ranged from a structured rating grid completed by two expert panels to “think-aloud” feedback from consumers during use. User characteristics that were examined included age, gender, education, income, years of home ownership, health literacy, and computer literacy. As noted in Table 3, some studies engaged multiple experts to review app content and features, while others assessed performance directly by comparing an app’s suggested diagnosis to a reference diagnosis from a clinician or other source, such as structured clinical vignettes. Although these apps are classified as low-risk devices by the FDA, it is important to note that we found no studies of accuracy or clinical risks and benefits based upon real-world use by consumers.
Quantitative studies of these apps’ accuracy most often expressed their results in terms of percentage of true positives (or percent of responses correctly assigned to a specific category), sensitivity, and/or specificity for app-generated diagnoses when compared to diagnoses from a clinician or other reference source. Less commonly reported quantitative measures included positive predictive value, negative predictive value, and nonparametric statistics (e.g. kappa, chi-square, odds ratio) (Table 3).
Potential privacy and security problems were highlighted by several studies; e.g. symptom checkers for STIs were rated as “poor to very poor” on informed consent, disclosure of privacy and confidentiality policies and possible conflicts of interest . A similar conclusion was reached in a study of apps for detecting Alzheimer’s disease .
Meanwhile, the cost of apps was difficult to ascertain. In the most comprehensive information we found, symptom checkers for both professionals and patients were said to range in price from “under $1 to $80 or more” . In a study of dermatological diagnostic and management apps, app prices were given as ranging from 99 cents to $139.99 . In neither study were prices for DTC diagnostic apps broken down separately. Only one of the three studies of the CrowdMed app mentioned its significant cost; i.e. users must offer a minimum $200 reward to the “crowd” of “medical detectives”.
Actual diagnostic performance varied widely. A study of 23 general symptom checkers by Semigran et al. found an 80% rate of appropriate triage advice in emergent cases, but just 33% for appropriate self-care suggestions. Still, researchers judged these interactive apps preferable to a static Google search . In a non-peer reviewed “contest”, the Babylon Check symptom checker, which was not included in the Semigran study, was pitted against a junior doctor and experienced nurse using a standardized case and compared favorably . A separate, non peer-reviewed article by the app’s sponsor concluded that Babylon Check produced accurate triage advice in 88.2% of cases (based on pre-determined case vignettes), vs. 75.5% for doctors and 73.5% for nurses . However, we also found articles calling into question some of the findings and asking for an independent evaluation and additional evidence for its accuracy .
Peer-reviewed results of general symptom checkers for particular diseases, rather than for general medical and triage advice, showed few favorable results. In one study, the diagnosis suggested by a symptom checker matched a final diagnosis related to hand surgery just 33% of the time , while in another, a symptom checker provided “frequently inaccurate” advice related to inflammatory joint disease . Specialty symptom checkers – like the general ones, based on answers to user questions – also fared poorly. An app for knee pain diagnoses had an accuracy rate of 58% . Apps to screen for Alzheimer’s disease were all rated “poor to very poor”, and the authors noted that one tested app even concluded the user had the condition no matter what data were entered .
However, when specialty symptom checkers used data directly entered from sensors, they sometimes showed more promise, albeit with significant variability in the findings. For example, while one study warned of substantial potential for patient harm from a dermatology app’s misleading results , another study of that same app using a different methodology 2 years later found an accuracy rate of 81% in detecting melanoma, a sensitivity of 73% and a specificity of 39.3% . Meanwhile, vision diagnostic apps using sensors and directly targeting consumers received cautiously positive assessments in two non peer-reviewed articles , .
No studies examined actual patient outcomes. The closest approximation came in two studies of CrowdMed. In one study, patients said the app provided helpful guidance , while in another, users had fewer provider visits and lower utilization . The patient’s ultimate correct diagnosis was however, never confirmed. There were evaluations of consumer characteristics related to performance with varying results. Luger et al. found that individuals who diagnosed their symptoms more accurately using a symptom checker were slightly younger  while Powley et al. concluded that neither age nor gender had a significant impact on usability . Hageman et al. identified more familiarity with the Internet as contributing to “optimal use and interpretation” .
Some study designs raised questions of evaluator bias against the interactive apps. Among the criticisms were whether a particular evaluation overweighed relatively rare diagnoses  or failed to compare app use for triage to a realistic consumer alternative, such as a telephone triage line . Our scoping review raised similar concerns; e.g. studies in which an orthopedist assessed whether a symptom checker could “guess” the correct diagnosis , a dermatologist setting out to show the need for greater regulation  and an otolaryngologist comparing a symptom checker’s diagnostic accuracy to his own . This potential bias could be due to the tendency to judge algorithms differently than fellow humans .
Patient diagnosis is evolving “from art to digital data-driven science”, both within and outside the exam room . DTC diagnostic technology is rapidly evolving: the second half of 2017, for example, witnessed the widespread online dissemination of a depression-assessment questionnaire , as well as with the debut of smartphone enhancements utilizing sensors and AI that target the same condition . The pace of change should inspire urgency to improve the evidence base on app performance. However, most of the studies we identified simply described various apps’ attributes, a finding similar to the conclusions of a broad systematic review of mHealth apps .
Our findings demonstrate the need to accelerate investments into evaluation and research related to consumer facing diagnostic apps. Conversely, there appears to be some progress in evaluating physician-facing diagnostic apps, such as determining accuracy of diagnosing complex cases by the Isabel clinical decision support system  and determining test ordering and diagnostic accuracy of an app for testing and diagnosis for certain hematologic conditions . A recent systematic review and meta-analysis concluded that differential diagnosis generators (often used as apps) “have the potential to improve diagnostic practice among clinicians” . Nevertheless, the review found many studies with poor methodological quality, in addition to high between-study heterogeneity .
Based on our review, we make three key recommendations to advance research, policy, and practice. First, researchers should consistently name all individual apps evaluated and provide all results by individual app. Apps are medical devices, and accurate and timely diagnosis is a significant public health issue. Given that some of these publicly available apps seemed to perform far better than others, identification is central to enabling the type of clinician-patient partnership recommended by NAM’s Improving Diagnosis report, as well as the accountability that comes from policy oversight and replication of research findings. Since these products are aimed at consumers, price information should also routinely be included.
Second, evaluations of apps should explicitly address underlying technological and functional differences. These may or may not be tied to whether an app is accessed via a web browser or is downloaded. Functionally, for example, an app relying on algorithmic analysis of answers to questions, even if it is downloaded to a mobile device, is very different than algorithmic analysis of data from that device’s sensors. In turn, the technological basis of those algorithms – for example, the use of artificial intelligence (AI) – has substantial future implications. For example, current evidence suggests that the sensor-based diagnoses of DTC dermatology apps are approaching high reliability  and that general symptom checker accuracy might be significantly improved with AI . These technological distinctions should be recognized by researchers and can inform evidence-based discussions about the clinical and economic impact of consumer use of DTC diagnostic apps and the appropriate public policy response.
Third, researchers should validate and standardize evaluation methodologies. The Standards for Universal reporting of patient Decision Aid Evaluation (SUNDAE) checklist for decision aids studies may serve as one example . In addition to ensuring that evaluations name individual apps and identify their functionality appropriately, a methodology should include agreed-upon sampling and selection criteria; characteristics related to usability and performance; and standards for assessing sensitivity, specificity, and other measures of app accuracy. These actions will help avoid bias while also ensuring that the evidence base aligns with the varying needs of clinicians, patients, researchers, private-sector entrepreneurs, and policymakers.
Overall, the current evidence base on DTC, interactive diagnostic apps is sparse in scope, uneven in the information provided, and inconclusive with respect to safety and effectiveness, with no studies of clinical risks and benefits involving real-world consumer use. Although some studies we examined rigorously determined the sensitivity and specificity of app-generated diagnoses, methodologies varied considerably. Given that DTC diagnostic apps are rapidly evolving, more frequent and rigorous evaluations are essential to inform decisions by clinicians, patients, policymakers, and other stakeholders.
We thank Annie Bradford, PhD for help with the medical editing. We also thank Kathryn M. McDonald, MM, PhD and Daniel Yang, MD for their valuable comments on an earlier draft of this manuscript.
Improving Diagnosis in Health Care. National Academies of Sciences, Engineering and Medicine. 2015. Available at: http://iom.nationalacademies.org/Reports/2015/Improving-Diagnosis-in-Healthcare.aspx. Accessed: 14 Jun 2016.
Fox S, Duggan M. Health Online 2013. 2013. Available at: http://www.pewinternet.org/2013/01/15/health-online-2013/. Accessed: 12 Jul 2017.
O’Reilly T. What is Web 2.0. 2005. Available at: http://www.oreilly.com/pub/a/web2/archive/what-is-web-20.html. Accessed: 15 Dec 2017.
Research2guidance. The mHealth App Market is Getting Crowded. 2016. Available at: https://research2guidance.com/mhealth-app-market-getting-crowded-259000-mhealth-apps-now/. Accessed: 4 Sep 2017.
Administration USFD. Mobile Medical Applications. Available at: https://www.fda.gov/medicaldevices/digitalhealth/mobilemedicalapplications/default.htm. Accessed: 15 Dec 2017.
Comstock J. In past editorials, Trump’s FDA pick advocated hands-off approach for health apps. 2017. Available at: http://www.mobihealthnews.com/content/past-editorials-trumps-fda-pick-advocated-hands-approach-health-apps. Accessed: 15 Dec 2017.
AMA Wire. Medical innovations and digital snake oil: AMA CEO speaks out. 2016. Available at: https://wire.ama-assn.org/life-career/medical-innovation-and-digital-snake-oil-ama-ceo-speaks-out. Accessed: 15 Dec 2017.
Aitken M, Lyle J. Patient Adoption of mHealth: Use, Evidence and Remaining Barriers to Mainstream Acceptance. Parsippany, NY: IMS Institute for Healthcare Informatics. 2015. https://pascaleboyerbarresi.files.wordpress.com/2015/03/iihi_patient_adoption_of_mhealth.pdf. Accessed: 12 July 2017.
American Medical Association. Report 6 of the Council on Medical Service (I-16). Integration of mobile health applications and devices into practice. 2016. https://www.ama-assn.org/sites/default/files/media-browser/public/about-ama/councils/Council%20Reports/council-on-medical-service/interim-2016-council-on-medical-service-report-6.pdf. Accessed: 12 July 2017.
TechTarget. Computing Fundamentals. 2007. Available at: http://searchmobilecomputing.techtarget.com/definition/app. Accessed: 12 Jul 2017.
Bender JL, Yue RY, To MJ, Deacken L, Jadad AR. A lot of action, but not in the right direction: systematic review and content analysis of smartphone applications for the prevention, detection, and management of cancer. J Med Internet Res 2013;15:e287. CrossrefPubMedGoogle Scholar
Bhattacharyya M. Studying the Reality of Crowd-Powered Healthcare. Paper presented at: AAAI HCOMP2015. Google Scholar
Brouard B, Bardo P, Bonnet C, Mounier N, Vignot M, Vignot S. Mobile applications in oncology: is it possible for patients and healthcare professionals to easily identify relevant tools? Ann Med 2016;48:509–15. PubMedCrossrefGoogle Scholar
Cheng J, Manoharan M, Lease M, Zhang Y. Is there a Doctor in the Crowd? Diagnosis Needed! (for less than $5). Paper presented at: iConference 2015. Google Scholar
Gibbs J, Gkatzidou V, Tickle L, Manning SR, Tilakkumar T, Hone K, et al. ‘Can you recommend any good STI apps?’ A review of content, accuracy and comprehensiveness of current mobile medical applications for STIs and related genital infections. Sex Transm Infect 2017;93:234–5. PubMedGoogle Scholar
Juusola JL, Quisel TR, Foschini L, Ladapo JA. The impact of an online crowdsourcing diagnostic tool on health care utilization: a case study using a novel approach to retrospective claims analysis. J Med Internet Res 2016;18:e127. CrossrefPubMedGoogle Scholar
Kassianos AP, Emery JD, Murchie P, Walter FM. Smartphone applications for melanoma detection by community, patient and generalist clinician users: a review. Br J Dermatol 2015;172:1507–18. PubMedCrossrefGoogle Scholar
Patel S, Madhu E, Boyers LN, Karimkhani C, Dellavalle R. Update on mobile applications in dermatology. Dermatol Online J 2015;21. Google Scholar
Pereira-Azevedo N, Carrasquinho E, Cardoso de Oliveira E, Cavadas V, Osório L, Fraga A, et al. mHealth in urology: a review of experts’ involvement in app development. PLoS One 2015;10:e0125547. CrossrefPubMedGoogle Scholar
Robillard JM, Illes J, Arcand M, Beattie BL, Hayden S, Lawrence P, et al. Scientific and ethical features of English-language online tests for Alzheimer’s disease. Alzheimers Dement (Amst) 2015;1:281–8. PubMedGoogle Scholar
Shen N, Levitan MJ, Johnson A, Bender JL, Hamilton-Page M, Jadad AA, et al. Finding a depression app: a review and content analysis of the depression app marketplace. JMIR Mhealth Uhealth 2015;3:e16. CrossrefGoogle Scholar
Bisson LJ, Komm JT, Bernas GA, Fineberg MS, Marzo JM, Rauh MA, et al. How accurate are patients at diagnosing the cause of their knee pain with the help of a web-based symptom checker? Orthop J Sports Med 2016;4:2325967116630286. PubMedGoogle Scholar
Maier T, Kulichova D, Schotten K, Astrid R, Ruzicka T, Berking C, et al. Accuracy of a smartphone application using fractal image analysis of pigmented moles compared to clinical diagnosis and histological result. J Eur Acad Dermatol Venereol 2015;29:663–7. CrossrefPubMedGoogle Scholar
Nabil R, Bergman W, KuKutsh NA. Poor agreemenet between a mobile phone application for the analysis of skin lesions and the clinical diagnosis of the dermatologist, a pilot study. Br J Dermatol 2017;177:583–4. CrossrefGoogle Scholar
Ngoo A, Finnane A, McMeniman E, Tan JM, Janda M, Soyer HP. Efficacy of smartphone applications in high-risk pigmented lesions. Australas J Dermatol 2017;1–8. [Epub ahead of print]. Google Scholar
Semigran HL, Linder JA, Gidengil C, Mehrotra A. Evaluation of symptom checkers for self diagnosis and triage: audit study. Br Med J 2015;351:h3480. Google Scholar
Thissen M, Udrea A, Hacking M, von Braunmuehl T, Ruzicka T. mHealth app for risk assessment of pigmented and nonpigmented skin lesions-a study on sensitivity and specificity in detecting malignancy. Telemed J E Health 2017;23:948–54. PubMedCrossrefGoogle Scholar
Chapman M. A health app’s AI took on human doctors to triage patients. 2016. Available at: https://motherboard.vice.com/en_us/article/z43354/a-health-apps-ai-took-on-human-doctors-to-triage-patients. Accessed: 12 Jul 2017.
Shah V, Hemang K, Pandya MD. Smartphones for visual function testing. 2015. Available at: https://www.retinalphysician.com/issues/2015/may-2015/smartphones-for-visual-function-testing. Accessed: 18 Dec 2017.
Husain I. Self-diagnosis app study scrutinized the wrong way. 2015. Available at: https://www.imedicalapps.com/author/iltifat/#. Accessed: 12 Jul 2017.
Middleton K, Butt M, Hammerla N, Hamblin S, Mheta K, Parsa A. Sorting out symptoms: design and evaluation of the ‘Babylon Check’ automated triage system. 2016. Available at: https://arxiv.org/abs/1606.02041. Accessed: 12 Jul 2017.
Lee L. Portable vision testing kit puts an eye doctor in your smartphone. 2016. Available at: https://newatlas.com/eyeque-personal-vision-tracker/47148/. Accessed: 18 Dec 2017.
Hagan P. Can an app really help you spot a risky mole? SkinVision can help you ‘be your own doctor’ by finding irregularities. 2016. Available at: http://www.dailymail.co.uk/health/article-3845614/Can-app-really-help-spot-risky-mole-SkinVision-help-doctor-finding-irregularities.html. Accessed: 18 Dec 2017.
McCartney M. Margaret McCartney: innovation without sufficient evidence is a disservice to all. Br Med J 2017;358:j3980. Google Scholar
National Alliance on Mental Illness. Google partners with NAMI to shed light on clinical depression. 2017. Available at: https://www.nami.org/About-NAMI/NAMI-News/2017/Google-Partners-with-NAMI-to-Shed-Light-on-Clinica. Accessed: 12 Jul 2017.
Morse J. So how worried should we be about Apple’s Face ID? 2017. Available at: http://mashable.com/2017/09/14/apple-faceid-privacy-concerns/#oL77nLsigiqV. Accessed: 8 Dec 2017.
Meyer AN, Thompson PJ, Khanna A, Desai S, Mathews BK, Yousef E, et al. Evaluating a mobile application for improving clinical laboratory test ordering and diagnosis. J Am Med Inform Assoc 2018;25:841–7. CrossrefPubMedGoogle Scholar
Riches N, Panagioti M, Alam R, Cheraghi-Sohi S, Campbell S, Esmail A, et al. The effectiveness of electronic differential diagnoses (ddx) generators: a systematic review and meta-analysis. PLoS One 2016;11:e0148991. PubMedCrossrefGoogle Scholar
Sepucha KR, Abhyankar P, Hoffman AS, Bekker HL, LeBlanc A, Levin CA, et al. Standards for UNiversal reporting of patient Decision Aid Evaluation studies: the development of SUNDAE checklist. BMJ Qual Saf 2018;27:380–8. PubMedCrossrefGoogle Scholar
About the article
Published Online: 2018-07-23
Published in Print: 2018-09-25
Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
Research funding: This project was funded by the Gordon and Betty Moore Foundation, Funder ID: 10.13039/100000936, Grant number: 5492. Dr. Singh is additionally supported by the VA Health Services Research and Development Service (Presidential Early Career Award for Scientists and Engineers USA 14-274), the VA National Center for Patient Safety and the Agency for Healthcare Research and Quality (R01HS022087) and in part by the Houston VA HSR&D Center for Innovations in Quality, Effectiveness and Safety (CIN13-413).
Employment or leadership: None declared.
Honorarium: None declared.
Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication.