Futurehealth 9 1 75
Futurehealth 9 1 75
Futurehealth 9 1 75
Authors: Christopher A Lovejoy, A Anmol Arora,B Varun BuchC and Ittai DayanD
Interest in artificial intelligence (AI) has grown exponentially outputs are highly multi-dimensional and associated in extremely
ABSTRACT
in recent years, attracting sensational headlines and specula- complex ways. A number of guidelines currently exist for the
tion. While there is considerable potential for AI to augment development of AI solutions, including regarding concerns about
clinical practice, there remain numerous practical implica- transparency, reproducibility, ethics and effectiveness.5,6 Reporting
tions that must be considered when exploring AI solutions. guidelines have also been produced in order to assess the
These range from ethical concerns about algorithmic bias to methodology of projects.7,8 However, real-world requirements for
legislative concerns in an uncertain regulatory environment. AI solutions are still evolving as noted in a recent UK parliamentary
In the absence of established protocols and examples of best research briefing on the topic of AI in healthcare.9 In the absence
practice, there is a growing need for clear guidance both for of any set precedent for its clinical application, it is essential that
innovators and early adopters. Broadly, there are three stages researchers have an appreciation of AI’s strengths and limitations
to the innovation process: invention, development and imple- to assist them in developing appropriate research questions. It is
mentation. In this paper, we present key considerations for well established that innovation consists of three stages: invention,
innovators at each stage and offer suggestions along the AI development and implementation.10 Here, we present key
development pipeline, from bench to bedside. considerations at each of the major stages in the AI innovation
pathway, from identifying a research question to deployment.
KEYWORDS: machine learning, innovation, technology, algorithms,
data
Identifying appropriate research questions
DOI: 10.7861/fhj.2021-0128 Before embarking on AI-based research, a fundamental and often
overlooked question is: ‘Would AI really be appropriate for the
research question at hand?’ There are research projects that can
Introduction be enhanced with AI and others where AI can be detrimental. AI
is useful in situations where the input is highly complex, where it is
Despite considerable acclaim, promotion and investment around
desirable for the question to have a complex answer or where the
artificial intelligence (AI), the technology is simply a form of
hypothesis space cannot easily be constrained. An example of a
computational analysis not sharply demarcated from other
highly complex input is imaging data, which may contain numerous
kinds of modelling.1–3 AI is defined as the ability of a computer
features that can represent a large number of pathologies.
system to perform tasks that are usually thought to require
The range of uses of AI varies from automated data collection
human intelligence, including processes of learning, reasoning
to developing diagnostic decision aids. The techniques required for
and self-correction.4 The term is sometimes used synonymously
each are similar and revolve around the ability of models to identify
with ‘machine learning’, a methodology that allows a computer
complex relationships within datasets due to their capacity to
system to refine its output function by learning from input data,
analyse many variables and their ability to extract useful features;
with minimal human intervention. The strength of AI is its ability
for example, with minimal human instruction, convolutional neural
to accentuate the flexibility and expressivity of more traditional
networks (CNNs) can extract features from medical images, such
statistical techniques, catering to problems where the inputs and
as recent work with histology slides.11,12 CNNs, which are modelled
on the animal visual cortex, are particularly suited to analysis of
imaging data that may otherwise be difficult to analyse.
Rather than population-level scoring systems that provide
generalisable but imprecise predictions for the non-existent
Authors: Aphysician, University College London, London, UK ‘average’ patient, AI models can provide predictions more specific
and University College Hospital, London, UK; Bmedical student, to smaller patient cohorts and with greater precision. Several models
University of Cambridge, Cambridge, UK and honorary research have illustrated this in recent months. Hilton et al’s model provides
fellow, Moorfields Eye Hospital, London UK; Cdirector of AI personalised predictions of adverse events, such as extended length
development, MGH & BWH Center for Clinical Data Science, of stay, 30-day readmission and death, while other recent studies
Boston, USA; Dlecturer, Partners HealthCare, Boston, USA and chief provide personalised predictions of heart failure outcomes and risks
executive officer, Rhino Health, Boston, USA of gestational diabetes or myocardial infarction.13–16
As well as being used to guide patient care, AI may be used to the size of a training dataset increases performance, albeit with
expand the scope of research by enabling automation of routine diminishing returns, whereby each incremental unit rise in dataset
tasks; for example, if a team wished to explore the prevalence size produces a smaller improvement in performance. An empirical
of pulmonary nodules, the time and financial costs required to approach is generally recommended, increasing the dataset size
manually annotate a large dataset of computed tomography until satisfactory clinical performance is achieved.
(CT) could exceed the resources available to a small research
group. However, using AI, if enough data were already accurately How to obtain the dataset?
labelled, the researchers could train a classifier that could be used An important guiding principle is to reduce bias and improve
to analyse the remaining images. Similarly, AI may be applied to generalisability by sampling across the entire domain of intended
interpret other investigations or even to analyse clinic letters using use. Having data sourced from a wide-ranging demographic of
natural language processing tools. patients, multiple geographical sites and with a large variety
of presentations is always preferable, but not always possible.
Unhelpful AI Collecting data for validation prospectively is also preferable to
enable a greater chance of a robust and unbiased validation for
Not all problems need an AI-based solution. A common pitfall
the model.
in industry is to search for solutions which utilise AI rather than
Patient consent should be obtained and the risks and benefits
focusing on existing problems. Such an approach is ill-advised
of participation in research should be clearly explained. Notably,
because, aside from questionable clinical utility of the outputs,
there are emerging risks to patient privacy from AI systems being
AI-based research has a number of inherent disadvantages.
developed that are capable of deanonymising patient data.
Ethical issues (such as algorithmic bias, lack of transparency and
It has already been possible to identify individuals from their
ambiguous accountability) have gathered attention as potential
electroencephalography and it has been suggested that clinical
barriers to real-world adoption of AI systems.17
data (such as fundoscopy or electrocardiography) may contain
Firstly, large datasets often lack diversity and studies based on
more hidden information that humans are able to interpret.22–24
these datasets may not reflect the target population; for example,
Anonymising patient data is an increasingly important and
the UK Biobank excludes young people and has low numbers
necessary task and there are emerging methods to assist with this,
of several common diseases of interest, such as stroke.18 While
including generating noise in the data using generative adversarial
the algorithms may demonstrate superior performance on a
networks.
limited set of test data, they may underperform when subjected
Consultation with a data scientist prior to collection is highly
to external validation on unseen data. This algorithmic bias may
recommended because there may be nuances in the data labelling
be seen with any predictive model, but AI models are particularly
strategy or methods for optimising data collection that can be
vulnerable as they may discriminate against certain patient groups
valuable to know before data collection commences.
while still maintaining very high aggregate performance measures,
such as accuracy and area under receiver operating characteristic
curve (AUC). Defining the ground truth
Secondly, common AI models, such as deep neural networks, A ground truth is the answer to the question the model is being
have internal logic which is inherently difficult to interpret. This asked for each datapoint in the training dataset; for example,
‘black-box’ problem makes models more difficult to explain to if the model is tasked with classifying skin lesions as cancerous
patients, to interrogate when clinical intuition contradicts them or non-cancerous, the ground truth is the label assigned to each
and to improve in a systematic and rigorous manner.19 For this image. If the ground truth was established using only the labelling
reason, AI attracts greater regulatory scrutiny, which can present of an individual human, the maximal performance of the AI is
additional hurdles and uncertainty compared with conventional limited to the accuracy of that human’s labelling. An alternative
solutions. There is, however, active research into the development method involves labelling the data using more parameters than
of methods to produce AI models while avoiding the ‘black-box’ the algorithm is being trained with or by consensus opinion of a
phenomenon, including using local interpretable model-agnostic committee of experts. A practical example may include training
explanations (LIME).20 an algorithm to visually classify skin lesions but labelling the data
Thirdly, there are times when clinical decisions are entrusted based on more robust biopsy results rather than human visual
to healthcare professionals and use of AI may be inappropriate; classification. Other methods of enhancing safety of AI models
examples include decisions relating to withdrawing life-supporting include human-in-the-loop learning, which involves a human in
treatment, decisions involving particularly sensitive clinical data the training, tuning and testing of an algorithm rather than relying
(such as sexual history or infection status) and decisions where upon a fully automated system.
there is a risk of discriminatory bias.
The multidisciplinary team
Engineering the model These projects are a multidisciplinary effort, in which two
components are essential: clinical expertise and machine learning
Collection and preparation of data
(ML) scientists. Clinicians are needed to help frame the clinical
How much data is needed? question, collect and annotate the data. ML expertise is needed
While efforts have been made to predict dataset size for development and assessment of the model.
requirements, the precise amount required for a particular task ML expertise may be provided by scientists within local hospitals,
is an inexact science and varies depending on the number of associated research institutions or through collaboration with
variables and the outcomes being studied.21 In general, increasing commercial organisations. Data sharing and privacy considerations
are particularly important when data are being shared beyond It is important to educate and train the workforce who will use
the boundaries of the host institution. The significant computing the model, stating clearly what the model should be used for
power required for training ML models may require collaboration and what its limitations are.29 There may be resistance to the
with third party cloud-based computer systems. Interestingly, it has introduction of new technology and, thus, it should be explained
been estimated that the computing power required to train large as openly and clearly as possible. Clinicians should be involved
AI models doubled every 3.4 months between 2012 and 2018, in the development process and feedback should be sought
with computing power being suggested as a potential roadblock to throughout.
future AI development.25,26
Conclusion
Deploying the model The development of AI to improve patient care and to enhance
Generalisability clinical research presents great potential and has attracted
widespread attention from researchers and the public in recent
A model is only as useful as its ability to perform on novel data. years. However, this has been accompanied by the emergence
While one strength of AI models is their ability to fit to complex of a number of potential barriers to adoption, including ethico-
nuances within a dataset, this places them at greater risk of finding legal concerns. Developing clear guidance for researchers to
artefactual idiosyncrasies in the dataset that are not reflective of appropriately frame and answer AI-related research questions
a wider reality. Overfitting to these artefacts affects the model’s remains a clear research priority for the medical field. ■
ability to maintain useful performance when applied to unseen
data. To mitigate the risk of overfitting and algorithmic bias,
Acknowledgements
there must be sufficient diversity within the training dataset. The
training data must be at least as diverse as the population that the We thank Prof Parashkev Nachev for his thoughtful comments and sug-
algorithm intends to serve. External validation on independently gestions on our manuscript.
derived data is required in order to ensure that the systems
perform effectively when exposed to novel data. References
1 Melton M. Babylon health gets $2 billion valuation with new
Regulation funding that will help it expand in US. Forbes 2019. www.forbes.
com/sites/monicamelton/2019/08/02/babylon-health-gets-
AI models that are used as an integral part of the diagnosis and 2-billion-valuation-with-new-funding-that-will-help-it-expand-in-us
management of human disease are treated by the US Food and [Accessed 01 September 2021].
Drug Administration (FDA) as ‘Software as a Medical Device’ 2 Browne R. AI pharma start-up BenevolentAI now worth $2 bil-
(SaMD). The regulatory standards for SaMD are still evolving, but lion after $115 million funding boost. CNBC 2018. www.cnbc.
a few principles are becoming apparent. Unlike traditional medical com/2018/04/19/ai-pharma-start-up-benevolentai-worth-2-
devices, which would not change after development, AI algorithms billion-after-funding-round.html [Accessed 01 September 2021].
can be updated and improved as new data are collected. Good 3 Tozzi J. Amazon-JPMorgan-Berkshire Health-Care venture to be
called Haven. Bloomberg 2019. www.bloomberg.com/news/
performance at the time of deployment does not guarantee
articles/2019-03-06/amazon-jpmorgan-berkshire-health-care-
that the model will continue to perform well. This introduces the
venture-to-be-called-haven [Accessed 01 September 2021].
need to regulate throughout the lifetime of an algorithm, and the 4 Academy of Medical Royal Colleges. Artificial Intelligence in
need to continually demonstrate safe and effective practices. At healthcare. AoMRC, 2019. www.aomrc.org.uk/reports-guidance/
present, it is unclear when and how often the FDA will review these artificial-intelligence-in-healthcare [Accessed 02 March 2021].
algorithms but embedding high-quality engineering practices, 5 Park Y, Jackson GP, Foreman MA et al. Evaluating artificial intel-
such as data traceability and regular performance review, will help ligence in medicine: phases of clinical research. JAMIA Open
to demonstrate safety in a clinical setting.27 The FDA has outlined 2020;3:326–31.
key actions that they intend to take in regards to its regulatory 6 Vollmer S, Mateen BA, Bohner G et al. Machine learning and
framework, including promoting device transparency to users and artificial intelligence research for patient benefit: 20 critical ques-
tions on transparency, replicability, ethics, and effectiveness. BMJ
developing real-world performance monitoring pilots.27
2020;368:l6927.
7 Moons KGM, Altman DG, Reitsma JB et al. Transparent Reporting
Deployment of a multivariable prediction model for Individual Prognosis or
Diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med
It is important to consider how a model will be incorporated into 2015;162:W1–73.
existing clinical workflows, with disruption kept to a minimum. 8 Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston AK. Reporting
Ideally this should be considered from the outset to ensure guidelines for clinical trial reports for interventions involving
that what is designed is actually useful. Interoperability is a key artificial intelligence: the CONSORT-AI extension. Nat Med
determinant in ensuring that models may be integrated across 2020;26:1364–74.
different software. Interfacing within electronic health record 9 Smeaton J, Christie L. AI and healthcare. UK Parliament, 2021.
systems can be challenging, although vendors are moving towards https://post.parliament.uk/research-briefings/post-pn-0637
[Accessed 18 April 2021].
accommodating this.28 Input from a software engineer is likely
10 Garud R, Tuertscher P, de Ven AHV. Perspectives on innovation pro-
to be of value at this stage. Good performance at the time of cesses. Acad Manag Ann 2013;7:775–819.
deployment does not guarantee that the model will continue to 11 Kather JN, Krisam J, Charoentong P et al. Predicting survival
perform well, so measures should be put in place for detecting and from colorectal cancer histology slides using deep learning: A
responding to changes. retrospective multicenter study. PLoS Med 2019;16:e1002730.