Simplicity, Complexity and Modelling
By Mike Christie, Philip Dawid and Stephen S. Senn
()
About this ebook
Several points of disagreement exist between different modelling traditions as to whether complex models are always better than simpler models, as to how to combine results from different models and how to propagate model uncertainty into forecasts. This book represents the result of collaboration between scientists from many disciplines to show how these conflicts can be resolved.
Key Features:
- Introduces important concepts in modelling, outlining different traditions in the use of simple and complex modelling in statistics.
- Provides numerous case studies on complex modelling, such as climate change, flood risk and new drug development.
- Concentrates on varying models, including flood risk analysis models, the petrol industry forecasts and summarizes the evolution of water distribution systems.
- Written by experienced statisticians and engineers in order to facilitate communication between modellers in different disciplines.
- Provides a glossary giving terms commonly used in different modelling traditions.
This book provides a much-needed reference guide to approaching statistical modelling. Scientists involved with modelling complex systems in areas such as climate change, flood prediction and prevention, financial market modelling and systems engineering will benefit from this book. It will also be a useful source of modelling case histories.
Related to Simplicity, Complexity and Modelling
Titles in the series (57)
An Introduction to Optimal Designs for Social and Biomedical Research Rating: 0 out of 5 stars0 ratingsIntroduction to Distribution Logistics Rating: 0 out of 5 stars0 ratingsStatistical Practice in Business and Industry Rating: 0 out of 5 stars0 ratingsStatistical Analysis of Cost-Effectiveness Data Rating: 3 out of 5 stars3/5Missing Data in Clinical Studies Rating: 0 out of 5 stars0 ratingsStatistical Methods in e-Commerce Research Rating: 5 out of 5 stars5/5Binary Data Analysis of Randomized Clinical Trials with Noncompliance Rating: 0 out of 5 stars0 ratingsBioequivalence Studies in Drug Development: Methods and Applications Rating: 0 out of 5 stars0 ratingsFinancial Surveillance Rating: 0 out of 5 stars0 ratingsA Practical Guide to Cluster Randomised Trials in Health Services Research Rating: 0 out of 5 stars0 ratingsBayesian Networks: A Practical Guide to Applications Rating: 3 out of 5 stars3/5Competing Risks: A Practical Perspective Rating: 0 out of 5 stars0 ratingsUsing Statistical Methods for Water Quality Management: Issues, Problems and Solutions Rating: 0 out of 5 stars0 ratingsModeling Online Auctions Rating: 0 out of 5 stars0 ratingsStatistical Framework for Recreational Water Quality Criteria and Monitoring Rating: 0 out of 5 stars0 ratingsComparing Clinical Measurement Methods: A Practical Guide Rating: 0 out of 5 stars0 ratingsStatistical Issues in Drug Development Rating: 0 out of 5 stars0 ratingsStatistical Methods for Groundwater Monitoring Rating: 0 out of 5 stars0 ratingsUnderstanding Biostatistics Rating: 0 out of 5 stars0 ratingsSelection Bias and Covariate Imbalances in Randomized Clinical Trials Rating: 0 out of 5 stars0 ratingsData Analysis in Forensic Science: A Bayesian Decision Perspective Rating: 0 out of 5 stars0 ratingsStatistical Analysis and Modelling of Spatial Point Patterns Rating: 0 out of 5 stars0 ratingsSimplicity, Complexity and Modelling Rating: 0 out of 5 stars0 ratingsUncertainty Modeling in Dose Response: Bench Testing Environmental Toxicity Rating: 0 out of 5 stars0 ratingsQuality of Life Outcomes in Clinical Trials and Health-Care Evaluation: A Practical Guide to Analysis and Interpretation Rating: 0 out of 5 stars0 ratingsMaximum Likelihood Estimation and Inference: With Examples in R, SAS and ADMB Rating: 4 out of 5 stars4/5Evidence Synthesis for Decision Making in Healthcare Rating: 0 out of 5 stars0 ratingsMultiple Imputation and its Application Rating: 0 out of 5 stars0 ratingsStatistical Monitoring of Complex Multivatiate Processes: With Applications in Industrial Process Control Rating: 0 out of 5 stars0 ratingsBayesian Analysis of Gene Expression Data Rating: 0 out of 5 stars0 ratings
Related ebooks
Environmental Modelling: Finding Simplicity in Complexity Rating: 0 out of 5 stars0 ratingsProbability and Conditional Expectation: Fundamentals for the Empirical Sciences Rating: 0 out of 5 stars0 ratingsMixtures: Estimation and Applications Rating: 0 out of 5 stars0 ratingsMultiple Imputation and its Application Rating: 0 out of 5 stars0 ratingsHandbook of Probability Rating: 0 out of 5 stars0 ratingsAn Elementary Introduction to Statistical Learning Theory Rating: 0 out of 5 stars0 ratingsBayesian Analysis of Stochastic Process Models Rating: 0 out of 5 stars0 ratingsCluster Analysis Rating: 4 out of 5 stars4/5A Climate Modelling Primer Rating: 0 out of 5 stars0 ratingsModelling Under Risk and Uncertainty: An Introduction to Statistical, Phenomenological and Computational Methods Rating: 0 out of 5 stars0 ratingsStatistical Tests for Mixed Linear Models Rating: 0 out of 5 stars0 ratingsIntroduction to Mixed Modelling: Beyond Regression and Analysis of Variance Rating: 0 out of 5 stars0 ratingsClassic Topics on the History of Modern Mathematical Statistics: From Laplace to More Recent Times Rating: 0 out of 5 stars0 ratingsMathematical and Computational Modeling: With Applications in Natural and Social Sciences, Engineering, and the Arts Rating: 0 out of 5 stars0 ratingsThe Statisticians and Their Statistics Rating: 0 out of 5 stars0 ratingsBayesian Theory Rating: 1 out of 5 stars1/5Random Field Models in Earth Sciences Rating: 5 out of 5 stars5/5Applied Regression Including Computing and Graphics Rating: 5 out of 5 stars5/5Quantify!: A Crash Course in Smart Thinking Rating: 0 out of 5 stars0 ratingsLeading Personalities in Statistical Sciences: From the Seventeenth Century to the Present Rating: 0 out of 5 stars0 ratingsLinear Programming and Resource Allocation Modeling Rating: 0 out of 5 stars0 ratingsBayesian Statistical Modelling Rating: 2 out of 5 stars2/5Statistical Tolerance Regions: Theory, Applications, and Computation Rating: 0 out of 5 stars0 ratingsSummary of Pedro G. Ferreira's The Perfect Theory Rating: 0 out of 5 stars0 ratingsBeyond Basic Statistics: Tips, Tricks, and Techniques Every Data Analyst Should Know Rating: 1 out of 5 stars1/5Contemporary Bayesian Econometrics and Statistics Rating: 0 out of 5 stars0 ratingsComputational Statistics Rating: 5 out of 5 stars5/5Statistical Implications of Turing's Formula Rating: 0 out of 5 stars0 ratingsThe Bayesian Way: Introductory Statistics for Economists and Engineers Rating: 2 out of 5 stars2/5Mathematical Modelling: A Graduate Textbook Rating: 0 out of 5 stars0 ratings
Mathematics For You
Linear Algebra For Dummies Rating: 3 out of 5 stars3/5Algorithms to Live By: The Computer Science of Human Decisions Rating: 4 out of 5 stars4/5HP Prime Guide Algebra Fundamentals: HP Prime Revealed and Extended Rating: 0 out of 5 stars0 ratingsCalculus For Dummies Rating: 4 out of 5 stars4/5The Art of Statistical Thinking Rating: 5 out of 5 stars5/5Fermat’s Last Theorem Rating: 4 out of 5 stars4/5How Minds Change: The New Science of Belief, Opinion and Persuasion Rating: 0 out of 5 stars0 ratingsThe Music of the Primes: Why an unsolved problem in mathematics matters (Text Only) Rating: 3 out of 5 stars3/5Think Like A Maths Genius: The Art of Calculating in Your Head Rating: 0 out of 5 stars0 ratingsIntroducing Game Theory: A Graphic Guide Rating: 4 out of 5 stars4/5The Art of Logic: How to Make Sense in a World that Doesn't Rating: 0 out of 5 stars0 ratingsLogicomix: An epic search for truth Rating: 4 out of 5 stars4/5Is Maths Real?: How Simple Questions Lead Us to Mathematics’ Deepest Truths Rating: 3 out of 5 stars3/5Game Theory: A Simple Introduction Rating: 4 out of 5 stars4/5Summary of The Black Swan: by Nassim Nicholas Taleb | Includes Analysis Rating: 5 out of 5 stars5/5The Cartoon Introduction to Calculus Rating: 5 out of 5 stars5/5An Introduction to Phase-Integral Methods Rating: 0 out of 5 stars0 ratingsHow to Bake Pi: Easy recipes for understanding complex maths Rating: 3 out of 5 stars3/5Pre-Calculus For Dummies Rating: 5 out of 5 stars5/5Game Theory: Understanding the Mathematics of Life Rating: 0 out of 5 stars0 ratingsThe Joy of X: A Guided Tour of Mathematics, from One to Infinity Rating: 0 out of 5 stars0 ratingsIntroduction to Proof in Abstract Mathematics Rating: 5 out of 5 stars5/5Calculus Essentials For Dummies Rating: 5 out of 5 stars5/5Learn Game Theory: Strategic Thinking Skills, #1 Rating: 5 out of 5 stars5/5Trigonometry For Dummies Rating: 0 out of 5 stars0 ratings
Reviews for Simplicity, Complexity and Modelling
0 ratings0 reviews
Book preview
Simplicity, Complexity and Modelling - Mike Christie
This edition first published 2011
© 2011 John Wiley & Sons, Ltd
Registered office
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The rights of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging-in-Publication Data
Simplicity, complexity, and modelling / edited by Mike Christie ... [et al.].
p. cm.
Includes bibliographical references and index.
ISBN 978-0-470-74002-6 (cloth)
1. Simulation methods. I. Christie, Mike.
T57.62.S53 2011
601.1 – dc23
2011020649
A catalogue record for this book is available from the British Library.
Print ISBN: 978-0-470-74002-6
ePDF ISBN: 978-1-119-95145-2
oBook ISBN: 978-1-119-95144-5
ePub ISBN: 978-1-119-96096-6
Mobi ISBN: 978-1-119-96097-3
Preface
In January 2006, the EPSRC held an Ideas Factory on the topic of Scientific Uncertainty and Decision Making for Regulatory and Risk Assessment Purposes. The questions posed on entry were:
‘The assessment and decision making processes within environmental, health, food and engineering sectors pose numerous challenges. Uncertainty is a fundamental characteristic of these problems. How do we account for all the uncertainties in the complex models and analyses that inform decision makers? How can those uncertainties be communicated simply but qualitatively to decision makers? How should decision makers use those uncertainties when combining the scientific evidence with more socio-economic considerations? And how can decisions be communicated so that the proper acknowledgement of uncertainty is transparent?’
In examining these questions, it became clear that many different subject areas use similar tools to tackle questions of uncertainty yet apply them in different ways. We felt that there was scope to learn from the varied applications of statistics and probability in different scientific and engineering disciplines.
This book results from our review of best practice in uncertainty quantifications in subject areas as diverse as pharmaceutical statistics, climate modelling, flood risk and oil reservoirs.
Acknowledgements
This book would not have been possible without the kind assistance of many others whose help we gratefully acknowledge as follows. In setting up and running the project we received support and encouragement from Mathew Collins of the Met Office, Stuart Allen and Paul Hulme of the Environment Agency and Glyn Williams of BP, as well as practical assistance from Tanya Cottrell and Rachel Wooley of the EPSRC, Kate Nimmo of the Glasgow University Research and Enterprise Office and Jean Jackson of the Department of Statistics at Glasgow. Our thanks are also owed to Anthony O'Hagan and Martin Grindrod for making the ‘sandpit’ at which we all met happen and of course to the EPSRC for funding our research. Scientists who generously helped our understanding of modelling included Mike Branson (Novartis), David Draper (University of California), Mark Girolami (University College London), Michael Goldstein (University of Durham), Steve Jewson (Risk Management Solutions), Axel Munk (University of Göttingen) and David Spiegelhalter (University of Cambridge), who contributed papers to a very stimulating workshop we organized in Cambridge, and Val Fedorov (GlaxoSmithKline) who made helpful comments on Chapter 3. Last, but not least, we are grateful to Heather Kay and Richard Davies for patiently seeing the book through to completion and production. None of the above, of course, are responsible for any weaknesses and errors that remain.
Contributing authors
Peter Challenor
National Oceanography Centre
Empress Dock
Southampton
Hants SO14 3ZH
UK
Mike Christie
Institute of Petroleum Engineering
Heriot Watt University
Edinburgh
UK
Andrew Cliffe
School of Mathematical Sciences
University of Nottingham
Nottingham NG7 2RD
UK
Philip Dawid
Centre for Mathematical Sciences
University of Cambridge
Cambridge CB3 0WB
UK
Suraje Dessai
Geography, College of Life and Environmental Sciences
University of Exeter
Amory Building
Rennes Drive
Exeter
EX4 4RJ
UK
Jim Hall
Environmental Change Institute
University of Oxford
Oxford
UK
Zoran Kapelan
College of Engineering, Mathematics and Physical Sciences
University of Exeter
Harrison Building, North Park Road
Exeter EX4 4QF
UK
Jeremy E. Oakley
School of Mathematics and Statistics
The University of Sheffield
The Hicks Building, Hounsfield Road
Sheffield S3 7RH
UK
Stephen Senn
School of Mathematics and Statistics
University of Glasgow
Glasgow, G12 8QW
UK
Robin Tokmakian
Department of Oceanography
Graduate School of Engineering and Applied Sciences
Naval Postgraduate School
Monterey, CA 93943
USA
Jeroen P. van der Sluijs
Utrecht University Faculty of Science
Copernicus Institute
Department of Science Technology and Society
Budapestlaan 6
3584 CD Utrecht
The Netherlands
Chapter 1
Introduction
Mike Christie¹, Andrew Cliffe², Philip Dawid³ and Stephen Senn⁴
¹Institute of Petroleum Engineering, Heriot Watt University, Edinburgh, UK
²School of Mathematical Sciences, University of Nottingham, UK
³Centre for Mathematical Sciences, University of Cambridge, UK
⁴School of Mathematics and Statistics, University of Glasgow, UK
In this introductory chapter we make some brief remarks about this book, what its purpose is, how it relates to the Simplicity Complexity and Modelling (SCAM) project and also more widely about what the purpose of modelling is and what various traditions in modelling there are.
1.1 The origins of the SCAM project
In January 2006 the Engineering and Physical Research Council (EPSRC) organized a ‘sandpit’ or ‘ideas factory’ at Shrigley Park under the directorship of Peter Grindrod with the title ‘Scientific Uncertainty and Decision Making for Regulatory and Risk Assessment Purposes’ in which scientists from a wide variety of disciplines participated. At the ideas factory there were frequent informal and formal meetings to discuss issues relevant to uncertainty in modelling. As the week progressed various themes emerged, projects were mooted and teams coalesced. These teams then competed with each other for funding from the EPSRC. Among those that were successful was a project which had the following specific objectives:
First, given that data are finite, what is the appropriate balance between simplicity and complexity required in modelling complex data?
Second, where more than one plausible candidate model is used, how should forecasts be combined?
Third, where model uncertainty exists, how should this uncertainty be propagated into predictions?
However, the project also had the more general and wider purposes of making modellers in different traditions mutually aware of what they were doing and also of making the different terminology that they employed intelligible to each other.
Funding for the project was agreed and the name Simplicity, Complexity and Modelling (SCAM) was chosen. This is the book of the SCAM project.
1.2 The scope of modelling in the modern world
Scientists working in many diverse areas are engaged in modelling the world. Obviously, the various fields in which the models they create are applied vary considerably and this is reflected in the approaches they adopt to build, fit, test and use the models they devise. Consider, for example, credit scoring and climate modelling. In the former case the data consist of billions of transactions every day. The field is data-rich and the opportunities to test the ability of the fitted models to predict (say) good and bad debts abundant. A model that is fitted today can be tested tomorrow and again the day after and so on. On the other hand, climate modellers are trying to predict a unique future. If current trends in human activity persist, will this lead to global warming and what will be the consequences? If the models suggest that the consequences of current activity are serious and if mankind acts on the warning and mends its ways then the prediction will never be validated. Climate modellers are thus cast in the role of Cassandras: if heeded they will ultimately be doubted because what they predict will not come to pass and only disaster will reveal them to have spoken the truth. This may seem somewhat fanciful, yet consider the case of the so-called millennium bug. Huge sums of money were invested in fixing computer code. The world computing network survived the arrival of the year 2000, and now some are convinced that it was all a fuss about nothing while others believe that it was only foresight and action that prevented disaster.
Yet, if one looks a little deeper even in these very different fields there are points in common. For example, in the wake of the global financial crisis of 2008 many financial analysts are no doubt pondering how well the current approach to forecasting the credit weather will serve if the credit climate is changing.
Nevertheless, some things are very different as one moves from one field to another, and it is the belief that knowledge of such differences is valuable that is one of the justifications for this book. On the other hand, some things that appear different are in fact the same or similar, and it is the vocabulary that differs from field to field and sometimes within a field, rather than the concept. For example, the terms random effects model, hierarchical model and mixed model used within the discipline of statistics are either synonyms or so readily interchangeable that they might be applied, depending on author, to exactly the same algebraic construct. However, those who work in pharmacometrics use machinery that is identical to random effects models but are likely to refer to such as population models (Sheiner et al. 1977). This reflects, of course, the fact that even within the same discipline different individuals responding to different perceived needs have stumbled across the same solution, and that as one switches discipline the scope for this phenomenon is even greater.
It is the object of this book and of the SCAM project, to represent various modelling traditions and application areas with a view to making researchers aware of a rich diversity but also that there are many concerns they share in common.
1.3 The different professions and traditions engaged in modelling
However, it would be foolish of us to claim that the team members cover all disciplines and hence that our book encompasses the whole field. We are, in fact, three statisticians (APD, JO and SS), an applied mathematician (AC), a climate modeller (PC), a geographer (SD) and three engineers (MC, ZK and JH). Not included in the team, for example, are any computer scientists. Also absent, to name but a few scientific professions, are any econometricians, financial analysts or pharmacometricians (although SS has some interests in the latter field). The bias towards the physical sciences in the team is thus clear. In fact the application areas covered by us include topics from the physical sciences such as climate, oil exploration, flood prevention, nuclear waste disposal, water distribution networks, and simpler approximations of complex computer programs. The modelling of treatment effects in drug development is perhaps the only exception to this theme.
We do not claim that the breadth of the book is great enough to cover all fields or even all lessons that might be learned from study of such fields, but hope that it is great enough to be interesting and valuable and that it will serve to make the strange familiar by drawing parallels where they can be found and to make the familiar strange by alerting modellers in a given field to the fact that others do not necessarily do things the same way and hence that what they take for granted may be far from obvious.
1.4 Different types of models
Cox (1990) identifies two major types of model: substantive and empirical. Models of the former type arise as a result of careful consideration of some well-established or at least plausible background scientific theory. Careful thought concerning processes involved suggests a relationship between quantities of interest. The theory thus embodied may suggest some difficult or intricate mathematical work, and this receives expression in a model. We give a simple example of the thinking that might go into such a model from the field of pharmacokinetics.
Various physiological considerations may suggest that a particular pharmaceutical given by injection will be eliminated at a rate that is proportional to its concentration in the blood. Suppose we have an experiment in which a healthy volunteer is given a pharmaceutical by intravenous injection and then blood samples are drawn at regular and frequent intervals. A differential equation suggests that the concentration–time relationship can then be modelled with concentration on the log scale as a linear function of time. Of course nothing is measured perfectly, so that some random variation should be allowed for. It may thus be valuable to think in terms of data which have a signal plus some noise. The signal part of the model can then be modelled as
1.1 1.1
where μt is the ‘true’ concentration at time t after dosing, μ0 is the concentration in the blood at time 0 and k is a so-called elimination constant. One could regard such a model as being a simple (incomplete) example of a substantive model. Making it realistic using purely theory-based considerations may be difficult, however. A log transformation is particularly appealing and we can then write
1.2 1.2
(Here we follow the usual statistician's convention of writing natural logarithms as log.) We do not, however, observe μt directly but (say) a quantity Yt. The model given in (1.1) may then be extended to represent observable quantities by proposing some simple relationship between a given observed concentration Yi taken at time ti and the true unobserved concentration images/c01_I0003.gif that involves an unobserved random variable images/c01_I0007.gif . One possible relationship is
1.3
1.3However, this model is itself not complete until we specify how the images/c01_I0008.gif are distributed. If we can assume that they are identically, independently distributed with unknown variance σ² which does not vary with time (and hence with concentration) then a rather good way to estimate the unknown parameter seems to be via ordinary least squares on the log concentration scale.
So far, some limited subject-matter theory (to do with plausible models for drug elimination) has been used for developing the model for the signal. The model for the noise, however, is rather ‘off the peg’ but it can be refined by further considerations. For instance, the theory of ordinary least squares tells us that where such a model applies and n blood samples have been taken, the variance of the estimate k, images/c01_I0005.gif , is given by
1.4 1.4
This raises the question, given that a fixed number of samples should be taken, when should we choose to take them. If formula (1.4) is correct the answer is half at baseline and half at infinity, since this is the arrangement that maximizes the denominator in (1.4) for given n and hence minimizes (1.4) for given n and σ². This is, however, absurd and its absurdity can be traced to two inappropriate assumptions in the error model: first, that on the log scale the error variance is constant; and second, that the error terms are independent. Recognizing that the variance (on this log scale) is likely to increase with time makes it less reasonable to measure at high values of t. Allowing the images/c01_I0009.gif to have a correlation that decays with time will indicate that, other things being equal, measurements taken more closely together provide less information.
Many models employed, however, are not the result of these sorts of consideration. These are models of the type Cox calls empirical. For example, in a clinical trial in adults suffering from asthma (Senn 1993) we may be measuring forced expiratory volume in one second (FEV1). We will of course have treatment given as an explanatory factor in the model. However, we know that, other things being equal, women have lower FEV1 than men and older adults have lower FEV1 than younger ones. As a first attempt at a model we might include a dummy variable for sex, taking on the value 0 for females and 1 for men, say. We could have a simple linear term for age but might consider also adding age squared and age cubed. Or perhaps we could use some other polynomial scheme such as that of so-called fractional polynomials (Royston and Altman 1994; Royston and Sauerbrei 2004). The general point here, however, is that the model we use is governed much more by what has been observed to work in the past and some general modelling habits we have, rather than by some considerations based on the physiology of the lung and (say) some biological model of how it deteriorates with age.
The choice of a suitable model may depend on context as well as purpose. Does one need to make predictions under conditions that are physically different from ones in which any of the observations have been made? To take an example from flood modelling, one may wish to predict how high the flood waters will be after construction of a dam. If one was just interested in predicting water levels next week, by which time the dam would not have been constructed, one could use a Kalman filter or a machine learning algorithm or some such, preferably rather parsimonious, empirical model. But if one wants to predict in changed circumstances one may have to go to the trouble of setting up a hydraulic model, estimating roughness parameters, and then changing the geometry to represent the future and unobserved conditions.
Of course, the distinction between these two types of model is not absolute. For instance, to return to pharmacokinetics, a modern approach builds up models of drug elimination from more fundamental models of various organ classes of the human body–liver, gut, skin, blood and so on – as well as biochemical models of the pharmaceutical (Krippendorff et al. 2009) to predict what sort of model of serum concentration in the blood will be adequate. From the perspective of this approach, adopting a model such as (1.1) directly without such background modelling is rather empirical.
One can also give examples tending in the other direction. A common approach to comparing generic formulations of a pharmaceutical to the innovator product for the purpose of obtaining a licence is to use a so-called bioequivalence study (Patterson and Jones 2006; Senn 2001). This compares the concentration–time profile in the blood of both formulations given on different occasions (the sequence being random) to healthy volunteers. Commonly these curves are compared using summary statistics such as area under the curve (AUC) and concentration maximum (Cmax) and a model is built relating AUC (say) to formulation, subject and period. From the perspective of someone who builds a model like (1.1) this is also very ad hoc and empirical. However, theoretical considerations can be produced based on a model like (1.1) to show that AUC is in fact a good measure to use to compare two concentration–time profiles.
The various examples of modelling in this book cover this spectrum pretty widely. Examples will be found of empirical modelling but also of complex models that are built up from more fundamental scientific considerations.
1.5 Different purposes for modelling
Different sciences have developed their own modelling traditions and approaches. Some use entirely deterministic models, others allow for uncertainty and random variation. Some attempt to model finely detailed structure, others a coarser ‘big picture’. The ‘fitness for purpose’ of a model will depend on many considerations. One important aspect is complexity: while incorporating more detail may allow a more accurate description, an over-complex model will be hard to identify from observations, and this can lead to poor predictions. Note, however, that a poorly identified model is not necessarily bad at prediction. For example, the parameter estimates may have high standard errors but be strongly negatively correlated. The variance of a prediction may then include a contribution not only from large variances of individual parameters but also from important negative covariance terms. For example, to return to the case of a clinical trial in asthma, any model that includes height, sex, age and baseline FEV1 in the model may find that the estimates have large standard errors since height, sex and age are all strongly predictive of FEV1. The problem is, however, that the collinearity makes it difficult to establish the separate contribution of each precisely. However, for a prediction for any given patient it is the joint effect of them all that is needed, and this may be measured quite well.
Nevertheless, it is important to strike the right balance between too much simplicity (which may miss important patterns in the world and signals in the data) and too much complexity (which may lose the signal in a halo of noise). A variety of methods has been developed to tackle this subtle but vital issue.
However, whatever the science, two purposes of models are commonly encountered. One is to increase understanding of a particular field. In the field of statistics this is very much associated with causal analysis (Pearl 2000). In the hard sciences it is to use models as a means of establishing and understanding ‘laws’. A further purpose, however, is for prediction. In the hard sciences the analogy would be to work out the consequences of the laws established.
1.6 The purpose of the book
The primary purpose of this book is to make it easier for modellers in different disciplines to interact and understand each other's concerns and approaches. This is largely achieved, we hope, through the subject-specific contributions (Chapters 3–10) which provide an introduction to modelling in various fields. We hope that the reader will emerge from perusing these chapters with the same sense of surprise that we experienced through our interactions with each other throughout the course of the project, namely that there is much more to modelling than we originally thought.
What the book is not is a basic introduction to linear models, generalized linear models or statistical modelling generally. For the reader who is in search of such, excellent texts that fulfil this purpose that we can recommend are the classics on linear models by Draper et al. (1998) and Seber and Lee (1977), that on generalized linear models by McCullagh and Nelder (1999) and three more general texts on statistical modelling, with very different but valuable perspectives, by Harrell (2001), Davison (2003) and Freedman (2005). For a Bayesian approach we recommend Gelman et al. (2004).
Nevertheless, a brief technical introduction to modelling is provided in Chapter 2, and in Chapter 11 we try and draw some threads together. We also provide a glossary, which we hope will help modellers to understand each other's vocabulary.
1.7 Overview of the chapters
The book contains ten further chapters after this one, two of which are general in scope and eight of which cover specific application areas reflecting the interests of the members of the team.
Chapter 2, by Philip Dawid and Stephen Senn, is a general purpose methodological one on model selection but also including some remarks on a matter that goes to the heart of the SCAM project. A model that is finally chosen may be a clear winner in that it seems to be the only model among many that adequately describes the data. On the other hand, it might simply be the best by a narrow margin among a wide set of candidate models. It would seem plausible that in the first case the true uncertainty in prediction is better captured by a within-model analysis than in the second. In the second case some consideration of the road or roads not taken would seem to be necessary in order to express uncertainty honestly. Yet if model selection and fitting proceeds, as it often has in practice, through a first stage of selection and then a second stage of prediction using the model selected as if one knew it were true, the true uncertainty is underestimated.
Chapter 3 is the first of the subject-matter chapters. In it Stephen Senn considers the field of drug development and, in particular, the analysis of so-called phase III trials. This is interesting not because the modelling is complex – in fact it is frequently very simple, although increasingly complex models are being used to deal, for instance, with the vexed problem of missing data (Molenberghs and Kenward 2007) – but rather because progress can often be made without complex modelling, albeit at a price.
The price is a reduction in precision. Under best conditions, randomized clinical trials yield unbiased estimates of the effect of treatments. However, including covariates in the model can often make these estimates more precise. Thus, simplicity has a price in the form of the need for larger sample sizes. On the other hand, it seems to be a psychological fact that simpler models (rightly or wrongly) are often trusted more than complex ones. Thus the reduction in