MEASURING RISK IN SCIENCE
Deyun YIN a,b, Zhao WU a, and Sotaro SHIBAYAMA c,d,*
a
Harbin Institute of Technology (Shenzhen), School of Economics and Management, Shenzhen,
China
b
World Intellectual Property Organization, Geneva, Switzerland
c
Lund University, Center for Innovation Research, Lund, Sweden
d
The University of Tokyo, Institute for Future Initiative, Tokyo, Japan
*
Corresponding author: sotaro.shibayama@fek.lu.se. +46 (0)46 2227812.
P.O. Box 7080, S-220 07 Lund, Sweden
ABSTRACT
Risk plays a fundamental role in scientific discoveries, and thus it is critical that the level of risk can be
systematically quantified. We propose a novel approach to measuring risk entailed in a particular mode
of discovery process – knowledge recombination. The recombination of extant knowledge serves as an
important route to generate new knowledge, but attempts of recombination often fail. Drawing on
machine learning and natural language processing techniques, our approach converts knowledge
elements in the text format into high-dimensional vector expressions and computes the probability of
failing to combine a pair of knowledge elements. Testing the calculated risk indicator on survey data,
we confirm that our indicator is correlated with self-assessed risk. Further, as risk and novelty have
been confounded in the literature, we examine and suggest the divergence of the bibliometric novelty
and risk indicators. Finally, we demonstrate that our risk indicator is negatively associated with future
citation impact, suggesting that risk-taking itself may not necessarily pay off. Our approach can assist
decision making of scientists and relevant parties such as policymakers, funding bodies, and R&D
managers.
KEYWORDS
Risk; Uncertainty; Novelty; Recombination; Science; Word embedding; Support Vector Machine
1
Electronic copy available at: https://ssrn.com/abstract=4462160
1. INTRODUCTION
Science is a risky business by nature. Scientists explore and cultivate uncharted space of knowledge
through trials and errors, in which their original ideas are often rejected and expected goals are not
fulfilled for various reasons (Franzoni and Stephan, 2021; Machado, 2021; OECD, 2021; Reinhilde et
al., 2022; Wang et al., 2019). Such risk and uncertainty tend to be especially high when scientists aim
at novel discoveries, which have the potential to open up new avenues and make substantial
advancement (Bourdieu, 1975; Hagstrom, 1974; Kuhn, 1970; Merton, 1973). Thus, there is a growing
concern over scientists' risk-averse behavioral patterns, and science communities and policymakers
emphasize that efforts should be made to facilitate high-risk-high-return research (Franzoni and Stephan,
2021; Gewin, 2012; Machado, 2021; OECD, 2021).
Despite its fundamental role, risk and uncertainty in science have been poorly understood (Althaus,
2005; Aven, 2011; Franzoni and Stephan, 2021; Hansson, 2018), which this study aims to contribute
to. Specifically, we aim to develop a bibliometric approach to quantify the degree of scientific risk in a
particular mode of scientific discovery process – recombination. Scientific knowledge is usually
generated on the basis of extant knowledge (Merton, 1965), and combining multiple elements of extant
knowledge is an indispensable route to generate new knowledge (Fontana et al., 2020; Uzzi et al., 2013).
Even novel discoveries often result from integrating pieces of extant knowledge that used to appear
unrelated (Dahlin and Behrens, 2005; Mednick, 1962; Simonton, 2003; Trapido, 2015; Uzzi et al., 2013;
Wang et al., 2017). Because of such a fundamental role of recombination, it is of scholarly and practical
interest to quantify the degree of risk associated in recombination. In this regard, the previous literature
seems to make an assumption that novel research is risky (Machado, 2021; Reinhilde et al., 2022).
While novel research may entail some risks (Franzoni and Stephan, 2021; Wang et al., 2017), risk and
novelty are not equivalent. Opportunities for novel recombination may be difficult to identify but may
be easily achieved once the opportunity is identified.
To quantify risk in the recombination process, we employ machine learning and natural language
processing techniques. Drawing on past trajectories of science, we develop a machine learning model
2
Electronic copy available at: https://ssrn.com/abstract=4462160
that predicts whether a certain pair of knowledge elements will be linked or not in the future. The
developed model calculates the probability that the pair of knowledge elements is combined, or put
differently, the risk in achieving or failing in recombination. This risk indicator is validated by a
questionnaire survey that we carried out, in which scientists self-assessed the anticipated risk of their
own projects. We further examine the relationship between risk and novelty.
The contribution of this study is two-fold. First, this study is the first to offer a validated indicator of
risk in science, which contributes to the underdeveloped literature on risk in science (Franzoni and
Stephan, 2021; Machado, 2021; Reinhilde et al., 2022). Second, the proposed approach offers a
practical tool for scientists, policymakers, and other parties in assessing the feasibility of research plans
and in developing research strategies.
This paper is structured as follows. Section 2 reviews previous studies on risk and uncertainty in science.
Section 3 describes our approach to quantify risk in recombination in science. Section 4 validates the
risk indicator with the questionnaire survey. Section 5 examines the relationship between risk and
novelty indicators. Section 6 summarizes the results and discusses implications.
2. LITERATURE REVIEW
2.1. Risk in Science
In general, risk is attributed to imperfect information (Marinacci, 2015), for which one cannot know in
advance exactly what will happen and what the consequence of it will be (Kaplan and Garrick, 1981).
Scientific research is risky in this regard because scientists do not always know what they discover,
whether they discover anything at all, or what the implication of a certain discovery is (Franzoni and
Stephan, 2021).
Scientists often start researching without having a clear expectation (Bourdieu, 1975; Shibayama, 2019;
Whitley, 1984). Such exploratory research could potentially lead to various discoveries, and scientists
3
Electronic copy available at: https://ssrn.com/abstract=4462160
happen to reach one or none of them. In other cases scientists start with a concrete hypothesis (Bourdieu,
1975; Whitley, 1984). In such confirmatory research, the result may either support or reject the
hypothesis, or whether the hypothesis was supported or rejected may turn out to be unclear. Even when
the intended hypothesis is rejected, scientists may serendipitously encounter unexpected results (Merton
and Barber, 2004; Yaqub, 2018). Overall, scientists cannot perfectly know whether or what they will
discover. Once a discovery is made, the consequence of the discovery might also be unpredictable
(Franzoni and Stephan, 2021). A discovery may be used for various applications, potentially beyond
what is expected, or may not be used at all. It is well known that the impact of a discovery in terms of
forward citations varies substantially (Levitt and Thelwall, 2011; Macroberts and Macroberts, 1989;
Min et al., 2021).
These risks are caused by various sources. One important source is the incompleteness of existing
scientific knowledge. The frontier of science is full of unknowns, which makes it difficult to predict
what will happen next. Many important discoveries were known to made accidentally (e.g., penicillin
and X-ray). Another source of risk is attributed to the nature, especially in empirical research. Some
phenomena occur only stochastically (e.g., the observation of neutrino and the discovery of pulsar), in
which whether scientists can observe a phenomenon or not depends on luck. Furthermore, technologies
may determine what scientists can discover. Scientists often need the aid of technologies to observe
what they intend to observe (e.g., a microscope and a DNA sequencer), but technologies may not
perfectly work due to innate technological uncertainty or due to errors on the side of scientists. Finally,
risk is also attributable to contextual factors beyond scientists' control. For example, the consequence
of a discovery, in terms of citation impact, is subject to the degree of competition in the research area.
2.2. Risk in Recombination
Among various types of risk in science, this study highlights risk of failing in a discovery – the chance
that an attempt of discovery results in no discovery. It further focuses on risk entailed in a particular
mode of discovery process – recombination. An important route to generate new knowledge is the
recombination of extant knowledge (Fontana et al., 2020; Uzzi et al., 2013). Previous literature argues
4
Electronic copy available at: https://ssrn.com/abstract=4462160
that associating remote elements is a path to creative solution (Mednick, 1962; Simonton, 2003), and
that combining diversified components is a major route to technological innovation (Arthur, 2007;
Fleming, 2001; Hall et al., 2001). In the context of science, for example, chemists synthesize a new
compound by reacting multiple chemicals, and molecular biologists literally recombine genetic
sequences to generate a protein with desired properties. Recombination is also suggested to be an
important source of novel discoveries (Lin et al., 2022; Uzzi et al., 2013).
Obviously attempts of recombination are not always successful. Scientists may try to combine a pair of
knowledge elements only to realize that it is technically infeasible. Thus, risk in recombination – given
a pair of knowledge elements whether their recombination is likely to succeed or fail – is of theoretical
and practical interest. Previous literature tends to consider attempts for more novel recombinations to
be riskier. Indeed, Franzoni et al. (2018) conducted a survey and found that the respondents' assessment
of risk is correlated with a recombinant novelty indicator. Wang et al. (2017) showed that publications
with higher novel recombination scores have higher variance in their citation impact.
2.3. Measuring Risk in Recombination
While the degree of risk can be measured subjectively in a small scale with surveys or peer reviews
(Franzoni et al., 2012; Linton, 2016), there is no approach to measure risk in science in a large scale to
the best of our knowledge with one exception. That is, a few studies proposed to use the bibliometric
indicator of recombinant novelty as a proxy of scientific risk (Machado, 2021; Reinhilde et al., 2022).
Recombinant novelty indicator. In recent years efforts have been made to quantify the degree of
scientific novelty based on citation data and text data (Tu and Seng, 2012; Yang et al., 2022). To
operationalize the concept, the majority of novelty indicators drew on the rarity of or the distance
between a combination of knowledge elements. For example, a group of novelty indicators considers a
scientific document to be novel if it cites a rare combination of journals (Lin et al., 2022; Uzzi et al.,
2013; Wang et al., 2017). In this case, a journal is used as the element of knowledge, and the distance
between two journals is measured by the (in)frequency of citation between the journals. Another group
5
Electronic copy available at: https://ssrn.com/abstract=4462160
of indicators uses a cited references as the knowledge element and considers a document to be novel if
it cites a rare combination of references (Matsumoto et al., 2020; Trapido, 2015). Yet another approach
draws on text information. For example, Boudreau et al. (2016) measures the novelty of a grant proposal
based on the first combination of Medical Subject Heading (MeSH) keywords in history. A more recent
novelty measure is based on the word-embedding technique. Shibayama et al. (2021) assign a high
dimensional vector to all relevant words based on the previous co-occurrences of those words, position
all documents in the high-dimensional space based on the document text, and finally calculate the
distance between a cited reference pair.
In these operationalizations, novelty may be associated with some challengingness. It is plausible that
a pair of knowledge elements had rarely co-occurred because previous studies had difficulty in
combining them. It is also possible that a novel pair was difficult to identify but combining the pair
itself may not be difficult. Thus, the link between recombinant novelty and risk in recombination is not
straightforward. In fact, the previous studies proposing to use the novelty indicator as a proxy of risk
are based on mixed reasonings in terms of what they mean by risk (Machado, 2021; Reinhilde et al.,
2022). One reasoning is that the novelty indicator is associated with a high variance in citation impact
(Wang et al., 2017), which concerns risk in consequence but not risk in the discovery process. Another
is based on a validation analysis that the novelty indicator is correlated with the questionnaire score of
"high-risk and high-reward", in which two concepts are confounded (Franzoni et al., 2018; Franzoni
and Stephan, 2021). Therefore, this study aims to measure risk in the process of recombination more
directly.
Prediction of knowledge evolution. In a nutshell, risk in recombination concerns the likelihood with
which a pair of knowledge elements are to be successfully combined or not. We thus need a technique
to predict a type of knowledge evolution. The evolution of knowledge has been long studied in the
sociology of science (Bourdieu, 1975; Kuhn, 1977; Whitley, 1984), and recent studies contributed to
empirically describing knowledge evolution with large-scale bibliometric data (Foster et al., 2015; Liu
et al., 2017; Palchykov et al., 2021; Sun and Latora, 2020).
6
Electronic copy available at: https://ssrn.com/abstract=4462160
While some of these studies aim to understand the generic mechanisms and laws behind knowledge
evolution (Mazzolini et al., 2018; Tria et al., 2018), others focus on micro mechanisms with which
knowledge elements are combined (Butun and Kaya, 2020; Sebastian et al., 2015). The latter is often
reduced to link prediction problems in a network of scientific documents. Such an approach first forms
a network of documents, in which a document is a node and a citation between two nodes is a link. It
then assesses various features of nodes and the network and predicts whether a pair of nodes will be
linked or not.
For example, Shibata et al. (2012) developed a link prediction model to assess whether a pair of papers
is likely to have a citation link, though the model does not predict future citations. Sebastian et al. (2015)
also developed a model to predict whether papers in different research areas will be linked through cocitation (i.e., cited by a paper published in the future). Drawing on a link prediction algorithm for
citation network, Butun and Kaya (2020) predicted the citation count of a paper, and Daud et al. (2017)
predicted the author network through citation links.
These link prediction algorithms draw on either node attributes or network topology. Earlier studies
tend to rely on node attributes, such as paper's keywords and author affiliations, while more recent
studies draw on topological network features such as the number of neighboring nodes, Jaccard's
coefficient and Adamic-Adar index, contending that topological information allows more precise
prediction (Butun and Kaya, 2020). These features are fed into supervised machine learning models
predict whether a pair of nodes will be linked or not. Technically, this task draws on classification
algorithms such as support vector machine (SVM), random forest, and logistic regression (Breiman,
2001; Chen and Guestrin, 2016; Cortes and Vapnik, 1995).
3. PREDICTION OF RECOMBINATION RISK
7
Electronic copy available at: https://ssrn.com/abstract=4462160
We similarly use link prediction algorithms to assess whether a pair of knowledge elements will be
linked or not. If a pair is predicted not to be linked, the likelihood of failing in combining the pair is
high, and we thus consider that the risk of recombination is high.
In developing our approach, we set a few requirements for the input of link prediction models. First,
our prediction is based on information about a knowledge element itself rather than information about
the network of knowledge elements. Second, we use only semantic information of knowledge elements
without requiring other information (e.g., author information, etc.). In other words, unlike previous
studies, we do not predict links between knowledge elements by citation network information or other
bibliometric information. Such information is available only after a paper is successfully published and
is embedded in citation network, which is not ideal because we would like to know the risk of
recombination even before the risk is taken.
Specifically, we draw on the word embedding technique – i.e., a high-dimensional vector assigned to
each word – to capture the semantic information of knowledge elements (Mikolov et al., 2013). We
assign a vector expression to each knowledge element and predict the likelihood of a pair of elements
being linked based on the corresponding vector pair. This approach is applicable as long as a knowledge
element is given in a text format, whether it is part of published papers (e.g., the abstract of a paper) or
not (e.g., a grant proposal, a paragraph of a research idea). We test our strategy in two steps in this and
the next sections.
3.1. Predicting Future Co-citation
To build a link prediction model, we draw on a published paper as a knowledge element and consider
that two knowledge elements are combined if a pair of papers are cited together by at least one paper.
With classification algorithms we compute the probability of two papers to be co-cited. By subtracting
this probability from 1, we obtain the risk of failing to combine the knowledge elements.
8
Electronic copy available at: https://ssrn.com/abstract=4462160
Data. We sampled papers in the field of biomedicine from the Web of Science (WoS). 1 We first
identified about 520,000 papers published in the field in 2010. We chose publications in 2010 to have
sufficient time for the papers to be cited. These papers can generate 1.35 × 1011 pairs, of which 1.38 ×
107 pairs were actually co-cited at least once in the following 8 years (until 2018). From these co-cited
pairs we randomly selected approximately 60,000 pairs as linked pairs. We also randomly sampled
60,000 non-linked pairs that were never co-cited. In total, we prepared a sample of 120,000 paper pairs,
a half linked and the other half non-linked, as the training data.2 We repeated the same sampling process
to prepare the test data of the same size consisting of 60,000 linked pairs and 60,000 non-linked pairs.
Word embedding. As discussed above, our link prediction model draws on the vector expressions of a
pair of papers. To this end we drew on the word embedding model that is trained with publication data
up to 2010 in WoS. The model provides 300-dimensional vector representations for 1.7 million unique
words. Yin et al. (2022) demonstrated that the model reasonably captures the distance between
knowledge elements (words, titles, and abstracts).3 For each paper listed in the training and test data,
we extracted its title and the abstract and assigned a word vector to each word included. Finally, we
averaged all word vectors to generate a document vector for each paper.4
Classifiers. Thus, we used a pair of document vectors as the input features for predicting whether paper
pairs are linked or not. The goal of the link prediction model is to classify linked and non-linked pairs.
For this classification task, we drew on several classifiers that have been commonly used for text data:
Bernoulli Naïve Bayes (BNB), Gaussian Naïve Bayes (GNB), Logistic Regressions (LR), Ridge
1
We focus on the field of biomedicine because our previous study suggested that recombination is well captured by word
embeddings in the field but not necessarily so in other fields (Yin et al. 2022). We include article, letter, and proceedings
paper in terms of the document type.
2
We sampled linked pairs and non-linked pairs separately – oversampled linked pairs – because relatively few linked pairs
exist due to the sparse nature of citation network. This biases the prediction scores. Thus, the risk indicators must be
interpreted in a relative sense – higher scores mean greater risk, but risk = 0.9 does not mean 90% probability of failure.
3
We used the word embedding model as it is without any fine-tuning. This is because our primary goal is to represent
scientific documents trained with a simple model and develop a generally applicable method for predicting scientific risk,
rather than achieving the highest fine-tuned scores. Moreover, considering that newly released scientific papers on a daily
basis would change word embeddings, we chose not to tune the embedding model with the selected target data.
4
In averaging the word vectors, we chose not to normalize each vector. Many previous studies have shown that simply
averaging word vectors without normalization sufficiently represents the semantic information of a document (Kenter et al.
2016).
9
Electronic copy available at: https://ssrn.com/abstract=4462160
Regressions (RR), Linear Discriminant Analyses (LDA), Quadratic Discriminant Analyses (QDA),
Random Forest (RF), eXtreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM)
(Chen and Guestrin, 2016; Cortes and Vapnik, 1995). We ran these 9 classifiers on the training data
with Python’s Scikit-learn package (Pedregosa et al., 2012) and developed 9 models. We fine-tuned the
hyperparameters of these models using Grid search with 5-fold cross-validation.5
3.2. Performance of Prediction
To assess the performance of each trained model, we applied the models to the test data. Fig.1 presents
the precision, recall, and F1 scores. The figure shows that SVM has both the highest precision (0.857)
and the highest recall (0.875). The F1 score – the harmonic mean of precision and recall – is thus highest
for SVM (0.860). We are relatively tolerant to false positives (predicted to be linked but not linked in
fact) rather than false negatives (linked but predicted not), because it is possible that a pair that has not
been linked yet may be linked after our citation window. Overall, SVM demonstrates the most desirable
performance among the tested classifiers.
Compared with previous link prediction models, our model presents reasonable performance. For
example, Sebastian et al. (2015) constructed a link prediction model to predict future co-citation with
F1 = 0.85, precision = 0.85, and recall = 0.86. Similarly, Shibata et al. (2012) predicted citation links
among existing papers with F1 scores ranging from 0.74 to 0.82. Given that the previous studies drew
on numerous features of nodes and network, our approach based solely on document vectors achieves
sufficient performance.
After finding that SVM is the best classification algorithm, we applied the trained model based on SVM
to the test data and computed the probability of each paper pair to be linked. We subtracted this
probability from 1 to construct the score of recombination risk – the probability that an attempt to link
5
To optimize the major hyperparameters, we set a range of possible values for the parameters and selected the optimal
values based on the average of weighted F1-scores in cross-validation.
10
Electronic copy available at: https://ssrn.com/abstract=4462160
two papers end up failing. Fig.2 illustrates the distribution of the risk scores for the linked pairs and
non-linked pairs, clearly indicating that the risk scores are higher for the non-linked pairs.
These results suggest that our link prediction model based on word embeddings can compute the risk
in recombining a pair of knowledge elements.
4. VALIDATING RECOMBINATION RISK INDICATOR WITH SURVEY
To further validate our recombination risk indicator, we carried out a questionnaire survey and asked
the respondents to self-assess the risk of their past project. Specifically, we considered their published
paper as a project and asked the contact authors to assess how they had initially perceived the risk of
the project when they started it.
4.1. Risk indicator
To bibliometrically compute the risk of a project, which is operationalized as a paper, we draw on the
references cited by the focal paper as knowledge elements. As a paper usually has multiple references,
we form all possible combinations from these references (for example, 10 references make 45 pairs).
For each reference pair, we compute the risk score based on the SVM model developed in Section 3.
After computing the risk scores for all reference pairs, we aggregate them at the focal paper level in the
following steps.
Suppose a focal paper has N references. Let 𝑟!" ∈ [0,1] be the risk score in combining reference i and
reference j (𝑖, 𝑗 ∈ {1, … , 𝑁}, 𝑖 ≠ 𝑗). The focal paper is characterized by a series of risk scores for the
recombination of N elements. It is debatable how these risk values should be aggregated. On the one
hand, the success of a project requires all recombinations to be realized. Therefore, the minimum risk
dictates the risk of a project. Thus, the recombination risk of a project may be given by
min
(!,")∈{(,…,*}!
𝑟!"
(1)
11
Electronic copy available at: https://ssrn.com/abstract=4462160
On the other hand, the success of a project may be subject to the most challenging recombination. Then,
the recombination risk of the paper is given by
max
(!,")∈{(,…,*}!
𝑟!"
(2)
Given these possibilities, we prepared a series of risk indicators by taking various percentile values:
𝑅𝑖𝑠𝑘, = 𝑝 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑟!"
(3)
, where Risk0 is the minimum and Risk100 is the maximum.
Note that we computed the risk of recombination that had already been achieved because all the
reference pairs are co-cited by the focal paper, and thus, we did not observe failed recombination.6
Nonetheless, we expected that even achieved recombinations had entailed potential risks. Fig.4
illustrates the distribution of risk indicators at different percentiles (p = 0, 10, …, 100), presenting a
considerable variance in the risk scores.
4.2. Questionnaire Survey
Sample. We sampled the survey respondents in the following steps. First, we identified contact authors
whose email addresses are available in WoS publication records. We then selected contact authors who
had published at least 4 English-written papers to ensure that the authors have sufficient research
experiences and can reasonably assess the corresponding risk. For each author, we selected one paper
that was most recently published as of 2018. To avoid the recall bias we excluded authors whose latest
publication was in 2015 or before. Finally, we randomly sampled 4,625 authors. After three rounds of
requests, 397 were bounced back and 378 responses were collected (response rate = 8.9%).
6
Also note that we formed all possible reference pairs, of which some pairs represent recombinations intended by the focal
paper but others may not be intended, and the latter may cause errors in the aggregated risk indicator.
12
Electronic copy available at: https://ssrn.com/abstract=4462160
Questionnaire. We developed a questionnaire survey on various qualities of scientific papers based on
interviews of scientists and tested it with a small-scale pilot survey. Of the survey questions, this study
draws on two items concerning risk, which assess how the respondents perceived the risk of their project
in two aspects. The part of the survey started with "The following questions are about the research
project that led to your paper: [the bibliographic information of the respondent's selected paper]" and
was followed by two questions. First, we inquired into overall risk by asking "when you started the
project, how confident were you that the project would reach the expected finding or any publishable
finding at all?" (Fig.3A). The response takes three options: (0) "Confident that the expected finding
would be obtained"; (1) "Confident that some publishable finding would be obtained"; and (2) "Not
confident if any publishable finding would be obtained." Second, we inquired into technical risk by
asking "when you started the project, did you anticipate any technical / methodological challenge that
could fail your project?" (Fig.3B). The response takes three options: (0) "No, I anticipated no major
challenge."; (1) "Yes, I anticipated a challenge, which I overcame."; (2) "Yes, I anticipated a challenge,
which forced us to change the research direction."
4.3. Validation
While our bibliometric risk indicators take continuous values between 0 and 1, the survey scores are
ordinal with 3 values. To examine the correlation between the bibliometric indicators (p = 0, 10, …,
100) and the survey scores, we first regressed the bibliometric indicators on the survey scores, in which
the survey scores are converted into ordinal dummy variables. Since our risk score is a proportion data
ranging between 0 and 1, we used a generalized linear model (GLM) with the logit link function and
the binomial distribution (Hardin and Hilbe, 2018).
Table 1 shows the result of the regression analyses. Comparing the two survey scores, the result suggests
that our bibliometric risk indicators are correlated with overall risk (Table 1A) but not with technical
risk (Table 1B). Regarding the overall risk, the result also shows that overall risk = 2 ("not confident if
any publishable finding would be obtained") is significantly positively correlated with the bibliometric
indicators but overall risk = 1 ("confident some publishable finding would be obtained") is not, which
13
Electronic copy available at: https://ssrn.com/abstract=4462160
is as expected. Comparing the series of bibliometric risk indicators (p = 0, 10, …, 100), we find that the
correlation with the survey score is significant at both low and high p values. To assess the magnitude
of the correlations, we further computed the marginal effects of the survey scores on the bibliometric
indicators. When overall risk changes from "0 or 1" to "2", Risk0 increases by 0.56 standard deviation
(SD) and Risk100 increases 0.37 SD, and Risk10 has the largest increase of 0.72 SD.
We also illustrated the Pearson's correlation coefficients between the survey scores and the bibliometric
indicators, finding significant correlations for overall risk but not for technical risk (Fig.5). Concerning
overall risk, the figure shows significantly positive correlations across a broad range of p values, but
stronger correlations are observed particularly at lower p values. This implies that the risk of a project
is determined by all risks that a project faces. Fig.6 further illustrates the distribution of the risk
indicators with relatively strong correlations with overall risk (p = {5, 10, 15, 20}). Comparing the high
and low overall risk groups, it demonstrates that the high risk group has greater risk scores.
5. RISK AND NOVELTY
5.1. Divergent Validity
Having constructed the bibliometric risk indicator, we test how it is related to novelty. Previous studies
found novelty and risk to be correlated (Franzoni et al., 2018) and sometimes treated them
interchangeably (Machado, 2021; Reinhilde et al., 2022). To test this assumption, we examine the
relationship between risk and novelty using bibliometric indicators of the two concepts. The sample for
this analysis is the same as in Section 4.
Bibliometric indicators. We use the bibliometric risk indicators based on the SVM model. In particular,
we test Riskp, where p = {5, 10, 15, 20}, as they demonstrated relatively strong correlations with the
survey risk score. As to the novelty indicator, we draw on the recombinant novelty indicator proposed
by Shibayama et al. (2021) as it employs an operationalization that is consistent with that for our
recombination risk indicator. To assess the novelty of a focal paper, their approach first converts each
reference of the focal paper into a document vector. Then, it computes the cosine distance of document
14
Electronic copy available at: https://ssrn.com/abstract=4462160
vectors for each reference pair. The cosine distances for all reference pairs are aggregated at the focal
paper level by taking the maximum value (Novel). That is, the novelty is determined by the reference
pair with the largest distance.
Correlation analysis. The result of the correlation analyses is summarized in Table 2. First, we test the
correlation between the novelty indicator and the survey risk scores. The Pearson's correlation
coefficient for the overall risk is 0.070 (p > 0.1) and that for the technical risk is -0.078 (p > 0.1). Thus,
recombinant novelty indicators do not appear to capture the risk perceived by scientists, unlike some
previous studies assumed (Machado, 2021; Reinhilde et al., 2022). Second, we examine the correlation
between the novelty indicator and the risk indicators. The result suggests rather negative than positive
correlations (Risk5: -.098, p < 0.1), and the correlations are overall weak. Therefore, we do not find
compelling evidence showing that the novelty indicator captures the particular risk concept studied in
this paper.
5.2. Prediction of Impact
To further analyze the relationship between novelty and risk, we evaluate how these bibliometric
indicators are associated with future citation impact. Previous studies suggested that while novel
discoveries may attract more citations, they also may be neglected and thus have fewer citations (Wang
et al., 2017). In fact, several analyses on the relationship between bibliometric novelty indicators and
future citation counts found both a positive relationship and an inverted-U-shaped relationship
(Shibayama et al., 2021; Uzzi et al., 2013; Wang et al., 2017). Here, we investigate whether and how
our indicator of recombination risk, together with novelty, is associated with future citation impact.
Setup of analysis. For this analysis, we use "top-1% cited" (TC) in the respective field as the dependent
variable, coded 1 if the citation count of the paper is within top 1% and 0 otherwise, and regress it on
15
Electronic copy available at: https://ssrn.com/abstract=4462160
the novelty indicator (Novel) and a risk indicator (Risk15). We use Risk15 as it showed the highest
correlation with the survey score in the previous section.7
We randomly sampled 4,000 articles published in biomedicine in 2010 and evaluated their citation
impact as of 2018. We oversampled top-1%-cited papers, so that the final sample consists of
approximately 2,000 top-1% cited papers and 2,000 non-top-1% cited papers. As the dependent variable
is a dummy variable, we draw on logistic regression.
Regression analysis. Table 3 reports the result of logistic regressions. Models 1 and 2 tests the
relationship between novelty and future citation impact, finding a positive coefficient for the linear term
and a negative coefficient for the quadratic term. Given the magnitude of the coefficients, Model 2
suggests a positive but stagnating effect of novelty on future citation impact, which is consistent with
the previous findings (Shibayama et al., 2021). Models 3 and 4 then examines the relationship between
risk and future citation impact, finding a negative coefficient for the linear term and a positive
coefficient for the quadratic term. Again, given the magnitude of the coefficients, Model 4 suggests a
negative but diminishing effect of risk on future citation impact.
As risk and novelty have been confounded in the literature, Model 5 includes both the novelty and risk
indicators. The magnitude of the coefficients slightly changes, but the overall relationships remain
qualitatively similar. Further to assess whether the risk and novelty indicators have any interaction
effect on future citations, Models 6 and 7 introduce various interaction terms without finding a
significant effect. Thus, novelty and risk indicators are associated with future citations through different
mechanisms, which also supports our argument that these two indicators capture different concepts.
To visually illustrate the result, Fig.7 presents the contour map of the predicted citation impact with a
range of novelty and risk values. In the horizontal direction, the figure shows that greater novelty is
7
We tested other risk indicators and obtained consistent results.
16
Electronic copy available at: https://ssrn.com/abstract=4462160
associated with higher citation impact. In the vertical direction, it shows that lower risk is associated
with higher citation impact. Taken together, higher citation impact than the average (the area below the
red contour curve) occurs only with high novelty and low risk. In particular, the result suggests that
risky recombination, even if successful, cause disadvantages in attracting future citations, although
high-risk research has been encouraged (OECD, Machado, 2021; 2021).
6. DISCUSSIONS AND CONCLUSIONS
As risk plays a fundamental role in science, it is of scholarly and practical interest to quantify the degree
of risk in science. Nonetheless, the concept of risk has been understudied (Franzoni and Stephan, 2021)
(Althaus, 2005; Aven, 2011; Hansson, 2018), and there has been no approach to measure risk in science
in a large scale to the best of our knowledge, except that bibliometric novelty indicators were suggested
as a proxy of risk (Machado, 2021; Reinhilde et al., 2022). To fill the gap, this study aims to develop
an approach to quantify the risk in science by focusing on recombination as a particular mode of
discovery process.
We proposed an indicator of risk computed on the basis of a machine learning model. We
operationalized the recombination of knowledge elements by a pair of papers being co-cited and the
risk of recombination by the paper pair not being co-cited. We tested several classification algorithms
to develop the model, among which the SVM demonstrated the best performance. The resulting risk
indicator was then tested against survey data, which confirmed that our risk indicator is positively
correlated with self-assessed overall risk of a project. Finally, while risk and novelty have been
confounded in the literature (Franzoni and Stephan, 2021; Machado, 2021; Reinhilde et al., 2022), we
showed the divergence of the bibliometric novelty and risk indicators.
Overall, this study makes scholarly contribution to the underdeveloped literature on risk in science
(Franzoni and Stephan, 2021; Machado, 2021; Reinhilde et al., 2022) by providing the first validated
indicator of a particular type of risk. We expect that the application of the risk indicator will deepen our
17
Electronic copy available at: https://ssrn.com/abstract=4462160
understanding about risk in science – what motivates risk-taking, what is the consequence of risk-taking,
how risk-taking should be rewarded, and so forth.
Practically, the proposed approach offers a flexible tool for quantifying risks and can benefit scientists
and relevant parties such as policymakers, funding bodies, and R&D managers. The risk indicator may
be also used for policy evaluation for example by assessing whether a certain policy encourages or
discourages scientists' risk-taking behavior. These are relevant especially because current policies tend
to emphasize high-risk high-return research projects (OECD, Machado, 2021; 2021).
We expect that the proposed method is applicable not only to scientific papers but also to other types
of scientific texts such as a paragraph of a research idea and a funding proposal, although such
applications need further examination. These applications should assist decision-makers to assess the
feasibility of a research project and help identify potential risks involved in a project. To facilitate such
applications, we prioritized two features in designing our approach. First, our approach relies solely on
text data. As long as a knowledge element of interest is provided in the form of text data, we can convert
text data into word embedding vectors and compute the risk indicator. Second, our risk indicator can
be calculated ex ante. Some existing bibliometric indicators are retrospectively computed, for example
drawing on citation network data, and thus can only be computed after papers are published. This is not
ideal because one wants to know the degree of risk before taking it. Our approach overcomes such a
limitation by not requiring post-publication information.
Despite all the contributions, further refinement and development of risk indicators are warranted. First,
we focused on risk in a specific type of scientific progress – recombination, but other modes of scientific
progress should also involve risk. Thus, future research should develop a method to quantify risk in
broader modes of scientific progress. Second, we tested our approach only in the biomedical field
because of the limitation of the word embedding model. To extend our approach to various disciplines,
future research needs to develop word embeddings with broader text data and examine whether they
can capture the semantic distance between words across disciplines. Third, there is room for
18
Electronic copy available at: https://ssrn.com/abstract=4462160
improvement in extracting semantic information from documents. Though our approach draws on the
title and abstract of scientific papers, other parts of documents (e.g., the method section, and even the
full text) might be informative. Further, while our approach computes a document vector by averaging
word vectors, we can directly compute document vectors (doc2vec). Our approach in these respects was
to prioritize the applicability of the approach, but future research could improve the indicator by
differently construct document vectors. Fourth, though we drew on all possible reference pairs in
constructing the risk indicator of a paper, not all the pairs may represent intended recombinations, which
can cause errors. This might be addressed by looking into how references are cited in the focal paper
and focusing on the relevant reference pairs. Finally, it is of interest to investigate the source of risk.
While novelty is a plausible source of risk (Machado, 2021; Reinhilde et al., 2022), our result does not
show correlation between risk and novelty. This suggests that risk is attributed to more than novelty,
but what it is remains unclear. Future research could dissect the recombination of knowledge elements
to understand the source of risk.
References
Althaus, C.E., 2005. A Disciplinary Perspective on the Epistemological Status of Risk. Risk Analysis
25, 567-588.
Arthur, W.B., 2007. The Structure of Invention. Research Policy 36, 274-287.
Aven, T., 2011. On Some Recent Definitions and Analysis Frameworks for Risk, Vulnerability, and
Resilience. Risk Analysis 31, 515-522.
Boudreau, K.J., Guinan, E.C., Lakhani, K.R., Riedl, C., 2016. Looking across and Looking Beyond the
Knowledge Frontier: Intellectual Distance, Novelty, and Resource Allocation in Science.
Management Science 62, 2765-2783.
Bourdieu, P., 1975. The Specificity of the Scientific Field and the Social Conditions for the Progress of
Reason. Social Science Information 14, 19–47.
Breiman, L., 2001. Random Forests. Machine Learning 45, 5-32.
Butun, E., Kaya, M., 2020. Predicting Citation Count of Scientists as a Link Prediction Problem. Ieee
Transactions on Cybernetics 50, 4518-4529.
Chen, T., Guestrin, C., 2016. Xgboost: A Scalable Tree Boosting System, Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Association for Computing Machinery, San Francisco, California, USA, pp. 785–794.
19
Electronic copy available at: https://ssrn.com/abstract=4462160
Cortes, C., Vapnik, V., 1995. Support-Vector Networks. Machine Learning 20, 273-297.
Dahlin, K.B., Behrens, D.M., 2005. When Is an Invention Really Radical? Defining and Measuring
Technological Radicalness. Research Policy 34, 717-737.
Daud, A., Ahmed, W., Amjad, T., Nasir, J.A., Aljohani, N.R., Abbasi, R.A., Ahmad, I., 2017. Who
Will Cite You Back? Reciprocal Link Prediction in Citation Networks. Library Hi Tech 35,
509-520.
Fleming, L., 2001. Recombinant Uncertainty in Technological Search. Management Science 47, 117132.
Fontana, M., Iori, M., Montobbio, F., Sinatra, R., 2020. New and Atypical Combinations: An
Assessment of Novelty and Interdisciplinarity. Research Policy 49, 28.
Foster, J.G., Rzhetsky, A., Evans, J.A., 2015. Tradition and Innovation in Scientists’ Research
Strategies. American Sociological Review 80, 875-908.
Franzoni, C., Scellato, G., Stephan, P., 2012. Foreign-Born Scientists: Mobility Patterns for 16
Countries. Nature Biotechnology 30, 1250-1253.
Franzoni, C., Scellato, G., Stephan, P., 2018. Context Factors and the Performance of Mobile
Individuals in Research Teams. Journal of Management Studies 55, 27-59.
Franzoni, C., Stephan, P., 2021. Uncertainty and Risk-Taking in Science: Meaning, Measurement and
Management. National Bureau of Economic Research Working Paper Series No. 28562.
Gewin, V., 2012. Risky Research: The Sky's the Limit. Nature 487, 395-397.
Hagstrom, W.O., 1974. Competition in Science. American Sociological Review 39, 1-18.
Hall, B.H., Jaffe, A., Trajtenberg, M., 2001. The Nber Patent Citations Data File: Lessons, Insights, and
Methodological Tools. NBER Working Paper 8498.
Hansson, S.O., 2018. Risk, in: Zalta, E.N. (Ed.), The Stanford Encyclopedia of Philosophy.
Metaphysics Research Lab, Stanford University.
Hardin, J.W., Hilbe, J.M., 2018. Generalized Linear Models and Extensions, 4th ed. Stata Press, TX,
USA.
Kaplan, S., Garrick, B.J., 1981. On the Quantitative Definition of Risk. Risk Analysis 1, 11-27.
Kenter, T., Borisov, A., & de Rijke, M. (2016). Siamese CBOW: Optimizing Word Embeddings for
Sentence Representations. Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), 941–951.
Kuhn, T.S., 1970. The Structure of Scientific Revolutions. University of Chicago Press, Chicago, MI.
Kuhn, T.S., 1977. The Essential Tension: Selected Studies in Scientific Tradition and Change.
University of Chicago Press, Chicago.
Levitt, J.M., Thelwall, M., 2011. A Combined Bibliometric Indicator to Predict Article Impact.
Information Processing & Management 47, 300-308.
Lin, Y., Evans, J.A., Wu, L., 2022. New Directions in Science Emerge from Disconnection and Discord.
Journal of Informetrics 16, 101234.
20
Electronic copy available at: https://ssrn.com/abstract=4462160
Linton, J.D., 2016. Improving the Peer Review Process: Capturing More Information and Enabling
High-Risk/High-Return Research. Research Policy 45, 1936-1938.
Liu, W.Y., Nanetti, A., Cheong, S.A., 2017. Knowledge Evolution in Physics Research: An Analysis
of Bibliographic Coupling Networks. Plos One 12, 19.
Machado, D., 2021. Quantitative Indicators for High-Risk/High-Reward Research, OECD Science,
Technology and Industry Working Papers. OECD Publishing, Paris.
Macroberts, M.H., Macroberts, B.R., 1989. Problems of Citation Analysis - a Critical-Review. Journal
of the American Society For Information Science 40, 342-349.
Marinacci, M., 2015. Model Uncertainty. Journal of the European Economic Association 13, 10221100.
Matsumoto, K., Shibayama, S., Kang, B., Igami, M., 2020. A Validation Study of Knowledge
Combinatorial Novelty, NISTEP Discussion Paper. NISTEP, Tokyo.
Mazzolini, A., Colliva, A., Caselle, M., Osella, M., 2018. Heaps' Law, Statistics of Shared Components,
and Temporal Patterns from a Sample-Space-Reducing Process. Physical Review E 98.
Mednick, S.A., 1962. The Associative Basis of the Creative Process. Psychological Review 69, 220232.
Merton, R.K., 1965. On the Shoulders of Giants. University of Chicago Press, Chicago, IL.
Merton, R.K., 1973. Sociology of Science. University of Chicago Press, Chicago.
Merton, R.K., Barber, E., 2004. The Travels and Adventures of Serendipity. A Study in Sociological
Semantics and the Sociology of Science. Princeton University Press, Princeton.
Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Efficient Estimation of Word Representations in
Vector Space. arXiv.
Min, C., Bu, Y., Wu, D., Ding, Y., Zhang, Y., 2021. Identifying Citation Patterns of Scientific
Breakthroughs: A Perspective of Dynamic Citation Process. Information Processing &
Management 58, 102428.
OECD, 2021. Effective Policies to Foster High-Risk/High-Reward Research, OECD Science,
Technology and Industry Policy Papers. OECD Publishing, Paris.
Palchykov, V., Krasnytska, M., Mryglod, O., Holovatch, Y., 2021. A Mechanism for Evolution of the
Physical Concepts Network. Condensed Matter Physics 24.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Müller,
A., Nothman, J., Louppe, G., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É., 2012. Scikit-Learn: Machine
Learning in Python. arXiv.
Reinhilde, V., Jian, W., Paula, S., 2022. Do Funding Agencies Select and Enable Risky Research:
Evidence from Erc Using Novelty as a Proxy of Risk Taking. National Bureau of Economic
Research, Inc.
Sebastian, Y., Siew, E.G., Orimaye, S.O., 2015. Predicting Future Links between Disjoint Research
Areas Using Heterogeneous Bibliographic Information Network, Advances in Knowledge
Discovery and Data Mining, Part Ii. Springer-Verlag Berlin, Berlin, pp. 610-621.
21
Electronic copy available at: https://ssrn.com/abstract=4462160
Shibata, N., Kajikawa, Y., Sakata, I., 2012. Link Prediction in Citation Networks. Journal of the
American Society For Information Science and Technology 63, 78-85.
Shibayama, S., 2019. Sustainable Development of Science and Scientists: Academic Training in Life
Science Labs. Research Policy 48, 676-692.
Shibayama, S., Yin, D., Matsumoto, K., 2021. Measuring Novelty in Science with Word Embedding.
Plos One 16, e0254034.
Simonton, D.K., 2003. Scientific Creativity as Constrained Stochastic Behavior the Integration of
Product, Person, and Process Perspectives. Psychological Bulletin 129, 475-494.
Sun, Y., Latora, V., 2020. The Evolution of Knowledge within and across Fields in Modern Physics.
Scientific Reports 10.
Trapido, D., 2015. How Novelty in Knowledge Earns Recognition: The Role of Consistent Identities.
Research Policy 44, 1488-1500.
Tria, F., Loreto, V., Servedio, V.D.P., 2018. Zipf's, Heaps' and Taylor's Laws Are Determined by the
Expansion into the Adjacent Possible. Entropy 20.
Tu, Y.-N., Seng, J.-L., 2012. Indices of Novelty for Emerging Topic Detection. Information Processing
& Management 48, 303-325.
Uzzi, B., Mukherjee, S., Stringer, M., Jones, B., 2013. Atypical Combinations and Scientific Impact.
Science 342, 468-472.
Wang, J., Veugelers, R., Stephan, P., 2017. Bias against Novelty in Science: A Cautionary Tale for
Users of Bibliometric Indicators. Research Policy 46, 1416-1436.
Wang, Y., Jones, B.F., Wang, D., 2019. Early-Career Setback and Future Career Impact. Nature
Communications 10, 4331.
Whitley, R., 1984. The Intellectual and Social Organization of the Sciences. Oxford University Press,
New York.
Yang, J., Lu, W., Hu, J., Huang, S., 2022. A Novel Emerging Topic Detection Method: A Knowledge
Ecology Perspective. Information Processing & Management 59, 102843.
Yaqub, O., 2018. Serendipity: Towards a Taxonomy and a Theory. Research Policy 47, 169-179.
Yin, D., Wu, Z., Yokota, K., Matsumoto, K., Shibayama, S., 2022. Identify Novel Elements of
Knowledge with Word Embedding.
22
Electronic copy available at: https://ssrn.com/abstract=4462160
Figures and Tables
Fig.1 Performance of Link Prediction Classifiers
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
BNB
GNB
LR
RR
F1
LDA
Precision
QDA
RF
XGBoost
SVM
Recall
Note. BNB: Bernoulli Naïve Bayes, GNB: Gaussian Naïve Bayes, LR: logistic regressions, RR: ridge
regressions, LDA: linear discriminant analysis, QDA: quadratic discriminant analysis, RF: random forest,
XGBoost: eXtreme Gradient Boosting, and SVM: support vector machine.
23
Electronic copy available at: https://ssrn.com/abstract=4462160
0
5
Density
10
15
Fig.2 Distribution of Recombination Risk
0
.2
.4
.6
.8
1
Risk
Co-cited
Not co-cited
Note. Based on the SVM model.
24
Electronic copy available at: https://ssrn.com/abstract=4462160
Fig.3 Survey Scores of Recombination Risk
(A) Overall Risk
8%
33%
(0) Confident that the
expected finding would
be obtained.
(1) Confident that some
publishable finding would
be obtained.
59%
(2) NOT confident if any
publishable finding would
be obtained.
(B) Technical Risk
(0) No, I anticipated no
major challenge.
18%
37%
(1) Yes, I anticipated a
challenge, which I
overcame.
45%
(2) Yes, I anticipated a
challenge, which forced
us to change the research
direction.
Note: (A) "When you started the project, how confident were you that the project would reach the expected
finding or any publishable finding at all?" (B) "When you started the project, did you anticipate any technical /
methodological challenge that could fail your project?"
25
Electronic copy available at: https://ssrn.com/abstract=4462160
4
Density
Density
4
Density
Density
10
.1
Risk(0)
.15
.4
Risk(60)
.6
.2
0
.1
.2
.3
.1
.3
2
.4
0
.1
.2
Risk(30)
.3
.4
0
0
.1
.2
.3
Risk(40)
.4
.5
0
.2
.4
.6
Risk(50)
Density
4
Density
1
Density
1
Density
0
.2
.8
0
.2
.4
Risk(70)
.6
.8
0
.2
.4
Risk(80)
.6
.8
2
0
0
0
0
0
1
.5
.5
1
Density
2
2
6
3
1.5
1.5
8
2
2
.2
Risk(20)
0
0
0
Risk(10)
3
.05
4
0
0
0
0
20
2
5
10
5
40
Density
20
Density
60
10
6
15
30
80
8
6
15
20
40
100
Fig. 4 Distribution of Recombination Risk of Project
0
.2
.4
.6
Risk(90)
.8
1
.2
.4
.6
Risk(100)
.8
1
Note. N = 378.
26
Electronic copy available at: https://ssrn.com/abstract=4462160
Fig.5 Correlation between Bibliometric and Survey Risk Scores
0.20
**
**
*** *** ***
**
**
**
Correlation coef.
0.15
**
** **
*
*
*
†
0.10
†
†
†
†
Overall
0.05
Technical
0.00
-0.05
95
10
0
90
85
80
75
70
65
60
55
50
45
40
35
30
25
20
5
10
15
0
-0.10
p (Percentile)
Note. N = 353. Pearson's correlation coefficients. †p<0.1. *p<0.05. **p<0.01.***p<0.001. We dichotomized
overall/technical risk by assigning 1 if overall/technical = 2 and 0 otherwise.
27
Electronic copy available at: https://ssrn.com/abstract=4462160
20
20
15
kdensity lp_sv_85
10
15
kdensity lp_sv_80
10
25
30
kdensity lp_sv_90
20
Low risk
High risk
0
.1
.2
Risk(5)
.3
0
.1
.2
Risk(10)
.3
0
0
0
0
5
10
5
10
40
kdensity lp_sv_95
20
30
50
40
Fig.6 Distribution of Risk Indicators by Overall Risk
0
.1
.2
Risk(15)
.3
.4
0
.1
.2
Risk(20)
.3
.4 .4
Note. High risk: overall risk = 2. Low risk: overall risk = 0 or 1.
28
Electronic copy available at: https://ssrn.com/abstract=4462160
Fig.7 Prediction of Impact
Risk15 (in percentile)
0.8
80
0.0225
0.0200
0.6
60
0.0175
0.0150
0.0125
0.0100
0.4
40
0.0075
0.0050
0.0025
20
0.2
0.2
20
0.4
40
0.6
60
0.8
80
Novel (in percentile)
Note. The contour map of prob.(TC=1) based on Model 5 in Table 3. The red curve indicates the base line
(prob.(TC =1) = 0.01), below which prob.(TC =1) > 0.01. The novelty and risk indicators are scaled in their
percentile values (e.g., 50 is the median of the indicators).
29
Electronic copy available at: https://ssrn.com/abstract=4462160
Table 1 Regression Analysis
(A) Overall Risk
Risk0
Risk10
Risk20
Risk30
Risk40
Risk50
Risk60
Risk70
Risk80
Risk90
Risk100
-.039
(.232)
1.151*
(.516)
27.065***
-9.057
353
-.107
(.135)
.689*
(.314)
15.704**
-32.567
353
-.104
(.120)
.608*
(.279)
15.303**
-49.449
353
-.080
(.115)
.541*
(.246)
13.902**
-66.953
353
-.085
(.108)
.459*
(.226)
11.535*
-85.503
353
-.080
(.105)
.422*
(.210)
10.468*
-105.082
353
-.044
(.103)
.374†
(.198)
8.431†
-127.024
353
.046
(.100)
.307
(.194)
5.384
-149.233
353
-.053
(.099)
.277
(.180)
3.784
-168.306
353
-.064
(.105)
.227
(.177)
4.067
-175.925
353
-.109
(.166)
.595*
(.272)
22.948***
-110.451
353
Risk0
Risk10
Risk20
Risk30
Risk40
Risk50
Risk60
Risk70
Risk80
Risk90
Risk100
.000
(.336)
-.057
(.358)
3.937
-9.194
352
.034
(.170)
.082
(.213)
5.352
-32.767
352
.030
(.144)
.0176
(.190)
5.737
-49.672
352
.040
(.130)
.033
(.174)
5.833
-67.143
352
.045
(.119)
.019
(.162)
4.864
-85.606
352
.058
(.114)
-.013
(.154)
4.661
-105.021
352
.090
(.110)
.014
(.148)
4.413
-126.730
352
.131
(.107)
.053
(.142)
3.419
-148.766
352
.133
(.106)
.003
(.137)
2.191
-167.824
352
.205†
(.112)
.000
(.136)
5.245
-175.229
352
.216
(.174)
.182
(.216)
18.682***
-110.536
352
Overall risk = 0 (base)
Overall risk = 1
Overall risk = 2
Chi squared
Log likelihood
N
(B) Technical Risk
Technical risk = 0 (base)
Technical risk = 1
Technical risk = 2
Chi squared
Log likelihood
N
Note. Generalized linear model with a logit link and the binomial family. Unstandardized coefficients (robust errors in parentheses). Two-tailed test. †p<0.1. *p<0.05.
**
p<0.01.***p<0.001. The sub-fields within biomedicine are controlled for.
30
Electronic copy available at: https://ssrn.com/abstract=4462160
Table 2 Divergent Validity
Survey
Bibliometric
Overall risk
Technical risk
Risk5
Risk10
Risk15
Risk20
Bibliometric
Novel
.070
-.078
-.098†
-.058
-.017
.008
Note. N = 353. Pearson's correlation coefficients. †p<0.1.
31
Electronic copy available at: https://ssrn.com/abstract=4462160
Table 3 Prediction of Impact (Top-1% Cited)
𝑁𝑜𝑣𝑒𝑙
Model 1
2.217***
(.156)
𝑁𝑜𝑣𝑒𝑙!
Model 2
8.183***
(1.146)
-3.276***
(.638)
𝑅𝑖𝑠𝑘"#
Model 3
Model 4
-14.711***
(1.396)
116.286***
-108.769
3903
!
𝑅𝑖𝑠𝑘"#
-18.375***
(1.374)
17.581***
(1.722)
Model 5
6.366***
(1.173)
-2.285***
(.656)
-18.751***
(1.527)
18.146***
(1.866)
Model 6
6.551***
(1.186)
-2.311***
(.658)
-14.101***
(3.268)
17.698***
(1.854)
-4.977
(3.375)
192.494***
-108.551
3903
307.846***
-106.422
3903
330.828***
-106.409
3903
𝑁𝑜𝑣𝑒𝑙 × 𝑅𝑖𝑠𝑘"#
𝑁𝑜𝑣𝑒𝑙! × 𝑅𝑖𝑠𝑘"#
!
𝑁𝑜𝑣𝑒𝑙 × 𝑅𝑖𝑠𝑘"#
!
𝑁𝑜𝑣𝑒𝑙! × 𝑅𝑖𝑠𝑘"#
Chi-squared stat
Log likelihood
N
200.967***
-109.684
3903
216.578***
-109.415
3903
Model 7
7.896***
(1.708)
-2.994**
(.984)
7.810
(13.020)
-3.513
(19.653)
-50.957
(34.117)
22.869
(21.524)
30.728
(57.463)
-7.093
(39.075)
375.162***
-106.382
3903
Note. Logistic regressions. Unstandardized coefficients (robust errors in parentheses). Two-tailed test. †p<0.1. *p<0.05. **p<0.01.***p<0.001. The sampling weight is
incorporated in the regression analysis. The sub-fields within biomedicine are controlled for. We also run OLS using the citation count as the dependent variable with the same
set of independent variables and obtained similarly significant results (see Supplementary Information).
32
Electronic copy available at: https://ssrn.com/abstract=4462160
Supplementary Information
Table S1 Prediction of Impact (Citation count)
𝑁𝑜𝑣𝑒𝑙
Model 1
1.611***
(.107)
𝑁𝑜𝑣𝑒𝑙!
Model 2
3.466***
(.537)
-1.145***
(.321)
𝑅𝑖𝑠𝑘"#
Model 3
Model 4
-2.280***
(.264)
54.220***
.060
3903
!
𝑅𝑖𝑠𝑘"#
-4.650***
(.532)
4.092***
(.830)
Model 5
2.989***
(.548)
-.916**
(.325)
-4.000***
(.504)
3.620***
(.783)
Model 6
3.205***
(.567)
-.983**
(.328)
-2.780**
(.885)
3.514***
(.802)
-1.540†
(.875)
52.451***
.070
3903
79.390***
.157
3903
66.543***
.158
3903
𝑁𝑜𝑣𝑒𝑙 × 𝑅𝑖𝑠𝑘"#
𝑁𝑜𝑣𝑒𝑙! × 𝑅𝑖𝑠𝑘"#
!
𝑁𝑜𝑣𝑒𝑙 × 𝑅𝑖𝑠𝑘"#
!
𝑁𝑜𝑣𝑒𝑙! × 𝑅𝑖𝑠𝑘"#
F stat
R2
N
133.632***
.116
3903
92.677***
.121
3903
Note. OLS. Unstandardized coefficients (robust errors in parentheses). Two-tailed test. †p<0.1. *p<0.05.
regression analysis. The sub-fields within biomedicine are controlled for.
**
Model 7
4.465***
(.769)
-1.707***
(.459)
8.033†
(4.443)
-12.653*
(6.321)
-29.007*
(12.171)
15.944*
(7.910)
40.330*
(18.505)
-23.023†
(12.578)
51.255***
.160
3903
p<0.01.***p<0.001. The sampling weight is incorporated in the
33
Electronic copy available at: https://ssrn.com/abstract=4462160