Academia.eduAcademia.edu

MEASURING RISK IN SCIENCE

Risk plays a fundamental role in scientific discoveries, and thus it is critical that the level of risk can be systematically quantified. We propose a novel approach to measuring risk entailed in a particular mode of discovery process-knowledge recombination. The recombination of extant knowledge serves as an important route to generate new knowledge, but attempts of recombination often fail. Drawing on machine learning and natural language processing techniques, our approach converts knowledge elements in the text format into high-dimensional vector expressions and computes the probability of failing to combine a pair of knowledge elements. Testing the calculated risk indicator on survey data, we confirm that our indicator is correlated with self-assessed risk. Further, as risk and novelty have been confounded in the literature, we examine and suggest the divergence of the bibliometric novelty and risk indicators. Finally, we demonstrate that our risk indicator is negatively associated with future citation impact, suggesting that risk-taking itself may not necessarily pay off. Our approach can assist decision making of scientists and relevant parties such as policymakers, funding bodies, and R&D managers.

MEASURING RISK IN SCIENCE Deyun YIN a,b, Zhao WU a, and Sotaro SHIBAYAMA c,d,* a Harbin Institute of Technology (Shenzhen), School of Economics and Management, Shenzhen, China b World Intellectual Property Organization, Geneva, Switzerland c Lund University, Center for Innovation Research, Lund, Sweden d The University of Tokyo, Institute for Future Initiative, Tokyo, Japan * Corresponding author: sotaro.shibayama@fek.lu.se. +46 (0)46 2227812. P.O. Box 7080, S-220 07 Lund, Sweden ABSTRACT Risk plays a fundamental role in scientific discoveries, and thus it is critical that the level of risk can be systematically quantified. We propose a novel approach to measuring risk entailed in a particular mode of discovery process – knowledge recombination. The recombination of extant knowledge serves as an important route to generate new knowledge, but attempts of recombination often fail. Drawing on machine learning and natural language processing techniques, our approach converts knowledge elements in the text format into high-dimensional vector expressions and computes the probability of failing to combine a pair of knowledge elements. Testing the calculated risk indicator on survey data, we confirm that our indicator is correlated with self-assessed risk. Further, as risk and novelty have been confounded in the literature, we examine and suggest the divergence of the bibliometric novelty and risk indicators. Finally, we demonstrate that our risk indicator is negatively associated with future citation impact, suggesting that risk-taking itself may not necessarily pay off. Our approach can assist decision making of scientists and relevant parties such as policymakers, funding bodies, and R&D managers. KEYWORDS Risk; Uncertainty; Novelty; Recombination; Science; Word embedding; Support Vector Machine 1 Electronic copy available at: https://ssrn.com/abstract=4462160 1. INTRODUCTION Science is a risky business by nature. Scientists explore and cultivate uncharted space of knowledge through trials and errors, in which their original ideas are often rejected and expected goals are not fulfilled for various reasons (Franzoni and Stephan, 2021; Machado, 2021; OECD, 2021; Reinhilde et al., 2022; Wang et al., 2019). Such risk and uncertainty tend to be especially high when scientists aim at novel discoveries, which have the potential to open up new avenues and make substantial advancement (Bourdieu, 1975; Hagstrom, 1974; Kuhn, 1970; Merton, 1973). Thus, there is a growing concern over scientists' risk-averse behavioral patterns, and science communities and policymakers emphasize that efforts should be made to facilitate high-risk-high-return research (Franzoni and Stephan, 2021; Gewin, 2012; Machado, 2021; OECD, 2021). Despite its fundamental role, risk and uncertainty in science have been poorly understood (Althaus, 2005; Aven, 2011; Franzoni and Stephan, 2021; Hansson, 2018), which this study aims to contribute to. Specifically, we aim to develop a bibliometric approach to quantify the degree of scientific risk in a particular mode of scientific discovery process – recombination. Scientific knowledge is usually generated on the basis of extant knowledge (Merton, 1965), and combining multiple elements of extant knowledge is an indispensable route to generate new knowledge (Fontana et al., 2020; Uzzi et al., 2013). Even novel discoveries often result from integrating pieces of extant knowledge that used to appear unrelated (Dahlin and Behrens, 2005; Mednick, 1962; Simonton, 2003; Trapido, 2015; Uzzi et al., 2013; Wang et al., 2017). Because of such a fundamental role of recombination, it is of scholarly and practical interest to quantify the degree of risk associated in recombination. In this regard, the previous literature seems to make an assumption that novel research is risky (Machado, 2021; Reinhilde et al., 2022). While novel research may entail some risks (Franzoni and Stephan, 2021; Wang et al., 2017), risk and novelty are not equivalent. Opportunities for novel recombination may be difficult to identify but may be easily achieved once the opportunity is identified. To quantify risk in the recombination process, we employ machine learning and natural language processing techniques. Drawing on past trajectories of science, we develop a machine learning model 2 Electronic copy available at: https://ssrn.com/abstract=4462160 that predicts whether a certain pair of knowledge elements will be linked or not in the future. The developed model calculates the probability that the pair of knowledge elements is combined, or put differently, the risk in achieving or failing in recombination. This risk indicator is validated by a questionnaire survey that we carried out, in which scientists self-assessed the anticipated risk of their own projects. We further examine the relationship between risk and novelty. The contribution of this study is two-fold. First, this study is the first to offer a validated indicator of risk in science, which contributes to the underdeveloped literature on risk in science (Franzoni and Stephan, 2021; Machado, 2021; Reinhilde et al., 2022). Second, the proposed approach offers a practical tool for scientists, policymakers, and other parties in assessing the feasibility of research plans and in developing research strategies. This paper is structured as follows. Section 2 reviews previous studies on risk and uncertainty in science. Section 3 describes our approach to quantify risk in recombination in science. Section 4 validates the risk indicator with the questionnaire survey. Section 5 examines the relationship between risk and novelty indicators. Section 6 summarizes the results and discusses implications. 2. LITERATURE REVIEW 2.1. Risk in Science In general, risk is attributed to imperfect information (Marinacci, 2015), for which one cannot know in advance exactly what will happen and what the consequence of it will be (Kaplan and Garrick, 1981). Scientific research is risky in this regard because scientists do not always know what they discover, whether they discover anything at all, or what the implication of a certain discovery is (Franzoni and Stephan, 2021). Scientists often start researching without having a clear expectation (Bourdieu, 1975; Shibayama, 2019; Whitley, 1984). Such exploratory research could potentially lead to various discoveries, and scientists 3 Electronic copy available at: https://ssrn.com/abstract=4462160 happen to reach one or none of them. In other cases scientists start with a concrete hypothesis (Bourdieu, 1975; Whitley, 1984). In such confirmatory research, the result may either support or reject the hypothesis, or whether the hypothesis was supported or rejected may turn out to be unclear. Even when the intended hypothesis is rejected, scientists may serendipitously encounter unexpected results (Merton and Barber, 2004; Yaqub, 2018). Overall, scientists cannot perfectly know whether or what they will discover. Once a discovery is made, the consequence of the discovery might also be unpredictable (Franzoni and Stephan, 2021). A discovery may be used for various applications, potentially beyond what is expected, or may not be used at all. It is well known that the impact of a discovery in terms of forward citations varies substantially (Levitt and Thelwall, 2011; Macroberts and Macroberts, 1989; Min et al., 2021). These risks are caused by various sources. One important source is the incompleteness of existing scientific knowledge. The frontier of science is full of unknowns, which makes it difficult to predict what will happen next. Many important discoveries were known to made accidentally (e.g., penicillin and X-ray). Another source of risk is attributed to the nature, especially in empirical research. Some phenomena occur only stochastically (e.g., the observation of neutrino and the discovery of pulsar), in which whether scientists can observe a phenomenon or not depends on luck. Furthermore, technologies may determine what scientists can discover. Scientists often need the aid of technologies to observe what they intend to observe (e.g., a microscope and a DNA sequencer), but technologies may not perfectly work due to innate technological uncertainty or due to errors on the side of scientists. Finally, risk is also attributable to contextual factors beyond scientists' control. For example, the consequence of a discovery, in terms of citation impact, is subject to the degree of competition in the research area. 2.2. Risk in Recombination Among various types of risk in science, this study highlights risk of failing in a discovery – the chance that an attempt of discovery results in no discovery. It further focuses on risk entailed in a particular mode of discovery process – recombination. An important route to generate new knowledge is the recombination of extant knowledge (Fontana et al., 2020; Uzzi et al., 2013). Previous literature argues 4 Electronic copy available at: https://ssrn.com/abstract=4462160 that associating remote elements is a path to creative solution (Mednick, 1962; Simonton, 2003), and that combining diversified components is a major route to technological innovation (Arthur, 2007; Fleming, 2001; Hall et al., 2001). In the context of science, for example, chemists synthesize a new compound by reacting multiple chemicals, and molecular biologists literally recombine genetic sequences to generate a protein with desired properties. Recombination is also suggested to be an important source of novel discoveries (Lin et al., 2022; Uzzi et al., 2013). Obviously attempts of recombination are not always successful. Scientists may try to combine a pair of knowledge elements only to realize that it is technically infeasible. Thus, risk in recombination – given a pair of knowledge elements whether their recombination is likely to succeed or fail – is of theoretical and practical interest. Previous literature tends to consider attempts for more novel recombinations to be riskier. Indeed, Franzoni et al. (2018) conducted a survey and found that the respondents' assessment of risk is correlated with a recombinant novelty indicator. Wang et al. (2017) showed that publications with higher novel recombination scores have higher variance in their citation impact. 2.3. Measuring Risk in Recombination While the degree of risk can be measured subjectively in a small scale with surveys or peer reviews (Franzoni et al., 2012; Linton, 2016), there is no approach to measure risk in science in a large scale to the best of our knowledge with one exception. That is, a few studies proposed to use the bibliometric indicator of recombinant novelty as a proxy of scientific risk (Machado, 2021; Reinhilde et al., 2022). Recombinant novelty indicator. In recent years efforts have been made to quantify the degree of scientific novelty based on citation data and text data (Tu and Seng, 2012; Yang et al., 2022). To operationalize the concept, the majority of novelty indicators drew on the rarity of or the distance between a combination of knowledge elements. For example, a group of novelty indicators considers a scientific document to be novel if it cites a rare combination of journals (Lin et al., 2022; Uzzi et al., 2013; Wang et al., 2017). In this case, a journal is used as the element of knowledge, and the distance between two journals is measured by the (in)frequency of citation between the journals. Another group 5 Electronic copy available at: https://ssrn.com/abstract=4462160 of indicators uses a cited references as the knowledge element and considers a document to be novel if it cites a rare combination of references (Matsumoto et al., 2020; Trapido, 2015). Yet another approach draws on text information. For example, Boudreau et al. (2016) measures the novelty of a grant proposal based on the first combination of Medical Subject Heading (MeSH) keywords in history. A more recent novelty measure is based on the word-embedding technique. Shibayama et al. (2021) assign a high dimensional vector to all relevant words based on the previous co-occurrences of those words, position all documents in the high-dimensional space based on the document text, and finally calculate the distance between a cited reference pair. In these operationalizations, novelty may be associated with some challengingness. It is plausible that a pair of knowledge elements had rarely co-occurred because previous studies had difficulty in combining them. It is also possible that a novel pair was difficult to identify but combining the pair itself may not be difficult. Thus, the link between recombinant novelty and risk in recombination is not straightforward. In fact, the previous studies proposing to use the novelty indicator as a proxy of risk are based on mixed reasonings in terms of what they mean by risk (Machado, 2021; Reinhilde et al., 2022). One reasoning is that the novelty indicator is associated with a high variance in citation impact (Wang et al., 2017), which concerns risk in consequence but not risk in the discovery process. Another is based on a validation analysis that the novelty indicator is correlated with the questionnaire score of "high-risk and high-reward", in which two concepts are confounded (Franzoni et al., 2018; Franzoni and Stephan, 2021). Therefore, this study aims to measure risk in the process of recombination more directly. Prediction of knowledge evolution. In a nutshell, risk in recombination concerns the likelihood with which a pair of knowledge elements are to be successfully combined or not. We thus need a technique to predict a type of knowledge evolution. The evolution of knowledge has been long studied in the sociology of science (Bourdieu, 1975; Kuhn, 1977; Whitley, 1984), and recent studies contributed to empirically describing knowledge evolution with large-scale bibliometric data (Foster et al., 2015; Liu et al., 2017; Palchykov et al., 2021; Sun and Latora, 2020). 6 Electronic copy available at: https://ssrn.com/abstract=4462160 While some of these studies aim to understand the generic mechanisms and laws behind knowledge evolution (Mazzolini et al., 2018; Tria et al., 2018), others focus on micro mechanisms with which knowledge elements are combined (Butun and Kaya, 2020; Sebastian et al., 2015). The latter is often reduced to link prediction problems in a network of scientific documents. Such an approach first forms a network of documents, in which a document is a node and a citation between two nodes is a link. It then assesses various features of nodes and the network and predicts whether a pair of nodes will be linked or not. For example, Shibata et al. (2012) developed a link prediction model to assess whether a pair of papers is likely to have a citation link, though the model does not predict future citations. Sebastian et al. (2015) also developed a model to predict whether papers in different research areas will be linked through cocitation (i.e., cited by a paper published in the future). Drawing on a link prediction algorithm for citation network, Butun and Kaya (2020) predicted the citation count of a paper, and Daud et al. (2017) predicted the author network through citation links. These link prediction algorithms draw on either node attributes or network topology. Earlier studies tend to rely on node attributes, such as paper's keywords and author affiliations, while more recent studies draw on topological network features such as the number of neighboring nodes, Jaccard's coefficient and Adamic-Adar index, contending that topological information allows more precise prediction (Butun and Kaya, 2020). These features are fed into supervised machine learning models predict whether a pair of nodes will be linked or not. Technically, this task draws on classification algorithms such as support vector machine (SVM), random forest, and logistic regression (Breiman, 2001; Chen and Guestrin, 2016; Cortes and Vapnik, 1995). 3. PREDICTION OF RECOMBINATION RISK 7 Electronic copy available at: https://ssrn.com/abstract=4462160 We similarly use link prediction algorithms to assess whether a pair of knowledge elements will be linked or not. If a pair is predicted not to be linked, the likelihood of failing in combining the pair is high, and we thus consider that the risk of recombination is high. In developing our approach, we set a few requirements for the input of link prediction models. First, our prediction is based on information about a knowledge element itself rather than information about the network of knowledge elements. Second, we use only semantic information of knowledge elements without requiring other information (e.g., author information, etc.). In other words, unlike previous studies, we do not predict links between knowledge elements by citation network information or other bibliometric information. Such information is available only after a paper is successfully published and is embedded in citation network, which is not ideal because we would like to know the risk of recombination even before the risk is taken. Specifically, we draw on the word embedding technique – i.e., a high-dimensional vector assigned to each word – to capture the semantic information of knowledge elements (Mikolov et al., 2013). We assign a vector expression to each knowledge element and predict the likelihood of a pair of elements being linked based on the corresponding vector pair. This approach is applicable as long as a knowledge element is given in a text format, whether it is part of published papers (e.g., the abstract of a paper) or not (e.g., a grant proposal, a paragraph of a research idea). We test our strategy in two steps in this and the next sections. 3.1. Predicting Future Co-citation To build a link prediction model, we draw on a published paper as a knowledge element and consider that two knowledge elements are combined if a pair of papers are cited together by at least one paper. With classification algorithms we compute the probability of two papers to be co-cited. By subtracting this probability from 1, we obtain the risk of failing to combine the knowledge elements. 8 Electronic copy available at: https://ssrn.com/abstract=4462160 Data. We sampled papers in the field of biomedicine from the Web of Science (WoS). 1 We first identified about 520,000 papers published in the field in 2010. We chose publications in 2010 to have sufficient time for the papers to be cited. These papers can generate 1.35 × 1011 pairs, of which 1.38 × 107 pairs were actually co-cited at least once in the following 8 years (until 2018). From these co-cited pairs we randomly selected approximately 60,000 pairs as linked pairs. We also randomly sampled 60,000 non-linked pairs that were never co-cited. In total, we prepared a sample of 120,000 paper pairs, a half linked and the other half non-linked, as the training data.2 We repeated the same sampling process to prepare the test data of the same size consisting of 60,000 linked pairs and 60,000 non-linked pairs. Word embedding. As discussed above, our link prediction model draws on the vector expressions of a pair of papers. To this end we drew on the word embedding model that is trained with publication data up to 2010 in WoS. The model provides 300-dimensional vector representations for 1.7 million unique words. Yin et al. (2022) demonstrated that the model reasonably captures the distance between knowledge elements (words, titles, and abstracts).3 For each paper listed in the training and test data, we extracted its title and the abstract and assigned a word vector to each word included. Finally, we averaged all word vectors to generate a document vector for each paper.4 Classifiers. Thus, we used a pair of document vectors as the input features for predicting whether paper pairs are linked or not. The goal of the link prediction model is to classify linked and non-linked pairs. For this classification task, we drew on several classifiers that have been commonly used for text data: Bernoulli Naïve Bayes (BNB), Gaussian Naïve Bayes (GNB), Logistic Regressions (LR), Ridge 1 We focus on the field of biomedicine because our previous study suggested that recombination is well captured by word embeddings in the field but not necessarily so in other fields (Yin et al. 2022). We include article, letter, and proceedings paper in terms of the document type. 2 We sampled linked pairs and non-linked pairs separately – oversampled linked pairs – because relatively few linked pairs exist due to the sparse nature of citation network. This biases the prediction scores. Thus, the risk indicators must be interpreted in a relative sense – higher scores mean greater risk, but risk = 0.9 does not mean 90% probability of failure. 3 We used the word embedding model as it is without any fine-tuning. This is because our primary goal is to represent scientific documents trained with a simple model and develop a generally applicable method for predicting scientific risk, rather than achieving the highest fine-tuned scores. Moreover, considering that newly released scientific papers on a daily basis would change word embeddings, we chose not to tune the embedding model with the selected target data. 4 In averaging the word vectors, we chose not to normalize each vector. Many previous studies have shown that simply averaging word vectors without normalization sufficiently represents the semantic information of a document (Kenter et al. 2016). 9 Electronic copy available at: https://ssrn.com/abstract=4462160 Regressions (RR), Linear Discriminant Analyses (LDA), Quadratic Discriminant Analyses (QDA), Random Forest (RF), eXtreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM) (Chen and Guestrin, 2016; Cortes and Vapnik, 1995). We ran these 9 classifiers on the training data with Python’s Scikit-learn package (Pedregosa et al., 2012) and developed 9 models. We fine-tuned the hyperparameters of these models using Grid search with 5-fold cross-validation.5 3.2. Performance of Prediction To assess the performance of each trained model, we applied the models to the test data. Fig.1 presents the precision, recall, and F1 scores. The figure shows that SVM has both the highest precision (0.857) and the highest recall (0.875). The F1 score – the harmonic mean of precision and recall – is thus highest for SVM (0.860). We are relatively tolerant to false positives (predicted to be linked but not linked in fact) rather than false negatives (linked but predicted not), because it is possible that a pair that has not been linked yet may be linked after our citation window. Overall, SVM demonstrates the most desirable performance among the tested classifiers. Compared with previous link prediction models, our model presents reasonable performance. For example, Sebastian et al. (2015) constructed a link prediction model to predict future co-citation with F1 = 0.85, precision = 0.85, and recall = 0.86. Similarly, Shibata et al. (2012) predicted citation links among existing papers with F1 scores ranging from 0.74 to 0.82. Given that the previous studies drew on numerous features of nodes and network, our approach based solely on document vectors achieves sufficient performance. After finding that SVM is the best classification algorithm, we applied the trained model based on SVM to the test data and computed the probability of each paper pair to be linked. We subtracted this probability from 1 to construct the score of recombination risk – the probability that an attempt to link 5 To optimize the major hyperparameters, we set a range of possible values for the parameters and selected the optimal values based on the average of weighted F1-scores in cross-validation. 10 Electronic copy available at: https://ssrn.com/abstract=4462160 two papers end up failing. Fig.2 illustrates the distribution of the risk scores for the linked pairs and non-linked pairs, clearly indicating that the risk scores are higher for the non-linked pairs. These results suggest that our link prediction model based on word embeddings can compute the risk in recombining a pair of knowledge elements. 4. VALIDATING RECOMBINATION RISK INDICATOR WITH SURVEY To further validate our recombination risk indicator, we carried out a questionnaire survey and asked the respondents to self-assess the risk of their past project. Specifically, we considered their published paper as a project and asked the contact authors to assess how they had initially perceived the risk of the project when they started it. 4.1. Risk indicator To bibliometrically compute the risk of a project, which is operationalized as a paper, we draw on the references cited by the focal paper as knowledge elements. As a paper usually has multiple references, we form all possible combinations from these references (for example, 10 references make 45 pairs). For each reference pair, we compute the risk score based on the SVM model developed in Section 3. After computing the risk scores for all reference pairs, we aggregate them at the focal paper level in the following steps. Suppose a focal paper has N references. Let 𝑟!" ∈ [0,1] be the risk score in combining reference i and reference j (𝑖, 𝑗 ∈ {1, … , 𝑁}, 𝑖 ≠ 𝑗). The focal paper is characterized by a series of risk scores for the recombination of N elements. It is debatable how these risk values should be aggregated. On the one hand, the success of a project requires all recombinations to be realized. Therefore, the minimum risk dictates the risk of a project. Thus, the recombination risk of a project may be given by min (!,")∈{(,…,*}! 𝑟!" (1) 11 Electronic copy available at: https://ssrn.com/abstract=4462160 On the other hand, the success of a project may be subject to the most challenging recombination. Then, the recombination risk of the paper is given by max (!,")∈{(,…,*}! 𝑟!" (2) Given these possibilities, we prepared a series of risk indicators by taking various percentile values: 𝑅𝑖𝑠𝑘, = 𝑝 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑟!" (3) , where Risk0 is the minimum and Risk100 is the maximum. Note that we computed the risk of recombination that had already been achieved because all the reference pairs are co-cited by the focal paper, and thus, we did not observe failed recombination.6 Nonetheless, we expected that even achieved recombinations had entailed potential risks. Fig.4 illustrates the distribution of risk indicators at different percentiles (p = 0, 10, …, 100), presenting a considerable variance in the risk scores. 4.2. Questionnaire Survey Sample. We sampled the survey respondents in the following steps. First, we identified contact authors whose email addresses are available in WoS publication records. We then selected contact authors who had published at least 4 English-written papers to ensure that the authors have sufficient research experiences and can reasonably assess the corresponding risk. For each author, we selected one paper that was most recently published as of 2018. To avoid the recall bias we excluded authors whose latest publication was in 2015 or before. Finally, we randomly sampled 4,625 authors. After three rounds of requests, 397 were bounced back and 378 responses were collected (response rate = 8.9%). 6 Also note that we formed all possible reference pairs, of which some pairs represent recombinations intended by the focal paper but others may not be intended, and the latter may cause errors in the aggregated risk indicator. 12 Electronic copy available at: https://ssrn.com/abstract=4462160 Questionnaire. We developed a questionnaire survey on various qualities of scientific papers based on interviews of scientists and tested it with a small-scale pilot survey. Of the survey questions, this study draws on two items concerning risk, which assess how the respondents perceived the risk of their project in two aspects. The part of the survey started with "The following questions are about the research project that led to your paper: [the bibliographic information of the respondent's selected paper]" and was followed by two questions. First, we inquired into overall risk by asking "when you started the project, how confident were you that the project would reach the expected finding or any publishable finding at all?" (Fig.3A). The response takes three options: (0) "Confident that the expected finding would be obtained"; (1) "Confident that some publishable finding would be obtained"; and (2) "Not confident if any publishable finding would be obtained." Second, we inquired into technical risk by asking "when you started the project, did you anticipate any technical / methodological challenge that could fail your project?" (Fig.3B). The response takes three options: (0) "No, I anticipated no major challenge."; (1) "Yes, I anticipated a challenge, which I overcame."; (2) "Yes, I anticipated a challenge, which forced us to change the research direction." 4.3. Validation While our bibliometric risk indicators take continuous values between 0 and 1, the survey scores are ordinal with 3 values. To examine the correlation between the bibliometric indicators (p = 0, 10, …, 100) and the survey scores, we first regressed the bibliometric indicators on the survey scores, in which the survey scores are converted into ordinal dummy variables. Since our risk score is a proportion data ranging between 0 and 1, we used a generalized linear model (GLM) with the logit link function and the binomial distribution (Hardin and Hilbe, 2018). Table 1 shows the result of the regression analyses. Comparing the two survey scores, the result suggests that our bibliometric risk indicators are correlated with overall risk (Table 1A) but not with technical risk (Table 1B). Regarding the overall risk, the result also shows that overall risk = 2 ("not confident if any publishable finding would be obtained") is significantly positively correlated with the bibliometric indicators but overall risk = 1 ("confident some publishable finding would be obtained") is not, which 13 Electronic copy available at: https://ssrn.com/abstract=4462160 is as expected. Comparing the series of bibliometric risk indicators (p = 0, 10, …, 100), we find that the correlation with the survey score is significant at both low and high p values. To assess the magnitude of the correlations, we further computed the marginal effects of the survey scores on the bibliometric indicators. When overall risk changes from "0 or 1" to "2", Risk0 increases by 0.56 standard deviation (SD) and Risk100 increases 0.37 SD, and Risk10 has the largest increase of 0.72 SD. We also illustrated the Pearson's correlation coefficients between the survey scores and the bibliometric indicators, finding significant correlations for overall risk but not for technical risk (Fig.5). Concerning overall risk, the figure shows significantly positive correlations across a broad range of p values, but stronger correlations are observed particularly at lower p values. This implies that the risk of a project is determined by all risks that a project faces. Fig.6 further illustrates the distribution of the risk indicators with relatively strong correlations with overall risk (p = {5, 10, 15, 20}). Comparing the high and low overall risk groups, it demonstrates that the high risk group has greater risk scores. 5. RISK AND NOVELTY 5.1. Divergent Validity Having constructed the bibliometric risk indicator, we test how it is related to novelty. Previous studies found novelty and risk to be correlated (Franzoni et al., 2018) and sometimes treated them interchangeably (Machado, 2021; Reinhilde et al., 2022). To test this assumption, we examine the relationship between risk and novelty using bibliometric indicators of the two concepts. The sample for this analysis is the same as in Section 4. Bibliometric indicators. We use the bibliometric risk indicators based on the SVM model. In particular, we test Riskp, where p = {5, 10, 15, 20}, as they demonstrated relatively strong correlations with the survey risk score. As to the novelty indicator, we draw on the recombinant novelty indicator proposed by Shibayama et al. (2021) as it employs an operationalization that is consistent with that for our recombination risk indicator. To assess the novelty of a focal paper, their approach first converts each reference of the focal paper into a document vector. Then, it computes the cosine distance of document 14 Electronic copy available at: https://ssrn.com/abstract=4462160 vectors for each reference pair. The cosine distances for all reference pairs are aggregated at the focal paper level by taking the maximum value (Novel). That is, the novelty is determined by the reference pair with the largest distance. Correlation analysis. The result of the correlation analyses is summarized in Table 2. First, we test the correlation between the novelty indicator and the survey risk scores. The Pearson's correlation coefficient for the overall risk is 0.070 (p > 0.1) and that for the technical risk is -0.078 (p > 0.1). Thus, recombinant novelty indicators do not appear to capture the risk perceived by scientists, unlike some previous studies assumed (Machado, 2021; Reinhilde et al., 2022). Second, we examine the correlation between the novelty indicator and the risk indicators. The result suggests rather negative than positive correlations (Risk5: -.098, p < 0.1), and the correlations are overall weak. Therefore, we do not find compelling evidence showing that the novelty indicator captures the particular risk concept studied in this paper. 5.2. Prediction of Impact To further analyze the relationship between novelty and risk, we evaluate how these bibliometric indicators are associated with future citation impact. Previous studies suggested that while novel discoveries may attract more citations, they also may be neglected and thus have fewer citations (Wang et al., 2017). In fact, several analyses on the relationship between bibliometric novelty indicators and future citation counts found both a positive relationship and an inverted-U-shaped relationship (Shibayama et al., 2021; Uzzi et al., 2013; Wang et al., 2017). Here, we investigate whether and how our indicator of recombination risk, together with novelty, is associated with future citation impact. Setup of analysis. For this analysis, we use "top-1% cited" (TC) in the respective field as the dependent variable, coded 1 if the citation count of the paper is within top 1% and 0 otherwise, and regress it on 15 Electronic copy available at: https://ssrn.com/abstract=4462160 the novelty indicator (Novel) and a risk indicator (Risk15). We use Risk15 as it showed the highest correlation with the survey score in the previous section.7 We randomly sampled 4,000 articles published in biomedicine in 2010 and evaluated their citation impact as of 2018. We oversampled top-1%-cited papers, so that the final sample consists of approximately 2,000 top-1% cited papers and 2,000 non-top-1% cited papers. As the dependent variable is a dummy variable, we draw on logistic regression. Regression analysis. Table 3 reports the result of logistic regressions. Models 1 and 2 tests the relationship between novelty and future citation impact, finding a positive coefficient for the linear term and a negative coefficient for the quadratic term. Given the magnitude of the coefficients, Model 2 suggests a positive but stagnating effect of novelty on future citation impact, which is consistent with the previous findings (Shibayama et al., 2021). Models 3 and 4 then examines the relationship between risk and future citation impact, finding a negative coefficient for the linear term and a positive coefficient for the quadratic term. Again, given the magnitude of the coefficients, Model 4 suggests a negative but diminishing effect of risk on future citation impact. As risk and novelty have been confounded in the literature, Model 5 includes both the novelty and risk indicators. The magnitude of the coefficients slightly changes, but the overall relationships remain qualitatively similar. Further to assess whether the risk and novelty indicators have any interaction effect on future citations, Models 6 and 7 introduce various interaction terms without finding a significant effect. Thus, novelty and risk indicators are associated with future citations through different mechanisms, which also supports our argument that these two indicators capture different concepts. To visually illustrate the result, Fig.7 presents the contour map of the predicted citation impact with a range of novelty and risk values. In the horizontal direction, the figure shows that greater novelty is 7 We tested other risk indicators and obtained consistent results. 16 Electronic copy available at: https://ssrn.com/abstract=4462160 associated with higher citation impact. In the vertical direction, it shows that lower risk is associated with higher citation impact. Taken together, higher citation impact than the average (the area below the red contour curve) occurs only with high novelty and low risk. In particular, the result suggests that risky recombination, even if successful, cause disadvantages in attracting future citations, although high-risk research has been encouraged (OECD, Machado, 2021; 2021). 6. DISCUSSIONS AND CONCLUSIONS As risk plays a fundamental role in science, it is of scholarly and practical interest to quantify the degree of risk in science. Nonetheless, the concept of risk has been understudied (Franzoni and Stephan, 2021) (Althaus, 2005; Aven, 2011; Hansson, 2018), and there has been no approach to measure risk in science in a large scale to the best of our knowledge, except that bibliometric novelty indicators were suggested as a proxy of risk (Machado, 2021; Reinhilde et al., 2022). To fill the gap, this study aims to develop an approach to quantify the risk in science by focusing on recombination as a particular mode of discovery process. We proposed an indicator of risk computed on the basis of a machine learning model. We operationalized the recombination of knowledge elements by a pair of papers being co-cited and the risk of recombination by the paper pair not being co-cited. We tested several classification algorithms to develop the model, among which the SVM demonstrated the best performance. The resulting risk indicator was then tested against survey data, which confirmed that our risk indicator is positively correlated with self-assessed overall risk of a project. Finally, while risk and novelty have been confounded in the literature (Franzoni and Stephan, 2021; Machado, 2021; Reinhilde et al., 2022), we showed the divergence of the bibliometric novelty and risk indicators. Overall, this study makes scholarly contribution to the underdeveloped literature on risk in science (Franzoni and Stephan, 2021; Machado, 2021; Reinhilde et al., 2022) by providing the first validated indicator of a particular type of risk. We expect that the application of the risk indicator will deepen our 17 Electronic copy available at: https://ssrn.com/abstract=4462160 understanding about risk in science – what motivates risk-taking, what is the consequence of risk-taking, how risk-taking should be rewarded, and so forth. Practically, the proposed approach offers a flexible tool for quantifying risks and can benefit scientists and relevant parties such as policymakers, funding bodies, and R&D managers. The risk indicator may be also used for policy evaluation for example by assessing whether a certain policy encourages or discourages scientists' risk-taking behavior. These are relevant especially because current policies tend to emphasize high-risk high-return research projects (OECD, Machado, 2021; 2021). We expect that the proposed method is applicable not only to scientific papers but also to other types of scientific texts such as a paragraph of a research idea and a funding proposal, although such applications need further examination. These applications should assist decision-makers to assess the feasibility of a research project and help identify potential risks involved in a project. To facilitate such applications, we prioritized two features in designing our approach. First, our approach relies solely on text data. As long as a knowledge element of interest is provided in the form of text data, we can convert text data into word embedding vectors and compute the risk indicator. Second, our risk indicator can be calculated ex ante. Some existing bibliometric indicators are retrospectively computed, for example drawing on citation network data, and thus can only be computed after papers are published. This is not ideal because one wants to know the degree of risk before taking it. Our approach overcomes such a limitation by not requiring post-publication information. Despite all the contributions, further refinement and development of risk indicators are warranted. First, we focused on risk in a specific type of scientific progress – recombination, but other modes of scientific progress should also involve risk. Thus, future research should develop a method to quantify risk in broader modes of scientific progress. Second, we tested our approach only in the biomedical field because of the limitation of the word embedding model. To extend our approach to various disciplines, future research needs to develop word embeddings with broader text data and examine whether they can capture the semantic distance between words across disciplines. Third, there is room for 18 Electronic copy available at: https://ssrn.com/abstract=4462160 improvement in extracting semantic information from documents. Though our approach draws on the title and abstract of scientific papers, other parts of documents (e.g., the method section, and even the full text) might be informative. Further, while our approach computes a document vector by averaging word vectors, we can directly compute document vectors (doc2vec). Our approach in these respects was to prioritize the applicability of the approach, but future research could improve the indicator by differently construct document vectors. Fourth, though we drew on all possible reference pairs in constructing the risk indicator of a paper, not all the pairs may represent intended recombinations, which can cause errors. This might be addressed by looking into how references are cited in the focal paper and focusing on the relevant reference pairs. Finally, it is of interest to investigate the source of risk. While novelty is a plausible source of risk (Machado, 2021; Reinhilde et al., 2022), our result does not show correlation between risk and novelty. This suggests that risk is attributed to more than novelty, but what it is remains unclear. Future research could dissect the recombination of knowledge elements to understand the source of risk. References Althaus, C.E., 2005. A Disciplinary Perspective on the Epistemological Status of Risk. Risk Analysis 25, 567-588. Arthur, W.B., 2007. The Structure of Invention. Research Policy 36, 274-287. Aven, T., 2011. On Some Recent Definitions and Analysis Frameworks for Risk, Vulnerability, and Resilience. Risk Analysis 31, 515-522. Boudreau, K.J., Guinan, E.C., Lakhani, K.R., Riedl, C., 2016. Looking across and Looking Beyond the Knowledge Frontier: Intellectual Distance, Novelty, and Resource Allocation in Science. Management Science 62, 2765-2783. Bourdieu, P., 1975. The Specificity of the Scientific Field and the Social Conditions for the Progress of Reason. Social Science Information 14, 19–47. Breiman, L., 2001. Random Forests. Machine Learning 45, 5-32. Butun, E., Kaya, M., 2020. Predicting Citation Count of Scientists as a Link Prediction Problem. Ieee Transactions on Cybernetics 50, 4518-4529. Chen, T., Guestrin, C., 2016. Xgboost: A Scalable Tree Boosting System, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, San Francisco, California, USA, pp. 785–794. 19 Electronic copy available at: https://ssrn.com/abstract=4462160 Cortes, C., Vapnik, V., 1995. Support-Vector Networks. Machine Learning 20, 273-297. Dahlin, K.B., Behrens, D.M., 2005. When Is an Invention Really Radical? Defining and Measuring Technological Radicalness. Research Policy 34, 717-737. Daud, A., Ahmed, W., Amjad, T., Nasir, J.A., Aljohani, N.R., Abbasi, R.A., Ahmad, I., 2017. Who Will Cite You Back? Reciprocal Link Prediction in Citation Networks. Library Hi Tech 35, 509-520. Fleming, L., 2001. Recombinant Uncertainty in Technological Search. Management Science 47, 117132. Fontana, M., Iori, M., Montobbio, F., Sinatra, R., 2020. New and Atypical Combinations: An Assessment of Novelty and Interdisciplinarity. Research Policy 49, 28. Foster, J.G., Rzhetsky, A., Evans, J.A., 2015. Tradition and Innovation in Scientists’ Research Strategies. American Sociological Review 80, 875-908. Franzoni, C., Scellato, G., Stephan, P., 2012. Foreign-Born Scientists: Mobility Patterns for 16 Countries. Nature Biotechnology 30, 1250-1253. Franzoni, C., Scellato, G., Stephan, P., 2018. Context Factors and the Performance of Mobile Individuals in Research Teams. Journal of Management Studies 55, 27-59. Franzoni, C., Stephan, P., 2021. Uncertainty and Risk-Taking in Science: Meaning, Measurement and Management. National Bureau of Economic Research Working Paper Series No. 28562. Gewin, V., 2012. Risky Research: The Sky's the Limit. Nature 487, 395-397. Hagstrom, W.O., 1974. Competition in Science. American Sociological Review 39, 1-18. Hall, B.H., Jaffe, A., Trajtenberg, M., 2001. The Nber Patent Citations Data File: Lessons, Insights, and Methodological Tools. NBER Working Paper 8498. Hansson, S.O., 2018. Risk, in: Zalta, E.N. (Ed.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University. Hardin, J.W., Hilbe, J.M., 2018. Generalized Linear Models and Extensions, 4th ed. Stata Press, TX, USA. Kaplan, S., Garrick, B.J., 1981. On the Quantitative Definition of Risk. Risk Analysis 1, 11-27. Kenter, T., Borisov, A., & de Rijke, M. (2016). Siamese CBOW: Optimizing Word Embeddings for Sentence Representations. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 941–951. Kuhn, T.S., 1970. The Structure of Scientific Revolutions. University of Chicago Press, Chicago, MI. Kuhn, T.S., 1977. The Essential Tension: Selected Studies in Scientific Tradition and Change. University of Chicago Press, Chicago. Levitt, J.M., Thelwall, M., 2011. A Combined Bibliometric Indicator to Predict Article Impact. Information Processing & Management 47, 300-308. Lin, Y., Evans, J.A., Wu, L., 2022. New Directions in Science Emerge from Disconnection and Discord. Journal of Informetrics 16, 101234. 20 Electronic copy available at: https://ssrn.com/abstract=4462160 Linton, J.D., 2016. Improving the Peer Review Process: Capturing More Information and Enabling High-Risk/High-Return Research. Research Policy 45, 1936-1938. Liu, W.Y., Nanetti, A., Cheong, S.A., 2017. Knowledge Evolution in Physics Research: An Analysis of Bibliographic Coupling Networks. Plos One 12, 19. Machado, D., 2021. Quantitative Indicators for High-Risk/High-Reward Research, OECD Science, Technology and Industry Working Papers. OECD Publishing, Paris. Macroberts, M.H., Macroberts, B.R., 1989. Problems of Citation Analysis - a Critical-Review. Journal of the American Society For Information Science 40, 342-349. Marinacci, M., 2015. Model Uncertainty. Journal of the European Economic Association 13, 10221100. Matsumoto, K., Shibayama, S., Kang, B., Igami, M., 2020. A Validation Study of Knowledge Combinatorial Novelty, NISTEP Discussion Paper. NISTEP, Tokyo. Mazzolini, A., Colliva, A., Caselle, M., Osella, M., 2018. Heaps' Law, Statistics of Shared Components, and Temporal Patterns from a Sample-Space-Reducing Process. Physical Review E 98. Mednick, S.A., 1962. The Associative Basis of the Creative Process. Psychological Review 69, 220232. Merton, R.K., 1965. On the Shoulders of Giants. University of Chicago Press, Chicago, IL. Merton, R.K., 1973. Sociology of Science. University of Chicago Press, Chicago. Merton, R.K., Barber, E., 2004. The Travels and Adventures of Serendipity. A Study in Sociological Semantics and the Sociology of Science. Princeton University Press, Princeton. Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Efficient Estimation of Word Representations in Vector Space. arXiv. Min, C., Bu, Y., Wu, D., Ding, Y., Zhang, Y., 2021. Identifying Citation Patterns of Scientific Breakthroughs: A Perspective of Dynamic Citation Process. Information Processing & Management 58, 102428. OECD, 2021. Effective Policies to Foster High-Risk/High-Reward Research, OECD Science, Technology and Industry Policy Papers. OECD Publishing, Paris. Palchykov, V., Krasnytska, M., Mryglod, O., Holovatch, Y., 2021. A Mechanism for Evolution of the Physical Concepts Network. Condensed Matter Physics 24. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Müller, A., Nothman, J., Louppe, G., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É., 2012. Scikit-Learn: Machine Learning in Python. arXiv. Reinhilde, V., Jian, W., Paula, S., 2022. Do Funding Agencies Select and Enable Risky Research: Evidence from Erc Using Novelty as a Proxy of Risk Taking. National Bureau of Economic Research, Inc. Sebastian, Y., Siew, E.G., Orimaye, S.O., 2015. Predicting Future Links between Disjoint Research Areas Using Heterogeneous Bibliographic Information Network, Advances in Knowledge Discovery and Data Mining, Part Ii. Springer-Verlag Berlin, Berlin, pp. 610-621. 21 Electronic copy available at: https://ssrn.com/abstract=4462160 Shibata, N., Kajikawa, Y., Sakata, I., 2012. Link Prediction in Citation Networks. Journal of the American Society For Information Science and Technology 63, 78-85. Shibayama, S., 2019. Sustainable Development of Science and Scientists: Academic Training in Life Science Labs. Research Policy 48, 676-692. Shibayama, S., Yin, D., Matsumoto, K., 2021. Measuring Novelty in Science with Word Embedding. Plos One 16, e0254034. Simonton, D.K., 2003. Scientific Creativity as Constrained Stochastic Behavior the Integration of Product, Person, and Process Perspectives. Psychological Bulletin 129, 475-494. Sun, Y., Latora, V., 2020. The Evolution of Knowledge within and across Fields in Modern Physics. Scientific Reports 10. Trapido, D., 2015. How Novelty in Knowledge Earns Recognition: The Role of Consistent Identities. Research Policy 44, 1488-1500. Tria, F., Loreto, V., Servedio, V.D.P., 2018. Zipf's, Heaps' and Taylor's Laws Are Determined by the Expansion into the Adjacent Possible. Entropy 20. Tu, Y.-N., Seng, J.-L., 2012. Indices of Novelty for Emerging Topic Detection. Information Processing & Management 48, 303-325. Uzzi, B., Mukherjee, S., Stringer, M., Jones, B., 2013. Atypical Combinations and Scientific Impact. Science 342, 468-472. Wang, J., Veugelers, R., Stephan, P., 2017. Bias against Novelty in Science: A Cautionary Tale for Users of Bibliometric Indicators. Research Policy 46, 1416-1436. Wang, Y., Jones, B.F., Wang, D., 2019. Early-Career Setback and Future Career Impact. Nature Communications 10, 4331. Whitley, R., 1984. The Intellectual and Social Organization of the Sciences. Oxford University Press, New York. Yang, J., Lu, W., Hu, J., Huang, S., 2022. A Novel Emerging Topic Detection Method: A Knowledge Ecology Perspective. Information Processing & Management 59, 102843. Yaqub, O., 2018. Serendipity: Towards a Taxonomy and a Theory. Research Policy 47, 169-179. Yin, D., Wu, Z., Yokota, K., Matsumoto, K., Shibayama, S., 2022. Identify Novel Elements of Knowledge with Word Embedding. 22 Electronic copy available at: https://ssrn.com/abstract=4462160 Figures and Tables Fig.1 Performance of Link Prediction Classifiers 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 BNB GNB LR RR F1 LDA Precision QDA RF XGBoost SVM Recall Note. BNB: Bernoulli Naïve Bayes, GNB: Gaussian Naïve Bayes, LR: logistic regressions, RR: ridge regressions, LDA: linear discriminant analysis, QDA: quadratic discriminant analysis, RF: random forest, XGBoost: eXtreme Gradient Boosting, and SVM: support vector machine. 23 Electronic copy available at: https://ssrn.com/abstract=4462160 0 5 Density 10 15 Fig.2 Distribution of Recombination Risk 0 .2 .4 .6 .8 1 Risk Co-cited Not co-cited Note. Based on the SVM model. 24 Electronic copy available at: https://ssrn.com/abstract=4462160 Fig.3 Survey Scores of Recombination Risk (A) Overall Risk 8% 33% (0) Confident that the expected finding would be obtained. (1) Confident that some publishable finding would be obtained. 59% (2) NOT confident if any publishable finding would be obtained. (B) Technical Risk (0) No, I anticipated no major challenge. 18% 37% (1) Yes, I anticipated a challenge, which I overcame. 45% (2) Yes, I anticipated a challenge, which forced us to change the research direction. Note: (A) "When you started the project, how confident were you that the project would reach the expected finding or any publishable finding at all?" (B) "When you started the project, did you anticipate any technical / methodological challenge that could fail your project?" 25 Electronic copy available at: https://ssrn.com/abstract=4462160 4 Density Density 4 Density Density 10 .1 Risk(0) .15 .4 Risk(60) .6 .2 0 .1 .2 .3 .1 .3 2 .4 0 .1 .2 Risk(30) .3 .4 0 0 .1 .2 .3 Risk(40) .4 .5 0 .2 .4 .6 Risk(50) Density 4 Density 1 Density 1 Density 0 .2 .8 0 .2 .4 Risk(70) .6 .8 0 .2 .4 Risk(80) .6 .8 2 0 0 0 0 0 1 .5 .5 1 Density 2 2 6 3 1.5 1.5 8 2 2 .2 Risk(20) 0 0 0 Risk(10) 3 .05 4 0 0 0 0 20 2 5 10 5 40 Density 20 Density 60 10 6 15 30 80 8 6 15 20 40 100 Fig. 4 Distribution of Recombination Risk of Project 0 .2 .4 .6 Risk(90) .8 1 .2 .4 .6 Risk(100) .8 1 Note. N = 378. 26 Electronic copy available at: https://ssrn.com/abstract=4462160 Fig.5 Correlation between Bibliometric and Survey Risk Scores 0.20 ** ** *** *** *** ** ** ** Correlation coef. 0.15 ** ** ** * * * † 0.10 † † † † Overall 0.05 Technical ​ 0.00 -0.05 95 10 0 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 5 10 15 0 -0.10 p (Percentile) Note. N = 353. Pearson's correlation coefficients. †p<0.1. *p<0.05. **p<0.01.***p<0.001. We dichotomized overall/technical risk by assigning 1 if overall/technical = 2 and 0 otherwise. 27 Electronic copy available at: https://ssrn.com/abstract=4462160 20 20 15 kdensity lp_sv_85 10 15 kdensity lp_sv_80 10 25 30 kdensity lp_sv_90 20 Low risk High risk 0 .1 .2 Risk(5) .3 0 .1 .2 Risk(10) .3 0 0 0 0 5 10 5 10 40 kdensity lp_sv_95 20 30 50 40 Fig.6 Distribution of Risk Indicators by Overall Risk 0 .1 .2 Risk(15) .3 .4 0 .1 .2 Risk(20) .3 .4 .4 Note. High risk: overall risk = 2. Low risk: overall risk = 0 or 1. 28 Electronic copy available at: https://ssrn.com/abstract=4462160 Fig.7 Prediction of Impact Risk15 (in percentile) 0.8 80 0.0225 0.0200 0.6 60 0.0175 0.0150 0.0125 0.0100 0.4 40 0.0075 0.0050 0.0025 20 0.2 0.2 20 0.4 40 0.6 60 0.8 80 Novel (in percentile) Note. The contour map of prob.(TC=1) based on Model 5 in Table 3. The red curve indicates the base line (prob.(TC =1) = 0.01), below which prob.(TC =1) > 0.01. The novelty and risk indicators are scaled in their percentile values (e.g., 50 is the median of the indicators). 29 Electronic copy available at: https://ssrn.com/abstract=4462160 Table 1 Regression Analysis (A) Overall Risk Risk0 Risk10 Risk20 Risk30 Risk40 Risk50 Risk60 Risk70 Risk80 Risk90 Risk100 -.039 (.232) 1.151* (.516) 27.065*** -9.057 353 -.107 (.135) .689* (.314) 15.704** -32.567 353 -.104 (.120) .608* (.279) 15.303** -49.449 353 -.080 (.115) .541* (.246) 13.902** -66.953 353 -.085 (.108) .459* (.226) 11.535* -85.503 353 -.080 (.105) .422* (.210) 10.468* -105.082 353 -.044 (.103) .374† (.198) 8.431† -127.024 353 .046 (.100) .307 (.194) 5.384 -149.233 353 -.053 (.099) .277 (.180) 3.784 -168.306 353 -.064 (.105) .227 (.177) 4.067 -175.925 353 -.109 (.166) .595* (.272) 22.948*** -110.451 353 Risk0 Risk10 Risk20 Risk30 Risk40 Risk50 Risk60 Risk70 Risk80 Risk90 Risk100 .000 (.336) -.057 (.358) 3.937 -9.194 352 .034 (.170) .082 (.213) 5.352 -32.767 352 .030 (.144) .0176 (.190) 5.737 -49.672 352 .040 (.130) .033 (.174) 5.833 -67.143 352 .045 (.119) .019 (.162) 4.864 -85.606 352 .058 (.114) -.013 (.154) 4.661 -105.021 352 .090 (.110) .014 (.148) 4.413 -126.730 352 .131 (.107) .053 (.142) 3.419 -148.766 352 .133 (.106) .003 (.137) 2.191 -167.824 352 .205† (.112) .000 (.136) 5.245 -175.229 352 .216 (.174) .182 (.216) 18.682*** -110.536 352 Overall risk = 0 (base) Overall risk = 1 Overall risk = 2 Chi squared Log likelihood N (B) Technical Risk Technical risk = 0 (base) Technical risk = 1 Technical risk = 2 Chi squared Log likelihood N Note. Generalized linear model with a logit link and the binomial family. Unstandardized coefficients (robust errors in parentheses). Two-tailed test. †p<0.1. *p<0.05. ** p<0.01.***p<0.001. The sub-fields within biomedicine are controlled for. 30 Electronic copy available at: https://ssrn.com/abstract=4462160 Table 2 Divergent Validity Survey Bibliometric Overall risk Technical risk Risk5 Risk10 Risk15 Risk20 Bibliometric Novel .070 -.078 -.098† -.058 -.017 .008 Note. N = 353. Pearson's correlation coefficients. †p<0.1. 31 Electronic copy available at: https://ssrn.com/abstract=4462160 Table 3 Prediction of Impact (Top-1% Cited) 𝑁𝑜𝑣𝑒𝑙 Model 1 2.217*** (.156) 𝑁𝑜𝑣𝑒𝑙! Model 2 8.183*** (1.146) -3.276*** (.638) 𝑅𝑖𝑠𝑘"# Model 3 Model 4 -14.711*** (1.396) 116.286*** -108.769 3903 ! 𝑅𝑖𝑠𝑘"# -18.375*** (1.374) 17.581*** (1.722) Model 5 6.366*** (1.173) -2.285*** (.656) -18.751*** (1.527) 18.146*** (1.866) Model 6 6.551*** (1.186) -2.311*** (.658) -14.101*** (3.268) 17.698*** (1.854) -4.977 (3.375) 192.494*** -108.551 3903 307.846*** -106.422 3903 330.828*** -106.409 3903 𝑁𝑜𝑣𝑒𝑙 × 𝑅𝑖𝑠𝑘"# 𝑁𝑜𝑣𝑒𝑙! × 𝑅𝑖𝑠𝑘"# ! 𝑁𝑜𝑣𝑒𝑙 × 𝑅𝑖𝑠𝑘"# ! 𝑁𝑜𝑣𝑒𝑙! × 𝑅𝑖𝑠𝑘"# Chi-squared stat Log likelihood N 200.967*** -109.684 3903 216.578*** -109.415 3903 Model 7 7.896*** (1.708) -2.994** (.984) 7.810 (13.020) -3.513 (19.653) -50.957 (34.117) 22.869 (21.524) 30.728 (57.463) -7.093 (39.075) 375.162*** -106.382 3903 Note. Logistic regressions. Unstandardized coefficients (robust errors in parentheses). Two-tailed test. †p<0.1. *p<0.05. **p<0.01.***p<0.001. The sampling weight is incorporated in the regression analysis. The sub-fields within biomedicine are controlled for. We also run OLS using the citation count as the dependent variable with the same set of independent variables and obtained similarly significant results (see Supplementary Information). 32 Electronic copy available at: https://ssrn.com/abstract=4462160 Supplementary Information Table S1 Prediction of Impact (Citation count) 𝑁𝑜𝑣𝑒𝑙 Model 1 1.611*** (.107) 𝑁𝑜𝑣𝑒𝑙! Model 2 3.466*** (.537) -1.145*** (.321) 𝑅𝑖𝑠𝑘"# Model 3 Model 4 -2.280*** (.264) 54.220*** .060 3903 ! 𝑅𝑖𝑠𝑘"# -4.650*** (.532) 4.092*** (.830) Model 5 2.989*** (.548) -.916** (.325) -4.000*** (.504) 3.620*** (.783) Model 6 3.205*** (.567) -.983** (.328) -2.780** (.885) 3.514*** (.802) -1.540† (.875) 52.451*** .070 3903 79.390*** .157 3903 66.543*** .158 3903 𝑁𝑜𝑣𝑒𝑙 × 𝑅𝑖𝑠𝑘"# 𝑁𝑜𝑣𝑒𝑙! × 𝑅𝑖𝑠𝑘"# ! 𝑁𝑜𝑣𝑒𝑙 × 𝑅𝑖𝑠𝑘"# ! 𝑁𝑜𝑣𝑒𝑙! × 𝑅𝑖𝑠𝑘"# F stat R2 N 133.632*** .116 3903 92.677*** .121 3903 Note. OLS. Unstandardized coefficients (robust errors in parentheses). Two-tailed test. †p<0.1. *p<0.05. regression analysis. The sub-fields within biomedicine are controlled for. ** Model 7 4.465*** (.769) -1.707*** (.459) 8.033† (4.443) -12.653* (6.321) -29.007* (12.171) 15.944* (7.910) 40.330* (18.505) -23.023† (12.578) 51.255*** .160 3903 p<0.01.***p<0.001. The sampling weight is incorporated in the 33 Electronic copy available at: https://ssrn.com/abstract=4462160