QUANTITATIVE RISK ASSESSMENT MODEL FOR SOFTWARE SECURITY IN
THE DESIGN PHASE OF SOFTWARE DEVELOPMENT
Except where reference is made to the work of others, the work described in this
dissertation is my own or was done in collaboration with my advisory committee. This
dissertation does not include proprietary or classified information.
__________________________
Idongesit Okon Mkpong-Ruffin
Certificate of Approval:
_________________________________
David A. Umphress, Co-Chair
Associate Professor,
Computer Science Software Engineering
________________________________
John A. Hamilton, Co-Chair
Associate Professor,
Computer Science Software Engineering
_________________________________
Juan E. Gilbert
Associate Professor,
Computer Science Software Engineering
________________________________
George T. Flowers
Dean
Graduate School
QUANTITATIVE RISK ASSESSMENT MODEL FOR SOFTWARE SECURITY IN
THE DESIGN PHASE OF SOFTWARE DEVELOPMENT
Idongesit Mkpong-Ruffin
A Dissertation
Submitted to
the Graduate Faculty of
Auburn University
in Partial Fulfillment of the
Requirements for the
Degree of
Doctor of Philosophy
Auburn, Alabama
May 9, 2009
QUANTITATIVE RISK ASSESSMENT MODEL FOR SOFTWARE SECURITY IN
THE DESIGN PHASE OF SOFTWARE DEVELOPMENT
Idongesit Mkpong-Ruffin
Permission is granted to Auburn University to make copies of this dissertation at its
discretion, upon request of individuals or institutions and at their expense. The author
reserves all publication rights.
_____________________________
Signature of Author
_____________________________
Date of Graduation
iii
DISSERTATION ABSTRACT
QUANTITATIVE RISK ASSESSMENT MODEL FOR SOFTWARE SECURITY IN
THE DESIGN PHASE OF SOFTWARE DEVELOPMENT
Idongesit Mkpong-Ruffin
Doctor of Philosophy, May 9, 2008
(M.S. Troy University, 2007)
(M.B.A, Tennessee State University, 1992)
(B.S. Freed-Hardeman University, 1985)
186 Typed Pages
Directed by David A. Umphress and John A. Hamilton
Risk analysis is a process for considering possible risks and determining which
are the most significant for any particular effort. Determining which risks to address and
the optimum strategy for mitigating said risks is often an intuitive and qualitative process.
An objective view of the risks inherent in a development effort requires a quantitative
risk model. Quantitative risk models used in determining which risk factors to focus on
tend to use a traditional approach of annualized loss expectancy (ALE) based on
frequency of occurrence and the exposure factor (EF) which is the percentage of asset
loss due to the potential threat in question. This research uses empirical data that reflects
the security posture of each vulnerability to calculate Loss Expectancy, a risk impact
iv
estimator. Data from open source vulnerability databases and results of predicted threat
models are used as input to the risk model. Security factors that take into account the
innate characteristics of each vulnerability are incorporated into the calculation of the risk
model. The result of this model is an assessment of the potential threats to a development
effort and a ranking of these threats based on the risk metric calculation.
v
ACKNOWLEDGEMENTS
I have been very blessed to have been advised, supported, and encouraged by the
different members of my committee during the different stages of my educational
journey. I thank Dr David Umphress for his patience, deliberate guidance, and direction.
I would also like to thank Dr. John Hamilton for his insight and support. I would like to
give my thanks to Dr. Juan Gilbert for his support and encouragement throughout my
stay at Auburn.
I would like to thank the members of the Human Centered Computing Lab
(HCCL), the Information Assurance Lab and other members of the Computer Science
department for their support. Their helpful comments, suggestions and acts of kindness
are very much appreciated.
I would also like to thank my family, especially my parents, Obong Okon
Mkpong and, ObongAwan Affiong Mkpong, for instilling in me the zeal for knowledge,
as well as my sisters and brothers for their continued support and belief in me and my
ability to achieve this milestone. A very special thanks to my husband, Dale, and
children, Akaninyene and Aniekan, who have been my motivation. Without their support,
patience and encouragement I would not have started nor finished this program. I
dedicate this work to them.
Most of all, special thanks to God-- for His mercies, for putting each person in my
path to strengthen me and make this journey what it has become. His protection and
guidance have been very evident, and I thank Him.
vi
Style manual or journal used: ACM Computing Surveys
Computer software used: Microsoft Word, Microsoft Excel, Microsoft Visual Studio (C#,
SQL Server, SQL Analysis Server)
vii
TABLE OF CONTENTS
TABLE OF CONTENTS .................................................................................................viii
LIST OF FIGURES...........................................................................................................xii
LIST OF TABLES ...........................................................................................................xiv
LIST OF EQUATIONS .................................................................................................... xv
1 INTRODUCTION............................................................................................................ 1
1.1 Problem Statement .................................................................................................... 1
1.2 Research Objective.................................................................................................... 2
1.3 Background ............................................................................................................... 2
1.3.1 Risk Assessment................................................................................................. 4
1.3.2 Threat Identification........................................................................................... 6
1.3.3 Vulnerability Identification ................................................................................ 7
1.3.4 Likelihood Determination .................................................................................. 7
1.3.5 Impact Determination......................................................................................... 8
1.3.6 Risk Determination ............................................................................................ 9
2 STATE OF RISK ASSESSMENT................................................................................. 11
2.1 Threat/Vulnerability Identification ......................................................................... 11
2.1.1 Threat/Vulnerability Categorization ................................................................ 12
2.2 Software Testing and Assessment........................................................................... 13
2.2.1 Black-Box Testing............................................................................................ 14
viii
2.2.2 White-Box Testing ........................................................................................... 15
2.3 Stakeholder’s Goals................................................................................................. 17
2.3.1 Boehm Win-Win approach............................................................................... 17
2.3.2 RISKIT ............................................................................................................. 18
2.4 Risk Management Framework (RMF) .................................................................... 19
2.5 Software Reliability................................................................................................. 20
2.6 Traditional approach ............................................................................................... 20
2.7 Cross-Disciplinary ties to Insurance ....................................................................... 21
2.8 Problems with Existing Approaches ....................................................................... 21
3 RESEARCH DETAILS ................................................................................................. 25
3.1
Conceptual Overview........................................................................................ 26
3.2
Data Preparation................................................................................................ 27
3.3
Parsing the Data ................................................................................................ 27
3.3.1
3.4
3.4.1
3.5
3.5.1
3.6
3.6.1
Validation of Data Entry ........................................................................... 29
Cluster Determination ....................................................................................... 30
Validation of Clustering ............................................................................ 35
Data Classification ............................................................................................ 37
Data Classification Validation .................................................................. 41
Loss Expectancy Determination........................................................................ 42
Determining Fields to Predict CVSS Score .............................................. 43
ix
3.6.1.1
Regression Analysis: Cluster Node versus Loss Type, Vulnerability
Type and Exploit Range........................................................................................ 45
3.6.1.2
Regression Analysis: CVSS Score versus Loss Type, Vulnerability
Type and Exploit Range........................................................................................ 47
3.6.2
Determining Loss Expectancy .................................................................. 48
3.6.2.1
3.6.3
Validation of Loss Expectancy Calculations ........................................ 50
Validation of Loss Expectancy ................................................................. 52
3.7
Confidence Interval of Predictions.................................................................... 57
3.8
Summary ........................................................................................................... 58
4 APPLICATION OF MODEL TO CASE STUDY ........................................................ 61
4.1
Vulnerability Identification and Data Preparation ............................................ 62
4.2
Data Classification and Loss Expectation Determination................................. 64
5 SUMMARY, FUTURE WORK AND CONCLUSIONS.............................................. 71
5.1 Summary ................................................................................................................. 71
5.2 Future Work ............................................................................................................ 74
5.3 Conclusions ............................................................................................................. 75
BIBLIOGRAPHY ............................................................................................................. 78
APPENDICES................................................................................................................... 84
Appendix A-1 – The NVD Schema .............................................................................. 85
Appendix A-2 – An Example of an NVD entry............................................................ 92
x
Appendix A-3 – Script for Parsing XML document ..................................................... 94
Appendix A-4 – Validation of Data Upload ............................................................... 112
Appendix B-1 – Creating Term Vector....................................................................... 113
Appendix B-2 – Determining Clustering Algorithm................................................... 117
Appendix B-3 – Query for Identifying Clusters in Training Data .............................. 127
Appendix B-4 – Classified Data Matrix...................................................................... 129
Appendix B-5 – Validation of Clustering Algorithms ................................................ 132
Appendix B-6 – Processing Time for Classifiers........................................................ 135
Appendix B-7 – Calculating CVSS ‘base’ Score........................................................ 136
Appendix C-1 – Stored Procedure for Calculating Impact Score Attributes .............. 138
Appendix C-2 – Stored Procedure for Determining Prioritized Listing ..................... 140
Appendix C-3 – Stored Procedure for Persisting Predictions ..................................... 141
Appendix D-1 – Singleton Request Algorithm ........................................................... 142
Appendix D-2 – Data Table Request (Code for Load Data)....................................... 146
Appendix D-3 – Data for SSRAM Validation ............................................................ 155
Appendix E-1 – Microsoft’s TAMT Listing of Vulnerabilities.................................. 160
Appendix E-2 – Prime III Data ................................................................................... 165
Appendix E-3 – Prime III Predictions......................................................................... 166
Appendix F- Vulnerability Classification ................................................................... 171
xi
LIST OF FIGURES
Figure 1 -- Difference between Speculative and Hazard Risk [Alberts 2006].................... 3
Figure 2 - Risk Function...................................................................................................... 4
Figure 3 - Risk Management Cycle [GAO AIMD-00-33].................................................. 5
Figure 4 - Top-N Risk Item List[Boehm 2001] ................................................................ 18
Figure 5 - Definition of Risk in Riskit Method [Kontio et al. 1998] ................................ 19
Figure 6 -Example of the Riskit analysis graph (risk scenarios) [Kontio et al. 1998] ...... 19
Figure 7 -Reported Incidences [NVD] .............................................................................. 25
Figure 8 - SSRAM's Cluster Creation ............................................................................... 26
Figure 9 - Software Security Risk Assessment ................................................................. 26
Figure 10 - NVD's First Level Node ................................................................................. 28
Figure 11- NVD Description Node ................................................................................... 28
Figure 12 - Node with nested lower level tags.................................................................. 29
Figure 13 - SSRAM Schema............................................................................................. 29
Figure 14 - Independent Variable Selection Statement..................................................... 31
Figure 15 -Clustering without Description........................................................................ 33
Figure 16 - Clustering with Description............................................................................ 34
Figure 17 - Cluster Assignment Query Example .............................................................. 35
Figure 18 -Mining Accuracy Chart ................................................................................... 36
Figure 19 -Classifiers with Description Vector................................................................. 40
xii
Figure 20 - Classifiers without Description Vector .......................................................... 40
Figure 21 - Prediction Accuracy of Classifiers ................................................................. 42
Figure 22 - CVSS base Score and Vector ......................................................................... 43
Figure 23 - Suggested Input Fields .................................................................................. 44
Figure 24 - Multiple Regression Result (Cluster Node) ................................................... 46
Figure 25 - Multiple Regression Result (CVSS score) ..................................................... 47
Figure 26-Query for Sample data to validate Calculation................................................. 50
Figure 27 - Calculation of Impact Factors Result (Stored Procedure).............................. 51
Figure 28 -Single Entry Impact Estimation....................................................................... 52
Figure 29 Vulnerability Entries (Test Data – 2002).......................................................... 53
Figure 30 - Predicted Values............................................................................................. 53
Figure 31 - Impact and Loss Expectation Estimation ....................................................... 53
Figure 32: Prime III System Architecture High-Level Overview..................................... 62
Figure 33 - Load Prime III ................................................................................................ 65
Figure 34 - Prime III De-normalized Predictions.............................................................. 65
Figure 35 - Prime III Loss Expectation (Prioritized List) – Predictions ........................... 69
Figure 36 - Prime III Loss Expectation (Prioritized List) – 2006 Predictions .................. 69
Figure 37 - Processing Time for Classification Algorithm ............................................. 135
xiii
LIST OF TABLES
Table 1 Likelihood Definitions [SP 800-30]....................................................................... 8
Table 2 Magnitude of Impact Definitions [SP 800-12]....................................................... 9
Table 3 Risk-Level Matrix ................................................................................................ 10
Table 4 Software Risk Management Techniques [Boehm 2001] ..................................... 17
Table 5 - Cluster Algorithm Cohesion Factors ................................................................. 37
Table 6 - Classifier Processing Time & Score – Cluster Classifier without Description . 39
Table 7 - Classification Confusion Matrix – Classifier without Description.................... 41
Table 8 – Calculation of Impact Factors (Excel Result) ................................................... 51
Table 9 - t-Test: Paired Two Sample for Means ............................................................... 54
Table 10 - Confidence Interval Derivation ....................................................................... 57
Table 11 - Prime III Vulnerability List ............................................................................. 63
Table 12- Prime III Normalized Tables ............................................................................ 64
Table 13 - Example of PrimeIII's Input Data ................................................................... 64
Table 14 – Comparison of SSRAM with other Risk Analysis Methodologies................. 76
Table 15 - Data Entry Validation Entries........................................................................ 112
xiv
LIST OF EQUATIONS
Equation 1 - Traditional Annual Loss Expectancy ........................................................... 21
Equation 2 - TFDIF ........................................................................................................... 31
Equation 3 – Accuracy Determination .............................................................................. 41
Equation 4 – Predicted Loss Expectancy (PLE) ............................................................... 48
Equation 5 - Predicted Impact Score (PIS) ....................................................................... 49
Equation 6 - Weighted Average CVSS Score................................................................... 49
Equation 7 - Growth Rate.................................................................................................. 49
Equation 8 - Margin of Error............................................................................................. 57
xv
1 INTRODUCTION
1.1 Problem Statement
Assessing security risks in software is predominately a qualitative process. Traditionally,
efforts to deal with security vulnerabilities focus on hardening networks and peripherals
that have access to computer systems. Hardening of networks and peripherals entails the
application of security measures, such as firewalls and virtual private networks, or
removing non secure systems and services from the network. In 2002, the National
Institute of Standards (NIST) reported that $59.5 billion was spent annually in
breakdowns and repairs of faulty software [NIST 2002-10] [Mead and Stehney 2005].
NIST also found that 92% of all security vulnerabilities were due to application
vulnerabilities as opposed to network vulnerabilities [Curphey 2004]. Efforts have been
underway to deal with application vulnerabilities early in the software development life
cycle. These efforts have underscored the fact that risk management should drive the
software development process, which assures that security is made an emergent feature of
the development process [Mkpong-Ruffin and Umphress 2007].
Methodologies for performing risk management activities – risk identification, risk
assessment and risk mitigation – are largely qualitative in nature. Risk identification
approaches such as attack trees, attack nets, and attack pattern matching aid in
determining and identifying the risks that exist in a development effort [Mkpong-Ruffin
1
and Umphress 2007, Jurjens 2001, Sindre and Opdahl, Microsoft 2006]; whereas risk
assessment methodologies and tools aid in determining what and where resources can be
allocated to ameliorate the risks identified [Boehm 2001, SP 800-30, FIPS 2004, Kontio
1999, Voas et al. 1997]. The qualitative nature of these methodologies makes it difficult
to generalize assessment and duplicate results from other projects. Since effective risk
mitigation strategies are dependent upon the results of the risk assessment process, there
needs to be an empirical means for implementing risk assessment.
1.2 Research Objective
The objective of this research is to use historical data as a basis to categorize and
quantitatively assess risk elements by incorporating risk impact factors in the assessment
model. This objective is articulated through the creation of a software security risk
assessment model (SSRAM) that categorizes, estimates, and ranks risk elements that
have been identified during a threat modeling activity in the design phase of a
development effort. The result of the research is a quantitative risk assessment of
software security for a development effort.
1.3 Background
Risk is defined as the estimation of the probability and severity of an adverse effect
[Haimes 2004]. There are two kinds of risks: speculative and hazard risk [Voas et al.
1998; Young 2001]. A speculative risk, as in a stock investment, may produce a loss or a
profit. A hazard risk, on the other hand, always produces a loss; see Figure 1 [Alberts
2006]. In dealing with risk, the prevalent approach is to find a plan that will reduce the
loss incurred or the actual occurrence of the risk. This plan can take the form of
insurance purchased to reduce the effect of the loss if it were to happen, installation of
2
deterrent devices around existing structures, or the incorporation of restraints within the
construct of the system. For example, a homeowner could buy insurance, install security
systems on an existing home, or incorporate burglary bars and other theft resistance
structures into the home.
Figure 1 -- Difference between Speculative and Hazard Risk [Alberts 2006]
Risk exists when the exercise of a vulnerability produces a net negative impact. Risk
management is the process of identifying, assessing, and taking the steps to reduce risk to
a satisfactory level. It takes into account the probability and impact of the occurrence of
risk factors. This process allows operational and economic costs to be balanced while still
protecting the development effort. It minimizes the negative impact on the organization
and gives the sound basis needed for decision-making. Risk management allows for a
better security posture in that it enables stakeholders “to make well-informed risk
management decisions to justify the expenditures” that are a part of the IT budget for the
development effort. It also assists stakeholders in “authorizing (or accrediting)” the
development effort based on the documentation derived from the performance of risk
management [SP 800-30; SP 800-53].
3
Effective risk management gives software engineers the necessary focus to understand
the stakeholders’ objectives. It also provides a context for exploring solution approaches;
thereby reducing the risk of building the wrong system. Effective risk management also
makes it possible for risks to be resolved early, avoiding extensive rework late into the
project [Boehm 2001]. These reasons show the need for risk management to be fully
integrated into the software development life cycle [SP 800-30; SP 800-55; Boehm
2001].
1.3.1 Risk Assessment
Figure 2 below, shows a high-level illustration of the risk assessment process [SP 800100].
Threat
Likelihood
Risk
Impact
Vulnerability
Figure 2- Risk Function
Risk assessment is the process of estimating the impact of a successful exploitation and
determining the likelihood of an attacker successfully exploiting a given vulnerability.
The estimation of impact is done by looking at the effect an exploitation can have on the
confidentiality, integrity and availability of the system [SP 800-100].
4
Figure 3 - Risk Management Cycle [GAO AIMD-00-33]
Risk assessment provides the foundation for the risk management process. As shown in
Figure 3, risk assessment affords the basis for the establishment of appropriate policies
upon which the other elements of risk management are directed [GAO AIMD-00-33 ]. It
determines the extent of the potential threat and the risks associated with the software
developed, throughout its life cycle [Voas et al 1997]. It also provides decision makers
with needed information that allows them to understand factors that can negatively
impact operations and outcomes. Additionally, risk assessment allows decision makers
to make informed judgments about actions to take and the extent to which such actions
should be taken to reduce risk. The output of this process aids in finding out what
controls will be needed to reduce or eliminate risks during the risk mitigation process.
For example, bank officials make judgments concerning loan portfolios based on the
results of risk assessment. Insurance providers use risk assessment to determine the
amount to charge for insurance provided. Nuclear power plant engineers conduct risk
assessments to ascertain risks to public health and safety. With the increased dependence
on computer systems, the growth of electronic data and the ubiquity of software, risk
assessment in the software development effort is critical.
5
In order to provide adequate assessment of the risks for a development effort, threats and
vulnerabilities have to be identified. The likelihood and impact of each risk occurring
have to be determined. In general, risk analysis entails [SP 800-30; McGraw 2006; GAO
AIMD-00 33 11/99; Boehm1989]:
•
Determination of the system’s character under question - understanding the
system boundaries, the data sensitivity and criticality, and the stakeholders’
perspectives on the system
•
Identification of the threats – attackers that would likely want to attack the system
•
Identification of vulnerabilities – flaws that exist or could exist in each level of
the development and the operating environment
•
Determination of the likelihood of a vulnerability being exploited.
•
Analysis of current and planned controls to mitigate the threats/vulnerabilities
exposed
•
Determination of the impact on the system and the organization should the risk be
realized
•
Determination of the risks based on the threat and vulnerability identification, the
likelihood of exploitation and impact on the system.
•
Determination of controls to mitigate stated risks
1.3.2 Threat Identification
Without a vulnerability to exploit, an attacker cannot exercise a threat. Such a threatsource can come either from a deliberate attack or from an accidental activation of a
vulnerability. It is generally accepted that a compilation of potential threat sources be
made that are applicable to the IT system under evaluation. These should be adapted to
6
the organization and its environment. Information on threats and threat sources are now
available from different sources, such as:
•
Federal Bureau of Investigation’s National Infrastructure Protection
Center (www.fbi.gov)
•
Federal Computer Incident Response Center (FedCIRC) – www.uscert.gov
•
Mass media, particularly Web-based resources such as National
Vulnerability Database (NVD), SecurityFocus.com,
SecurityWatch.com, SecurityPortal.com, and SANS.org
1.3.3 Vulnerability Identification
Vulnerability is a flaw or weakness in a system’s security procedure, design,
implementation, or internal controls. Its exploitation could be either intentional or
accidentally triggered and would result in a violation of the system’s security policy or a
security breach [SP 800-30; SP 800-12; Steel et al. 2005].
In making an analysis of the threats to an IT system, an analysis of the vulnerabilities
associated with the system and its environment must also be made. A list of flaws or
weakness that could be exploited by the probable threat-sources has to be developed. The
different vulnerabilities can be categorized to allow for ease of identification. [Tsipenyuk
et al. 2005; Howard et al. 2005; OWASP]
1.3.4 Likelihood Determination
Once vulnerabilities and attackers have been determined, the likelihood of a potential
vulnerability being exercised within the associated threat environment has to be derived.
7
The attacker’s motivation and capability, nature of the vulnerability, existence and
effectiveness of current control are factors that govern likelihood determination. [SP 80030; SP 800-100; SP 800-12; Steel et al. 2005]
To describe the likelihood determination most practitioners use qualitative measures of
high, medium, low to describe likelihood levels as shown in Table 1.
Likelihood Level
High
Likelihood Definition
The threat-source is highly motivated and sufficiently
capable; and the controls to prevent the vulnerability
from being exercised are ineffective
Medium
The threat-source is motivated and capable, but
controls are in place that may impede successful
exercise of the vulnerability
Low
The threat-source lacks motivation or capability, or
controls are in place to significantly impede, the
vulnerability from being exercised.
Table 1 Likelihood Definitions [SP 800-30]
1.3.5 Impact Determination
When a vulnerability is exercised, the impact of the security event is usually described as
either a degradation of one or a combination of the following security goals: integrity,
availability and confidentiality. Stoneburner [SP 800-30] briefly describes each security
goal and the result of the goal not being met as follows:
Loss of Integrity. System and data integrity refers to the requirement that
information be protected from improper modification. Integrity is lost if
unauthorized changes are made to the data or IT system by either intentional or
accidental acts. If the loss of system or data integrity is not corrected, continued use
of the contaminated system or corrupted data could result in inaccuracy, fraud, or
erroneous decisions. Also, violation of integrity may be the first step in a successful
attack against system availability or confidentiality. For all these reasons, loss of
integrity reduces the assurance of an IT system.
Loss of Availability. If a mission-critical IT system is unavailable to its end users,
the organization’s mission may be affected. Loss of system functionality and
operational effectiveness, for example, may result in loss of productive time, thus
impeding the end users’ performance of their functions in supporting the
organization’s mission.
Loss of Confidentiality. System and data confidentiality refers to the protection of
information from unauthorized disclosure. The impact of unauthorized disclosure of
confidential information can range from the jeopardizing of national security to the
disclosure of Privacy Act data. Unauthorized, unanticipated, or unintentional
8
disclosure could result in loss of public confidence, embarrassment, or legal action
against the organization. Some tangible impacts can be measured quantitatively in
lost revenue, the cost of repairing the system, or the level of effort required to correct
problems caused by a successful threat action. Other impacts (e.g., loss of public
confidence, loss of credibility, damage to an organization’s interest) cannot be
measured in specific units but can be qualified or described in terms of high,
medium, and low impacts.
Qualitative categories of high, medium and low are also used to describe the impact of
these security goals (Table 2).
Magnitude of Impact
High
Medium
Low
Impact Definition
Exercise of the vulnerability (1) may result in the highly costly loss of
major tangible assets or resources or (2) may significantly violate,
harm or impede an organization’s mission reputation, or interest or
(3) may result in human death or serious injury
Exercise of the vulnerability (1) may result in the costly loss of some
tangible assets or resources or (2) may violate, harm or impede an
organization’s mission reputation, or interest or (3) may result in
human injury
Exercise of the vulnerability (1) may result in the loss of some
tangible assets or resources or (2) may noticeably affect an
organization’s mission reputation, or interest.
Table 2 Magnitude of Impact Definitions [SP 800-12]
1.3.6 Risk Determination
Risk is a function of the likelihood of a given attacker’s ability to exercise a potential
vulnerability and the resulting impact of that adverse event on the organization. The
impact realized is the degree of harm that could be caused when vulnerability is
exercised. The level of impact is based on the relative value of the resources affected
such as the sensitivity and criticality of the system’s components and data [SP 800-30 SP
800-12; Steel et al. 2005]. The linking of system-level concerns with probability and
impact measures that matter to a software development organization produces superior
risk analysis [McGraw 2006; Voas et. al 1997]. As such, it is important that the threats to
a system, the probable vulnerabilities, and the controls in existence be analyzed in
tandem for the system in question.
9
Once the threat-sources, vulnerabilities, likelihood of occurrence and impacts have been
determined, a risk measure can then be developed. Deciding the risk for a particular
threat/vulnerability pair may be expressed as a function of:
•
•
•
probability of a particular attacker trying to exercise a particular vulnerability
the magnitude of impact if the vulnerability is successfully exercised
the adequacy of planned or existing security controls for removing or reducing
risk [SP 800-30]
Stoneburner advocates that a risk scale and risk-level matrix, shown in Table 3, be
developed for measuring risk [SP 800-30].
Impact
Threat
Low
Medium
High
Likelihood
(10)
(50)
(100)
High (1.0) Low
Medium
High
10 X 1.0 = 10 50 X 1.0 = 50 100 X 1.0 = 100
High (0.5) Low
Medium
High
10 X 0.5 = 5 50 X 0.5 = 25 100 X 0.5 = 50
High (0.1) Low
Medium
High
10 X 0.1 = 1 50 X 0.1 = 5 100 X 0.1 = 10
Risk Scale: High (>50 to 100); Medium (>10 to 50); Low (1 to 10) [SP 800-30]
Table 3 Risk-Level Matrix
When the risk has been determined, the results of the assessments can be documented and
maintained; this allows for accountability. Since risks and threats change over time, it is
important that risks and threats be reassessed periodically. Subsequent risk assessment
can use the report as the basis for subsequent risk assessment and evaluations [GAO
AIMD-00-33]. This information also assists designers and developers in challenging their
own built up assumptions about their system [McGraw 2006]. Risk assessment early in
the process molds and provides contextual framework for the assumptions developers
have about the system.
10
2 STATE OF RISK ASSESSMENT
Risk management is an integral part of the software development process. Since risk
assessment is the foundation for other risk management activities, it should drive the
development process to ameliorate security issues. Developers are expected to identify,
assess, rank, mitigate and manage risk through out the software product life cycle
[Humphrey et al. 2004; Addison and Vallabh 2002]. As noted previously, methodologies
used to allow risk to drive the development process have in large part been qualitative in
nature.
2.1 Threat/Vulnerability Identification
To determine what threats and vulnerabilities exist in a development effort,
methodologies for identification early in the development process exist. Some of them
are:
•
SecureUML – a modeling language that incorporates information pertinent to
access control into applications modeled or defined using the Uniform Modeling
Language (UML) [Lodderstedt et al. 2002]. It models security requirement for
“well-behaved applications in predictable environment’ [McGraw 2006].
•
UMLsec [Jurjens 2001] – an extension of UML to encapsulate the modeling of
security-related features, such as confidentiality and access control.
11
•
Abuse Cases [Sindre and Opdahl] – an approach that extends use-cases to include
misuse-cases, showing side-by-side what behavior should be supported and/or
prevented.
•
Microsoft’s Threat Analysis and Modeling Tool (TAMT) – a tool that generates
risks based on the components, roles, data, external dependencies and the
application’s use-cases of a given development effort [Microsoft 2006].
2.1.1 Threat/Vulnerability Categorization
Different approaches have been used to categorize vulnerabilities. A representative list is
given below.
•
Neural Networks - have been used to categorize software security risks within
software with different levels of success [Neumann 2002; Jain et al. 1996]. In
most cases they are used with software metrics, such as McCabe’s complexity
metrics, failure history, lines of code, etc., on modules that have already been
written.
Neumann [Neumann 2002] utilized neural networks combined with factor and
component analysis to generate a set of orthogonal independent variables that
could be used in representing the dependent variable, the goal being to allow for
the focus of testing efforts on the portions of code with the largest number of
faults.
•
STRIDE –Introduced by Michael Howard and David Leblanc [Howard and
Leblanc], this approach relies heavily on cycling through a list of attacks and
applying to each attack one or more of the different risk categories of - Spoofing,
Tampering, Repudiation, Information Disclosure, Denial of Service or Elevation
12
of Privilege (STRIDE) - during analysis. McGraw finds it a good starting point
that gives a useful list of things to consider when identifying risks [McGraw
2006].
•
Tsipenyuk [Tsipenyuk et al. 2005] uses a taxonomy of seven “kingdoms” (Input
validation and representation; API abuse; Security features, Time and State,
Errors, Code quality, Encapsulation and Environment) to categorize risk elements.
Under each kingdom are the specific elements relevant to the kingdom. For
example SQL injection and buffer-overflow belong to the input-validation and
representation kingdom.
•
Others, such as the Open Web Application Security Project [OWASP] and
[Howard et al. 2005] provide lists of types of vulnerabilities that can be used to
facilitate adoption of vulnerability groupings. These lists tend to have a mixture
of specific types of errors and groupings of vulnerabilities. For example, crosssite scripting and SQL-injection are discussed at the same level of abstraction as
Improper Error Handling and Trusting Network Address Information. A
mapping of the different types of vulnerabilities as given by Howard, OWASP
and Tsipenyuk is shown in Appendix F.
2.2 Software Testing and Assessment
Traditionally, testing has been done to see if an application works correctly. For security
testing, emphasis is placed on identifying and removing vulnerabilities that could result
in security violations. It validates the effectiveness of the security measures that are in
place [Pan 1999]. Just as with ordinary testing, security testing methods fall into one of
two major categories: black-box or white-box testing.
13
What method the tester uses to carry out security assessment and testing depends on the
tester’s perspective with respect to the software component. Black-box tests are
conducted when specific knowledge about the software internals is not known, and the
test cases are constructed based on functional requirements. White-box tests are
conducted when the application’s internal structure is known.
2.2.1 Black-Box Testing
In black-box testing, the data used is taken from the specified functional requirements
[Howard et. al. 2005; Hetzel 1988]. Testing approaches such as penetration, functional,
risk-based, and unit testing are commonly used to perform black-box testing [Howard et.
al. 2005; Hetzel 1988; Voas et al. 1998; Michael and Radosevich].
•
Penetration testing looks at how easy it would be for a component to be broken
into.
•
Functional testing looks to see whether software behaves as it was required to
behave and whether it adheres to the given functional requirements.
•
Risk-based testing is a subset of functional testing that focuses on those negative
requirements that do not have a direct code base to be tested against. For
example, the requirement to make sure that the application is not vulnerable to
buffer overflows would be hard to map to a particular segment of code but still
requires that testing be implemented to address this requirement. Risk-based
testing is usually done by testing for misuse and abuse cases [Michael and
Radosevich 2005].
•
Unit security testing looks at how an adversary could gain access to the software
and then take control of the software after gaining access. Given this two-stage
14
approach to unit testing, many use attack trees as a method for identifying and
modeling threats since attack trees are amenable to portraying the different stages
of implementation of a threat [Schneier 2000].
2.2.2 White-Box Testing
White-box approaches use test and assessment activities to show the structure and flow of
the software under review. The program structure provides the basis for the test cases
generated [Mkpong-Ruffin and Umphress 2007]. Commonly used white-box testing and
assessment methods are fault-injection, source-code analysis and profiling.
Fault-Injection
Fault-injection uses information based on test cases – flaws – to show the effects of
successfully exploiting a vulnerability by adding code that would forcefully change
program state. This approach gives insight into predictive measures of risk such as
minimum-time-to-hazard and mean-time-to-failure. Wallace defines hazard as “an unsafe
condition that may lead to an unintended event that causes an undesirable outcome”
[Wallace 1991].
Fault-injection allows for absolute worst-case prediction [Voas et al. 1998]. Jeffery Voas
and colleagues distill the problems that software risks could stem from to the following:
“erroneous input data (from sensors, humans, or stored files), faulty code, or a
combination of the two” [Voas et al. 1997]. They postulate that accurate assessment of
liability requires worst-case predictions of software outputs if any of the aforementioned
problems would occur. To identify faulty or defective code, analytical techniques such as
formal code verification and testing can be employed. To detect wrong input data and
faulty code, fault injection should be performed.
15
Injecting anomalies allows the simulation of the effects that the real anomalies would
probably have and makes it feasible to build up a body of knowledge about probable
future behavior. The simulation of the effects makes it possible to quantify the risks that a
system’s component would create and allows observation of how bad things could get.
Being able to quantify the risks gives a way to determine the ‘boundaries of liability’
[Voas et al. 1997]. Like other testing techniques, fault-injection is a technique utilized
after a development effort is well under way.
Static Analysis
Static analysis tools look at the text of program, while not in execution, to discover
potential vulnerabilities within the program. For example, many vulnerabilities are
known to come from reusable library functions such as C’s strcopy() and stat(). A static
analyzer could scan the programs to see if they contain any calls to those functions
[Mkpong-Ruffin and Umphress 2007] and then analyze the potential vulnerabilities
uncovered to ascertain that they do not lead to security violations [Janardhanudu].
Profiling
Profiling tools allow the tester to observe the application’s performance while in
execution to see where bottlenecks occur. It also allows for observation and
understanding of the sequence of function calls and the time spent in different areas of
the software. This observation of the performance bottlenecks of the software with
profiling tools allows for a better understanding of vulnerability areas, such as areas with
memory leaks, which would not be apparent with the use of static code analyzers [Steel et
al. 2005].
In all the instances given for software testing and assessment, these methodologies are at
a lower level of abstraction in the development process.
16
2.3 Stakeholder’s Goals
2.3.1 Boehm Win-Win approach
Boehm advocates using a risk management cycle consisting of risk identification, risk
assessment and risk tracking. To this end, risk identification aided by software risk
management techniques [Boehm 2001] proactively identifies, as early as possible, the
possible sources of significant risks (see Table 4).
Source of Risk
Risk Management Techniques
Personnel shortfalls
Staffing with top talent; key personnel agreements; teambuilding; training; tailoring process to skill mix;
walkthroughs.
Schedules, budgets, process
Detailed, multi-source cost and schedule estimation; design
to cost; incremental development; software reuse;
requirements descoping; adding more budget and schedule;
outside reviews.
COTS, external components
Benchmarking; inspections; reference checking;
compatibility prototyping and analysis
Requirements mismatch
Requirements scrubbing; prototyping; cost-benefit analysis;
design to cost; user surveys
User interface –mismatch
Prototyping; scenarios; user characterization (functionality;
style, workload); identifying the real users
Architecture, performance, quality
Simulation; benchmarking; modeling; prototyping;
instrumentation; tuning
Requirements changes
High change threshold: information hiding; incremental
development (defer changes to later increments)
Legacy software
Reengineering; code analysis; interviewing; wrappers;
incremental deconstruction
Externally-performed tasks
Pre-award audits, award-fee contracts, competitive design
or prototyping
Straining computer science
Technical analysis; cost-benefit analysis; prototyping;
reference checking
Table 4 Software Risk Management Techniques [Boehm 2001]
The cycle would then continue with resolving each risk and, if not possible, addressing
the resolution in the risk management plan by re-scoping the risk. Barry Boehm [Boehm
2001] uses the ‘Top-N Risk Item List’ to monitor risk management. This process gives a
17
summary of the risk items that are either growing or decreasing in criticality. The
elaboration of the cycle is shown in Figure 4.
•
•
•
•
•
•
Identify new risks
Identify affects of risks
Assess risk exposure; reconcile risks with project goals, constraints,
objectives
Evaluate risk reduction alternatives and risk reduction leverage
Take corrective action; assess decision points to invoke contingency plans
Perform top-N Risk Item Tracking (See Table 4)
• Identify the top-N Risk Items
• Highlight these in regular project reviews (focuses review on
manager-priority items)
• Focus on new entries and slow-progress items
• Reassess top-N risks
Figure 4 - Top-N Risk Item List[Boehm 2001]
2.3.2 RISKIT
RISKIT [Kontio 1997] is a method for risk management developed by J. Kontio. It
incorporates utility theory with a support for the perspective of different stakeholders’
goals, by the maintenance of links between risks and stakeholders (See Figure 5). It
utilizes analysis graphs, as shown in Figure 6, for visual documentation of risks. The
goal of the analysis graph is to give a deeper understanding of the process thereby
enhancing communication [Kontio et al. 1998].
18
Figure 5 - Definition of Risk in Riskit Method [Kontio et al. 1997]
Reaction
Effect on Goals
Change database
product
Schedule: 2+ mo
Cost: $30+ K
Event
Effect on Goals
Database not
available in time
Reaction
Delay project
Factor
Schedule: 3+ mo
Reputation:
negative
Database release in
beta test phase
Event
Reaction
Effect on Goals
Database
integration
problems
Call in the DB
expert
Cost: $10K
Figure 6 -Example of the Riskit analysis graph (risk scenarios) [Kontio et al. 1998]
2.4 Risk Management Framework (RMF)
Introduced by Gary McGraw, this risk analysis approach is embedded within a risk
management framework [McGraw 2006]. He suggests three steps – attack resistance
analysis, ambiguity analysis, and weakness analysis as the sub processes performed to
accomplish risk analysis. Attack patterns and exploit graphs are explored for attack
resistance analysis. Ambiguity analysis looks at the disparate viewpoints of analysts of a
particular system, on how the system should work to determine whether disagreements
are due to misunderstanding or the need for further analysis. [Viega and McGraw 2001].
19
Weakness analysis is performed by looking at the security of commonly used frameworks
such as .NET and JEE and other third-party components.
2.5 Software Reliability
The area of software reliability engineering looks at the probability of failure-free
software. [Rosenberg et al. 1998]. It is a mature area of software engineering with
models and methods for assessing the reliability of software and, in some cases,
predicting future-failures based on error-history [Karunanithi et al. 1992]. Software
reliability engineering utilizes fault prevention, fault removal, fault tolerance and
fault/failure forecasting as methods to provide system dependability. These terms are
explained below [Lyu et al.1996]:
•
Fault is a defect within software that causes a failure.
•
Failure is the erroneous output as determined by the requirements for the system.
•
Fault prevention is the avoidance of the occurrence of a fault.
•
Fault removal is the usage of verification and validation to find and remove faults
and
•
Fault/failure forecasting is the estimation of the existence of faults and where they
occur and the consequences of failure.
2.6 Traditional approach
A traditional approach to quantitative risk analysis is to measure risk with respect to
financial loss or loss expectancy [GAO AIMD-00-33; Mkpong-Ruffin and Umphress
2007; Verdon and McGraw 2004; Steel et al. 2005]. This involves identifying all key
risk elements and assigning estimates of values associated with each risk element, such as
frequency, asset cost, potential loss value, business impact. With these values, estimates
20
of potential loss can be computed and an analysis of potential threats can be performed.
The Annual Loss Expectancy (ALE) can be computed as follows:
ALE = Single Loss Expectancy (SLE) x Annualized Rate of Occurrence (ARO)
Where,
ARO is the frequency of threats per year,
SLE is the asset value x exposure factor (EF)
EF is the percentage of asset loss caused by the potential threat
Equation 1 - Traditional Annual Loss Expectancy
2.7 Cross-Disciplinary ties to Insurance
In actuarial science, actuarial tables (known also as mortality tables or life tables) exist to
show the probability, at each age, of a person dying before their next birthday. With this
as a starting point, statistics are derived on the probability of surviving any particular year
of age. This process uses characteristics such as smoking, occupation and socioeconomic class to distinguish between different risks in determining life expectancy. To
develop predictions of future insured events, such as sickness, death, disability, etc.,
actuaries study incidence and severity of these events based on the recent past. They use
this study to develop expectations about how the factors that motivated the events in the
past will change over time. Armed with factors that affect risk, they are able to create
models that allow them to evaluate risk [Klugman et al. 2004].
2.8 Problems with Existing Approaches
Most of the work done on risk management in software development has been from a
subjective point of view [Boehm 2001; Farahmand 2003]. The main problem with
qualitative approaches is the lack of objectivity, which makes it hard to duplicate results
21
or generalize assessments from previous projects [Voas et al. 1997]. The approaches
enumerated above have the following limitations:
•
Threat/Vulnerability Identification models and tools have as a goal the listing of risks
that need addressing. These lists provide a starting point in dealing with security but
it is not the case that all risks identified can be addressed. Even if all risks enumerated
could be addressed, protecting against every single conceivable threat would reduce
security to “a labor-some project that gets in the way of functionality …” [Cheswick
et al. 2003]. The removal of all enumerated risk would make it very difficult, if not
impossible, to produce a system that would be useful as well as secure. Since all risks
may not be addressable, determining which vulnerabilities to address becomes a
problem.
•
The neural network approaches have had disparate results as evidenced by the
statements given by Sherer and Koshgoftar [Sherer 1995, Koshgoftar 1995]. Sherer’s
conclusion was that neural networks did not do a good job of predicting the
components that would have many faults. Koshgoftar, on the other hand, saw neural
networks as effective modeling tool that should be seriously considered for software
engineers because of its predictive capabilities.
•
The testing approaches, though effective in measuring some classes of risks
described, are used later in the development process, after most of the work for
development has been accomplished. Any changes due to the analysis would be more
expensive in terms of time and cost.
•
The traditional quantitative approach of annualized loss expectancy (ALE) and the
exposure factor (EF) carries with it the burden of knowing what asset costs to use.
22
The impact effect is measured by the cost of asset loss, which is not easily determined
during the design phase of development. If the cost of asset loss is available, it may
not adequately reflect the seriousness of the problem. This approach also suffers
from the lack of sufficient data to determine probability of loss. Unlike the insurance
or finance companies, software developers do not have probability loss tables
available to adequately determine the frequency of occurrence and the probability of a
risk occurring.
•
Another reason given for not using quantitative models is the lack of access to
reliable and current vulnerability data to support estimating probability of loss [GAO
AIMD-00-33 ]. The data on the likelihood and costs associated with information
security risk factors have been limited and the risk factors in this domain are
constantly changing. This lack of reliable and current data has made it hard to
determine which information security risks are the most significant. It is also difficult
to compare controls to determine which ones are most effective. To this end, many
organizations historically have leaned towards methods that do not rely on empirical
data for obtaining reliable results.
•
Risk assessment methodologies that are based on being able to accurately quantify
reliability have also not been found to be good approaches because 100 percent
reliability does not necessarily correlate to zero percent risk. Another problem with
using reliability for liability prediction is that the reliability prediction models in
existence are conflicting and make it almost impossible to know definitively the true
reliability of a piece of software. Most of the error-history-based reliability models
23
predict reliability differently using the same data, which makes determining the
model that is most accurate for a definite system hard to do.
•
Relevant work on risk management in other disciplines has not been fully embraced
by the software risk management community [Kontio 1999]. Since risk management
is mature in other disciplines, knowledge can be gleaned from those disciplines to
strengthen software risk management process.
Risk analysis is a process for considering possible risks and determining which are the
most significant for any particular effort. Work has been done on risk identification,
mitigation, evaluation and assessment, but not much has been done in assessing and
ranking risks during the early part of the software development life cycle [Voas et al.
1997]. Determining which risk elements should be addressed and the optimum strategy
for mitigating risks is often an intuitive process. Risk element assessment in software
security is predominantly a qualitative process. An objective view of the risks inherent in
a development effort requires a quantitative risk model.
24
3 RESEARCH DETAILS
A great need exists for tools to evaluate and quantify risks that are attributable to
software development, if for no other reason than to reduce litigation costs due to
software failure [Voas et al. 1997]. This research demonstrates that those attempting to
perform risk assessment can do so from an objective point of view.
As previously noted, one of the hindrances to objective measures for risk assessment is
the lack of available data. As shown in Figure 7, the rate of reported incidence of
vulnerabilities as recorded in NVD has increased dramatically. This increase in available
data provides a basis for building a database for determining and evaluating risks.
Reported incidences
6000
5000
4000
Reported
incidences
3000
2000
1000
2006
2004
2002
2000
1998
1996
1994
1992
1990
0
1988
Reported Incidences
7000
Year
Figure 7 -Reported Incidences [NVD]
25
3.1 Conceptual Overview
To empirically assess the risks inherent in a development effort, using the approach
applied in this research requires that the historical data be placed in a format that allows
for classification. A clustering algorithm can then be applied to the data to determine the
best segmentation of the underlying data as illustrated in Figure 8. This allows for new
risk elements to be classified based on their similarities to the clusters discovered. Once
the risk elements have been classified, the cluster that best represents the risk element in
question is then used to determine the impact factor for that risk element. The impact
factor value can then be used to calculate loss expectancy. This process of determining an
empirical posture for a given development effort is illustrated in the sequence diagram of
Figure 9.
SSRAM
nvdData
clusteredIncidences
request NVD data
return NVD data
Clean NVD data and
Update SSRAM db
Create clusters
Figure 8 - SSRAM's Cluster Creation
SSRAM
riskElement(s)
clusteredIncidences
Generated list
of
risk elements
Send unordered list of risk
elements
requestRiskAssessment
(rElement)
total incidences &
impact values
List of
elements
return ranked list of
elements
calculate
loss expectancy
Figure 9 - Software Security Risk Assessment
26
Identify
Risk Cluster
( rElement)
This model provides an empirical assessment of the potential threats to a development
effort and a ranking of these threats based on the risk metric calculation.
3.2 Data Preparation
To be able to assess risk elements, the historical data must be in a categorized form that
allows new elements to be classified (Figure 8). To this end, historical data for this
research was taken from an open-source vulnerability database provided by the U.S.
government, the National Vulnerability Database (NVD). The NVD database provides
RSS (Really Simple Syndication) feeds, in an XML format, of incidences reported from a
variety of sources. The schema for this data is presented in Appendix A-1. It has seven
sublevels with a nested hierarchy. An example of an entry is provided in Appendix A-2.
3.3 Parsing the Data
To parse the XML data, we had to traverse each node of the NVD XML document and
used different types of SQL INSERT SELECT statements to accommodate the different
ways that the XML feed reports node level data. An overview of these steps is given
below, accompanied by the database diagram that depicts the tables generated based on
the schema given.
• Insert the XML documents into a table
use ssram
INSERT nvdtmp
SELECT CONVERT(xml, BulkColumn, 2) FROM
OPENROWSET(Bulk 'F:\Dissertation\Databases\National Vulnerability
Database\nvdcve-recent.xml', SINGLE_BLOB) [rowsetresults]
• Parse the data into the different tables based on the schema of the XML document. To
parse the data into each table required that we implement different scripts for each
table due to the differences in each of the tags of the XML document.
27
o
Parsing first level tags — The first level tag <entry> has all of its values in
attributes as shown below in Figure 10. To parse this data, an attribute centric
approach was applied to the data in the first level node,
<entry
type="CVE"
name="CVE-2006-0948"
seq="2006-0948"
published="2006-08-21"
modified="2006-08-22"
severity="High"
CVSS_score="7.0"
CVSS_vector="(AV:L/AC:L/Au:NR/C:C/I:C/A:C/B:N)">
Figure 10 - NVD's First Level Node
o
Parsing lower level tags with additional tag and attribute values — The <desc>,
<ref> and <sols> tags use a combination of tag and attribute values that need be
parsed independently. For example, Figure 11 shows a <desc> tag that has the
nested <descript> tag and the source attribute. The tag and attribute have to be
parsed separately. Also, since these are nested tags, parsing involved getting
information from the parent tag along with information from the lower level tag.
<desc>
<descript source="cve"> AOL 9.0 Security Edition revision 4184.2340, and probably
other versions, uses insecure permissions (Everyone/Full Control) for the "America
Online 9.0" directory, which allows local users to gain privileges by replacing
critical files.
</descript>
</desc>
Figure 11- NVD Description Node
o Parsing lower level tags with nested lower level tags –
The <loss_types>, <vuln_types> and <range> tag entries have nested lower-level
tags that contain the data needed. If the value is not present, then the tag is not
available for that particular instance. For example Figure 13 shows two instances
of the <loss_types> tag. Note that Figure 12(a), has the <int /> tag which is not
in Figure 12(b)’s tag and Figure 12(b) has the <sec_prot> tag and its associated
attribute which is not in Figure 12(a). These values are embedded as part of the
28
meta property of the tag. The meta property data for these tags needed to be
accessed to accurately represent their information.
<loss_types>
<avail />
<sec_prot other="1" />
</loss_types>
<loss_types>
<avail />
<int />
</loss_types>
(a)
(b)
Figure 12 - Node with nested lower level tags
Appendix A-3 has the full script for parsing the NVD XML document into required
tables. Successful parsing of the data resulted in the population of the SSRAM database
based on the schema depicted in Figure 13. Each entry level tag is depicted as a record in
the entry table with a unique entry name that is used to associate with the lower level tags
in the other tables shown in Figure 13.
nvdtmp
XmlCol
vuln_types
entryDesc
entry_name
entry_name
vuln_type
descript
input_bound
source
input_buffer
entry
type
entrySols
name
entry_name
seq
sol
nvd_name
source
discovered
published
modified
severity
lossType2
entry_name
reject
loss_type
CVSS_score
adminSP
CVSS_vector
range
entry_name
exploit_range
userSP
otherSP
vuln_soft
refs
entry_name
entry_name
prodName
ref
prodVendor
source
versionNum
url
preVersion
sig
edition
adv
patch
Figure 13 - SSRAM Schema
3.3.1 Validation of Data Entry
To validate that the data was gotten correctly from the XML files, a random sample of
entries was chosen. [Bartlett II et al] recommends a sample size that is less than or equal
29
to 5% of the captive population. Given that the NVD data is captive data, in that all the
data requested is in the file gotten, a file with 136 entries was downloaded on 8/26/06 and
used to test the process for data entry. A systematic random sampling approach was
implemented to test the data. With systematic random sampling, every k-th entry is
chosen, with the first entry randomly selected. Since this test file had 136 entries, we
chose 5 % × 136 = 7 entries to be compared with the actual XML data to see if there
were any discrepancies and chose 136÷ 7 = 20 as the k factor. To this end, a random
number, between 1 and 19 based on a random generator, was chosen to determine the
first entry to be validated. Thereafter, every 20th entry was chosen for validation. The
following records were checked - the 5th, 25, 45, 65, 85, 105, 125 and, for added measure,
the first and last entries. We compared the contents of the actual XML document with the
content of the uploaded table to ascertain that they had the same information (Table 20 in
Appendix A-4). Any discrepancies between the table and the actual record information
were corrected and the data uploaded again. We again compared the contents of the first
sample of data and then using the same approach, selected seven other entries (2, 22, 42,
62, 82, 102, 122) to confirm that the data had been correctly uploaded. See Appendix A-4
for complete data upload validation information.
3.4 Cluster Determination
With the data parsed into the component tables, we were able to look at the fields
necessary for clustering the data. Clustering the data allows for determining the different
kinds of natural groupings or classes that may exist within the underlying data. Given the
nature of the data collected, we chose fields within each table that were independent as
30
fields necessary for clustering. For example, in the entry table, we did not choose the
severity field, as it depends on the CVSS_score field. Along with the scalar fields, we
also chose the description text field as it contained information about the vulnerability
entry that is cogent in determining the kind of vulnerabilities and losses affected. Figure
14 shows the select statement of the fields chosen to determine the clusters for the
elements reported.
select entry.name, entry.discovered,entry.published, entry.cvss_score,
lossType2.loss_type, lossType2.adminSP,lossType2.userSP,
lossType2.otherSP,vuln_types.vuln_type, vuln_types.input_bound,
vuln_types.input_buffer,range.exploit_range, refs.source,
refs.sig, refs.adv, refs.patch,vuln_soft.prodName,
vuln_soft.versionNum, vuln_soft.preVersion, vuln_soft.edition
into denormNVD
from entry, lossType2, vuln_types, range, refs,vuln_soft
where
entry.name = lossType2.entry_name AND
entry.name = vuln_types.entry_name AND
entry.name = range.entry_name AND
entry.name = refs.entry_name AND
entry.name = vuln_soft.entry_name
Figure 14 - Independent Variable Selection Statement
Since the description field contained ‘comment-like’ data that cannot be clustered in the
form given, it was necessary to create a table to which the nouns and noun-phrases of the
description field were extracted. With each term an associated score Term Frequency and
Inverse Document Frequency (TFIDF) was assigned. This score is defined as:
TFIDF of a Term T = (frequency of T) * log( (#rows in Input) / (#rows having T) ).
Equation 2 - TFDIF
Upon completion, a vector table that associates each term with the description entry was
also created. (Appendix B-1)
Determining the clusters to use to classify risk elements required the exploration of
different segmentation schemes. Some clustering algorithms produce well-separated
clusters, such that an object can belong to only one cluster. Others use an overlapping
31
approach so that an object could belong to more than one cluster. To identify the natural
groupings that may exist in the historical data, we examined two clustering algorithms
that implemented each of the segmentation schemes described above, K-Means and
Expected Maximization (EM).
K-Means is a clustering algorithm that produces k partitions or clusters of the underlying
data by trying to find the centers of natural clusters in the data. This is achieved by
minimizing the total variance within the cluster [Tang and MacLennan 2005]. Using this
approach, the K-Means algorithm assigns each data record to a specific cluster. This
approach does not allow for uncertainty in the membership record within a cluster.
The EM clustering algorithm, like K-Means, produces k partitions by finding the center
of the natural clusters in the data, but unlike the K-means, relaxes the assumption that a
data record has to belong to only one cluster. It allows for the overlapping of clusters, in
that data can belong to other clusters but with different probabilities of membership
[Bradley, Fayyad, Reina 1998].
Two mining structures were developed, one for the data without the description terms and
the other with the vector table included as a nested table. Within each mining structure,
mining models using EM and KM algorithms with different target numbers for clusters
were generated. These models were then compared for predictive accuracy. We found
that the EM mining models with the description vector table showed better predictive
accuracy than those without the description vector tables. Figure 15 and Figure 16 show
the mining accuracy charts for the different structures. The X-axis of each chart
represents the percentage of the test dataset that is used to compare the predictions, while
32
the Y-axis represents the percentage of predictions that are correct. In both cases, the EM
models outperformed the KM models. As shown in Figure 15 and Figure 16, the
structure with models that took into account the description vector table had better
prediction values than the mining structure without the vector table. Given 100% of the
test cases used for prediction, the best predictive model for the mining structure with the
description vector table (NVDCL-EM10) had a 71.86 % predictive accuracy score as
compared to 61.25% predictive accuracy of the best model for the mining structure
without the vector table (NVDCLwo_EM10).
NVDCLwo-EM10
NVDCLwo-EM8
NVDCLwo-EM6
NVDCLwo-KM10,
KM8, KM6
Figure 15 -Clustering without Description
33
NVDCL-EM10
NVDCL-EM8
NVDCL-EM6
NVDCL-KM10
NVDCL-KM8
NVDCL-KM6
Figure 16 - Clustering with Description
Choosing to use the mining structures with the description vector, we decided to use the
EM clustering algorithm to segment our training data since it predominantly
outperformed the K-Means algorithms, regardless of the mining structure used, as shown
in the graphs in Figure 15 and Figure 16. To determine the cluster node size to use, we
looked at predictive results obtained from mining structures with 4, 6, 8 and 10 cluster
node sizes. We decided on using node size of four based on the heuristic determination
of the clustering algorithm. We then incremented the node size by two, up to the default
node size of ten for the different mining models and compared the predictive results
obtained with each model. Upon completion we chose the EM clustering algorithm
based on 10 nodes because the EM 10-Node size outperformed the other node sizes
across most of the population. See Figure 18.
With the clusters formed, each data item of the training model was assigned to a cluster.
This was done by extracting all the data necessary for the prediction query along with the
34
cluster tag associated with each entry and saving this data so that prediction algorithms
would be run against it. Effectively, a query such as that is shown in Figure 17.
SELECT
(t.[name]), CLUSTER() FROM [NVDCL-EM-10]
PREDICTION JOIN
OPENQUERY([SSRAM],…) AS t
ON
[NVDCL-EM-10].[Cvss Score] = t.[cvss_score] AND …
Figure 17 - Cluster Assignment Query Example
3.4.1 Validation of Clustering
We separated the data from NVD into two parts so that we could use one part for training
and the other for testing our algorithm. After training the algorithm, we used two
approaches to validate the clustering algorithms. In the first approach, we used the test
data to find an indication of how likely it is for each test case to exist within a determined
cluster. A score for the model as shown in the mining legend table to the right of the
graph in Figure 18 reflects the average case likelihood of each of the training cases
existing within each model. A score closer to 1 is an indication that the training points are
close to the clusters in the model and, as such, implies that the clusters are compact and
well-defined. A score that is close to 0 implies that the training data points are scattered
and the clusters are not as well defined [Tang and MacLennan 2005].
35
NVDCL-EM8
NVDCL-EM10
NVDCL-KM10
NVDCL-KM6
NVDCL-EM6
NVDCL-KM8
Figure 18 -Mining Accuracy Chart
Also, we examined the cohesion of the cluster with respect to the centroid of the cluster
by looking at the mean, standard deviation, maximum and minimum values for each of
the clusters within each mining model to. In essence these statistics were used to reflect
the proximity of the entries within each cluster to the mean. An example of the results
gotten is shown in the Table 5. For the algorithm represented here, seven clusters were
derived and the count shows the number of entries within the cluster. With the smallest
number of entries in a node being 300, we assume a normal distribution for our data. The
average CVSS score and the standard deviation values reflect the fact that though the
range of scores in each of the clusters ranged from 1.9 – 10, 96.3% of the data within the
clusters were within 1.95 standard deviation from the mean, implying a tight cohesion.
Cluster 4 had the least cohesion with a standard deviation of 2.27, while Cluster 7 had the
tightest cohesion with entries that were exactly the same, as reflected in its standard
deviation and range of 0.
36
Std
Avg
Max
Min
Cluster Count Deviation Score Score Score
1
1700
1.61
3.8
10
1.9
2
8351
0.81
6.74
10
1.9
3
8761
1.34
9.29
10
1.6
4
860
2.27
4.41
10
1.9
5
429
1.91
4.45
10
2.3
6
2990
1.54
3.28
10
2.3
7
300
0
7
7
7
Table 5 - Cluster Algorithm Cohesion Factors (NVDCL-EM10 )
Cluster
1
2
3
4
5
6
Count
22717
7816
7296
3856
9306
514
Std
Deviation
2.08
2.2
2.19
2.39
2.33
0
Avg
Score
7.44
5.43
5.16
5.01
4.75
7
Max
Score
10
10
10
10
10
7
Min
Score
1.6
1.9
1.9
2.3
2.3
7
Table 6-Cluster Algorithm Cohesion Factors (NVDCL-EM8)
Looking at the cohesion values for the mining structures in Table 5 and Table 6, it is
evident that the clusters formed by the mining model NVDCL-EM10 exhibited tighter
cohesion values than that of NVDCL-EM8. All but one of the clusters in Table 5 have
has data within 1.95 standard deviations from the mean. On the other hand, NVDCLEM8 has only one cluster with data within 1.95 standard deviations from the mean (Table
6). Given that NVDCL-EM10 (Table 5) had the tightest cohesion factors, we chose this
algorithm for clustering the historical data so that new data can be classified against it.
See Appendix B-5 for complete validation results.
3.5 Data Classification
Determining the general categories of the vulnerability database allows for classifying
new data. Classification is the process of assigning objects to one of several predefined
categories. To find the cluster most similar to new data given, we examined Decision
37
Tree, Neural Networks and Naïve Bayes classification algorithms and chose the
classification algorithm that best predicted the cluster that should be used.
Decision Tree algorithm: The decision tree algorithm is a classification algorithm for
predictive modeling of continuous and discrete attributes. It uses the relationships
between the input columns within the data set to make predictions. It does this by
identifying the input columns that correlate with the predictable columns. Each time an
input column is found to be significantly correlated with the predictable column, the node
is added [Microsoft 2008-3].
Neural Networks: Neural network algorithms stem from the 1940s research work by
Warren McCulloch and Walter Pits on simulating how the brain works [Tang and
MacLennan 2005]. Neural network algorithms address primarily classification and
regression tasks of data mining. Like decision trees, they find non-linear relationships
between input attributes, but unlike decision trees, they find smooth as opposed to
discontinuous nonlinearities.
Neural network algorithms use networks that are made up
of three types of nodes or layers (input, hidden and output) and directed edges that show
the data flow during the prediction process. The input nodes, which form the first layer
of the network, are mapped to the input attributes. The hidden nodes, which are the nodes
in intermediate layer, combine the values gotten from previous layers with weights of
associated edges to perform some calculations and generate the result value to the next
layer. The output layer represents the predictable attribute(s) [Microsoft 2008-4].
Naïve Bayes: The Naïve Bayes algorithm, like decision tree and neural network
algorithms, is a classification algorithm for predictive modeling. It works by calculating
the conditional probability of input and predictive attributes. Naïve Bayes assumes that
38
the input attributes are independent and as such does not take into account any
dependencies that may exist between attributes. In doing so, the probabilities can be
multiplied so that the likelihood of each state can be determined [Microsoft 2008-5, Tang
and MacLennan 2005].
To help in determining the model to be used for classification, two mining structures
were created, one using the vector table with descriptive information and one without. A
comparison of the classification algorithms explained above was done. Figure 19 and
Figure 20 pictorially depict the result of classifying the test data with and without the
description vector. The Naïve Bayes and Neural Network algorithm outperformed the
Decision Tree algorithm in each instance. Though the scores for Neural and Naïve Bayes
were the same (Table 7), the Naïve Bayes algorithm took less time (0:00:03 versus
0:00:32) to process the same data.
Model Name
Processing
Score
Duration Time
NBClusterClassifier (Naïve Bayes)
3 seconds
0.81
DTClusterClassifier (Decision Tree Classifier)
5 seconds
0.78
NNClusterClassifier (Neural Network Classifier)
32 seconds
0.81
Table 7 - Classifier Processing Time and Score – Cluster Classifier without Description
39
NB ClusterClassifier
NN ClusterClassifier
DT ClusterClassifier
Figure 19 -Classifiers with Description Vector
NB ClusterClassifier
NN ClusterClassifier
DT ClusterClassifier
Figure 20 - Classifiers without Description Vector
40
3.5.1 Data Classification Validation
Classification models are evaluated based on the number of test records that they
correctly and incorrectly predict. An example of this evaluation is shown in Table 8.
Columns correspond to the actual values and rows depict the predicted values. The
complete matrix can be seen in Appendix B-4. Table 8 shows the confusion Matrix for
the Naïve Bayes model, the first row should be read as follows: Row 1 depicts the entries
that were predicted for Cluster 1. Of all the entries predicted for Cluster 1, 3158 actually
belonged to Cluster 1, 496 belonged to Cluster 6 and 157 entries predicted as Cluster 1
were in actuality Cluster 2’s entries. This matrix provides enough information to
measure the performance of each model using the Accuracy metric defined as follows:
Accuracy =
Number of correct Pr edictions
Total Number of Pr edictions
Equation 3 – Accuracy Determination
Predicted
Cluster 1
Cluster 6
Cluster 4
Cluster 3
Cluster 8
Cluster 5
Cluster 7
Cluster 2
Accuracy=0.67
Cluster
1
(Actual)
3158
307
35
19
2
47
28
11
Counts for NB ClusterClassifier on [Cluster Node]
Cluster Cluster Cluster Cluster Cluster Cluster
6
4
3
8
5
7
(Actual) (Actual) (Actual) (Actual) (Actual) (Actual)
496
60
145
74
164
41
680
1
89
0
63
2
14
257
5
75
32
197
129
10
513
0
7
4
0
9
0
17
0
15
89
6
15
2
1017
7
2
226
8
202
2
578
21
6
221
10
27
15
Cluster
2
(Actual)
157
29
6
103
6
59
22
628
Total
Predicted
4295
1171
621
785
49
1242
1068
939
Table 8 - Classification Confusion Matrix – Classifier without Description
Figure 21 shows a comparison of the accuracy performance metric for the classifiers in
both mining structures. The mining structure without the description vector tables had an
average accuracy score of 0.66 while the structure without the description vector had an
average accuracy score of 0.79. Though the description vector was necessary in
41
determining cluster, it was not necessary to have that information in classifying a new
entry.
Accuracy of Classification Algorithms
Prediction Accuracy
1.00
0.80
0.60
with Desc
0.40
without Desc
0.20
0.00
DT Classifier NB Classifier NN Classifier
with Desc
0.63
0.68
0.65
without Desc
0.86
0.67
0.85
Figure 21 - Prediction Accuracy of Classifiers
3.6 Loss Expectancy Determination
With the data segmented into clusters and a classification algorithm adopted, the impact
score and frequency of occurrence of a given risk entry was determined and used to
calculate loss expectancy. Traditionally, loss expectancy is computed by determining the
value of the percentage of asset loss caused by the potential threat and finding the
frequency of occurrence of threats within a given year, as shown in Equation 1.
Since asset value loss cannot be easily determined, we opted to use the Common
Vulnerability Scoring System (CVSS) ‘base’ score because it shows the inherent
characteristics of the vulnerability incidence reported and reflects the impact value of the
reported incident.
42
<entry
type="CVE"
name="CVE-2006-0948"
seq="2006-0948"
published="2006-08-21"
modified="2006-08-22"
severity="High"
CVSS_score="7.0"
CVSS_vector="(AV:L/AC:L/Au:NR/C:C/I:C/A:C/B:N)">
Figure 22 - CVSS base Score and Vector
Each NVD entry has along with the CVSS base score, a vector that reflects how the base
score was calculated. The vector, as shown in Figure 22, is used to determine the base
score for a particular entry. In Figure 22, AV: Access Vector = L (Local accessibility)
intimates that this vulnerability can only be exploited locally; AC: Access Complexity =
L (Low) indicates that an attack based on this vulnerability would not require special
access conditions, as it could be performed manually. It would not necessarily require
much skill or a lot of additional information for its execution. Au: Authentication = NR
(Not Required) shows that authentication is not required to trigger this vulnerability. The
impact metrics of C: Confidentiality, I: Integrity and A: Availability all equaling C
(Complete) imply a total system compromise if vulnerability is exercised. Based on the
NVD algorithm for computing base score [First 2005], the resulting base score of 7.0 was
calculated as shown in (Appendix B-7).
3.6.1 Determining Fields to Predict CVSS Score
The data mining wizard, supplied with Microsoft’s Visual Studio 2005, suggests fields
that would likely give information that would lead to the selected output using entropybased analysis [Tang and MacLennan 2005]. This analysis looks at the contribution of
attribute values to predicting a particular attribute. In our case, the analysis looked at the
contribution of all of the attributes in predicting CVSS score. Those attributes with an
43
entropy value of 0 are evaluated as being able to contribute nothing to the prediction
decision. Those attributes with entropy values of 1 are so distinct that in effect, each
record would be a partition of its own, thereby making it too small to make any reliable
prediction.
Figure 23 - Suggested Input Fields
This suggestion provided a basis for the fields that could be used in determining the
parameters for predicting impact. Since the scores provide values between 0 and 1, we
chose to remove those attributes whose entropy scores were less than 0.01 , since the
level of contribution was so close to zero implying very little contribution, and those
whose entropy values were very close to 1 were also removed as they would not help in
making reliable prediction. Figure 23 depicts all the attributes for the model, their
associated entropy score and the Input column that identifies those fields suggested for
predicting CVSS score. Although Administrative level security protection (adminSP)
was suggested as an input attribute, we did not include it in our model since its value is
dependent on a specific loss type. Those attributes grayed out in Figure 23 are those with
entropy values very close to 1. We chose Cluster Node, Loss Type, Vulnerability Type
44
and Exploit Range as the fields for predicting CVSS score. To validate the choice of
fields for predicting impact factors, we performed two multiple regression analysis test,
one between cluster node and loss type, vulnerability type and exploit range (Figure 24)
and the other between CVSS Score and loss type, vulnerability type and exploit range
(Figure 25). We performed the multiple regression analysis against cluster node (Figure
24), to ascertain whether loss type, vulnerability type and exploit range could be used to
determine a specific cluster for a risk element since our algorithm requires us to classify a
risk element to a specific cluster (Figure 9). The determination of the CVSS score is also
based on the cluster node chosen; as such, the second multiple regression analysis was
conducted to examine whether loss type, vulnerability type, and exploit range were
variables that could be used to predict CVSS score (Figure 25).
3.6.1.1Regression Analysis: Cluster Node versus Loss Type,
Vulnerability Type and Exploit Range
The result of the multiple regression analysis, as shown in Figure 24, implies that loss
type, vulnerability type and exploit range are significantly related to cluster node. The
coefficient of determination (R2) and the adjusted R2 of 27.9% show that 27.9% of
variations in the dependent variable (Cluster Node) can be explained by the independent
variables (Loss Type, Vulnerability Type and Exploit Range). This indicates that Loss
Type, Vulnerability Type and Exploit Range can be used to determine the cluster node to
which a risk element could belong. Given the lack of difference between the R2 and the
adjusted R2, we can intimate that all the variables chosen to explain variation of cluster
nodes are necessary (see Table 9 for interpretation of other Regression statistics).
45
Regression Analysis:
Cluster Node versus Loss Type, Vulnerability Type and Exploit Range
Regression Statistics
Multiple R
0.528
R Square
0.279
Adjusted R
Square
0.279
Standard Error
1.992
Observations
13140
ANOVA
df
Regression
Residual
Total
Intercept
LossType
VulnType
ExploitRange
3
13136
13139
SS
20169.732
52144.324
72314.056
MS
6723.244
3.970
F
1693.694
Significance
F
0
Standard
Coefficients
Error
t Stat
P-value
8.070
0.084
96.497
0
-0.927
0.014
-65.714
0
-0.279
0.009
-31.171 7.5E-206
-0.337
0.034
-10.020 1.52E-23
Figure 24 - Multiple Regression Result (Cluster Node)
The F test statistic (F) and its corresponding p-value (Significance F) indicate an overall
goodness of fit for the model. The p-value (0) is considered highly significant as it is less
than 1 %. The F-test statistic of 1693.694 shows that the ratio of explained variation
(6723.244) to unexplained variation (3.970) is very large but more so, the significance
value (p < 0.001) of the F-test statistic allows us to reject the null hypothesis that there is
no significant relationship between Cluster Node and the independent variables, Loss
Type, Vulnerability Type and Exploit Range. We also looked at the influence of Loss
Type, Vulnerability Type and Exploit Range on Cluster Node and as can be seen by the
results of the test statistics for LossType, VulnType and ExploitRange, along with their
46
P-values (p < .0001), the three independent variables have a significant relation with
Cluster Node and as such can be used to classify a risk element entry to a cluster.
Regression Statistics
Multiple R
0.528
R Square
0.279 =
27.9%
Adjusted R Square
0.279
Standard Error
Observations
1.992
13140
Interpretation
R= Coefficient of Multiple Correlation = the positive
square root of R-squared
R-squared = Coefficient of Multiple Determination =
percent of the variation in the cluster node (dependent
variable) that is explained by the x-variables (LossType,
VulnType, ExploitRange) and the model
R-squared adjusted = version of R-squared that has
been adjusted for the number of predictors in the model.
R-squared tends to over estimate the strength of the
association, especially when more than one independent
variable is included in the model.
S = Standard Error = Standard Error of the Estimate =
average squared difference of the error in the actual to
the predicted values.
Number of observations in the sample.
Table 9 - Regression Statistic Interpretation
3.6.1.2Regression Analysis: CVSS Score versus Loss Type,
Vulnerability Type and Exploit Range
Regression Analysis:
CVSS Score versus Loss Type, Vulnerability Type and Exploit Range
Regression Statistics
Multiple R
0.593
R Square
0.352
Adjusted R
Square
0.352
Standard Error
1.943
Observations
13140
ANOVA
df
Regression
Residual
Total
Intercept
LossType
VulnType
ExploitRange
SS
MS
F
3 26962.24325 8987.414 2380.023
13136 49604.01067 3.776188
13139 76566.25392
Standard
Coefficients
Error
t Stat
P-value
0.916
0.082
11.224
4.2E-29
1.147
0.014
83.300
0
0.108
0.009
12.335 9.15E-35
0.774
0.033
23.583 1.8E-120
Figure 25 - Multiple Regression Result (CVSS score)
47
Significance
F
0
The result of the analysis to see if the parameters, loss type, vulnerability type and exploit
range, would aid in predicting CVSS score is summarized in Figure 25. The coefficient
of determination, R2 and the adjusted R2 = .352, showed that 35.2% of the variation of
CVSS score could be explained by the independent variables, loss type, vulnerability type
and exploit range. Also, the F-test statistic of 2380.023 with P-value (p < .001) showed
that there is a significant relationship between the dependent variable CVSS score and the
independent variables in question. The P-values for t-test statistic for LossType,
VulnType and ExploitRange are all (p < .0001), also show that the independent variables
(LossType, VulnType and ExploitRange) have a significant linear relationship with the
dependent variable CVSS score.
3.6.2 Determining Loss Expectancy
The result of the multiple regression analysis motivated us to use the related data from a
classified cluster to determine the predicted impact score and the predicted frequency of
occurrence. We used the predicted impact score in the same manner asset loss x
exposure factor is used in the traditional approach, to determine single loss expectancy.
Given the predicted impact score, frequency of occurrence of incidence, and growth rate
for each of these attributes, the loss expectancy for a specified time was calculated as
follows:
Predicted Loss Expectancy (PLE) = Predicted Impact Score (PIS) × Predicted Frequency
of Occurrence (PFO)
Equation 4 – Predicted Loss Expectancy (PLE)
Using the growth rate of the CVSS score and the average CVSS score, the predicted
impact value was calculated as follows for each test entry:
Predicted Impact Score =
xCVSS Score ( period ) × (1 + µ Growth Rate ( period ) )
48
where
period covers the evaluation time given,
x = the weighted mean, with the weight being the number of incidences reported
µ = the arithmetic mean
Equation 5 - Predicted Impact Score (PIS)
To find the weighted mean CVSS scores for the period ( xCVSS Score ( period ) ), we used the
number of incidences reported for each month as the weighting factor to obtain the
weighted mean. We used the relative weights of the number of incidences for each month
as coefficients to express the weighted mean in a linear form. The coefficients αi were
determined by dividing the number of incidences reported per month by the total number
of incidences reported for the time period under evaluation.
xCVSS Score( period ) =α1x1 + α2x2+…+ αnxn
where
the real numbers αi satisfy αi ≥ 0 and α1 + α2+…+ αn = 1 and
αi = number of incidence coefficient for the month and
xi = average CVSS scores for the month
Equation 6 - Weighted Average CVSS Score
The CVSS score and frequency of occurrence growth rates were calculated as follows:
( current Month' s Avg Value − previous Month' s Avg Value )
current Month ' s Avg Value
Equation 7 - Growth Rate
49
3.6.2.1Validation of Loss Expectancy Calculations
To make sure that the calculations in our algorithm were correct, we generated a query to
select certain sample data from the input set (Figure 26).
select datediff(month, published,'01/01/2002')as
timeLag,avg(cvss_score)as avgScore, count(distinct name) as
reportedEntries
from trainingdatawclusters96_01 where
(datediff(month, published,'01/01/2002') <= 12)and
(datediff(month, published,'01/01/2002') > 0 )and
clusternode = 'Cluster 1'
group by datediff(month, published,'01/01/2002')
order by timeLag
Figure 26-Query for Sample data to validate Calculation
We used Excel to calculate the values in question (Table 10), and then compared the
results to those gotten from the actual running of the stored procedure used within
SSRAM (Figure 27). See Appendix C-1 for complete listing of the stored procedure used
to calculate different values necessary for computation of loss expectancy. The
coefficients for the reported entries were calculated for each month by dividing the
number of entries for that month by the total number of entries. For example, in Table
10, the coefficient for Dec-01 (0.15) was calculated by dividing 94 by 630. The weighted
average score was then calculated using Equation 6. The growth rate for the score and
frequency of occurrence were calculated using Equation 7 and the total number of
incidences reported was determined by adding the reported entries for each month. These
values as noted in Table 10 and Figure 27 are the same and validates that the algorithm
works as expected.
50
TimeLag
1
2
3
4
5
6
7
8
9
10
11
12
12 months
Coefficient
prior to
Reported
of Reported
Score
1/1/2002
Entries
Avg Score
Entries
Growth
Dec-01
94
7.58
0.15
-0.11
Nov-01
26
6.77
0.04
0.06
Oct-01
51
7.16
0.08
-0.03
Sep-01
46
6.97
0.07
0.23
Aug-01
84
8.57
0.13
-0.07
Jul-01
81
8.01
0.13
0.09
Jun-01
71
8.72
0.11
-0.05
May-01
43
8.32
0.07
-0.33
Apr-01
4
5.57
0.01
0.59
Mar-01
33
8.86
0.05
0.00
Feb-01
49
8.87
0.08
-0.24
Jan-01
48
6.77
0.08
Total
630 WeightedAvg
7.93
0.0143
Table 10 – Calculation of Impact Factors (Excel Result)
Frequency
Growth
-0.72
0.96
-0.10
0.83
-0.04
-0.12
-0.39
-0.91
7.25
0.48
-0.02
0.6564
Figure 27 - Calculation of Impact Factors Result (Stored Procedure)
To ascertain that the result of an impact and loss expectation prediction could be made,
we created an application to determine loss expectancy for a single entry (Appendix D-1).
We chose loss type, vulnerability type, and exploit range values and used the information
to classify the entry to a cluster node from which the impact factors were calculated. For
our case, we chose input as the vulnerability type, security protection as the loss type and
exploitation range of local. This selection was classified as a Cluster 1 entry and the
51
calculations for impact were computed (Figure 28). Note that these values are also
consistent with the values shown in Figure 27 and Table 10 for Cluster 1 data.
Figure 28 -Single Entry Impact Estimation
3.6.3 Validation of Loss Expectancy
The goal of this research is to provide an empirical assessment of risk elements based on
historical data and provide a prioritized list of the risk elements based on the empirical
estimations generated. To this end we have chosen the historical data validation approach
[Sargent 2003] for validating our research.
NVD’s data was divided into two parts, the training and testing data. The data from
1996 - 2001 was used to build the impact prediction model and tested against the impact
values for January 2002 as shown in Figure 31 (Appendix D-2 for code). In essence the
test data was loaded as vulnerability entries as shown in Figure 29; each entry was then
classified to a cluster and the data from that cluster used to predict the frequency of
occurrence and the impact value for each entry (Figure 30). The mean values from the
actual data were compared to the predicted scores from the model (see Appendix D-3 for
the data) using t-Test comparison of means (Table 11).
52
Figure 29 Vulnerability Entries (Test Data – 2002)
Figure 30 - Predicted Values
Figure 31 - Impact and Loss Expectation Estimation
The goal of this research is to predict the impact score of a given risk element. The
predicted impact is analogous to the base CVSS score provided by NVD. Our claim is
53
that our predicted impact score will be significantly equivalent to actual NVD’s CVSS
scores. We express this claim as µd = 0. As such, we have the null hypothesis (H0)
H0: The impact values derived from SSRAM does not reflect the risk posture of the
threat element in question.
Ho: µd > 0 or µd < 0 where
µd = (µactual – µpredicted ) the difference of means of the predicted scores from
the actual scores
versus
Ha: The impact values derived from SSRAM correctly reflects the risk posture of a
software development effort in that µactual – µpredicted = 0
Ha: µ d = 0
t-Test: Paired Two Sample for Means
Actual
Score
Predicted
Mean
5.663889
6.10243
Variance
5.556087 2.225617
Observations
36
36
Pearson Correlation
0.67673
Hypothesized Mean Difference
0
df
35
t Stat
-1.51354
P(T<=t) one-tail
0.06956
t Critical one-tail
1.689572
P(T<=t) two-tail
0.139121
t Critical two-tail
2.030108
Table 11 - t-Test: Paired Two Sample for Means
The t-test critical value was computed and compared with the critical t-value at a
significance level of α = 0.05.
The sample mean of the difference is d = -0.4385 and the sample standard deviation
is sd = 1.7385. The test statistic is
54
t0 -
d
− 0.4385
=
= -1.5135
sd
1.7385
n
36
Since this is a two-tailed test, we determined the critical t-value at the α = 0.05 level of
significance with n – 1 = 36 – 1 = 35 degrees of freedom to be − t 0.05 = − t 0.025 = -2.030
2
and t 0.025 = 2.030. Because the test statistic t0 = -1.5135 is greater than the critical value
− t 0.025 = -2.030 and t0 = 1.5135 is less than t 0.025 = 2.030, we reject the null hypothesis
that there is a significant difference between the actual and predicted scores. These
statistics suggest that there is no statistical significant difference between the predicted
scores given by SSRAM and the actual scores as reported to NVD.
To further evaluate the predictions, we made a comparison of the predictions based on
the three classification schemes as shown in Table 12, Table 13, and Table 14. The
critical value for all three algorithms is 2.04 and the t-stat of 1.06 and 1.633 for the neural
network and decision tree, as evidenced in Table 13, and Table 14, showed that the null
hypothesis could be rejected. This result was consistent with the results of the predictive
accuracy shown in section 3.5.1 that reflected that there was no significant difference
between neural networks and decision trees when used for predictions. There was a
difference with Naïve Bayes, which is also reflected with the t-stat of 2.13 as shown in
Table 12. This t-stat does not allow us to reject the null hypothesis that there is no
significant difference between the actual scores and the predicted scores when using
Naïve Bayes as the classification scheme for predicting. The result of this analysis shows
that either neural networks or decision tree algorithm can be used for classifying new risk
elements.
55
Actual
Predicted
Mean
5.577419 6.23709
Variance
5.520473 2.362795
Observations
31
31
Pearson Correlation
0.68273
Hypothesized Mean Difference
0
df
30
t Stat
-2.1378
P(T<=t) one-tail
0.020399
t Critical one-tail
1.697261
P(T<=t) two-tail
0.040798
t Critical two-tail
2.042272
Table 12 - t-Test Results using Naïve Bayes algorithm
Actual
Predicted
Mean
5.76129 6.102877
Variance
5.636452 2.316325
Observations
31
31
Pearson Correlation
0.658869
Hypothesized Mean Difference
0
df
30
t Stat
-1.06461
P(T<=t) one-tail
0.147773
t Critical one-tail
1.697261
P(T<=t) two-tail
0.295546
t Critical two-tail
2.042272
Table 13 -t-Test: Paired Two Sample for Means using Decision Tree Classifier
Actual
Predicted
Mean
5.653125
6.180540625
Variance
5.525796371 2.337423284
Observations
32
32
Pearson Correlation
0.629684793
Hypothesized Mean Difference
0
Df
31
t Stat
1.633197931
P(T<=t) one-tail
0.056272384
t Critical one-tail
1.695518742
P(T<=t) two-tail
0.112544769
t Critical two-tail
2.039513438
Table 14 - t-Test: Paired Two Sample for Means using Neural Networks Classification
Algorithm
56
3.7 Confidence Interval of Predictions
The result of our predictions as shown in Figure 30 and Figure 31 are based on a
confidence that the 95% scores predicted are ± the margin of error shown in Table 15.
The confidence interval was computed on the training data used for classification of new
risk elements. Table 15 shows the number of distinct elements in each cluster, their
average CVSS score, and the standard deviation from the mean for each cluster’s score.
Given the large number of entries for each cluster, we can assume normal distribution,
and use these values to construct the margin of error as:
1.96 ×
σ
n
Equation 8 - Margin of Error
where σ the standard deviation and, n is the size of each cluster and, the value 1.96,
based on a two-tailed normal distribution at the 95% confidence interval, is used as the
critical value.
Training Data by Clusters (1996 - 2001)
NumInCluster
CVSS_Score
StdDev
1257
719
107
765
1298
1250
735
8.2541
7.0872
7
5.5393
5.3409
4.3934
4.3105
1.8124
1.7984
0
2.0795
1.9665
2.0171
1.6957
Margin of
Error
0.1002
0.1315
0
0.1474
0.107
0.1118
0.1226
Cluster
Cluster 2
Cluster 4
Cluster 7
Cluster 5
Cluster 1
Cluster 3
Cluster 6
With 95% Confidence
Interval
Lower
Upper
8.1539
6.9557
7
5.3919
5.2339
4.2816
4.1879
8.3543
7.2186
7
5.6867
5.4478
4.5052
4.4331
Table 15 - Confidence Interval Derivation
Given this confidence interval derivation, we can say that we are confident that 95% of
our predictions will be within the margin of error shown in Table 15. For example, in
Figure 30 ‘CVE-1999-1081’ is classified as Cluster 3 with a predicted impact of 4.9852.
57
Given our confidence interval construction, we can say that we are 95% confident that
the predicted impact factor for ‘CVE-1999-1081’ is 4.9852 ± 0.1118.
3.8 Summary
One of the deterrents to creating objective measures for assessing risks in software
security was purported to be the lack of data to use as a basis for historical predictions.
There are now open source vulnerability databases, one such being the NVD, which
provide a historical data source that can be used to assess vulnerabilities. Since the data
from NVD is an RSS feed in XML format, we parsed the NVD data into SSRAM’s
database. In so doing, we were able to categorize the data based on the K-Means and EM
algorithm to obtain natural groupings of the entries based on the data about each entry.
These algorithms portrayed different approaches to categorizing data.
We compared the performance of the models between two mining structures, one with
the description entries, a comment-like attribute without discrete values, and one without.
This performance was measured by dividing the NVD data into two segments, one for
training and the other for testing. We used the test data to find an indication of how
likely it is for each test case to exist within a determined cluster. Based on the average
case likelihood scores obtained, we saw that the clustering algorithms that took into
account the description vectors performed better than those without. In addition to this,
we examined the cohesion of each cluster with respect to the centroid of the cluster by
looking at the mean, standard deviation, maximum and minimum values for each of the
clusters within each mining model. We validated our choice by comparing the cohesion
values of the algorithms.
58
Within each structure we compared EM versus K-Means models and chose those with
higher average case likelihood scores. EM consistently outperformed KM models for our
data. Choosing the algorithm with the best score, we labeled each entry of the training
data with the cluster it belonged to. To determine what fields should be used to classify
the data we looked at those fields with entropy values > 0.05 and < 0.95. We then ran two
multiple regression analysis, which at α = 0.05 significance showed that the variables loss
type, vulnerability type and exploit range were significant to predict cluster node and
CVSS score. Upon naming each entry with its cluster, we investigated different
classification algorithms to see which best classified the data under investigation. We
again used two mining structures to compare results of classification mining models, one
with the description terms and another without the description terms. Furthermore, we
looked at the accuracy of each of the classification models and with a step-wise approach
chose the algorithm with good accuracy and shorter performance time. With CVSS score
used as the impact factor we adapted the traditional loss expectancy model by using
CVSS score, as opposed to asset cost as the impact factor to determine loss expectation.
In addition, we constructed confidence intervals for our predictions based on the training
data sets used for classifying new entries at a 95% level of confidence.
Given these results, this study along with the historical validation of the model with
statistical significance, suggests an empirical means for implementing risk assessment. It
does so by providing a means for categorizing historical data (NVD) based on risk factors
such as loss type, vulnerability type, exploit range and other product dependent factors.
These discovered categories allow for classifying new threats, and through this
classification, predict the impact for exercising the threat and the frequency of occurrence
59
of such exercise. These predicted values are used to rank the risk elements identified
during a threat modeling activity and provide a means for objectively justifying the
approaches chosen to ameliorate stated risks.
60
4 APPLICATION OF MODEL TO CASE STUDY
This chapter shows the application of the model to a case study. We looked at an
electronic voting system Prime III developed by the Human Computer Centered Lab at
Auburn University [Prime III]:
The Prime III voting system should be engineered through a human-centered
computing approach. This approach considers the users first and implements a
design that accommodates users and integrates usability with the necessary
safeguards to provide security, trust, and privacy (i.e., usable security). Prime III
should provide a naturally interactive user interface that reduces the learning
curve by using multiple modalities during the voting process. Voters will cast
their votes using touch and/or voice commands as shown in [Figure 32],
eliminating the need for specialized machines for one segment of the voting
population. Everyone votes on the same type of machine, independently without
additional assistance. The procedure for a voter should be as follows:
1. The voter steps into the voting booth and puts on the provided headset.
2. The voter begins voting as normal.
3. Each time the voter makes a selection either by voice or touch, he hears a confirmation in the
headphones and simultaneously observes the confirmation on the screen. For example, when the
voter selects candidate A, Prime III should say “selected candidate A” and concurrently display
the selection. This audio and visual confirmation should be heard and seen for every important
action that the voter takes including, selecting a candidate, unselecting a candidate, advancing
races, and submitting the ballot.
4. The Prime III visual and audio output are passed to the video recording unit which records the
voting session on some physical medium such as a video cassette.
5. At the end of the session the voter confirms his selection twice; then submits their ballot and
leaves the voting booth.
As illustrated below in Figure 32, Prime III will be comprised of systems and software
that are deployable at each electoral precinct. The basic functions of this
system are to permit citizens to vote, and to allow for the tabulation and creation of an
auditable trail of the ballots cast.
61
Figure 32: Prime III System Architecture High-Level Overview
4.1 Vulnerability Identification and Data Preparation
To determine the vulnerabilities for this application, Microsoft’s Threat Analysis and
Modeling Tool (TAMT) was used to generate the non-prioritized list of vulnerabilities
(Appendix E-1), namely, Canonicalization, Buffer Overflow, Cryptanalysis Attacks,
Format String and Integer Underflow or Overflow.
Before entering this non-prioritized list into SSRAM, each of the threats given was
evaluated for the type of security impact that would be affected if the threat were
exercised, the kind of vulnerability this threat could exploit, and the exploitation range
that could exist for it. A summary of the threats as given by TAMT, the impact of loss,
the vulnerability and exploit range evaluations performed are shown in Table 16. It
62
should be noted that the TAMT provided the threats and descriptions from which we
derived the loss type, vulnerability and exploit range values.
ID
Threat
Threat-1
Canonicalization
Threat-2
Buffer Overflow
Threat-3
Cryptanalysis
Attacks
Threat-4
Format String
Threat-5
Integer
Overflow/Underflow
Description
Loss Type
Vulnerability
Canonicalization : Only
accept primitive typed
identified (e.g., integers)
which are mapped to
filenames
Buffer Overflow : Use safe
functions such as strncpy,
strncat instead of strcpy,
strcat
Validation on input should be
performed.
Cryptanalysis Attacks : Use
well-known implementations
of well-known cryptographic
algorithms
Use cryptographically
generated random keys
Utilize platform supplied
feature to store secret key
(e.g., DPAPI)
Utilize SSL or IPSec w/
Encryption to establish a
secure communication
channel
Format String : Use a
managed language
Confidentiality
Availability
Integrity
Input
Confidentiality
Availability
Integrity
Input
Access
Local
UserInit
Confidentiality
Availability
Integrity
Security
Protection –
Admin
Environment
Access
Local
Confidentiality
Availability
Integrity
Confidentiality
Availability
Integrity
Design
Access
Local
Input
Local
Integer Overflow/Underflow
: Use Language features
Exploit
Range
Local
Userinit
Table 16 - Prime III Vulnerability List
The vulnerability listing and the evaluation of threat descriptions as shown in Table 16
provided the basis for creating tables that associated each threat to its loss type,
vulnerability type and exploit range as portrayed in Table 17 (a – c). These tables were
then de-normalized using a select statement similar to that of Figure 14 to create a table
similar to Table 18. The complete de-normalized table is shown in Appendix E-2.
63
ID
LossType
ID
VulnType
ID
ExploitRange
Threat-1
Avail
Threat-1
input
Threat-1
user_init
Threat-1
Int
Threat-2
access
Threat-1
local
Threat-1
Conf
Threat-2
input
Threat-2
user_init
Threat-2
Int
Threat-3
env
Threat-2
local
Threat-2
Avail
Threat-3
access
Threat-3
local
Threat-2
Conf
Threat-4
design
Threat-3
local
Threat-3
Avail
Threat-5
Threat-4
user_init
Threat-3
Int
Threat-4
local
Threat-3
Conf
Threat-3
Sec_Prot
Threat-4
Avail
Threat-4
Int
Threat-4
Conf
Threat-5
Avail
Threat-5
Int
Threat-5
Conf
input
(b)
Threat-5
user_init
(c)
(a)
Table 17- Prime III Normalized Tables
ID
Threat-1
Threat-1
Threat-1
Threat-1
Threat-1
Threat-1
Threat-2
Threat-2
Threat-2
Threat-2
Threat-2
Threat-2
Threat-2
Threat-2
Threat-2
Name
Canonicalization
Canonicalization
Canonicalization
Canonicalization
Canonicalization
Canonicalization
Buffer Overflow
Buffer Overflow
Buffer Overflow
Buffer Overflow
Buffer Overflow
Buffer Overflow
Buffer Overflow
Buffer Overflow
Buffer Overflow
VulnType
input
input
input
input
input
input
access
input
access
input
access
input
access
input
access
Loss Type
Avail
Avail
Int
Int
Conf
Conf
Int
Int
Int
Int
Avail
Avail
Avail
Avail
Conf
Exploit Range
user_init
local
user_init
local
user_init
local
user_init
user_init
local
local
user_init
user_init
local
local
user_init
Table 18 - Example of PrimeIII’s Input Data
4.2 Data Classification and Loss Expectation Determination
As noted in section 3.1, using SSRAM to assess risk elements requires that the historical
data used be segmented into clusters (Figure 8). Since this was already done with data
from 1996 – 2001, we imported the data shown in Table 18 into SSRAM database and
64
loaded same data into SSRAM’s application (Figure 33) to assess the risk of the
vulnerabilities identified. SSRAM classified each of the de-normalized entries based on
its similarity to the clusters within SSRAM. The data within the cluster assigned to each
entry was then used to calculate the impact factor and the frequency of occurrence of
each threat within the time constraint given (Figure 34).
Figure 33 - Load Prime III
Figure 34 - Prime III De-normalized Predictions
65
The de-normalized vulnerabilities list, along with the impact criteria, served as the basis
for generating the predicted values. The Predicted Values data grid (Figure 34) shows not
only the vulnerability entries uploaded but the cluster an entry is most similar to
(EntryCluster), the predicted impact based on that cluster (PredictedImpact), the number
of entries reported of like incidences for the time period under evaluation (Incidences),
the rate of growth of said incidences (FregGrowthRate), the predicted frequency of
occurrence (Predicted Frequency) and the loss expectation (LossExpect). The predicted
impact value derived is an estimation of the CVSS base score, and the predicted
frequency of occurrence, which is the product of the incidences and the rate of growth of
the number of incidences reported, reflects the evaluated period’speriods predicted
number of occurrences. These predictions are based on a statistically significant,
historically validated approach as explained in section 3.6.3 and summarized in Table 11
of same section. The prediction of impact is based on the confidence that 95% of
predictions will fall within the margin of error as explained in section 3.7 of Chapter 3.
For example, in Figure 34, the first entry is assigned to Cluster 4 with a predicted impact
of 7.36 ± 0.1315 (Table 15). The loss expectation is calculated as the product of the
predicted impact and the predicted frequency of occurrence (Equation 4), following the
best practice used in calculating traditional loss expectancy (Equation 1). This
information is then aggregated for each vulnerability entry and ranked in descending
order, as seen in the resulting prioritized list (Figure 35).
Since loss expectation is derived from the impact factor and the frequency of occurrence,
we can assess situations when threats of lower impact may require more attention due to
the high predicted frequency of occurrence of attacks to a particular vulnerability. In
66
Figure 35, the prioritized list of threats identified and predicted for January of 2001
shows ‘Format String’ as the risk element of highest threat, with a loss expectation value
of 2266.78. Even though the impact factor for ‘Format String’ (6.15) is less than that of
‘Canonicalization’ (7.42), ‘Format String’ would still be considered a greater threat
because of its higher predicted number of occurrences (377) when compared to
‘Canonicalization’ (237). In looking at the aggregated predicted impact scores, the
clusters to which they most resemble, and the confidence intervals for each cluster as
depicted in Table 15 of section 3.7, we observed that ‘Canonicalization’, ‘Buffer
Overflow’, and ‘Integer Overflow/Underflow’ would be classified to the same cluster –
‘Cluster 4’. ‘Cryptanalysis’ and ‘Format String’ would be classified as Cluster 5 data. As
can be seen in Figure 35, the difference in priority is based on the predicted frequency of
threats to exercise these vulnerabilities. Since our prediction of impact is based on the
CVSS score, we note that the values for Prime III vulnerabilities warrant that all listed
vulnerabilities be addressed during the development phase of the software development
life cycle. The CVSS score range from 0 – 9, where 0 indicates no damage potential and
9 signifies a high collateral damage. All scores for the prediction depicted in Figure 35
are above the mid-level point of 5 and would be considered at least medium to high risk
vulnerabilities. In order of priority based on the loss expectation ‘Format String’,
‘Canonicalization’, ‘Buffer Overflow’, ‘Integer Overflow/Underflow’ and
‘Cryptanalysis’ should be considered in the order listed. Based on the confidence interval
for the predictions, we would surmise that ‘Canonicalization’, ‘Buffer Overflow’ and
‘Integer Overflow/Underflow’ would be of the same level of importance and though
‘Cryptanalysis’ and ‘Format String’ are of the same classification, due to the frequency of
67
predicted exercise of the ‘Format String’ vulnerability that it take precedence in
mitigation.
Figure 36 shows the prediction of impact based on the same vulnerability list but with
historical data from 2003 – 2005 to predict January 2006 data. To obtain the clusters for
classifying this data, we applied EM clustering algorithm (NVDCL-EM10). It can be
seen that though the predicted impact scores have become less, the frequency of
occurrence of threats have at least tripled. It should also be noted that the vulnerability
whose loss expectation implies greatest collateral damage has also changed from ‘Format
String’ to ‘Buffer Overflow’. ‘Canonicalization’ and ‘Integer Overflow/Underflow’ now
command the same level of attention and though their impact scores (4.56) are slightly
below the medium risk level, the high number associated with frequency of occurrence
(1,784) requires that countermeasures to address these vulnerabilities are addressed, even
before attention is given to ‘Format String’ vulnerability even though it has a higher
impact score. The full prediction and confidence interval information for Prime III is in
Appendix E-3.
Training Data by Clusters (1996 - 2001)
CVSS_Score
StdDev
Margin of
Error
Cluster
8.2541
7.0872
7
5.5393
5.3409
4.3934
4.3105
1.8124 0.1002
Cluster 2
8.1539
1.7984 0.1315
Cluster 4
6.9557
0
0
Cluster 7
7
2.0795 0.1474
Cluster 5
5.3919
1.9665 0.107
Cluster 1
5.2339
2.0171 0.1118
Cluster 3
4.2816
1.6957 0.1226
Cluster 6
4.1879
Table 15 - Confidence Interval Derivation
68
Lower
With 95%
Confidence
Interval
Upper
8.3543
7.2186
7
5.6867
5.4478
4.5052
4.4331
Figure 35 - Prime III Loss Expectation (Prioritized List) – Predictions
Figure 36 - Prime III Loss Expectation (Prioritized List) – 2006 Predictions
Given this ordered list with empirical values showing the vulnerabilities with their
associated impact values and predicted number of occurrences, we have provided an
objective measure that can be used to justify dealing with the vulnerabilities in question.
Unlike vulnerability identification systems that list in no particular order of importance
the vulnerabilities and threats of a particular effort, SSRAM’s prediction of the impact
factor and the frequency of occurrence provides an objective means for determining and
justifying which threats to ameliorate. SSRAM also provides an empirical foundation for
focusing testing later in the development process to make sure that the development
effort took steps to lessen the chances of these threats being successfully exercised. The
analysts can now determine, based on the tools at hand, what needs to be done, what
resources need to be acquired and which vulnerabilities they want to concentrate efforts
and resources. In essence they can better perform analysis analogous to cost benefit
69
analysis and empirically assess the opportunity cost of the risk elements they chose to
address.
70
5 SUMMARY, FUTURE WORK AND CONCLUSIONS
5.1 Summary
When responding to system vulnerabilities, efforts tend towards reactive measures that
support the hardening of systems and their connected networks. Even though these
efforts are necessary, they do not address the fact that the majority of the security
vulnerabilities are due to software vulnerabilities, as reported by NIST [Curphey 2004].
Efforts to ameliorate system vulnerabilities should in effect start early in the software
development life cycle, so that security is built in and not bolted onto the system upon
completion of development. To this end, risks to a software development effort have to
be determined and assessed early in the development life cycle. Attempts to date to
assess the risks that could apply to a development effort have been largely qualitative in
nature. Though helpful, these qualitative approaches have drawbacks such as the
difficulty in duplicating results and transferring assessment lessons to other projects. The
traditional quantitative approach of annualized loss expectancy (ALE) and the exposure
factor (EF) carries with it the burden of knowing what asset costs to use. The impact
effect is measured by the cost of asset loss, which is not easily determined during the
design phase of development. If the cost of asset loss is available, it may not adequately
reflect the impact that is a result of the exercise of a given threat. Other efforts that
provide objective results are implemented late in the software development life cycle,
71
usually during the testing phase, thereby incurring additional costs for dealing with
vulnerabilities assessed after the system has been built.
In looking at risk assessment in other fields such as insurance, banking, and finance, we
find that one of the factors that allows for prediction of risk metrics is the availability of
historical data. One of the deterrents to creating objective measures for assessing risks in
software security was purported to be the lack of data to use as a basis for historical
predictions. As it stands, there are now open source vulnerability databases, which
provide a historical data source that can be used to assess vulnerabilities. We chose the
NVD, and in so doing we were able to categorize the data based on the K-Means and EM
algorithm to obtain natural groupings of the entries based on the data about each entry.
These algorithms portrayed different approaches to categorizing data.
We compared the performance of the models between two mining structures, one with
the description entries, a comment-like attribute without discrete values, and one without.
This performance was measured by dividing the NVD data into two segments, one for
training and the other for testing. We used the test data to find an indication of how
likely it is for each test case to exist within a determined cluster. Based on the average
case likelihood scores obtained, we saw that the clustering algorithms that took into
account the description vectors performed better than those without. In addition to this,
we examined the cohesion of each cluster with respect to the centroid of the cluster by
looking at the mean, standard deviation, maximum and minimum values for each of the
clusters within each mining model. We validated our choice by comparing the cohesion
values of the algorithms.
72
Within each structure we compared EM versus K-Means models and chose those with
higher average case likelihood scores. The average case likelihood score measures the
likelihood that a test case would be similar to one of the cases within the training model.
EM consistently outperformed KM models for our data. Choosing the algorithm with the
best score, we labeled each entry of the training data with the cluster it belonged to. To
determine what fields should be used to classify the data we looked at those fields with
entropy values > 0.05 and < 0.95. We then ran two multiple regression analysis, which at
α = 0.05 significance showed that the variables loss type, vulnerability type and exploit
range were significant to predict cluster node and CVSS score. Upon naming each entry
with its cluster, we investigated different classification algorithms to see which best
classified the data under investigation. We again used two mining structures to compare
results of classification mining models, one with the description terms and another
without the description terms. Furthermore, we looked at the accuracy of each of the
classification models and with a step-wise approach chose the algorithm with good
accuracy and shorter performance time. With CVSS score used as the impact factor we
adapted the traditional loss expectancy model by using CVSS score as opposed to asset
cost as the impact factor to determine loss expectation. In addition, we constructed
confidence intervals for our predictions based on the training data sets used for
classifying new entries at a 95% level of confidence. With the ability to predict CVSS
score with statistical significance, our study suggests an empirical approach to
determining the impact of a given risk or threat and provides a way to objectively
compare the impact and frequency of occurrence of identified threats.
73
5.2 Future Work
Any significant research effort invariably opens more questions than it solves. This
endeavor was no different. The following are suggestions for extending the research:
•
Validate SSRAM with an industry project. Our research looked at the Prime III
development effort but did not validate the results of its work. We propose to run
SSRAM in a development environment and assess its effectiveness.
•
Investigate the usage of SSRAM to assess risk on component based systems. As
development efforts rely more and more on other components, empirically
assessing the risk of utilizing components within a development effort is
necessary.
•
Perform comparative studies. We propose to do comparative studies using other
vulnerability databases and other data mining algorithms for classification and
prediction of impact factors.
•
Strengthen predictive values. We will investigate factors, along with time, which
will strengthen the predictive values of the impact scores. Along with this, we
propose to look at the trend of vulnerabilities reported based on security factors
such as impact scores, loss type, vulnerability type, and exploit range.
•
Investigate the characteristic of the clusters within SSRAM. We will investigate
the characteristics of the clusters determined within SSRAM to better describe the
classification of each entry. As it stands, SSRAM classifies each entry to a
generic cluster – “Cluster 1 … Cluster N”.
•
Investigate other variables that can be used to determine clusters. The R2 statistic
of 27.9% in Figure 24shows that there is a need to explore other variables that can
74
be used to predict the cluster node. Though Loss Type, Vulnerability Type and
Exploit Range are significant predictors of cluster node, they are not enough to
explain the model.
•
Generate countermeasures based on the vulnerabilities assessed. SSRAM would
be enhanced by the provision of countermeasures to enact based on the
vulnerabilities assessed.
•
Integrate SSRAM with a vulnerability identification system. The use of SSRAM
is predicated on the identification of vulnerabilities. The integration of SSRAM
to an identification system would reduce the amount of work necessary to
implement SSRAM and overall produce a more robust risk assessment
methodology.
5.3 Conclusions
In creating a model that predicts impact factors, we found that there was no statistical
difference between the actual scores and the predicted scores produced by our research.
We were able to estimate the CVSS score and use it as the impact factor estimation that
allows loss expectation to be quantitatively derived based on historical data. Along with
that we also produced confidence intervals for the predicted impact scores.
This study shows that the database of vulnerabilities is adequate for use as a basis for
predicting risk metrics such as impact factor. We were also able to provide a
categorization of vulnerabilities based on development environment factors along with
descriptive terms of the vulnerabilities. This categorization allowed us to be able to
classify new vulnerabilities based on their similarities to historical entries.
75
This research has shown that the predictions based on these categorizations were not
significantly different from the actual impact scores for the time under research.
Data
Availability
Assess
Risk
Elements
Can be used
at design
level of
development
Generalize
&
Duplicate
Results
Categorizatio
n based on
Development
Environment
al Factors
SSRAM
Yes
Yes
Yes
Yes
Yes
Traditional Risk
Yes
No
No
No
No
Impact Estimation
Software Reliability
Yes
No
Yes
Yes
N/A
Qualitative
Yes
Yes
N/A
No
?
Assessment
approaches (Boehm,
RiskIT, Risk
Management
Framework)
Testing & Assessment Yes
Yes
Yes
No
No
Approaches
Identification
Yes
Yes
Yes
No
No
Approaches
Table 19 – Comparison of SSRAM with other Risk Analysis Methodologies
Identify
Risks
No
No
No
No
No
Yes
This research shows that our approach (SSRAM) when compared to other risk analysis
methodologies is capable of being used to objectively assess risk elements early in the
software development life cycle due to its ability to classify risk elements to categories
that are determined based on the development factors (Table 19). SSRAM provides a
basis for objectively assessing threats or vulnerabilities early in the software development
life cycle. The prediction of both the impact value and the number of occurrences of a
given threat in a prioritized format provides objective evidence that can direct risk
mitigation during the development of software. Assessment lessons can be transferred to
other projects and the predictions from SSRAM can be used to do comparative analysis
of different projects with similar parameters. Since the assessment is done early in the
development cycle, these vulnerabilities can be mitigated during the development effort
76
without the added cost usually incurred when they are discovered and assessed during
later periods in the software development life cycle such as the testing phase.
Given the pivotal role risk assessment has in the development of reliable and secure
systems, SSRAM provides a historically validated risk assessment model for analyzing
risks so that an objective justification of the direction and choice of risk elements can be
made.
77
BIBLIOGRAPHY
[Addison and Vallabh 2002] Controlling Software Project Risks – an Empirical Study of
Methods used by Experienced Project Managers. Proceedings of South African
Institute of Computer Scientists and Information Technologists SAICSIT
(September 2002), 128-140
[Alberts 2006] “Common Elements of Risk” CMU SEI 2006
“http://www.sei.cmu.edu/pub/documents/06.reports/pdf/06tn014.pdf”
[Bartlett II et al] Bartlett II, J., Kotrlik, J. W., Higgins, C. C., “Organizational Research:
Determining Appropriate Sample Size in Survey Research” Information
Technology, Learning and Performance Journal (Spring 2001), 43-50.
[Boehm 1989] Boehm B., Software Risk Management, IEEE-Computer Science Press,
1989
[Boehm 2001] Boehm, B.; Educating software engineering students to manage risk” ,
2001. Proceedings of the 23rd International Conference Software Engineering
(12-19 May 2001), 591 – 600
[Boehm and Demarco 1997] Boehm, B.W.; and DeMarco, T.; Software risk
management IEEE Software, Volume 14, Issue 3, ( May-June 1997),17 – 19
[Bradley, Fayyad, Reina 1998] Bradley, P. S., Fayyad, U., and Reina, C., “Scaling EM
(Expectation-Maximization) Clustering to Large Databases,” Technical Report
MSR-TR-98-35, Microsoft Research. (1998)
[Carr et al. 1993] Carr M., Konda S ., Monarch 1., Ulrich, F., and Walker, C.,
“Taxonomy-Based Risk Identification,” CMU/SEI-93-TR=6, Software
Engineering Institute, Pittsburgh, PA 15213, 1993.
[Cheswick et al. 2003] Cheswick, B, Kocher, P., McGraw, G., and Rubin, A.. “Bacon Ice
Cream: The Best Mix of Proactive and Reactive Security?” IEEE Security and
Privacy 1.4. (2003)
[Curphey 2004] Curphey, M. “Software Security Testing: Let's Get Back to Basics”
Security (October 2004) http://www.softwaremag.com/L.cfm?Doc=200409/2004-09software-security-testing last accessed 6/15/07
78
[Farahmand 2003] Farahmand, F., Navathe, S. B., Sharp, G., Phillip P., Enslow H.,
Managing Vulnerabilities of Information Systems to Security Incidents
International Conference on Entertainment Computing ICEC Pittsburgh, PA,
(2003).
[First 2005] CVSS v1 Complete Documentation http://www.first.org/cvss/v1/guide.html
last accessed 5/2006
[First 2008] CVSS v2 Complete Documentation http://www.first.org/cvss/cvssguide.html#i2.1 last accessed 7/22/08
[FIPS 2004] Federal Information Processing Standard (FIPS) 199 http://csrc.nist.gov/publications/fips/fips199/FIPS-PUB-199-final.pdf
- last accessed 3/27/07
[GAO AIMD-00-33 ] Information Security Risk Assessment: Practices of Leading
Organizations (GAO AIMD-00-33, 1999)
http://www.gao.gov/special.pubs/ai00033.pdf last accessed 1/24/07
[Haimes 2004] Haimes, Y. V. Risk Modeling, Assessment, and Management 2nd ed.
Wiley Press, 2004
[Hetzel 1988] Hetzel, William C., The Complete Guide to Software Testing, 2nd ed.
Publication info: Wellesley, Mass. : QED Information Sciences, 1988. ISBN:
0894352423.Physical description: ix, 280 p. : ill ; 24 cm
[Howard and Leblanc] Howard, Michael & LeBlanc, David C. Writing Secure Code, 2nd
ed. Redmond, WA: Microsoft Press, 2002
[Howard et al. 2005] Howard M., LeBlanc D., and Viega J., 19 Deadly Sins of Software
Security, McGraw-Hill, (2005).
[Humphrey et al. 2004] Humphrey, W., Davis, N., Redwine Jr., S. T., Zibulski, G.,
McGraw, G., “Processes for Producing Secure Software – Summary of US
National Cyberscevurity Summit subgroup Report” IEEE Security & Privacy
(May/June 2004)
[Janardhanudu] Janardhanudu, G. “White Box Testing” https://buildsecurityin.uscert.gov/daisy/bsi/articles/best-practices/white-box/259-BSI.html last accessed
3/9/2006
[Jain et al. 1996] Jain, A.K.; Mao, J., Mohiuddin, K.M.; “Artificial neural networks: a
tutorial” Computer Volume 29, Issue 3, ( March 1996),:31 - 44
[Jurjens 2001] Jan Jurjens. “Towards Secure Systems Development with UMLsec,”
Proceedings of FASE 2001. Springer Lecture Notes in Computer Science 2001
79
[Karunanithi et al. 1992] Karunanithi, N.; Whitley, D.; Malaiya, Y.K Using neural
networks in reliability prediction Software, IEEE Volume 9, Issue 4, (July 1992
), :53 – 59
[Khoshgoftar 1995] Khoshgoftar T.M., Szabo R.M., and Guasti P.J., “Exploring the
Behaviour of Neural Network Software Quality Models,” Software Engineering
Journal., (May 1995).
[Klugman et al. 2004] Klugman, S. A., Panjer, H. H., Willmot, G. E., Loss Models From
Data to Decisions 2nd edition Wiley Series in Probability and Statistics John
Wiley & Sons Inc. (2004)
[Kontio 1997] Kontio, J., The Riskit Method for Software Risk Management, version
1.00 CS-TR-3782 / UMIACS-TR- 97-38, 1997. Computer Science Technical
Reports. University of Maryland. College Park, MD
[Kontio 1999] Kontio, J.; Risk management in software development: a technology
overview and the riskit method Software Engineering, 1999. Proceedings of the
1999 International Conference (May 16-22 1999), 679 -– 680
[Kontio et al. 1998] J. Kontio, G. Getto, and D. Landes. Experiences in improving risk
management processes using the concepts of the Riskit method. In Proceedings.
ACM SIGSOFT Int’l Symp. Foundations Softw. Eng., (1998), 163–174
[Lodderstedt et al. 2002] Lodderstedt T., Basin D., Doser, J.; SecureUML: A UMLBased Modeling Language for Model-Driven Security
http://kisogawa.inf.ethz.ch/WebBIB/publicationssoftech/papers/2002/0_secuml_uml2002.pdf last accessed 07/03/2007
[Lyu et al. 1996] Michael R. Lyu (editor) Handbook of Software Reliability Engineering
IEEE Computer Society Press and McGraw-Hill Book Company 1996
http://www.cse.cuhk.edu.hk/~lyu/book/reliability/ last accessed 3/30/07
[McGraw 2002] McGraw, G., “Managing Software Security Risks” Computer (April
2002)
[McGraw 2006] McGraw, G., Software Security Building Security In Addison-Wesley
Software Security Series, Boston, MA.( 2006)
[Mead and Stehney 2005] Mead, N. R.; Stehney, T. “ Security Quality Requirements
Engineering (SQUARE) Methodology”
[Mead 2005] Mead, Nancy R., Hough, Eric D. , Stehney II, Theodore R. Software
Quality Requirements Engineering (SQUARE) methodology
http://www.sei.cmu.edu/pub/documents/05.reports/pdf/05tr009.pdf November
2005
80
[Mead 2006] Mead, Nancy R. Software Engineering Institute Carnegie Mellon University
https://buildsecurityin.us-cert.gov/daisy/bsi/articles/bestpractices/requirements/232.html. Last accessed 1/24/07
[Michael and Radosevich 2005] Michael, C., and Radosevich “Risk-Based and
Functional Security Testing” Building Security In https://buildsecurityin.uscert.gov/portal/article/bestpractices/security_testing/overview.xml#Risk-BasedTesting
[Microsoft 2006] http://www.microsoft.com/downloads/details.aspx?familyid=570dccd9596a-44bc-bed7-1f6f0ad79e3d&displaylang=en accessed last - 3/9/07
[Microsoft 2008-1] mshelp://MS.VSCC.v80/MS.VSIPCC.v80/MS.SQLSVR.v9.en/uas9/html/61eb48618a6a-4214-a4b8-1dd278ad7a68.htm accessed last – 5/20/08
[Microsoft 2008-2] ] mshelp://MS.VSCC.v80/MS.VSIPCC.v80/MS.SQLSVR.v9.en/uas9/html/3b53e0113b1a-4cd1-bdc2-456768ba31b5.htm accessed last – 5/20/08
[Microsoft 2008-3] mshelp://MS.VSCC.v80/MS.VSIPCC.v80/MS.SQLSVR.v9.en/uas9/html/95ffe66fc261-4dc5-ad57-14d2d73205ff.htm accessed last – 5/20/08
[Microsoft 2008-4] mshelp://MS.VSCC.v80/MS.VSIPCC.v80/MS.SQLSVR.v9.en/uas9/html/61eb48618a6a-4214-a4b8-1dd278ad7a68.htm accessed last – 7/1/08
[Microsoft 2008-5] mshelp://MS.VSCC.v80/MS.VSIPCC.v80/MS.SQLSVR.v9.en/uas9/html/3b53e0113b1a-4cd1-bdc2-456768ba31b5.htm accessed last – 5/20/08
[Mkpong-Ruffin and Umphress 2007] Mkpong-Ruffin, I., Umphress, D. A., “HighLeveraged Techniques for Software Security” CrossTalk The Journal of Defense
Software Engineering (March 2007)
[Neumann 2002] D. E. Neumann. An enhanced neural network technique for software
risk analysis. IEEE Transactions Software Eng., 28(9):904–912, 2002.
[NIST 2002-10] National Institute of Standards and Technology, “Software Errors Cost
U.S. Economy $59.5 Billion Annually” (NIST 2002-10).
http://www.nist.gov/public_affairs/releases/n02- 10.htm (2002).
[NVD] National Vulnerability Database www.nist.nvd.gov last accessed 4/7/07
[OWASP] http://www.owasp.org/index.php/Category:OWASP_Top_Ten_Project last
accessed 4/5/07
81
[Pan 1999] 16. Pan, Jiantao. “Software Testing – 18- 849b Dependable Embedded
Systems.” Carnegie Mellon University, 1999 ,www.ece.cmu.edu/~koop
man/des_s99/sw_testing>.
[Prime III] http://www.primevotingsystem.com/ last access 7/29/08
[Rosenberg et al. 1998] Linda Rosenberg, Ted Hammer, Jack Shaw International
Symposium on Software Reliability November 1998
http://satc.gsfc.nasa.gov/support/ISSRE_NOV98/software_metrics_and_reliabilit
y.html - last accessed 3/30/07
[Sherer 1995] Sherer, S.A., “Software Fault Prediction,” J. Systems and Software, vol. 2
no. 2, (May 1995).
[Sargent 2003] Sargent, R. G., “Validation and Verification of Simulation Models”
Proceedings of the 2003 Winter Simulation Conference S. Chick, P. J. Sánchez,
D. Ferrin, and D. J. Morrice, eds.
[Schneier 2000] Schneier, B. Secrets and Lies: Digital Security in a Networked World.
New York: John Wiley & Sons, 2000
[Sindre and Opdahl] Sindre, G., and Opdahl, A.L.,. “Templates for Misuse Case
Description.” Proc. Of the Seventh International Workshop on Requirements
Engineering, Foundation for Software Quality (REFSQ 2001), (4-5 June 2001),
Switzerland.
[SP 800-30] Stoneburner, G.; Goguen, A.; Feringa, A., NIST SP-300 Risk Management
Guide for Information Technology Systems - (July 2002)
[SP 800-53] Ross, R., Katzke, S., Johnson, A., Swanson, Stoneburner, G., Rogers, G.,
Lee, A., -, “Information Security” National Institute of Standards and Technology
(NIST) Special Publication 800-53 Recommended Security Controls for Federal
Information Systems
[SP 800-55] Swanson, M., Bartol, N., Sabato, J., Hash, J., and Graffo, L.,”C O M P U T
E R S E C U R I T Y” – NIST SP 800-55 Security Metrics Guide for Information
Technology Systems (July 2003)
[SP800-12] An Introduction to Computer Security – The NIST Handbook – Chapter 7
Computer Security Risk Management
[SP800-100] Bowen, P. hash, J., Wilson, M. “Information Security Handbook: A Guide
for Managers” –National Institute of Standards and Technology (NIST) Special
Publication 800-100 (October 2006)
82
[Steel et al.. 2005] Steel, C., Nagappan, R., Lai, R., Core Security Patterns: Best Practices
and Strategies for J2EE Web Services, and Identity Management. Prentice Hall,
(2005)
[Tang & MacLennan 2005] Tang, Z., MacLennan, J., Data Mining with SQL Server 2005
Wiley, (2005)
[Tsipenyuk et al.. 2005] Tsipenyuk, K., Chess, B., and McGraw, G., “Seven Pernicious
Kingdoms: A Taxonomy of Software Security Errors,” IEEE Security and
Privacy (November/December 2005)
[Verdon and McGraw 2004] Verdon, D, McGraw, G., “Risk Analysis in Software
Design.” IEEE Security and Privacy 2.4 (2004)
[Viega and McGraw 2001] Viega, J., and McGraw, G.,. Building Secure Software: How
to Avoid Security Problems the Right Way. Addison-Wesley Boston, MA.(2001)
[Voas et al.. 1997] Voas, J., McGraw, G., Kassab, L., Voas, L.,“A ‘Crystal Ball’ for
Software Liability” IEEE Computer (June 1997) 29 – 36
[Voas et al. 1998] Voas, J., and McGraw, G.,. Software Fault Injection: Inoculating
Programs Against Errors, 47-48. New York, NY: John Wiley & Sons, 1998.
[Walker et al.. 2006] Walker, Robert J., Holmes, R., Hedgeland, I., Kapur, P., Smith, A.,
“A Lightweight Approach to Technical Risk Estimation via Probabilistic Impact
Analysis” International Conference on Software Engineering Proceedings of the
2006 international workshop on Mining software repositories Shanghai, China
(2006)
[Wallace 1991] Wallace, D.R., Kuhn, D.R., Cherniavsky, J.C., "Report on a Workshop
on the Assurance of High Integrity Software," Proceedings of the Sixth Annual
Conference on Computer Assurance (COMPASS 1991), NIST, Gaithersburg, MD,
June 24-27, 1991 , The Institute of Electrical and Electronics Engineers, (1991).
[Weingberg and Schulman 1974] Weinberg, G. M., Schulman, E. M., “Goals and
Performance in Computer Programming,” Human Factors, (1974), 16(1), 70-77.
[Young 2001] Young, Peter C. and Tippins, Steven C. Managing Business Risk: An
Organization-Wide Approach to Risk Management. New York, NY: American
Management Association, (2001) (ISBN: 0-814-40461-8).
83
APPENDICES
84
Appendix A-1 – The NVD Schema
This appendix gives the NVD Schema from which SSRAM derives its schema.
Table structural data and attribute types used in SSRAM are based on the NVD Schema
shown in this appendix.
<xs:simpleType name="CVSSVector" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:annotation>
<xs:documentation>
simpleType to describe the CVSS Base Vector
</xs:documentation>
</xs:annotation>
<xs:restriction base="xs:string">
<xs:pattern value="\(AV:[RL]/AC:[HL]/Au:(R|NR)/C:[NPC]/I:[NPC]/A:[NPC]/B:[NCIA]\)( Approximated)?" />
</xs:restriction>
</xs:simpleType><xs:simpleType name="dateType" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:annotation>
<xs:documentation>
Defines date format for NVD. Dates follow the mask "yyyy-mm-dd"
</xs:documentation>
</xs:annotation>
<xs:restriction base="xs:date">
<xs:pattern value="(19|20)\d\d-((01|03|05|07|08|10|12)-(0[1-9]|[1-2]\d|3[01])|(04|06|09|11)-(0[1-9]|[1-2]\d|30)|02-(0[19]|1\d|2\d))" />
</xs:restriction>
</xs:simpleType><xs:simpleType name="trueOnlyAttribute" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:annotation>
<xs:documentation>
simpleType used for attributes that are only present when they are true.
Such attributes appear only in the form attribute_name="1".
</xs:documentation>
</xs:annotation>
<xs:restriction base="xs:NMTOKEN">
<xs:enumeration value="1" />
</xs:restriction>
</xs:simpleType><xs:simpleType name="zeroToTen" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:annotation>
<xs:documentation>
simpleType used when scoring on a scale of 0-10, inclusive
</xs:documentation>
</xs:annotation>
<xs:restriction base="xs:decimal">
<xs:minInclusive value="0" fixed="true" />
<xs:maxInclusive value="10" fixed="true" />
</xs:restriction>
</xs:simpleType><xs:simpleType name="urlType" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:annotation>
<xs:documentation>
Restricts urls in NVD beyond the xs:anyURI restrictions.
</xs:documentation>
</xs:annotation>
<xs:restriction base="xs:anyURI">
<xs:whiteSpace value="collapse" />
<xs:pattern value="((news|(ht|f)tp(s)?)://([^:]|:[^/]|:/[^/])+(:|:/)?)+" />
85
</xs:restriction>
</xs:simpleType><xs:element name="nvd" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:annotation>
<xs:documentation>
Root element. Contains only "entry" child elements.
Attributes for this element describe the version of the XML feed being read.
Attributes:
"nvd_xml_version" (required) => the schema and DTD version number currently supported by this document
"pub_date" (required) => the date this document was compiled
</xs:documentation>
</xs:annotation>
<xs:complexType>
<xs:sequence>
<xs:element name="entry" minOccurs="0" maxOccurs="unbounded">
<xs:annotation>
<xs:documentation>
Documents one CVE entry. The child elements should always appear
in the sequence defined below. These elements are compatible with
entry elements from the CVE XML feeds.
Attributes:
"type" (required) => CVE or CAN
"name" (required) => full CVE name
"seq" (required) => sequence number from CVE name
"nvd_name" => NVD name (if it exists)
"discovered" => date discovered
"published" (required) => date published
"modified" => date modified
"severity" => severity as determined by NVD analysts: High, Medium, or Low
"reject" => indicates that this CVE entry has been rejected by CVE or NVD
"CVSS_score" => CVSS Severity Score
"CVSS_vector" => CVSS Base Vector
</xs:documentation>
</xs:annotation>
<xs:complexType>
<xs:sequence>
<xs:element name="desc">
<xs:annotation>
<xs:documentation>
Description wrapper tag, parent to any documented descriptions of this CVE entry.
While the "desc" tag will always be present, there may be no "descript" child tags.
Only one "descript" tag will exist for each description source (i.e. CVE, NVD, ...).
</xs:documentation>
</xs:annotation>
<xs:complexType>
<xs:sequence>
<xs:element name="descript" minOccurs="0" maxOccurs="2">
<xs:annotation>
<xs:documentation>
Contains a specific description of this CVE entry from source
indicated by the "source" attribute.
</xs:documentation>
</xs:annotation>
<xs:complexType mixed="true">
<xs:attribute name="source" use="required">
<xs:simpleType>
<xs:restriction base="xs:NMTOKEN">
<xs:enumeration value="cve" />
<xs:enumeration value="nvd" />
</xs:restriction>
</xs:simpleType>
</xs:attribute>
86
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="impacts" minOccurs="0">
<xs:annotation>
<xs:documentation>
Impact wrapper tag (may or may not be present). Only one "impact" tag will exist
for each impact explanation source.
</xs:documentation>
</xs:annotation>
<xs:complexType>
<xs:sequence>
<xs:element name="impact">
<xs:annotation>
<xs:documentation>
Contains a specific impact explanation of this CVE entry from
source indicated by the "source" attribute.
</xs:documentation>
</xs:annotation>
<xs:complexType mixed="true">
<xs:attribute name="source" use="required">
<xs:simpleType>
<xs:restriction base="xs:NMTOKEN">
<xs:enumeration value="nvd" />
</xs:restriction>
</xs:simpleType>
</xs:attribute>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="sols" minOccurs="0">
<xs:annotation>
<xs:documentation>
Solution wrapper tag (may or may not be present). Only one "sol" tag will exist
for each solution explanation source.
</xs:documentation>
</xs:annotation>
<xs:complexType>
<xs:sequence>
<xs:element name="sol">
<xs:annotation>
<xs:documentation>
Contains a specific solution explanation of this CVE entry from
source indicated by the "source" attribute.
</xs:documentation>
</xs:annotation>
<xs:complexType mixed="true">
<xs:attribute name="source" use="required">
<xs:simpleType>
<xs:restriction base="xs:NMTOKEN">
<xs:enumeration value="nvd" />
</xs:restriction>
</xs:simpleType>
</xs:attribute>
</xs:complexType>
</xs:element>
</xs:sequence>
87
</xs:complexType>
</xs:element>
<xs:element name="loss_types" minOccurs="0">
<xs:annotation>
<xs:documentation>
Loss type tag (may or may not be present). Contains one loss type child for each loss
type of this CVE entry.
Potential loss types are:
"avail" => availability
"conf" => confidentiality
"int" => integrity
"sec_prot" => security protection
</xs:documentation>
</xs:annotation>
<xs:complexType>
<xs:sequence>
<xs:element name="avail" minOccurs="0" />
<xs:element name="conf" minOccurs="0" />
<xs:element name="int" minOccurs="0" />
<xs:element name="sec_prot" minOccurs="0">
<xs:annotation>
<xs:documentation>
Security Protection tag with one attribute for each security protection type.
Potential security protection types are:
"admin" => gain administrative access
"user" => gain user access
"other" => other
</xs:documentation>
</xs:annotation>
<xs:complexType>
<xs:attribute name="admin" type="trueOnlyAttribute" />
<xs:attribute name="user" type="trueOnlyAttribute" />
<xs:attribute name="other" type="trueOnlyAttribute" />
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="vuln_types" minOccurs="0">
<xs:annotation>
<xs:documentation>
Vulnerability type tag (may or may not be present). Contains one vulnerability type
child for each vulnerability type of this CVE entry.
Potential vulnerability types are:
"access" => Access validation error
"input" => Input validation error
"design" => Design error
"exception" => Exceptional condition error
"env" => Environmental error
"config" => Configuration error
"race" => Race condition error
"other" => other
</xs:documentation>
</xs:annotation>
<xs:complexType>
<xs:sequence>
<xs:element name="access" minOccurs="0" />
<xs:element name="input" minOccurs="0">
<xs:annotation>
<xs:documentation>
Input validation error tag with one attribute for each input validation error type.
88
Potential input validation error types are:
"bound" => Boundary condition error
"buffer" => Buffer overflow
</xs:documentation>
</xs:annotation>
<xs:complexType>
<xs:attribute name="bound" type="trueOnlyAttribute" />
<xs:attribute name="buffer" type="trueOnlyAttribute" />
</xs:complexType>
</xs:element>
<xs:element name="design" minOccurs="0" />
<xs:element name="exception" minOccurs="0" />
<xs:element name="env" minOccurs="0" />
<xs:element name="config" minOccurs="0" />
<xs:element name="race" minOccurs="0" />
<xs:element name="other" minOccurs="0" />
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="range" minOccurs="0">
<xs:annotation>
<xs:documentation>
Vulnerability range tag (may or may not be present). Contains one vulnerability range
child for each vulnerability range of this CVE entry.
Potential vulnerability ranges are:
"local" => Locally exploitable
"remote" => Remotely exploitable
"user_init" => User accesses attacker
</xs:documentation>
</xs:annotation>
<xs:complexType>
<xs:sequence>
<xs:element name="local" minOccurs="0" />
<xs:element name="remote" minOccurs="0" />
<xs:element name="user_init" minOccurs="0" />
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="refs">
<xs:annotation>
<xs:documentation>
Reference wrapper tag (always present). External references to this CVE entry are contained
within this tag.
</xs:documentation>
</xs:annotation>
<xs:complexType>
<xs:sequence>
<xs:element name="ref" minOccurs="0" maxOccurs="unbounded">
<xs:annotation>
<xs:documentation>
Individual reference to this CVE entry. Text is the name of this vulnerability
at this particular reference.
Attributes:
"source" (required) => Name of reference source
"url" (required) => hyperlink to reference
"sig" => indicates this reference includes a tool signature
"adv" => indicates this reference is a Security Advisory
"patch" => indicates this reference includes a patch for this vulnerability
</xs:documentation>
</xs:annotation>
<xs:complexType mixed="true">
89
<xs:attribute name="source" type="xs:string" use="required" />
<xs:attribute name="url" type="urlType" use="required" />
<xs:attribute name="sig" type="trueOnlyAttribute" />
<xs:attribute name="adv" type="trueOnlyAttribute" />
<xs:attribute name="patch" type="trueOnlyAttribute" />
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="vuln_soft" minOccurs="0">
<xs:annotation>
<xs:documentation>
Vulnerable software wrapper tag (may or may not be present). Software affected by this CVE
entry are listed within this tag.
</xs:documentation>
</xs:annotation>
<xs:complexType>
<xs:sequence>
<xs:element name="prod" maxOccurs="unbounded">
<xs:annotation>
<xs:documentation>
Product wrapper tag. Versions of this product that are affected by this
vulnerability are listed within this tag.
Attributes:
"name" => Product name
"vendor" => Vendor of this product
</xs:documentation>
</xs:annotation>
<xs:complexType>
<xs:sequence>
<xs:element name="vers" maxOccurs="unbounded">
<xs:annotation>
<xs:documentation>
Represents a version of this product that is affected by this vulnerability.
Attributes:
"num" => This version number
"prev" => Indicates that versions previous to this version number are also affected by this vulnerability
</xs:documentation>
</xs:annotation>
<xs:complexType>
<xs:attribute name="num" type="xs:string" use="required" />
<xs:attribute name="prev" type="trueOnlyAttribute" />
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="name" type="xs:string" use="required" />
<xs:attribute name="vendor" type="xs:string" use="required" />
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="type" use="required">
<xs:simpleType>
<xs:restriction base="xs:NMTOKEN">
<xs:enumeration value="CAN" />
<xs:enumeration value="CVE" />
</xs:restriction>
</xs:simpleType>
90
</xs:attribute>
<xs:attribute name="name" use="required">
<xs:simpleType>
<xs:restriction base="xs:ID">
<xs:pattern value="(CAN|CVE)\-\d\d\d\d\-\d\d\d\d" />
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="seq" use="required">
<xs:simpleType>
<xs:restriction base="xs:NMTOKEN">
<xs:pattern value="\d\d\d\d\-\d\d\d\d" />
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="nvd_name" type="xs:string" />
<xs:attribute name="discovered" type="dateType" />
<xs:attribute name="published" type="dateType" use="required" />
<xs:attribute name="modified" type="dateType" />
<xs:attribute name="severity">
<xs:simpleType>
<xs:restriction base="xs:NMTOKEN">
<xs:enumeration value="High" />
<xs:enumeration value="Medium" />
<xs:enumeration value="Low" />
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="reject" type="trueOnlyAttribute" />
<xs:attribute name="CVSS_score" type="zeroToTen" />
<xs:attribute name="CVSS_vector" type="CVSSVector" />
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="nvd_xml_version" type="xs:NMTOKEN" use="required" />
<xs:attribute name="pub_date" type="dateType" use="required" />
</xs:complexType>
</xs:element>
91
Appendix A-2 – An Example of an NVD entry
This appendix shows an example of an NVD entry that had to be parsed. In all, 20,804
entries where loaded into SSRAM from 1996 – 2006 to be used for training and testing of
data. The historical data used was divided into 2 groups. 1996 – 2001 for training to
predict 2002 data, and 2003-2005 to predict 2006.
- <entry type="CVE" name="CVE-2006-0948" seq="2006-0948" published="2006-0821" modified="2006-08-22" severity="High" CVSS_score="7.0"
CVSS_vector="(AV:L/AC:L/Au:NR/C:C/I:C/A:C/B:N)">
- <desc>
<descript source="cve">AOL 9.0 Security Edition revision 4184.2340, and probably
other versions, uses insecure permissions (Everyone/Full Control) for the "America
Online 9.0" directory, which allows local users to gain privileges by replacing critical
files.</descript>
</desc>
- <sols>
<sol source="nvd">AOL has released fixes to address this issue. These fixes can be
automatically applied by logging in to the service.</sol>
</sols>
- <loss_types>
<sec_prot admin="1" />
</loss_types>
- <vuln_types>
<design />
</vuln_types>
- <range>
<local />
</range>
- <refs>
<ref source="BID" url="http://www.securityfocus.com/bid/19583"
patch="1">19583</ref>
<ref source="BUGTRAQ"
url="http://www.securityfocus.com/archive/1/archive/1/443622/100/0/threaded"
adv="1">20060818 Secunia Research: AOL Insecure Default Directory
Permissions</ref>
<ref source="FRSIRT" url="http://www.frsirt.com/english/advisories/2006/3317"
adv="1" patch="1">ADV-2006-3317</ref>
<ref source="SECTRACK" url="http://securitytracker.com/id?1016717"
patch="1">1016717</ref>
92
<ref source="SECUNIA" url="http://secunia.com/advisories/18734" adv="1"
patch="1">18734</ref>
<ref source="XF" url="http://xforce.iss.net/xforce/xfdb/28445" patch="1">aol-defaultinsecure-permissions(28445)</ref>
</refs>
- <vuln_soft>
- <prod name="AOL Security Edition" vendor="AOL">
<vers num="9.0 4184.2340" />
</prod>
</vuln_soft>
</entry>
93
Appendix A-3 – Script for Parsing XML document
The code shown in this appendix was used to load the xml document feeds and parse the
xml document into corresponding tables in SSRAM.
use ssram
-- --------------Insert XML file into nvdtmp
INSERT nvdtmp
SELECT CONVERT(xml, BulkColumn, 2) FROM
OPENROWSET(Bulk 'F:\Dissertation\Databases\National Vulnerability
Database\nvdcve-2002.xml', SINGLE_BLOB) [rowsetresults]
-OPENROWSET(Bulk 'F:\Dissertation\Databases\National Vulnerability
Database\nvdcve-recent.xml', SINGLE_BLOB) [rowsetresults]
Go
-- ------------ Drop and create SSRAM tables
USE [SSRAM]
GO
--Drop entryDesc
IF EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id =
OBJECT_ID(N'[dbo].[FK_entryDesc_entry]') AND parent_object_id =
OBJECT_ID(N'[dbo].[entryDesc]'))
ALTER TABLE [dbo].[entryDesc] DROP CONSTRAINT [FK_entryDesc_entry]
GO
USE [SSRAM]
GO
/****** Object: Table [dbo].[entryDesc]
Script Date: 02/05/2008
10:43:52 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id =
OBJECT_ID(N'[dbo].[entryDesc]') AND type in (N'U'))
DROP TABLE [dbo].[entryDesc]
---Drop entrySols
USE [SSRAM]
GO
IF EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id =
OBJECT_ID(N'[dbo].[FK_entrySols_entry]') AND parent_object_id =
OBJECT_ID(N'[dbo].[entrySols]'))
ALTER TABLE [dbo].[entrySols] DROP CONSTRAINT
[FK_entrySols_entry]
GO
USE [SSRAM]
GO
/****** Object: Table [dbo].[entrySols]
Script Date: 02/05/2008
10:55:14 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id =
OBJECT_ID(N'[dbo].[entrySols]') AND type in (N'U'))
DROP TABLE [dbo].[entrySols]
94
---drop lossType2
USE [SSRAM]
GO
IF EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id =
OBJECT_ID(N'[dbo].[FK_lossType2_entry]') AND parent_object_id =
OBJECT_ID(N'[dbo].[lossType2]'))
ALTER TABLE [dbo].[lossType2] DROP CONSTRAINT [FK_lossType2_entry]
GO
USE [SSRAM]
GO
/****** Object: Table [dbo].[lossType2]
Script Date: 02/05/2008
10:56:52 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id =
OBJECT_ID(N'[dbo].[lossType2]') AND type in (N'U'))
DROP TABLE [dbo].[lossType2]
--Drop nvdtmp
USE [SSRAM]
GO
/****** Object: Table [dbo].[nvdtmp]
Script Date: 02/05/2008
10:58:49 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id =
OBJECT_ID(N'[dbo].[nvdtmp]') AND type in (N'U'))
DROP TABLE [dbo].[nvdtmp]
----DROP range
USE [SSRAM]
GO
IF EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id =
OBJECT_ID(N'[dbo].[FK_range_entry]') AND parent_object_id =
OBJECT_ID(N'[dbo].[range]'))
ALTER TABLE [dbo].[range] DROP CONSTRAINT [FK_range_entry]
GO
USE [SSRAM]
GO
/****** Object: Table [dbo].[range]
Script Date: 02/05/2008
10:59:59 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id =
OBJECT_ID(N'[dbo].[range]') AND type in (N'U'))
DROP TABLE [dbo].[range]
--- DROP refs
USE [SSRAM]
GO
IF EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id =
OBJECT_ID(N'[dbo].[FK_refs_entry]') AND parent_object_id =
OBJECT_ID(N'[dbo].[refs]'))
ALTER TABLE [dbo].[refs] DROP CONSTRAINT [FK_refs_entry]
GO
95
USE [SSRAM]
GO
/****** Object: Table [dbo].[refs]
Script Date: 02/05/2008 11:01:17
******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id =
OBJECT_ID(N'[dbo].[refs]') AND type in (N'U'))
DROP TABLE [dbo].[refs]
--DROP vuln_soft
USE [SSRAM]
GO
IF EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id =
OBJECT_ID(N'[dbo].[FK_vuln_soft_entry]') AND parent_object_id =
OBJECT_ID(N'[dbo].[vuln_soft]'))
ALTER TABLE [dbo].[vuln_soft] DROP CONSTRAINT [FK_vuln_soft_entry]
GO
USE [SSRAM]
GO
/****** Object: Table [dbo].[vuln_soft]
Script Date: 02/05/2008
11:03:26 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id =
OBJECT_ID(N'[dbo].[vuln_soft]') AND type in (N'U'))
DROP TABLE [dbo].[vuln_soft]
---DROP vuln_types
USE [SSRAM]
GO
IF EXISTS (SELECT * FROM sys.foreign_keys WHERE object_id =
OBJECT_ID(N'[dbo].[FK_vuln_types_entry]') AND parent_object_id =
OBJECT_ID(N'[dbo].[vuln_types]'))
ALTER TABLE [dbo].[vuln_types] DROP CONSTRAINT [FK_vuln_types_entry]
GO
USE [SSRAM]
GO
/****** Object: Table [dbo].[vuln_types]
Script Date: 02/05/2008
11:04:17 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id =
OBJECT_ID(N'[dbo].[vuln_types]') AND type in (N'U'))
DROP TABLE [dbo].[vuln_types]
---DROP entry
-----ENTRY TABLE DROP AND CREATE
/****** Object: Table [dbo].[entry]
Script Date: 02/05/2008
10:42:30 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id =
OBJECT_ID(N'[dbo].[entry]') AND type in (N'U'))
DROP TABLE [dbo].[entry]
96
/****** Object: Table [dbo].[entry]
Script Date: 02/05/2008
10:43:14 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[entry](
[type] [varchar](5) COLLATE SQL_Latin1_General_CP1_CI_AS NOT
NULL,
[name] [varchar](15) COLLATE SQL_Latin1_General_CP1_CI_AS NOT
NULL,
[seq] [varchar](12) COLLATE SQL_Latin1_General_CP1_CI_AS NOT
NULL,
[nvd_name] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL,
[discovered] [datetime] NULL,
[published] [datetime] NOT NULL,
[modified] [datetime] NULL,
[severity] [varchar](10) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL,
[reject] [varchar](10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[CVSS_score] [real] NULL,
[CVSS_vector] [varchar](25) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL,
CONSTRAINT [PK_entry] PRIMARY KEY CLUSTERED
(
[name] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
-- CREATE entryDesc
USE [SSRAM]
GO
/****** Object: Table [dbo].[entryDesc]
Script Date: 02/05/2008
10:53:52 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[entryDesc](
[entry_name] [varchar](15) COLLATE SQL_Latin1_General_CP1_CI_AS
NOT NULL,
[descript] [varchar](max) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL,
[source] [varchar](5) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
97
CONSTRAINT [PK_entryDesc] PRIMARY KEY CLUSTERED
(
[entry_name] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
EXEC sys.sp_addextendedproperty @name=N'MS_Description',
@value=N'associated with entry.name (foreign key)'
,@level0type=N'SCHEMA', @level0name=N'dbo', @level1type=N'TABLE',
@level1name=N'entryDesc', @level2type=N'COLUMN',
@level2name=N'entry_name'
GO
EXEC sys.sp_addextendedproperty @name=N'MS_Description',
@value=N'Contains a specific description of this CVE entry from source
indicated by the "source" attribute' ,@level0type=N'SCHEMA',
@level0name=N'dbo', @level1type=N'TABLE', @level1name=N'entryDesc',
@level2type=N'COLUMN', @level2name=N'descript'
GO
EXEC sys.sp_addextendedproperty @name=N'MS_Description',
@value=N'enumerated type "cve" or "nvd" - required for each descript'
,@level0type=N'SCHEMA', @level0name=N'dbo', @level1type=N'TABLE',
@level1name=N'entryDesc', @level2type=N'COLUMN', @level2name=N'source'
GO
USE [SSRAM]
GO
ALTER TABLE [dbo].[entryDesc] WITH CHECK ADD CONSTRAINT
[FK_entryDesc_entry] FOREIGN KEY([entry_name])
REFERENCES [dbo].[entry] ([name])
-----------entrySols Table
USE [SSRAM]
GO
/****** Object: Table [dbo].[entrySols]
Script Date: 02/05/2008
10:55:56 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[entrySols](
[entry_name] [varchar](15) COLLATE SQL_Latin1_General_CP1_CI_AS
NOT NULL,
[sol] [varchar](max) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[source] [varchar](5) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
CONSTRAINT [PK_entrySols] PRIMARY KEY CLUSTERED
(
98
[entry_name] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
USE [SSRAM]
GO
ALTER TABLE [dbo].[entrySols] WITH CHECK ADD CONSTRAINT
[FK_entrySols_entry] FOREIGN KEY([entry_name])
REFERENCES [dbo].[entry] ([name])
--------------create lossType
USE [SSRAM]
GO
/****** Object: Table [dbo].[lossType2]
Script Date: 02/05/2008
10:57:14 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[lossType2](
[entry_name] [varchar](15) COLLATE SQL_Latin1_General_CP1_CI_AS
NOT NULL,
[loss_type] [varchar](15) COLLATE SQL_Latin1_General_CP1_CI_AS
NOT NULL,
[adminSP] [char](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[userSP] [char](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[otherSP] [char](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
CONSTRAINT [PK_lossType2] PRIMARY KEY CLUSTERED
(
[entry_name] ASC,
[loss_type] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
USE [SSRAM]
GO
ALTER TABLE [dbo].[lossType2] WITH CHECK ADD CONSTRAINT
[FK_lossType2_entry] FOREIGN KEY([entry_name])
REFERENCES [dbo].[entry] ([name])
-----nvdtmp
USE [SSRAM]
99
GO
/****** Object: Table [dbo].[nvdtmp]
10:58:18 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[nvdtmp](
[XmlCol] [xml] NULL
) ON [PRIMARY]
Script Date: 02/05/2008
-----range
USE [SSRAM]
GO
/****** Object: Table [dbo].[range]
Script Date: 02/05/2008
11:00:42 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[range](
[entry_name] [varchar](15) COLLATE SQL_Latin1_General_CP1_CI_AS
NOT NULL,
[exploit_range] [varchar](15) COLLATE
SQL_Latin1_General_CP1_CI_AS NOT NULL,
CONSTRAINT [PK_range] PRIMARY KEY CLUSTERED
(
[entry_name] ASC,
[exploit_range] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
USE [SSRAM]
GO
ALTER TABLE [dbo].[range] WITH CHECK ADD
FOREIGN KEY([entry_name])
REFERENCES [dbo].[entry] ([name])
CONSTRAINT [FK_range_entry]
----refs
USE [SSRAM]
GO
/****** Object: Table [dbo].[refs]
******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
100
Script Date: 02/05/2008 11:01:37
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[refs](
[entry_name] [varchar](15) COLLATE SQL_Latin1_General_CP1_CI_AS
NOT NULL,
[ref] [nvarchar](max) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[source] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL,
[url] [nvarchar](100) COLLATE SQL_Latin1_General_CP1_CI_AS NOT
NULL,
[sig] [nchar](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[adv] [nchar](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[patch] [nchar](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
CONSTRAINT [PK_refs] PRIMARY KEY CLUSTERED
(
[entry_name] ASC,
[url] ASC
)WITH (IGNORE_DUP_KEY = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
USE [SSRAM]
GO
ALTER TABLE [dbo].[refs] WITH CHECK ADD
FOREIGN KEY([entry_name])
REFERENCES [dbo].[entry] ([name])
CONSTRAINT [FK_refs_entry]
-----vuln_soft
USE [SSRAM]
GO
/****** Object: Table [dbo].[vuln_soft]
Script Date: 02/05/2008
11:03:47 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[vuln_soft](
[entry_name] [varchar](15) COLLATE SQL_Latin1_General_CP1_CI_AS
NOT NULL,
[prodName] [nvarchar](25) COLLATE SQL_Latin1_General_CP1_CI_AS
NOT NULL,
[prodVendor] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS
NOT NULL,
[versionNum] [nvarchar](15) COLLATE SQL_Latin1_General_CP1_CI_AS
NOT NULL,
[preVersion] [nchar](1) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL,
[edition] [nchar](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
CONSTRAINT [PK_vuln_soft] PRIMARY KEY CLUSTERED
(
101
[entry_name] ASC,
[prodName] ASC,
[prodVendor] ASC,
[versionNum] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
USE [SSRAM]
GO
ALTER TABLE [dbo].[vuln_soft] WITH CHECK ADD CONSTRAINT
[FK_vuln_soft_entry] FOREIGN KEY([entry_name])
REFERENCES [dbo].[entry] ([name])
--vuln_types
USE [SSRAM]
GO
/****** Object: Table [dbo].[vuln_types]
Script Date: 02/05/2008
11:04:49 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[vuln_types](
[entry_name] [varchar](15) COLLATE SQL_Latin1_General_CP1_CI_AS
NOT NULL,
[vuln_type] [varchar](15) COLLATE SQL_Latin1_General_CP1_CI_AS
NOT NULL,
[input_bound] [char](1) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL,
[input_buffer] [char](1) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL,
CONSTRAINT [PK_vuln_types] PRIMARY KEY CLUSTERED
(
[entry_name] ASC,
[vuln_type] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
USE [SSRAM]
GO
ALTER TABLE [dbo].[vuln_types] WITH CHECK ADD CONSTRAINT
[FK_vuln_types_entry] FOREIGN KEY([entry_name])
REFERENCES [dbo].[entry] ([name])
102
USE [SSRAM]
GO
/****** Object: Table [dbo].[denormNVD]
Script Date: 02/05/2008
11:07:28 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id =
OBJECT_ID(N'[dbo].[denormNVD]') AND type in (N'U'))
DROP TABLE [dbo].[denormNVD]
USE [SSRAM]
GO
/****** Object: Table [dbo].[denormNVD]
Script Date: 02/05/2008
11:08:06 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[denormNVD](
[name] [varchar](15) COLLATE SQL_Latin1_General_CP1_CI_AS NOT
NULL,
[discovered] [datetime] NULL,
[published] [datetime] NOT NULL,
[cvss_score] [real] NULL,
[descript] [varchar](max) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL,
[sol] [varchar](max) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[loss_type] [varchar](15) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL,
[adminSP] [char](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[userSP] [char](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[otherSP] [char](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[vuln_type] [varchar](15) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL,
[input_bound] [char](1) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL,
[input_buffer] [char](1) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL,
[exploit_range] [varchar](15) COLLATE
SQL_Latin1_General_CP1_CI_AS NULL,
[ref] [nvarchar](max) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[source] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL,
[sig] [nchar](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[adv] [nchar](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[patch] [nchar](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[prodName] [nvarchar](25) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL,
[versionNum] [nvarchar](15) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL,
[preVersion] [nchar](1) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL,
[edition] [nchar](1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
) ON [PRIMARY]
103
GO
SET ANSI_PADDING OFF
-- ------------Parse Entry---------------------- to be used by the sp_xml_prepared document to shred the nvd xml file
DECLARE @idoc int
DECLARE @doc xml
-- Get the nvd xml content for parsing
-note that set @ doc accepts any valid expression, a select
statement had to be put in parenthesis to work
set @doc = (select XmlCol from dbo.nvdtmp)
-- create and internal representation of the XML document
EXEC sp_xml_preparedocument @idoc OUTPUT, @doc
-- Use OPENXML to provide rowset consisting of entries for nvd
insert entry
select * from OPENXML(@idoc, '/nvd/entry')
with entry
EXEC sp_xml_removedocument @idoc
------------------------------- Parse EntryDesc --------------------use ssram
-- to be used by the sp_xml_prepared document to shred the nvd xml file
DECLARE @idoc int
DECLARE @doc xml
-- Get the nvd xml content for parsing
-note that set @ doc accepts any valid expression, a select
statement had to be
-put in parenthesis to work
set @doc = (select top 1 XmlCol from dbo.nvdtmp)
-- create and internal representation of the XML document
EXEC sp_xml_preparedocument @idoc OUTPUT, @doc
-- Use OPENXML to provide rowset consisting of descriptions for each
entry in the nvd
-- Note the use of 'with' and '../../@name to get the grandparent node
attribute and the use
-- of ntext 'text () ' to get the element value for the description
insert entryDesc
select * from OPENXML(@idoc, '/nvd/entry/desc/descript', 3)
with (entry_name varchar (15) '../../@name',
descript
ntext
'text ()',
source
varchar (max) '@source')
EXEC sp_xml_removedocument @idoc
-- -------------- Parse into entrySols -----------------------------
104
use ssram
-- to be used by the sp_xml_prepared document to shred the nvd xml file
DECLARE @idoc int
DECLARE @doc xml
-- Get the nvd xml content for parsing
-note that set @ doc accepts any valid expression, a select
statement had to be
-put in parenthesis to work
-- set @doc = (select top 1 XmlCol from dbo.nvdtmp)
set @doc = (select XmlCol from dbo.nvdtmp)
-- create and internal representation of the XML document
EXEC sp_xml_preparedocument @idoc OUTPUT, @doc
-- Use OPENXML to provide rowset consisting of descriptions for each
entry in the nvd
-- Note the use of 'with' and '../../@name to get the grandparent node
attribute and the use
-- of ntext 'text () ' to get the element value for the description
insert entrySols
select * from OPENXML(@idoc, '/nvd/entry/sols/sol', 3)
with (entry_name varchar (15) '../../@name',
sol
ntext
'text ()',
source
varchar (5) '@source')
EXEC sp_xml_removedocument @idoc
-- -------------------------Parse LossType -------------------------use ssram
-- to be used by the sp_xml_prepared document to shred the nvd xml file
DECLARE @idoc int
DECLARE @doc xml
-- Get the nvd xml content for parsing
-note that set @ doc accepts any valid expression, a select
statement had to be
-put in parenthesis to work
--set @doc = (select top 1 XmlCol from dbo.nvdtmp)
set @doc = (select XmlCol from dbo.nvdtmp)
-- create and internal representation of the XML document
EXEC sp_xml_preparedocument @idoc OUTPUT, @doc
-- Note use of meta property @mp:localname to get the value for the
loss_type element
-- for availability loss type
insert lossType2
select * from OPENXML(@idoc, '/nvd/entry/loss_types/avail',9)
with (entry_name varchar (15) '../../@name',
loss_type
varchar (15) '@mp:localname',
105
adminSP
char (1)
'@admin',
userSP
char (1)
'@user',
otherSP
char (1)
'@other')
-- for confidentiality loss type
insert lossType2
select * from OPENXML(@idoc, '/nvd/entry/loss_types/conf',9)
with (entry_name varchar (15) '../../@name',
loss_type
varchar (15) '@mp:localname',
adminSP
char (1)
'@admin',
userSP
char (1)
'@user',
otherSP
char (1)
'@other')
-- for integrity loss type
insert lossType2
select * from OPENXML(@idoc, '/nvd/entry/loss_types/int',9)
with (entry_name varchar (15) '../../@name',
loss_type
varchar (15)
'@mp:localname',
adminSP
char (1)
'@admin',
userSP
char (1)
'@user',
otherSP
char (1)
'@other')
-- for security protection loss type
insert lossType2
select * from OPENXML(@idoc, '/nvd/entry/loss_types/sec_prot',9)
with (entry_name varchar (15) '../../@name',
loss_type
varchar (15)
'@mp:localname',
adminSP
char (1)
'@admin',
userSP
char (1)
'@user',
otherSP
char (1)
'@other')
----- admin attribute for security protection loss type
--update lossType2
--set adminSP = LS.adminSP
--from
--(
-select * from OPENXML(@idoc, '/nvd/entry/loss_types/sec_prot', 3)
-with (entry_name varchar (15) '../../@name',
-adminSP
nchar (10)
'@admin')
--) as LS
--where lossType2.entry_name = LS.entry_name
----- user attribute for security protection loss type
--update lossType2
--set userSP = LS.userSP
--from
--(
-select * from OPENXML(@idoc, '/nvd/entry/loss_types/sec_prot', 3)
-with (entry_name varchar (15) '../../@name',
-userSP
nchar (10)
'@user')
--) as LS
--where lossType2.entry_name = LS.entry_name
----- other attribute for security protection loss type
--update lossType2
106
--set userSP = LS.userSP
--from
--(
-select * from OPENXML(@idoc, '/nvd/entry/loss_types/sec_prot', 3)
-with (entry_name varchar (15) '../../@name',
-otherSP
nchar (10)
'@other')
--) as LS
--where lossType2.entry_name = LS.entry_name
--
EXEC sp_xml_removedocument @idoc
-- -------------------Parse Vuln Types------------------------------use ssram
-- to be used by the sp_xml_prepared document to shred the nvd xml file
DECLARE @idoc int
DECLARE @doc xml
-- Get the nvd xml content for parsing
-note that set @ doc accepts any valid expression, a select
statement had to be
-put in parenthesis to work
--set @doc = (select top 1 XmlCol from dbo.nvdtmp)
set @doc = (select XmlCol from dbo.nvdtmp)
-- create and internal representation of the XML document
EXEC sp_xml_preparedocument @idoc OUTPUT, @doc
---- Use OPENXML to provide rowset consisting of descriptions for each
entry in the nvd
--insert vuln_types
--select * from OPENXML(@idoc, '/nvd/entry/vuln_types', 3)
-with (entry_name varchar (15) '../@name',
-vuln_type
varchar (10)
'/access',
-input
varchar (5) '/input',
-design
varchar (10) '/design',
-exception varchar (10) '/exception',
-env
varchar (10) '/env',
-config
varchar (10) '/config',
-race
varchar (10) '/race',
-other
varchar (10) '/other')
-- update the other nodes for vuln_types
insert vuln_types
select * from OPENXML(@idoc, '/nvd/entry/vuln_types/access',9)
with (entry_name varchar (15) '../../@name',
vuln_type
varchar (15) '@mp:localname',
input_bound
char (1)
'@bound',
107
input_buffer
char
(1)
'@buffer' )
insert vuln_types
select * from OPENXML(@idoc, '/nvd/entry/vuln_types/input',9)
with (entry_name varchar (15) '../../@name',
vuln_type
varchar (15) '@mp:localname',
input_bound
char (1)
'@bound',
input_buffer
char (1)
'@buffer' )
insert vuln_types
select * from OPENXML(@idoc, '/nvd/entry/vuln_types/design',9)
with (entry_name varchar (15) '../../@name',
vuln_type
varchar (15) '@mp:localname',
input_bound
char (1)
'@bound',
input_buffer
char (1)
'@buffer' )
insert vuln_types
select * from OPENXML(@idoc, '/nvd/entry/vuln_types/exception',9)
with (entry_name varchar (15) '../../@name',
vuln_type
varchar (15) '@mp:localname',
input_bound
char (1)
'@bound',
input_buffer
char (1)
'@buffer' )
insert vuln_types
select * from OPENXML(@idoc, '/nvd/entry/vuln_types/env',9)
with (entry_name varchar (15) '../../@name',
vuln_type
varchar (15) '@mp:localname',
input_bound
char (1)
'@bound',
input_buffer
char (1)
'@buffer' )
insert vuln_types
select * from OPENXML(@idoc, '/nvd/entry/vuln_types/config',9)
with (entry_name varchar (15) '../../@name',
vuln_type
varchar (15) '@mp:localname',
input_bound
char (1)
'@bound',
input_buffer
char (1)
'@buffer' )
insert vuln_types
select * from OPENXML(@idoc, '/nvd/entry/vuln_types/race',9)
with (entry_name varchar (15) '../../@name',
vuln_type
varchar (15) '@mp:localname',
input_bound
char (1)
'@bound',
input_buffer
char (1)
'@buffer' )
insert vuln_types
select * from OPENXML(@idoc, '/nvd/entry/vuln_types/other',9)
with (entry_name varchar (15) '../../@name',
vuln_type
varchar (15) '@mp:localname',
input_bound
char (1)
'@bound',
input_buffer
char (1)
'@buffer' )
EXEC sp_xml_removedocument @idoc
-- ----------------- Parse Exploit range ---------------------------
108
use ssram
-- to be used by the sp_xml_prepared document to shred the nvd xml file
DECLARE @idoc int
DECLARE @doc xml
-- Get the nvd xml content for parsing
-note that set @ doc accepts any valid expression, a select
statement had to be
-put in parenthesis to work
--set @doc = (select top 1 XmlCol from dbo.nvdtmp)
set @doc = (select XmlCol from dbo.nvdtmp)
-- create and internal representation of the XML document
EXEC sp_xml_preparedocument @idoc OUTPUT, @doc
-- Use OPENXML to provide rowset consisting of the vulnerabilty range
for each entry in the nvd
--insert range
--select * from OPENXML(@idoc, '/nvd/entry/range', 3)
-with (entry_name
varchar (15) '../@name',
-local
nchar (1)
'local',
-local_network
nchar (1)
'local_network',
-network
nchar (1)
'network',
-user_init
nchar (1)
'user_init')
--- update the local nodes for range -- locally exploitable range
insert range
select * from OPENXML(@idoc, '/nvd/entry/range/local',9)
with (entry_name varchar (15) '../../@name',
exploit_range
varchar (15) '@mp:localname')
-- for local network exploitable range
insert range
select * from OPENXML(@idoc, '/nvd/entry/range/local_network',9)
with (entry_name varchar (15) '../../@name',
exploit_range
varchar (15) '@mp:localname')
-- for remote exploitable range
insert range
select * from OPENXML(@idoc, '/nvd/entry/range/remote',9)
with (entry_name varchar (15) '../../@name',
exploit_range
varchar (15) '@mp:localname')
-- for user intiated exploit - where user accesses the attacker
insert range
select * from OPENXML(@idoc, '/nvd/entry/range/user_init',9)
with (entry_name varchar (15) '../../@name',
exploit_range
varchar (15) '@mp:localname')
EXEC sp_xml_removedocument @idoc
109
-- ------------------ Parse ref node ------------use ssram
-- to be used by the sp_xml_prepared document to shred the nvd xml file
DECLARE @idoc int
DECLARE @doc xml
-- Get the nvd xml content for parsing
-note that set @ doc accepts any valid expression, a select
statement had to be
-put in parenthesis to work
--set @doc = (select top 1 XmlCol from dbo.nvdtmp)
set @doc = (select XmlCol from dbo.nvdtmp)
----Used to see what the contents would be like 2/5/08
---- may have to do a nested select or join it with itself to remove
duplicates
----EXEC sp_xml_preparedocument @idoc OUTPUT, @doc
----------select * from OPENXML(@idoc, '/nvd/entry/refs/ref', 3)
---- with (entry_name varchar (15)
'../../@name',
---ref
nvarchar(max)
'text ()',
---source
nvarchar (50)
'@source',
---url
nvarchar (100)
'@url',
---sig
nchar (1)
'@sig',
---adv
nchar (1)
'@adv',
---patch
nchar (1)
'@patch')
-- create and internal representation of the XML document
EXEC sp_xml_preparedocument @idoc OUTPUT, @doc
-- USED TO SHRED INTO TABLE
-- Use OPENXML to provide rowset consisting of external references for
each entry in the nvd
-- Note the use of 'with' and '../../@name to get the grandparent node
attribute and the use
-- of ntext 'text () ' to get the element value for the description
insert refs
select * from OPENXML(@idoc, '/nvd/entry/refs/ref', 3)
with (entry_name varchar (15)
'../../@name',
ref
nvarchar(max)
'text ()',
source
nvarchar (50)
'@source',
url
nvarchar (100)
'@url',
sig
nchar (1)
'@sig',
adv
nchar (1)
'@adv',
patch
nchar (1)
'@patch')
EXEC sp_xml_removedocument @idoc
use ssram
-- to be used by the sp_xml_prepared document to shred the nvd xml file
DECLARE @idoc int
110
DECLARE @doc xml
-- Get the nvd xml content for parsing
-note that set @ doc accepts any valid expression, a select
statement had to be
-put in parenthesis to work
set @doc = (select top 1 XmlCol from dbo.nvdtmp)
-- create and internal representation of the XML document
EXEC sp_xml_preparedocument @idoc OUTPUT, @doc
-- Use OPENXML to provide rowset consisting of software (product and
versions)affected by each entry in the nvd
-- Note the use of 'with' and '../../../@name to get the greatgrandparent node (if it works)attribute and the use
insert vuln_soft
select distinct * from OPENXML(@idoc, '/nvd/entry/vuln_soft/prod/vers',
3)
with (entry_name varchar (15)
'../../../@name',
prodName
nvarchar (25) '../@name',
prodVendor
nvarchar (50) '../@vendor',
versionNum
nvarchar (15) '@num',
preVersion
nchar (1) '@prev',
edition
nchar (1) '@edition')
EXEC sp_xml_removedocument @idoc
111
Appendix A-4 – Validation of Data Upload
This appendix shows the entries that were used to validate that the NVD XML data were
parsed and loaded correctly into SSRAM
Records Validated
Record Number
15
25
45
65
85
105
125
Last
Actual File
CVE-2006-0948
CVE-2006-3124
CVE-2006-4257
CVE-2006-4277
CVE-2006-4298
CVE-2006-4319
CVE-2006-4349
CVE-2006-4369
CVE-2006-4380
NVDtmp File
CVE-2006-0948
CVE-2006-3124
CVE-2006-4257
CVE-2006-4277
CVE-2006-4298
CVE-2006-4319
CVE-2006-4349
CVE-2006-4369
CVE-2006-4380
Table 20 - Data Entry Validation Entries
At the end of the first round of validation, we found out that our representation of the loss
type and vulnerability type nodes did not adequately reflect the different types of security
protection loss and input vulnerabilities. As such, additional fields were added to the
loss_type and vuln_type tables and parsing script corrected to reflect the changes made.
We compared again for the previous data and 7 additional records given below to confirm
that the data was uploaded correctly.
Record Number
2
22
42
62
82
102
122
Actual File
CVE-2006-2122
CVE-2006-4254
CVE-2006-4274
CVE-2006-4295
CVE-2006-4316
CVE-2006-4346
CVE-2006-4366
112
NVDtmp File
CVE-2006-2122
CVE-2006-4254
CVE-2006-4274
CVE-2006-4295
CVE-2006-4316
CVE-2006-4346
CVE-2006-4366
Appendix B-1 – Creating Term Vector
We felt that the description attribute of each entry had information that would help in
clustering NVD data. This was a non- trivial problem. In order to be able to use this
comment-like attribute we had to create term vectors that associated terms with entries,
so that we could look at the frequency of occurrence of those terms. To create a term
vector so that the comment-like description of vulnerability terms could be used as part of
the data mining process involved the following steps.
1. Create a text mining dictionary for the vector table based on the terms in the
description attribute for all of the entries.
2. Build Term Vectors based on an association of entries to text mining dictionary
terms.
Creating a Text Mining Dictionary for the Vector Table
1.1 Create a new DTS (SSIS) package
1.2 Rename the package to BuildDictionary.dtsx
1.3 Go to Data Flow tab and add a new Data Flow task
1.4 In the data flow task, add a “OLE DB Source” transform
• Connection: create a new for localhost.SSRAM
• Table: desc
• Columns:
•
1.5 Add a Data Conversion transform and connect from the OLE DB Source transform
1.6 Add a “Term Extraction” transform and connect from the Data Conversion transform
• Term Type: Noun and Noun Phrase
•
Score Type: TFIDF –
TFIDF - Specify that the score is the TFIDF value of the term. The TFIDF score is the
product of Term Frequency and Inverse Document Frequency, defined as: TFIDF of a
Term T = (frequency of T) * log( (#rows in Input) / (#rows having T) )
• Parameters: Frequency=2, Length=10
1.7 Add a “Sort” transform and connect it.
• Sort “Term” in ascending order
• Don’t pass through Score column
113
1.8 Add an “OLE DB Destination” transform and connect it.
• Use the connection: localhost.SSRAM
• Click “New” and name it “TermDictionary”
• In Mappings, connect the column, “Term”
1.9 Execute the package
• It automatically enters into debugging mode
• It may take a few minutes
1.10 Stop debugging
Build term vectors
1.11 Create a new DTS (SSIS) package
1.12 Rename the package to BuildTermVectors.dtsx
1.13 Go to Data Flow tab and add a new Data Flow task
1.14 In the data flow task, add a “OLE DB Source” transform
• Connection: create a new for localhost.SSRAM
• Table: entryDesc
• Columns: entry_Name, desc only
1.15
1.16
•
Add a “Term Lookup” transform and connect from the previous transform
• Reference table: TermDictionary
Add a “Sort” transform and connect it.
• Sort “entry_name” in ascending order, then, “Term” in ascending order, no
duplicates
114
1.17
1.18
1.19
Add an “OLE DB Destination” transform and connect it.
• Use the connection: localhost.TDM
• Click “New” and name it “TermVectors”
• In Mappings, make sure to connect all columns, “Term”, “Frequency”, “ID”
Execute the package
• It automatically enters into debugging mode
• It may take a few minutes
Stop debugging
• PassThru column: entry_name
• Lookup input column: descript
115
116
Appendix B-2 – Determining Clustering Algorithm
This appendix shows the process of preparing the data for clustering and setting up the
clustering algorithm.
o Create a new prepareSampleData by splitting the data for training and
testing. In our case, we used those entries reported between 1996 through
2001 for training and created the test data as those entries reported in
2002.
o Setup new DataSourceView to be used for Clustering
o Create clustering data mining structure with and without description
components
With Description
• Following Fields suggested
117
•
118
•
Created 4 Clustering using EM algorithm with 10,8,6 and 4 nodes.
o The X-axis of the chart represents the percentage of the test dataset that is
used to compare the predictions
o the Y-axis now represents the percentage of predictions that are correct.
119
o
o
120
With NVDCL-EM4 consistently underperforming the others, it was replaced with a
KMeans algorithm for further comparison
121
The K-Means algorithm regardless of number of nodes used had the same score and
population predicted as correct were the same, so it did not matter which one we used.
122
123
The KMeans models consistently underperformed the EM models –
Choosing the EM-8 models.
o Without Description
124
o
o
125
Processing time Clustering with Description Vector
We looked at the processing time for clustering given the description vector, given all of
the algorithms, it took 56 seconds.
126
Appendix B-3 – Query for Identifying Clusters in Training Data
The following query was run on the analysis server to predict and assign clusters to each
entry.
SELECT
t.[name],
t.[discovered],
t.[published],
t.[cvss_score],
t.[loss_type],
t.[adminSP],
t.[userSP],
t.[otherSP],
t.[vuln_type],
t.[input_bound],
t.[input_buffer],
t.[exploit_range],
t.[ref],
t.[source],
t.[sig],
t.[adv],
t.[patch],
t.[prodName],
t.[versionNum],
t.[preVersion],
t.[edition],
(Cluster()) as
[ClusterNode],
(ClusterProbability()) as
[ClusterProbability]
From
[NVDCL-EM-10]
PREDICTION JOIN
SHAPE {
OPENQUERY([SSRAM],
'SELECT
[name],
[discovered],
[published],
[cvss_score],
[loss_type],
[adminSP],
[userSP],
[otherSP],
[vuln_type],
[input_bound],
[input_buffer],
[exploit_range],
[ref],
[source],
[sig],
[adv],
[patch],
[prodName],
[versionNum],
[preVersion],
[edition]
FROM
[dbo].[TrainingData_96_01]
ORDER BY
[name]')}
APPEND
({OPENQUERY([SSRAM],
'SELECT
[Term],
[Frequency],
[entry_name]
FROM
[dbo].[TermVectors]
ORDER BY
[entry_name]')}
RELATE
[name] TO [entry_name])
AS
[TermVectors] AS t
ON
[NVDCL-EM-10].[Discovered] =
t.[discovered] AND
[NVDCL-EM-10].[Published] =
t.[published] AND
[NVDCL-EM-10].[Cvss Score] =
t.[cvss_score] AND
[NVDCL-EM-10].[Loss Type] =
t.[loss_type] AND
[NVDCL-EM-10].[Admin SP] =
t.[adminSP] AND
[NVDCL-EM-10].[User SP] =
t.[userSP] AND
[NVDCL-EM-10].[Other SP] =
t.[otherSP] AND
[NVDCL-EM-10].[Vuln Type] =
t.[vuln_type] AND
[NVDCL-EM-10].[Input Bound] =
t.[input_bound] AND
[NVDCL-EM-10].[Input Buffer] =
t.[input_buffer] AND
127
[NVDCL-EM-10].[Exploit
Range] = t.[exploit_range]
AND
[NVDCL-EM-10].[Source] =
t.[source] AND
[NVDCL-EM-10].[Sig] =
t.[sig] AND
[NVDCL-EM-10].[Adv] =
t.[adv] AND
[NVDCL-EM-10].[Patch] =
t.[patch] AND
[NVDCL-EM-10].[Prod Name] =
t.[prodName] AND
[NVDCL-EM-10].[Version Num] =
t.[versionNum] AND
[NVDCL-EM-10].[Pre Version] =
t.[preVersion] AND
[NVDCL-EM-10].[Edition] =
t.[edition] AND
[NVDCL-EM-10].[Term
Vectors].[Term] =
t.[TermVectors].[Term] AND
[NVDCL-EM-10].[Term
Vectors].[Frequency] =
t.[TermVectors].[Frequency]
Which effectively gives a result as follows that is stored in a table.
128
Appendix B-4 – Classified Data Matrix
Classification models are evaluated based on the number of test records that they
correctly and incorrectly predict. Columns correspond to the actual values and rows
depict the predicted values. The metric to measure the performance of each model –
Accuracy, defined as follows
Accuracy =
Numberofcorrect Pr edictions
TotalNumberofpredictions
The rest of this appendix shows the confusion matrix used to determine the accuracy of
the classification models.
Classifier without Description
Counts for DT
ClusterClassifier
on [Cluster
Node]
Classification Confusion Matrix
Cluster
6
(Actual)
Cluster
4
(Actual)
Cluster
3
(Actual)
Cluster
8
(Actual)
Cluster
5
(Actual)
Cluster
7
(Actual)
Cluster
2
(Actual)
Predicted
Cluster
1
(Actual)
Cluster 1
2708
5
0
6
42
4
0
102
Cluster 6
767
1243
0
256
2
162
0
54
Cluster 4
31
13
167
2
41
23
71
2
Cluster 3
0
48
0
117
0
5
7
79
Cluster 8
0
0
0
0
0
0
0
0
Cluster 5
32
92
9
18
0
1099
13
91
Cluster 7
47
6
394
5
289
2
762
25
Cluster 2
22
24
5
592
6
17
6
657
Predicted
Cluster
1
(Actual)
Cluster
6
(Actual)
Cluster
4
(Actual)
Cluster
3
(Actual)
Cluster
8
(Actual)
Cluster
5
(Actual)
Cluster
7
(Actual)
Cluster
2
(Actual)
Cluster 1
3158
496
60
145
74
164
41
157
Cluster 6
307
680
1
89
0
63
2
29
Cluster 4
35
14
257
5
75
32
197
6
Cluster 3
19
129
10
513
0
7
4
103
Cluster 8
2
0
9
0
17
0
15
6
Cluster 5
47
89
6
15
2
1017
7
59
Accuracy .66
Counts for NB
ClusterClassifier
on [Cluster
Node]
129
Cluster 7
28
2
226
8
202
2
578
22
Cluster 2
11
21
6
221
10
27
15
628
Predicted
Cluster
1
(Actual)
Cluster
6
(Actual)
Cluster
4
(Actual)
Cluster
3
(Actual)
Cluster
8
(Actual)
Cluster
5
(Actual)
Cluster
7
(Actual)
Cluster
2
(Actual)
Cluster 1
3052
260
46
77
70
149
31
137
Cluster 6
421
889
0
154
0
73
1
28
Cluster 4
34
3
207
3
68
32
138
7
Cluster 3
11
146
6
566
2
5
3
174
Accuracy 0.67
Counts for NN
ClusterClassifier
on [Cluster
Node]
Cluster 8
0
1
6
1
2
1
4
1
Cluster 5
50
89
16
15
3
1027
11
70
Cluster 7
27
10
285
10
226
2
662
29
Cluster 2
12
33
9
170
9
23
9
564
Accuracy 0.86
Classifier with Description
Classification Confusion Matrix
Counts for DTCluster
Classifier with Desc on
[Cluster Node]
Predicted
Accuracy
Cluster 1
(Actual)
Cluster 6
(Actual)
Cluster 4
(Actual)
Cluster 3
(Actual)
Cluster 8
(Actual)
Cluster 5
(Actual)
Cluster 7
(Actual)
Cluster 2
(Actual)
Cluster 1
2708
5
0
6
42
4
0
102
Cluster 6
774
1251
0
256
2
212
0
54
Cluster 4
0
0
0
0
0
0
0
0
Cluster 3
0
0
0
0
0
0
0
0
Cluster 8
0
0
0
0
0
0
0
0
Cluster 5
32
72
0
18
0
882
3
95
Cluster 7
79
31
570
7
330
209
843
40
Cluster 2
14
72
5
709
6
5
13
719
0.63
Counts for NB
ClusterClassifier with
Desc on [Cluster Node]
Predicted
Accuracy
Cluster 1
(Actual)
Cluster 6
(Actual)
Cluster 4
(Actual)
Cluster 3
(Actual)
Cluster 8
(Actual)
Cluster 5
(Actual)
Cluster 7
(Actual)
Cluster 2
(Actual)
Cluster 1
3158
409
0
105
44
163
3
129
Cluster 6
339
809
0
153
0
64
0
33
Cluster 4
31
13
185
2
41
23
71
2
Cluster 3
3
83
2
325
0
7
0
83
Cluster 8
0
0
0
0
0
0
0
0
Cluster 5
12
84
9
16
0
1033
10
79
Cluster 7
47
6
376
9
289
2
769
29
Cluster 2
17
27
3
386
6
20
6
655
0.68
130
Counts for NN
ClusterClassifier
with Desc on
[Cluster Node]
Accuracy
Predicted
Cluster
1
(Actual)
Cluster
6
(Actual)
Cluster
4
(Actual)
Cluster
3
(Actual)
Cluster
8
(Actual)
Cluster
5
(Actual)
Cluster
7
(Actual)
Cluster
2
(Actual)
Cluster 1
3482
880
100
196
83
205
25
155
Cluster 6
33
347
0
41
0
22
0
2
Cluster 4
0
0
4
0
0
0
0
0
Cluster 3
2
63
8
287
0
5
15
43
Cluster 8
0
0
0
0
0
0
0
0
Cluster 5
25
85
12
16
0
1043
13
85
Cluster 7
48
11
448
7
291
13
793
22
Cluster 2
17
45
3
449
6
24
13
703
0.65
131
Appendix B-5 – Validation of Clustering Algorithms
Validation Statistics of all models for Training Data 1996-2001. In order that the
cohesion of each model could be determined we looked at the average score for each
cluster and the standard deviation. This allowed us to see the variability around the
mean. We also looked at the range of scores within the clusters by getting the minimum
(Min) and maximum (Max) score. The wtd scores were based on the growth rate of the
scores as explained in Chapter 3.
EM8Clusters All
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
EM8Clusters
Probability
>.8
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
EM10Clusters
All
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Count
22717
7816
7296
3856
9306
514
Avg
Probability
0.81
0.54
0.51
0.51
0.62
0.9
Std
Deviation
2.08
2.2
2.19
2.39
2.33
0
Avg
Score
7.44
5.43
5.16
5.01
4.75
7
Max
Score
10
10
10
10
10
7
Min
Score
1.6
1.9
1.9
2.3
2.3
7
Wtd
StdDev
2.25
1.43
1.64
1.52
1.84
0.61
Wtd
Avg
Score
6.09
2.89
2.66
2.46
3.02
6.29
Wtd
Max
10
10
10
10
10
7
Wtd
Min
0.91
0.95
1.06
0.83
1.02
5.59
Count
14101
687
1011
631
2143
292
Avg
Probability
0.94
0.97
0.96
0.99
0.86
0.97
Std
Deviation
1.67
2.26
2.4
2.15
2.31
0
Avg
Score
7.69
5.21
4.58
4.51
6.32
7
Max
Score
10
10
10
10
10
7
Min
Score
1.6
2.3
1.9
2.3
2.3
7
Wtd
StdDev
1.64
2.19
2.29
2.11
1.88
0.01
Wtd
Avg
Score
7.23
5.03
4.39
4.45
5.35
6.82
Wtd
Max
10
10
10
10
10
7
Wtd
Min
1.6
1.86
1.59
1.9
1.85
6.82
Count
9468
11350
11688
6940
4869
6649
541
Avg
Probability
0.65
0.82
0.86
0.55
0.51
0.74
0.87
Std
Deviation
2.02
1.24
1.59
1.91
1.91
1.73
0
Avg
Score
4.38
6.75
8.89
6.04
4.6
3.9
7
Max
Score
10
10
10
10
10
10
7
Min
Score
1.9
1.9
1.6
1.9
2.3
2.3
7
Wtd
StdDev
1.34
1.32
2.23
1.34
1.35
1.25
0.69
Wtd
Avg
Score
2.78
5.51
7.72
3.23
2.38
2.82
6.11
Wtd
Max
10
10
10
10
10
10
7
Wtd
Min
0.72
0.9
1.03
1.01
1.2
1.12
5.34
132
EM10Clusters
>0.80
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Count
1700
8351
8761
860
429
2990
300
Avg
Probability
0.9
0.9
0.94
0.94
0.98
0.86
0.96
Std
Deviation
1.61
0.81
1.34
2.27
1.91
1.54
0
Avg
Score
3.8
6.74
9.29
4.41
4.45
3.28
7
Max
Score
10
10
10
10
10
10
7
Min
Score
1.9
1.9
1.6
1.9
2.3
2.3
7
Wtd
StdDev
1.71
0.78
1.46
2.17
1.87
1.48
0.02
Wtd
Avg
Score
3.48
6.09
8.72
4.13
4.35
2.85
6.73
Wtd
Max
10
10
10
10
10
10
7
Wtd
Min
1.86
1.61
1.59
1.57
2.25
1.85
6.73
Count
15406
6655
2072
2324
4484
5843
5191
9016
514
Avg
Probability
0.85
0.73
0.5
0.56
0.5
0.53
0.53
0.41
0.84
Std
Deviation
1.79
2.07
1.93
2.79
1.06
1.37
2.36
1.97
0
Avg
Score
8.08
4.18
4.2
7.23
3.57
6.71
5.11
5.87
7
Max
Score
10
10
10
10
10
10
10
10
7
Min
Score
1.6
2.3
2.3
1.9
1.9
1.9
2.3
2.7
7
Wtd
StdDev
2.3
1.33
1.95
1.68
1.17
0.91
1.58
1.17
0.87
Wtd
Avg
Score
6.86
2.87
2.26
3.51
1.81
3.46
2.74
2.48
5.88
Wtd
Max
10
10
10
10
10
10
10
9.98
7
Wtd
Min
1.49
0.96
0.61
1.16
0.82
1.19
0.89
0.84
4.89
Count
11589
3107
471
673
518
435
266
71
292
Avg
Probability
0.96
0.9
0.97
0.95
0.99
0.9
0.98
0.95
0.95
Std
Deviation
1.73
1.56
2.22
2.44
1.93
1.25
2.53
2.16
0
Avg
Score
8.1
3.31
4.85
4.9
4.15
5.84
4.81
5.62
7
Max
Score
10
10
10
10
10
10
10
10
7
Min
Score
1.6
2.3
2.3
1.9
1.9
1.9
2.3
2.7
7
Wtd
StdDev
1.69
1.56
2.14
2.4
1.9
1.29
2.54
2.06
0.02
Wtd
Avg
Score
7.81
3.03
4.71
4.7
4.11
5.26
4.74
5.35
6.64
Wtd
Max
10
10
10
10
10
10
10
9.98
7
Wtd
Min
1.6
1.85
2.14
1.55
1.9
1.89
2.23
2.69
6.64
EM12Clusters
All
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Cluster 8
Cluster 9
EM12Clusters
>0.8
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Cluster 8
Cluster 9
133
KM10Clusters
All
Cluster 1
Cluster 10
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Cluster 8
Cluster 9
Count
12020
2126
12774
6422
2642
4423
3136
2257
3522
2183
Avg
Probability
1
1
1
1
1
1
1
1
1
1
Std
Deviation
1.19
1.75
1.64
0.55
0.77
0.73
0.62
1.02
2
2.04
Avg
Score
7.01
4.14
8.68
3.38
3.41
6.99
2.41
5.25
4.64
6.78
Max
Score
10
10
10
10
10
10
10
10
10
10
Training Data by Clusters (1996 - 2001)
Margin of
NumInCluster CVSS_Score StdDev Error
7.9726 1.7142
0.09
1395
867
6.7191 1.4718
0.098
338
6.1498 2.6781
0.2855
651
5.5779
2.266
0.1741
573
4.8681 2.1551
0.1765
431
4.6508 2.4416
0.2305
536
4.384 2.1133
0.1789
549
3.3
0
0
134
Min
Score
2.3
1.9
1.9
1.9
1.9
2.3
1.9
1.6
1.9
2.3
Wtd
StdDev
1.19
1.75
1.64
0.55
0.77
0.73
0.62
1.02
2
2.04
Wtd
Avg
Score
7.01
4.14
8.68
3.38
3.41
6.99
2.41
5.25
4.64
6.78
Wtd
Max
10
10
10
10
10
10
10
10
10
10
With 95%
Confidence
Interval
Cluster
Cluster 1
Cluster 6
Cluster 8
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 7
Lower
7.8826
6.6211
5.8643
5.4038
4.6916
4.4203
4.2051
3.3
Upper
8.0625
6.817
6.4353
5.752
5.0445
4.8813
4.5629
3.3
Wtd
Min
2.3
1.9
1.9
1.9
1.9
2.3
1.9
1.6
1.9
2.3
Appendix B-6 – Processing Time for Classifiers
Figure 37 - Processing Time for Classification Algorithm
<Batch xmlns="http://schemas.microsoft.com/analysisservices/2003/engine">
<Parallel>
<Process xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Object>
<DatabaseID>SSRAM22608</DatabaseID>
<MiningStructureID>Classify Clusters</MiningStructureID>
</Object>
<Type>ProcessFull</Type>
<WriteBackTableCreation>UseExisting</WriteBackTableCreation>
</Process>
</Parallel>
</Batch>
Processing Mining Structure 'ClassifyClustersWoDescription' completed successfully.
Start time: 7/22/2008 7:53:20 AM; End time: 7/22/2008 7:54:00 AM; Duration: 0:00:40
Processing Mining Model 'NB ClusterClassifier' completed successfully.
Start time: 7/22/2008 7:53:27 AM; End time: 7/22/2008 7:53:30 AM; Duration: 0:00:03
Processing Mining Model 'DT ClusterClassifier' completed successfully.
Start time: 7/22/2008 7:53:28 AM; End time: 7/22/2008 7:53:33 AM; Duration: 0:00:05
Processing Mining Model 'NN ClusterClassifier' completed successfully.
Start time: 7/22/2008 7:53:28 AM; End time: 7/22/2008 7:54:00 AM; Duration: 0:00:32
135
Appendix B-7 – Calculating CVSS ‘base’ Score
An example of how the CVSS base score is derived – given the entry in Figure 22 CVSS base Score and Vector
.
BASE METRIC
EVALUATION
SCORE
------------------------------------------------------------------Access Vector
[Local]
(0.395)
Access Complexity
[Low]
(0.71)
Authentication
[None]
(0.704)
Confidentiality Impact
[Complete]
(0.66)
Integrity Impact
[Complete]
(0.66)
Availability Impact
[Complete]
(0.66)
------------------------------------------------------------------BASE FORMULA
BASE SCORE
------------------------------------------------------------------Impact = 10.41*(1-(0.34*0.34*0.4)) == 9.83
Exploitability = 20*0.395*0.71*0.704 == 3.95
f(Impact) = 1.176
Base Score = ((0.6*9.83)+(0.4*3.95)–1.5)*1.176
== (7.09)
Note that the score values above are derived from the algorithm shown below from
[First 2005]
BaseScore = round_to_1_decimal(((0.6*Impact)+(0.4*Exploitability)–
1.5)*f(Impact))
Impact = 10.41*(1-(1-ConfImpact)*(1-IntegImpact)*(1-AvailImpact))
Exploitability = 20* AccessVector*AccessComplexity*Authentication
f(impact)= 0 if Impact=0, 1.176 otherwise
AccessVector
= case AccessVector of
requires local access: 0.395
adjacent network accessible: 0.646
network accessible: 1.0
AccessComplexity = case AccessComplexity of
high: 0.35
medium: 0.61
low: 0.71
136
Authentication
= case Authentication of
requires multiple instances of authentication:
0.45
requires single instance of authentication:
0.56
requires no authentication: 0.704
ConfImpact
= case ConfidentialityImpact of
none:
0.0
partial:
0.275
complete:
0.660
IntegImpact
= case IntegrityImpact of
none:
0.0
partial:
0.275
complete:
0.660
AvailImpact
= case AvailabilityImpact of
none:
0.0
partial:
0.275
complete:
0.660
137
Appendix C-1 – Stored Procedure for Calculating Impact Score
Attributes
This appendix shows the code for calculating impact score attributes
set ANSI_NULLS ON
set QUOTED_IDENTIFIER ON
go
-- =============================================
-- Author:
Idongesit Mkpong-Ruffin
-- Create date: 4/14/2008
-- Description:
Calculate impact factor values
-- =============================================
ALTER PROCEDURE [dbo].[spCalcImpactFactors]
-- Add the parameters for the stored procedure here
(
@endDate datetime = '01/01/2001',
@timePeriods int = 12,
@clusterNode nvarchar(155)='Cluster 1',
@weightedAvgScore float OUTPUT,
@avgScoreGrowth float OUTPUT,
@avgFreqGrowth float OUTPUT,
@SumResult int OUTPUT) -- to see if i can get the information
returned
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
-- Declare the return variable here
--DECLARE @SumResult int
-- Temporary table to store interim data for computation
DECLARE @tmpClusterInfo table(timeGap int, avgScore float,
reportedEntries float)
INSERT INTO @tmpClusterInfo(timeGap, avgScore,
reportedEntries)
select datediff(month, published, @endDate)as timeGap,
avg(cvss_score) as avgScore, count(distinct name) as reportedEntries
from trainingdatawclusters96_01
where
(datediff(month, published, @endDate) > 0)and
(datediff(month, published, @endDate) <=
@timePeriods)and
138
clusternode = @clusterNode
group by datediff(month, published, @endDate)
order by timegap
-- Total Entries
-select sum(reportedEntries) from @tmpClusterInfo as
TotalEntries --debug
set @SumResult = (select sum(reportedEntries) from
@tmpClusterInfo)
--- get the total entries for coefficientSummation
set @weightedAvgScore = (select
round(sum(((reportedEntries)/@sumResult)*avgScore),2)
from @tmpClusterInfo)
-- Score Growth
-- get the avgGrowth rate of Score for this period within
cluster
set @avgScoreGrowth = (select
round(avg(((nextRow.avgScore/curRow.avgScore)-1)),2) as avgScoreGrowth
from @tmpClusterInfo curRow
left join @tmpClusterInfo nextRow
on curRow.timegap = nextRow.timeGap-1)
-- Frequency Of Occurrence Growth
-- get the avgGrowth rate of Score for this period within cluster
set @avgFreqGrowth = (select
round(avg(((nextRow.reportedEntries/curRow.reportedEntries)-1)),2)
from @tmpClusterInfo curRow
left join @tmpClusterInfo nextRow
on curRow.timegap = nextRow.timeGap-1)
-- Add the T-SQL statements to compute the return value here
return
END
139
Appendix C-2 – Stored Procedure for Determining Prioritized
Listing
Once each entry has been de-normalized so that component elements of the entry can be
analyzed and assigned scores, they are aggregated to determine overall impact score for
the entry. This appendix shows the code used to effect this in SSRAM.
set ANSI_NULLS ON
set QUOTED_IDENTIFIER ON
go
-- =============================================
-- Author:
Idongesit Mkpong-Ruffin
-- Create date:
-- Description:
-- =============================================
ALTER PROCEDURE [dbo].[spPrioritizedList]
-- Add the parameters for the stored procedure here
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
-- Insert statements for procedure here
select distinct entryID, round(avg(predImpact),2) as PredImpact,
round(avg(predFreq),0) as PredFreq, round(avg(lossExpect),2) as
LossExpect
from predictions
group by entryID
order by LossExpect desc
END
140
Appendix C-3 – Stored Procedure for Persisting Predictions
To be able to prioritize predictions requires that the results of the prediction are stored.
This appendix shows the code use for persisting predictions.
set ANSI_NULLS ON
set QUOTED_IDENTIFIER ON
go
-- =============================================
-- Author:
Idongesit Mkpong-Ruffin
-- Create date: 5/19/08
-- Description:
Creates persistent storage of predicted information
-- =============================================
ALTER PROCEDURE [dbo].[spPredictionStore]
-- Add the parameters for the stored procedure here
@entryID varChar(30),
@vulnType varChar(15),
@lossType varChar(15),
@exploitRange varChar(15),
@clusterNode varChar(15),
@predImpact float,
@predFreq float,
@lossExp float
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
-- Insert statements for procedure here
insert into
Predictions(entryID,vulnType,lossType,exploitRange,clusterNode,predImpa
ct,predFreq, lossExpect)
values(@entryID,@vulnType,@lossType,@exploitRange,@clusterNode,@predImp
act,@predFreq,@lossExp);
END
141
Appendix D-1 – Singleton Request Algorithm
To validate that SSRAM could, based on loss type, vulnerability type and exploit range,
classify and entry and by extension accurately predict CVSS score, we generated a single
request of known clusters based on the code shown in this appendix.
using
using
using
using
using
using
using
using
System;
System.Collections.Generic;
System.ComponentModel;
System.Data;
System.Drawing;
System.Text;
System.Windows.Forms;
Microsoft.AnalysisServices.AdomdClient;
namespace SSRAMWinApp1
{
public partial class frmSingle : Form
{
public frmSingle()
{
InitializeComponent();
}
private void linkLabel1_LinkClicked(object sender,
LinkLabelLinkClickedEventArgs e)
{
}
private void label2_Click(object sender, EventArgs e)
{
}
private void label1_Click(object sender, EventArgs e)
{
}
private void Form1_Load(object sender, EventArgs e)
{
}
private void label1_Click_1(object sender, EventArgs e)
{
}
142
private void btnCalc_Click(object sender, EventArgs e)
{
string strASConnString =
SSRAMWinApp1.Properties.Settings.Default.AS_SSRAMConnectionString;
string strDBConnString =
Convert.ToString(SSRAMWinApp1.Properties.Settings.Default.DB_SSRAMConne
ctionString);
string strClusterLbl = "";
double dblPredictImpact = 0;
double dblPredictFreq = 0;
DateTime dtPredictDate;
int intPeriods = 0;
try
{
IPredictElementCluster objRequester = new
PredictElementCluster();
this.txtResult.Text = "";
//
string strRequestCluster =
objRequester.Command(strASConnString, this.tbVT.Text.Trim(),
this.tbLT.Text.Trim(), this.tbER.Text.Trim());
string strRequestCluster =
objRequester.Command(strASConnString,
this.cbVulnType.SelectedItem.ToString().Trim(),
this.cbLossType.SelectedItem.ToString().Trim(),
this.cbExploitRange.SelectedItem.ToString().Trim());
strClusterLbl = objRequester.getClusterNode;
ICalculateImpactFactor objRequestImpact = new
CalcImpactFactor();
this.txtImpactInfo.Text = ""; // to put impact factor
and loss expectation
dtPredictDate =
Convert.ToDateTime(this.dateTimePicker1.Text.Trim());
// result = Int32.TryParse("3", NumberStyles.Integer,
null, out int32Val);
bool blconvertPeriod =
int.TryParse(this.txtMonths.Text, out intPeriods);
//bool blcp = int.TryParse(this.txtImpactInfo.Text, out
intPeriods);
string strImpactInfo =
objRequestImpact.Command(strDBConnString,dtPredictDate,intPeriods,strCl
usterLbl);
int intImpactLength = strImpactInfo.Length;
//get predicted Impact
dblPredictImpact = objRequestImpact.getPredictedImpact;
dblPredictFreq =
objRequestImpact.getPredictedOccurrences;
143
//display results
this.txtResult.Text = strRequestCluster +
strImpactInfo;
this.txtImpactInfo.Text = "Predicted Impact: " +
dblPredictImpact.ToString().Trim() + "\r\n" +
"Predicted Occurrence: " +
dblPredictFreq.ToString().Trim();
}
catch (Exception ex)
{
System.Windows.Forms.MessageBox.Show(ex.ToString());
//
try
//
{
//
IPredictCustomerMDXRequester objRequester =
new PredictCustomerMDXRequester();
//
this.m_tbResult.Text = "";
//
this.m_tbResult.Text =
objRequester.Command(strASConnString, this.m_tbCity.Text.Trim(),
this.m_tbContactTitle.Text.Trim());
//
}
//
catch (Exception ex)
//
{
//
System.Windows.Forms.MessageBox.Show(ex.ToString());
//
}
}
} // btnCalcClick
private void label1_Click_2(object sender, EventArgs e)
{
}
private void textBox1_TextChanged(object sender, EventArgs e)
{
}
private void label2_Click_1(object sender, EventArgs e)
{
}
private void checkedListBox1_SelectedIndexChanged(object
sender, EventArgs e)
{
}
144
private void label1_Click_3(object sender, EventArgs e)
{
}
private void comboBox1_SelectedIndexChanged(object sender,
EventArgs e)
{
}
private void comboBox1_SelectedIndexChanged_1(object sender,
EventArgs e)
{
}
private void button1_Click(object sender, EventArgs e)
{
Close();
}
}
145
Appendix D-2 – Data Table Request (Code for Load Data)
This appendix shows the code for loading a table of de-normalized entries for
classification and prediction of impact scores.
using
using
using
using
using
using
using
using
using
System;
System.Collections.Generic;
System.ComponentModel;
System.Data;
System.Data.SqlClient;
System.Drawing;
System.Text;
System.Windows.Forms;
Microsoft.AnalysisServices.AdomdClient;
namespace SSRAMWinApp1
{
public partial class frmTableRequest : Form
{
// Adding the ManufactureEntryDataTable
public static void ManufactureEntryDataTable(DataSet ds)
{
DataTable entries = new DataTable("Entries");
entries.Columns.Add(new DataColumn("EntryID",
typeof(string)));
entries.Columns.Add(new DataColumn("VulnType",
typeof(string)));
entries.Columns.Add(new DataColumn("LossType",
typeof(string)));
entries.Columns.Add(new DataColumn("ExploitRange",
typeof(string)));
ds.Tables.Add(entries);
}// ManufactureEntryDataTable(DataSet ds)
// Adding the ManufacturePredictDataTable
public static void ManufacturePredictDataTable(DataSet ds)
{
DataTable predictions = new DataTable("Predictions");
predictions.Columns.Add(new DataColumn("EntryID",
typeof(string)));
predictions.Columns.Add(new DataColumn("VulnType",
typeof(string)));
predictions.Columns.Add(new DataColumn("LossType",
typeof(string)));
predictions.Columns.Add(new DataColumn("ExploitRange",
typeof(string)));
predictions.Columns.Add(new DataColumn("EntryCluster",
typeof(string)));
146
predictions.Columns.Add(new DataColumn("PredictedImpact",
typeof(double)));
predictions.Columns.Add(new
DataColumn("PredictedFrequency", typeof(int)));
predictions.Columns.Add(new DataColumn("LossExpect",
typeof(double)));
ds.Tables.Add(predictions);
}// ManufacturePredictDataTable(DataSet ds)
public static void ManufactureOrderedTable(DataSet ds)
{
DataTable entries = new DataTable("Ordered");
entries.Columns.Add(new DataColumn("PredictedImpact",
typeof(double)));
entries.Columns.Add(new DataColumn("PredictedFrequency",
typeof(double)));
entries.Columns.Add(new DataColumn("LossExpection",
typeof(double)));
ds.Tables.Add("ordered");
// ds.Tables.Add(prioritizedPredictions);
} // ManufactureOrderedTable(DataSet ds)
public frmTableRequest()
{
InitializeComponent();
}
private void btnLoadDta_Click(object sender, EventArgs e)
{
try
{
//// used as test data to query - may need to change
this to get those based on the client's choice
string selectQuery = "select distinct name as EntryID,"
+
"vuln_type as VulnType,loss_type
as LossType, exploit_range as ExploitRange " +
"from test02classifier where
month(published) = 1" +
"and year(published) = 2002";
SqlConnection dtaConnection = new
SqlConnection(strDBConnString);
SqlDataAdapter daCmd = new SqlDataAdapter(selectQuery,
dtaConnection);
147
////ManufactureEntryDataTable(ds); //Add the Entries
table to the dataset
daCmd.Fill(ds, "Entries");
this.vulnEntriesDataGridView.AutoGenerateColumns =
true;
this.vulnEntriesDataGridView.DataSource = ds;
this.vulnEntriesDataGridView.DataMember = "Entries";
// do I need to close the connection?
dtaConnection.Close();
ManufacturePredictDataTable(ds);
}// end try
catch (Exception ex)
{
System.Windows.Forms.MessageBox.Show(ex.ToString());
}//end catch
}
private void label1_Click(object sender, EventArgs e)
{
}
private void btnPredict_Click(object sender, EventArgs e)
{
string strClusterLbl = "";
double dblPredProb = 0;
SqlConnection conn = new SqlConnection(strDBConnString);
//delete content of predictions
string del = "delete Predictions";
SqlCommand cmdDel = new SqlCommand(del, conn);
conn.Open();
cmdDel.ExecuteScalar();
conn.Close();
foreach (DataRow row in ds.Tables["Entries"].Rows)
{
string strVulnType = row["VulnType"].ToString();
string strLossType = row["LossType"].ToString();
string strExploitRange =
row["ExploitRange"].ToString();
148
IPredictElementCluster objRequester = new
PredictElementCluster();
string strRequestCluster =
objRequester.Command(strASConnString, strVulnType.Trim(),
strLossType.Trim(),
strExploitRange.Trim());
strClusterLbl = objRequester.getClusterNode;
dblPredProb = objRequester.getPredictProbability;
// Get the predicted values for this entry values
ICalculateImpactFactor objRequestImpact = new
CalcImpactFactor();
// will look at this when I get data from interface
dtPredictDate =
Convert.ToDateTime(this.dateTimePicker1.Text.Trim());
bool blconvertPeriod =
int.TryParse(this.txtMonths.Text, out intPeriods);
string strImpactInfo =
objRequestImpact.Command(strDBConnString, dtPredictDate, intPeriods,
strClusterLbl);
//get predicted Impact
dblPredictImpact = objRequestImpact.getPredictedImpact;
dblPredictFreq =
objRequestImpact.getPredictedOccurrences;
double dblLossExpect = Math.Round(dblPredictImpact *
dblPredictFreq,2);
//inserting a new row with the values into the
Predictions Table
DataRow predictRow = ds.Tables["Predictions"].NewRow();
predictRow["EntryID"] = row["EntryID"];
predictRow["VulnType"] = row["VulnType"];
predictRow["LossType"] = row["LossType"];
predictRow["ExploitRange"] = row["ExploitRange"];
predictRow["EntryCluster"] = strClusterLbl;
predictRow["PredictedImpact"] = dblPredictImpact;
predictRow["PredictedFrequency"] = dblPredictFreq;
predictRow["LossExpect"] = dblLossExpect;
// used to add the new row to the prediction table
ds.Tables["Predictions"].Rows.Add(predictRow);
//Persist the data gotten 8/2/2008
string ins = @"insert into
Predictions(entryID,vulnType,lossType,exploitRange,clusterNode,predImpa
ct,predFreq,lossExpect)
149
values(@entryID,@vulnType,@lossType,@exploitRange,@clusterNode,@predImp
act,@predFreq, @lossExp)";
SqlCommand cmdIns = new SqlCommand(ins, conn);
// not sure if this is necessary yet
cmdIns.CommandType = CommandType.TableDirect;
// assign values to the parameters given
SqlParameter objParameter = null;
objParameter = cmdIns.CreateParameter();
// objParameter.Direction =
System.Data.ParameterDirection.Input; //changed from
ParameterDirection.Input
objParameter.ParameterName = "entryID";
objParameter.Value = row["EntryID"];
cmdIns.Parameters.Add(objParameter);
objParameter = cmdIns.CreateParameter();
// objParameter.Direction =
System.Data.ParameterDirection.Input; //changed from
ParameterDirection.Input
objParameter.ParameterName = "vulnType";
objParameter.Value = row["VulnType"];
cmdIns.Parameters.Add(objParameter);
objParameter = cmdIns.CreateParameter();
//
objParameter.Direction =
System.Data.ParameterDirection.Input; //changed from
ParameterDirection.Input
objParameter.ParameterName = "lossType";
objParameter.Value = row["LossType"];
cmdIns.Parameters.Add(objParameter);
objParameter = cmdIns.CreateParameter();
//
objParameter.Direction =
System.Data.ParameterDirection.Input; //changed from
ParameterDirection.Input
objParameter.ParameterName = "exploitRange";
objParameter.Value = row["ExploitRange"];
cmdIns.Parameters.Add(objParameter);
objParameter = cmdIns.CreateParameter();
// objParameter.Direction =
System.Data.ParameterDirection.Input; //changed from
ParameterDirection.Input
objParameter.ParameterName = "clusterNode";
objParameter.Value = strClusterLbl; //
predictRow["EntryCluster"];
cmdIns.Parameters.Add(objParameter);
objParameter = cmdIns.CreateParameter();
//
objParameter.Direction =
System.Data.ParameterDirection.Input; //changed from
ParameterDirection.Input
objParameter.ParameterName = "predImpact";
150
objParameter.Value = dblPredictImpact; //
predictRow["PredictedImpact"];
cmdIns.Parameters.Add(objParameter);
objParameter = cmdIns.CreateParameter();
// objParameter.Direction =
System.Data.ParameterDirection.Input; //changed from
ParameterDirection.Input
objParameter.ParameterName = "predFreq";
objParameter.Value = dblPredictFreq; //
predictRow["PredictedFrequency"];
cmdIns.Parameters.Add(objParameter);
objParameter = cmdIns.CreateParameter();
//
objParameter.Direction =
System.Data.ParameterDirection.Input; //changed from
ParameterDirection.Input
objParameter.ParameterName = "lossExp";
objParameter.Value = dblLossExpect; //
predictRow["PredictedFrequency"];
cmdIns.Parameters.Add(objParameter);
cmdIns.UpdatedRowSource = UpdateRowSource.Both;
SqlDataAdapter da = new SqlDataAdapter(cmdIns);
conn.Open();
da.InsertCommand = cmdIns;
// conn.Open();
// cmdIns.ExecuteScalar();
// cmdIns.ExecuteNonQuery();
da.Update(ds, "Predictions");
conn.Close();
}// end foreach (DataRow row in ds.Tables["Entries"].Rows)
this.predictedImpactDataGridView.AutoGenerateColumns =
true;
this.predictedImpactDataGridView.DataSource = ds;
this.predictedImpactDataGridView.DataMember =
"Predictions";
// Aggregate each entry's prediction for a prioritized
listing 8/3/2008
ManufactureOrderedTable(ds);
//Create data adapter to fill the dataset table
SqlDataAdapter daOrdered = new SqlDataAdapter();
// Setup command for data adapter
SqlCommand orderedCmd = new SqlCommand("spPrioritizedList",
conn);
orderedCmd.CommandType = CommandType.StoredProcedure;
orderedCmd.UpdatedRowSource = UpdateRowSource.None;
151
daOrdered.SelectCommand = orderedCmd;
daOrdered.Fill(ds, "Ordered");
this.orderedListDataGridView.AutoGenerateColumns = true;
this.orderedListDataGridView.DataSource = ds;
this.orderedListDataGridView.DataMember = "Ordered";
this.orderedListDataGridView.Visible = true;
}//private void btnPredict_Click(object sender, EventArgs e)
private string strASConnString =
SSRAMWinApp1.Properties.Settings.Default.AS_SSRAMConnectionString;
private string strDBConnString =
Convert.ToString(SSRAMWinApp1.Properties.Settings.Default.DB_SSRAMConne
ctionString);
private double dblPredictImpact = 0;
private double dblPredictFreq = 0;
private DateTime dtPredictDate;
private int intPeriods = 0;
private DataSet ds = new DataSet();
private void button1_Click(object sender, EventArgs e)
{
//get the data from the Predictions data table into a
persistent state in the database
SqlConnection conn = new SqlConnection(strDBConnString);
//conn.Open();
try
{
//delete content of predictions
string del = "delete Predictions";
SqlCommand cmdDel = new SqlCommand(del,conn);
conn.Open();
cmdDel.ExecuteScalar();
conn.Close();
// string strEntryID, strVulnType, strLossType,
strExploitRange, strClusterNode;
// double dblPredImpact, dblPredFreq;
foreach (DataRow row in ds.Tables["Predictions"].Rows)
{
152
string ins = @"insert into
Predictions(entryID,vulnType,lossType,exploitRange,clusterNode,predImpa
ct,predFreq)
values(@entryID,@vulnType,@lossType,@exploitRange,@clusterNode,@predImp
act,@predFreq)";
SqlCommand cmdIns = new SqlCommand(ins, conn);
// not sure if this is necessary yet
cmdIns.CommandType = CommandType.TableDirect;
// assign values to the parameters given
SqlParameter objParameter = null;
objParameter = cmdIns.CreateParameter();
objParameter.Direction =
System.Data.ParameterDirection.Input; //changed from
ParameterDirection.Input
objParameter.ParameterName = "entryID";
objParameter.Value = row["EntryID"];
cmdIns.Parameters.Add(objParameter);
objParameter = cmdIns.CreateParameter();
objParameter.Direction =
System.Data.ParameterDirection.Input; //changed from
ParameterDirection.Input
objParameter.ParameterName = "vulnType";
objParameter.Value = row["VulnType"];
cmdIns.Parameters.Add(objParameter);
objParameter = cmdIns.CreateParameter();
objParameter.Direction =
System.Data.ParameterDirection.Input; //changed from
ParameterDirection.Input
objParameter.ParameterName = "lossType";
objParameter.Value = row["LossType"];
cmdIns.Parameters.Add(objParameter);
objParameter = cmdIns.CreateParameter();
objParameter.Direction =
System.Data.ParameterDirection.Input; //changed from
ParameterDirection.Input
objParameter.ParameterName = "exploitRange";
objParameter.Value = row["ExploitRange"];
cmdIns.Parameters.Add(objParameter);
objParameter = cmdIns.CreateParameter();
objParameter.Direction =
System.Data.ParameterDirection.Input; //changed from
ParameterDirection.Input
objParameter.ParameterName = "clusterNode";
objParameter.Value = row["EntryCluster"];
cmdIns.Parameters.Add(objParameter);
153
objParameter = cmdIns.CreateParameter();
objParameter.Direction =
System.Data.ParameterDirection.Input; //changed from
ParameterDirection.Input
objParameter.ParameterName = "predImpact";
objParameter.Value = row["PredictedImpact"];
cmdIns.Parameters.Add(objParameter);
objParameter = cmdIns.CreateParameter();
objParameter.Direction =
System.Data.ParameterDirection.Input; //changed from
ParameterDirection.Input
objParameter.ParameterName = "predFreq";
objParameter.Value = row["PredictedFrequency"];
cmdIns.Parameters.Add(objParameter);
cmdIns.UpdatedRowSource = UpdateRowSource.None;
SqlDataAdapter da = new SqlDataAdapter(cmdIns);
conn.Open();
da.InsertCommand = cmdIns;
// conn.Open();
// cmdIns.ExecuteScalar();
// cmdIns.ExecuteNonQuery();
da.Update(ds, "Predictions");
conn.Close();
}//foreach (DataRow row in
ds.Tables["Predictions"].Rows)
}// try
catch (Exception ex)
{
//Console.WriteLine("Error: " + e);
System.Windows.Forms.MessageBox.Show(ex.ToString());
}// catch
finally
{
conn.Close();
}
Close();
}
}
}
154
Appendix D-3 – Data for SSRAM Validation
To validate that there was no statistical difference between SSRAM’s prediction and the
actual result we ran a series of t-test of sample means. We used data reported in January
of 2002 as our test cases. The following tables show the predictions and comparison to
actual scores using the different classification schemes.
155
t-Test Results using Naïve Bayes algorithm
Mean
Variance
Observations
Pearson Correlation
Hypothesized Mean Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
CVE-ID
CVE-ID Actual
Predicted
CVE-1999-1081 CVE-1999-1081
CVE-1999-1091 CVE-1999-1091
CVE-1999-1091 CVE-1999-1091
CVE-2001-0887 CVE-2001-0887
CVE-2001-0891 CVE-2001-0891
CVE-2001-1457 CVE-2001-1457
CVE-2002-0002 CVE-2002-0002
CVE-2002-0005 CVE-2002-0005
CVE-2002-0007 CVE-2002-0007
CVE-2002-0008 CVE-2002-0008
CVE-2002-0008 CVE-2002-0008
CVE-2002-0009 CVE-2002-0009
CVE-2002-0009 CVE-2002-0009
CVE-2002-0010 CVE-2002-0010
CVE-2002-0010 CVE-2002-0010
CVE-2002-0010 CVE-2002-0010
CVE-2002-0011 CVE-2002-0011
CVE-2002-0038 CVE-2002-0038
CVE-2002-0043 CVE-2002-0043
CVE-2002-0044 CVE-2002-0044
CVE-2002-0045 CVE-2002-0045
CVE-2002-0045 CVE-2002-0045
CVE-2002-0046 CVE-2002-0046
CVE-2002-0047 CVE-2002-0047
CVE-2002-0077 CVE-2002-0077
CVE-2002-0077 CVE-2002-0077
CVE-2002-1594 CVE-2002-1594
CVE-2002-1595 CVE-2002-1595
CVE-2002-1596 CVE-2002-1596
CVE-2002-1597 CVE-2002-1597
CVE-2002-1600 CVE-2002-1600
156
Actual
Predicted
5.577419
6.23709
5.520473 2.362795
31
31
0.68273
0
30
-2.1378
0.020399
1.697261
0.040798
2.042272
Actual
Score Predicted Difference
3.3
6.292
-2.992
3.3
4.8555
-1.5555
3.3
5.7886
-2.4886
2.3
4.8555
-2.5555
7
8.0093
-1.0093
7
8.0093
-1.0093
7
8.0093
-1.0093
10
8.0093
1.9907
10
6.72
3.28
7
6.72
0.28
7
8.0093
-1.0093
3.3
5.7886
-2.4886
3.3
6.292
-2.992
8
5.7886
2.2114
8
6.72
1.28
8
8.0093
-0.0093
3.3
6.292
-2.992
3.3
3.3
0
7
6.72
0.28
4.7
4.8555
-0.1555
8
5.7886
2.2114
8
8.0093
-0.0093
3.3
5.7886
-2.4886
3.3
3.3
0
7
6.72
0.28
7
8.0093
-1.0093
7
8.0093
-1.0093
3.3
6.292
-2.992
3.3
3.3
0
3.3
3.3
0
3.3
5.7886
-2.4886
t-Test: Paired Two Sample for Means using Neural Networks Classification Algorithm
Actual
Predicted
Mean
5.653125 6.180540625
Variance
5.525796371 2.337423284
Observations
32
32
Pearson Correlation
0.629684793
Hypothesized Mean Difference
0
df
31
t Stat
1.633197931
P(T<=t) one-tail
0.056272384
t Critical one-tail
1.695518742
P(T<=t) two-tail
0.112544769
t Critical two-tail
2.039513438
CVE-1999-1081
CVE-1999-1081
Actual
3.3
Predicted
6.292
Difference
-2.992
CVE-1999-1091
CVE-1999-1091
CVE-2001-0887
CVE-2001-0891
CVE-2001-1457
CVE-2002-0002
CVE-2002-0005
CVE-2002-0007
CVE-2002-0008
CVE-2002-0008
CVE-2002-0009
CVE-2002-0009
CVE-2002-0010
CVE-2002-0010
CVE-2002-0010
CVE-2002-0011
CVE-2002-0038
CVE-2002-0043
CVE-2002-0044
CVE-2002-0045
CVE-2002-0045
CVE-2002-0045
CVE-2002-0046
CVE-2002-0047
CVE-2002-0077
CVE-2002-0077
CVE-2002-1594
CVE-2002-1595
CVE-2002-1596
CVE-2002-1597
CVE-2002-1600
CVE-1999-1091
CVE-1999-1091
CVE-2001-0887
CVE-2001-0891
CVE-2001-1457
CVE-2002-0002
CVE-2002-0005
CVE-2002-0007
CVE-2002-0008
CVE-2002-0008
CVE-2002-0009
CVE-2002-0009
CVE-2002-0010
CVE-2002-0010
CVE-2002-0010
CVE-2002-0011
CVE-2002-0038
CVE-2002-0043
CVE-2002-0044
CVE-2002-0045
CVE-2002-0045
CVE-2002-0045
CVE-2002-0046
CVE-2002-0047
CVE-2002-0077
CVE-2002-0077
CVE-2002-1594
CVE-2002-1595
CVE-2002-1596
CVE-2002-1597
CVE-2002-1600
3.3
3.3
2.3
7
7
7
10
10
7
7
3.3
3.3
8
8
8
3.3
3.3
7
4.7
8
8
8
3.3
3.3
7
7
7
3.3
3.3
3.3
3.3
4.8555
5.7886
4.8555
8.0093
8.0093
8.0093
8.0093
6.72
6.292
8.0093
5.7886
6.292
5.7886
6.72
8.0093
6.292
3.3
6.72
4.8555
4.8555
5.7886
8.0093
5.7886
3.3
6.72
8.0093
8.0093
6.292
3.3
3.3
5.7886
-1.5555
-2.4886
-2.5555
-1.0093
-1.0093
-1.0093
1.9907
3.28
0.708
-1.0093
-2.4886
-2.992
2.2114
1.28
-0.0093
-2.992
0
0.28
-0.1555
3.1445
2.2114
-0.0093
-2.4886
0
0.28
-1.0093
-1.0093
-2.992
0
0
-2.4886
157
t-Test: Paired Two Sample for Means using Decision Tree Classifier
Actual
Predicted
Mean
5.76129 6.102877
Variance
5.636452 2.316325
Observations
31
31
Pearson Correlation
0.658869
Hypothesized Mean Difference
0
df
30
t Stat
-1.06461
P(T<=t) one-tail
0.147773
t Critical one-tail
1.697261
P(T<=t) two-tail
0.295546
t Critical two-tail
2.042272
CVE-ID
CVE-1999-1081
CVE-1999-1091
CVE-1999-1091
CVE-2001-0887
CVE-2001-0891
CVE-2001-1457
CVE-2002-0002
CVE-2002-0005
CVE-2002-0007
CVE-2002-0008
CVE-2002-0008
CVE-2002-0009
CVE-2002-0010
CVE-2002-0010
CVE-2002-0010
CVE-2002-0010
CVE-2002-0011
CVE-2002-0038
CVE-2002-0043
CVE-2002-0044
CVE-2002-0045
CVE-2002-0045
CVE-2002-0045
CVE-2002-0046
CVE-2002-0047
CVE-2002-0077
CVE-2002-1594
CVE-2002-1595
CVE-2002-1596
CVE-2002-1597
CVE-2002-1600
CVE-1999-1081
CVE-1999-1091
CVE-1999-1091
CVE-2001-0887
CVE-2001-0891
CVE-2001-1457
CVE-2002-0002
CVE-2002-0005
CVE-2002-0007
CVE-2002-0008
CVE-2002-0008
CVE-2002-0009
CVE-2002-0010
CVE-2002-0010
CVE-2002-0010
CVE-2002-0010
CVE-2002-0011
CVE-2002-0038
CVE-2002-0043
CVE-2002-0044
CVE-2002-0045
CVE-2002-0045
CVE-2002-0045
CVE-2002-0046
CVE-2002-0047
CVE-2002-0077
CVE-2002-1594
CVE-2002-1595
CVE-2002-1596
CVE-2002-1597
CVE-2002-1600
158
Actual
3.3
3.3
3.3
2.3
7
7
7
10
10
7
7
3.3
8
8
8
8
3.3
3.3
7
4.7
8
8
8
3.3
3.3
7
7
3.3
3.3
3.3
3.3
Predicted
5.7886
4.8555
6.292
4.8555
8.0093
8.0093
8.0093
8.0093
6.72
6.72
8.0093
5.7886
5.7886
6.292
6.72
8.0093
5.7886
3.3
6.72
4.8555
4.8555
5.7886
8.0093
5.7886
3.3
6.72
8.0093
5.7886
3.3
3.3
5.7886
Difference
-2.4886
-1.5555
-2.992
-2.5555
-1.0093
-1.0093
-1.0093
1.9907
3.28
0.28
-1.0093
-2.4886
2.2114
1.708
1.28
-0.0093
-2.4886
0
0.28
-0.1555
3.1445
2.2114
-0.0093
-2.4886
0
0.28
-1.0093
-2.4886
0
0
-2.4886
Mean
Std Dev
df
t
Entry-ID
CVE-1999-1081
CVE-1999-1091
CVE-1999-1091
CVE-2001-0887
CVE-2001-0887
CVE-2001-0891
CVE-2001-1457
CVE-2002-0002
CVE-2002-0005
CVE-2002-0007
CVE-2002-0008
CVE-2002-0008
CVE-2002-0009
CVE-2002-0009
CVE-2002-0010
CVE-2002-0010
CVE-2002-0010
CVE-2002-0010
CVE-2002-0011
CVE-2002-0038
CVE-2002-0043
CVE-2002-0044
CVE-2002-0044
CVE-2002-0045
CVE-2002-0045
CVE-2002-0045
CVE-2002-0045
CVE-2002-0046
CVE-2002-0047
CVE-2002-0077
CVE-2002-0077
CVE-2002-1594
CVE-2002-1595
CVE-2002-1596
CVE-2002-1597
CVE-2002-1600
Cluster
Node
Cluster 2
Cluster 5
Cluster 3
Cluster 5
Cluster 5
Cluster 1
Cluster 1
Cluster 1
Cluster 1
Cluster 6
Cluster 1
Cluster 6
Cluster 2
Cluster 2
Cluster 3
Cluster 6
Cluster 2
Cluster 1
Cluster 2
Cluster 7
Cluster 6
Cluster 5
Cluster 5
Cluster 5
Cluster 2
Cluster 1
Cluster 1
Cluster 2
Cluster 7
Cluster 6
Cluster 6
Cluster 1
Cluster 2
Cluster 7
Cluster 7
Cluster 2
-0.4385
1.7385
35.0000
-1.51354
Actual
Score
3.3
3.3
3.3
2.3
2.3
7
7
7
10
10
7
7
3.3
3.3
8
8
8
8
3.3
3.3
7
4.7
4.7
8
8
8
8
3.3
3.3
7
7
7
3.3
3.3
3.3
3.3
159
Predicted
Impact
5.1
3.8
2.4
3.8
3.8
8.1
8.1
8.1
8.1
5.6
8.1
5.6
5.1
5.1
2.4
5.6
5.1
8.1
5.1
3.3
5.6
3.8
3.8
3.8
5.1
8.1
8.1
5.1
3.3
5.6
5.6
8.1
5.1
3.3
3.3
5.1
Difference
-1.770
-0.518
0.893
-1.518
-1.518
-1.058
-1.058
-1.058
1.942
4.387
-1.058
1.387
-1.770
-1.770
5.593
2.387
2.930
-0.058
-1.770
0.000
1.387
0.882
0.882
4.182
2.930
-0.058
-0.058
-1.770
0.000
1.387
1.387
-1.058
-1.770
0.000
0.000
-1.770
Appendix E-1 – Microsoft’s TAMT Listing of Vulnerabilities
This appendix shows the report generated from using Microsoft’s Threat Analysis and
Modeling Tool.
Confidentiality Threats
Unauthorized disclosure of <creates a unique ballot for his/herself> using <speech user
interface> by <Validated voter>
Countermeasures
Canonicalization : Only accept primitive typed identified (e.g., integers) which are
mapped to filenames
Unauthorized disclosure of <creates a unique ballot for his/herself> using <Graphical user
interface> by <Validated voter>
Countermeasures
Canonicalization : Only accept primitive typed identified (e.g., integers) which are
mapped to filenames
Unauthorized disclosure of <completes ballot choices> using <speech user interface> by
<Validated voter>
Countermeasures
Canonicalization : Only accept primitive typed identified (e.g., integers) which are
mapped to filenames
Unauthorized disclosure of <completes ballot choices> using <Graphical user interface> by
<Validated voter>
Countermeasures
Canonicalization : Only accept primitive typed identified (e.g., integers) which are
mapped to filenames
Unauthorized disclosure of <prints ballot entry> using <Printer> by <Validated voter>
Countermeasures
Canonicalization : Only accept primitive typed identified (e.g., integers) which are
mapped to filenames
Unauthorized disclosure of <deletes his/her's ballot entry> using <speech user interface> by
<Validated voter>
Countermeasures
Canonicalization : Only accept primitive typed identified (e.g., integers) which are
160
mapped to filenames
Unauthorized disclosure of <deletes his/her's ballot entry> using <Graphical user interface> by
<Validated voter>
Countermeasures
Canonicalization : Only accept primitive typed identified (e.g., integers) which are
mapped to filenames
Unauthorized disclosure of <Create Ballot counter> using <Secure counter> by <ballot tally>
Countermeasures
Buffer Overflow : Use safe functions such as strncpy, strncat instead of strcpy,
strcat
Buffer Overflow : Validation on input should be performed on the input
Canonicalization : Only accept primitive typed identified (e.g., integers) which are
mapped to filenames
Cryptanalysis Attacks : Use well-known implementations of well-known
cryptographic algorithms
Cryptanalysis Attacks : Use cryptographically generated random keys
Cryptanalysis Attacks : Utilize platform supplied feature to store secret key (e.g.,
DPAPI)
Cryptanalysis Attacks : Utilize SSL or IPSec w/ Encryption to establish a secure
communication channel
Format String : Use a managed language
Integer Overflow/Underflow : Use Language features
Unauthorized disclosure of <reads ballot counter> using <Secure counter> by <ballot tally>
Countermeasures
Buffer Overflow : Use safe functions such as strncpy, strncat instead of strcpy,
strcat
Buffer Overflow : Validation on input should be performed on the input
Canonicalization : Only accept primitive typed identified (e.g., integers) which are
mapped to filenames
Cryptanalysis Attacks : Use well-known implementations of well-known
cryptographic algorithms
Cryptanalysis Attacks : Use cryptographically generated random keys
161
Cryptanalysis Attacks : Utilize platform supplied feature to store secret key (e.g.,
DPAPI)
Cryptanalysis Attacks : Utilize SSL or IPSec w/ Encryption to establish a secure
communication channel
Format String : Use a managed language
Integer Overflow/Underflow : Use Language features
Unauthorized disclosure of <updates ballot counter> using <Secure counter> by <ballot tally>
Countermeasures
Buffer Overflow : Use safe functions such as strncpy, strncat instead of strcpy,
strcat
Buffer Overflow : Validation on input should be performed on the input
Canonicalization : Only accept primitive typed identified (e.g., integers) which are
mapped to filenames
Cryptanalysis Attacks : Use well-known implementations of well-known
cryptographic algorithms
Cryptanalysis Attacks : Use cryptographically generated random keys
Cryptanalysis Attacks : Utilize platform supplied feature to store secret key (e.g.,
DPAPI)
Cryptanalysis Attacks : Utilize SSL or IPSec w/ Encryption to establish a secure
communication channel
Format String : Use a managed language
Integer Overflow/Underflow : Use Language features
Unauthorized disclosure of <creates imposter file> using <imposter file> by <ballot tally>
Countermeasures
Buffer Overflow : Use safe functions such as strncpy, strncat instead of strcpy,
strcat
Buffer Overflow : Validation on input should be performed on the input
Canonicalization : Only accept primitive typed identified (e.g., integers) which are
mapped to filenames
Cryptanalysis Attacks : Use well-known implementations of well-known
cryptographic algorithms
162
Cryptanalysis Attacks : Use cryptographically generated random keys
Cryptanalysis Attacks : Utilize platform supplied feature to store secret key (e.g.,
DPAPI)
Cryptanalysis Attacks : Utilize SSL or IPSec w/ Encryption to establish a secure
communication channel
Format String : Use a managed language
Integer Overflow/Underflow : Use Language features
Unauthorized disclosure of <reads imposter file> using <imposter file> by <ballot tally>
Countermeasures
Buffer Overflow : Use safe functions such as strncpy, strncat instead of strcpy,
strcat
Buffer Overflow : Validation on input should be performed on the input
Canonicalization : Only accept primitive typed identified (e.g., integers) which
are mapped to filenames
Cryptanalysis Attacks : Use well-known implementations of well-known
cryptographic algorithms
Cryptanalysis Attacks : Use cryptographically generated random keys
Cryptanalysis Attacks : Utilize platform supplied feature to store secret key
(e.g., DPAPI)
Cryptanalysis Attacks : Utilize SSL or IPSec w/ Encryption to establish a secure
communication channel
Format String : Use a managed language
Integer Overflow/Underflow : Use Language features
Unauthorized disclosure of <updates imposter file> using <imposter file> by <ballot tally>
Countermeasures
Buffer Overflow : Use safe functions such as strncpy, strncat instead of strcpy,
strcat
Buffer Overflow : Validation on input should be performed on the input
Canonicalization : Only accept primitive typed identified (e.g., integers) which are
mapped to filenames
Cryptanalysis Attacks : Use well-known implementations of well-known
163
cryptographic algorithms
Cryptanalysis Attacks : Use cryptographically generated random keys
Cryptanalysis Attacks : Utilize platform supplied feature to store secret key (e.g.,
DPAPI)
Cryptanalysis Attacks : Utilize SSL or IPSec w/ Encryption to establish a secure
communication channel
Format String : Use a managed language
Integer Overflow/Underflow : Use Language features
Unauthorized disclosure of <Prints total tallies for ballots entered> using <Printer> by <Poll
Worker >
Countermeasures
Canonicalization : Only accept primitive typed identified (e.g., integers) which are
mapped to filenames
164
Appendix E-2 – Prime III Data
ID
Threat-1
Threat-1
Threat-1
Threat-1
Threat-1
Threat-1
Threat-2
Threat-2
Threat-2
Threat-2
Threat-2
Threat-2
Threat-2
Threat-2
Threat-2
Threat-2
Threat-2
Threat-2
Threat-3
Threat-3
Threat-3
Threat-3
Threat-3
Threat-3
Threat-3
Threat-3
Threat-3
Threat-3
Threat-3
Threat-3
Threat-3
Threat-3
Threat-3
Threat-3
Threat-4
Threat-4
Threat-4
Threat-4
Threat-4
Threat-4
Threat-5
Threat-5
Threat-5
Name
Canonicalization
Canonicalization
Canonicalization
Canonicalization
Canonicalization
Canonicalization
Buffer Overflow
Buffer Overflow
Buffer Overflow
Buffer Overflow
Buffer Overflow
Buffer Overflow
Buffer Overflow
Buffer Overflow
Buffer Overflow
Buffer Overflow
Buffer Overflow
Buffer Overflow
Cryptanalysis
Cryptanalysis
Cryptanalysis
Cryptanalysis
Cryptanalysis
Cryptanalysis
Cryptanalysis
Cryptanalysis
Cryptanalysis
Cryptanalysis
Cryptanalysis
Cryptanalysis
Cryptanalysis
Cryptanalysis
Cryptanalysis
Cryptanalysis
Format String
Format String
Format String
Format String
Format String
Format String
Integer
Overflow/Underflow
Integer
Overflow/Underflow
Integer
Overflow/Underflow
VulnType
input
input
input
input
input
input
access
input
access
input
access
input
access
input
access
input
access
input
env
access
env
access
env
access
env
access
env
access
env
access
env
access
env
access
design
design
design
design
design
design
input
Loss Type
Avail
Avail
Int
Int
Conf
Conf
Int
Int
Int
Int
Avail
Avail
Avail
Avail
Conf
Conf
Conf
Conf
Avail
Avail
Avail
Avail
Int
Int
Int
Int
Conf
Conf
Conf
Conf
Sec_Prot
Sec_Prot
Sec_Prot
Sec_Prot
Avail
Avail
Int
Int
Conf
Conf
Avail
Exploit Range
user_init
local
user_init
local
user_init
local
user_init
user_init
local
local
user_init
user_init
local
local
user_init
user_init
local
local
local
local
local
local
local
local
local
local
local
local
local
local
local
local
local
local
user_init
local
user_init
local
user_init
local
user_init
input
Int
user_init
input
Conf
user_init
165
Appendix E-3 – Prime III Predictions
This appendix shows the result of making predictions for our case study using the
different classification schemes. We ran the predictions based on two separate historical
data sources (1996 – 2001, 2003 – 2005). We performed predictions for 2002 and 2006
based on the same vulnerability listing.
The resulting predictions are shown in this appendix along with the confidence interval
for the two separate historical basis for predictions.
A. Using 2003-2005 data as training for 2006 Predictions with Neural Network
algorithm for classification
The prioritized list shows Integer overflow/underflow has the highest loss expectation
based on the impact score and predicted frequency. Countermeasures for dealing with
integer overflow/underflow, such as making sure that the programming language features
that deal with minimizing overflow/underflow, should be employed. Canonicalization,
Buffer Overflow and Format String would be considered of the same priority given the
166
confidence interval for the historical data (2003 – 2005) used for the prediction. Since
these are above 5.0, the first four vulnerabilities would be considered medium to high risk
elements and, as such, require that countermeasures to ameliorate them be given.
B. Using 2003-2005 data as training for 2006 Predictions with Decision Tree
algorithm for classification
Using the decision tree algorithm, Buffer Overflow would be considered the highest risk
element, due to its high predicted impact while Canonicalization and Integer
Overflow/Underflow would be of the same level of priority. Although Format String has
a higher impact factor, the predicted number of occurrences warrants it to be placed at a
lower priority. Cryptanalysis in both measures was given the lowest priority.
167
Training Data by Clusters (2003 - 2005)
Number
in
Cluster
2083.000
1430.000
1616.000
371.000
17.000
1031.000
1764.000
669.000
Avg
Score
7.599
6.967
5.379
4.159
3.300
3.124
3.013
2.631
Std
Deviation
1.673
0.831
2.075
2.088
0.000
0.776
0.551
0.932
Margin
of
Error
0.072
0.043
0.101
0.212
0.000
0.047
0.026
0.071
Cluster
Node
Cluster 2
Cluster 1
Cluster 3
Cluster 7
Cluster 8
Cluster 4
Cluster 6
Cluster 5
With 95%
Confidence Interval
Lower
7.527
6.924
5.277
3.947
3.300
3.077
2.987
2.561
Upper
7.671
7.010
5.480
4.372
3.300
3.172
3.039
2.702
Confidence Interval for Predictions on 2006 data based on 2003-2005 data
C. Using 1996-2001 data as training for 2001 Predictions with Neural Network
algorithm for classification
The prioritized list for 2001 shows Integer Overflow/underflow as the highest
vulnerability risk element for January 2001. Canonicalization, Cryptanalysis and Buffer
Overflow would be interpreted as having the same level of priority given the confidence
interval data for the training period.
168
Using Decision Tree Classification Algorithm
Using decision tree, Format String would be considered the element of highest risk due to
the predicted number of occurrences, though Canonicalization, Buffer Overflow and
Integer Overflow and Underflow all had higher impact predictions. These three would be
considered of the same priority level also.
To actually predict for data not within our training data, we predicted for January 2002.
Using the neural network classification algorithm we found that, Integer
169
Overflow/Underflow, Canonicalization, and Buffer Overflow were predicted with high
impact scores of the same level of priority, while Format String and Cryptanalysis though
of lower priority still had impact scores in the medium to high range. In this case, we
would recommend that all the vulnerabilities’ countermeasures be considered in the
priority listed but, in essence, allow the decision makers to understand why they should
all be addressed.
Training Data by Clusters (1996 - 2001)
NumInCluster
CVSS_Score
StdDev
1257
719
107
765
1298
1250
735
8.2541
7.0872
7
5.5393
5.3409
4.3934
4.3105
1.8124
1.7984
0
2.0795
1.9665
2.0171
1.6957
Margin of
Error
0.1002
0.1315
0
0.1474
0.107
0.1118
0.1226
170
Cluster
Cluster 2
Cluster 4
Cluster 7
Cluster 5
Cluster 1
Cluster 3
Cluster 6
With 95% Confidence
Interval
Lower
Upper
8.1539
6.9557
7
5.3919
5.2339
4.2816
4.1879
8.3543
7.2186
7
5.6867
5.4478
4.5052
4.4331
Appendix F- Vulnerability Classification
Kindoms
[Tsipenyuk et al. 2005]
Input validation and
representation
Vulnerability Class
[NVD]
Input validation error
– (boundary condition
error, buffer overflow)
API abuse
Security features
Access validation error
Time and State
Race condition
Errors
Code quality
Exceptional condition
error
Design error
Encapsulation
Environment
Environmental error
19 Sins
[Howard et al. 2005]
Buffer overflows,
command injection,
cross-site scripting,
format string problems,
integer range errors,
SQL injection
Trusting network address
information
Failing to protect
network traffic, failing to
store and protect data,
failing to use
cryptographically strong
random numbers,
improper file access,
improper use of SQL, use
of weak password-based
systems, unauthenticated
key exchange
Signal race conditions,
use of “magic” URLs and
hidden forms
Failure to handle errors
Poor usability
OWASP
Buffer overflows,
cross-site scripting
flaws, injection
flaws, unvalidated
input
Broken access
control, insecure
storage
Broken
authentication and
session management
Improper error
handling
Denial of service
Information leakage
Insecure
configuration
management
Other error
171