Veneta Haralampieva
Veneta Haralampieva
Veneta Haralampieva
Author: Supervisor:
Veneta Haralampieva Dr. Gavin Brown
Secondly, I would like to extend a thank you to my parents and my sister for always believing
in me and helping me through all difficult times.
Lastly, I would like to express my gratitude to all my colleagues and friends, in particular to my
flat-mate Gaby and my boyfriend Dan, for the useful feedback regarding the seminar and later
the presentation of results.
Figure 8: Example one reproduced from Kuncheva's paper (Kuncheva, 2007). ............................ 29
Figure 9: Example two reproduced from Kuncheva's paper (Kuncheva, 2007). ........................... 29
The focus of this chapter is to provide a short overview of the main aspects related to this
project, such as the problem being investigated, the motivation behind it and its aims and
objectives. The final section will outline the report’s structure.
1.2 MOTIVATION
In recent years the term ‘Big Data’ appears prominently in both research literature and popular
media. Ward and Barker provide a unified definition of it which postulates that ‘Big data is a
term describing the storage and analysis of large and or complex data sets using a series of
techniques including, but not limited to: NoSQL, MapReduce and Machine learning.’ (Ward &
Barker, 2013). It might seem counter-intuitive as to why this could be problematic since one
would assume that the more information there is regarding a particular problem, the easier
finding a solution will be.
The high-energy physics domain can be used to illustrate this phenomenon; the Large Hadron
Collider in CERN, Geneva has been colliding particles travelling close to the speed of light, and
thus driving important scientific discoveries in the field of particle physics over the last few
decades. Interestingly, the research conducted there presents numerous unique challenges not
only to physicists but also to computer scientists. The data anticipated to be generated from all
four experiments is 25 GB/sec, a single collision generating 1 MB of data (CERN, 2016). After a
two-stage filtering process, around 15 petabytes of data are stored annually. From this
astonishing amount scientists have estimated that only one collision in a million is of interest. It
becomes apparent that without proper data filtering and storage, its management and analysis
becomes virtually impossible. For scientists in that domain selecting only the relevant particle
collisions is of vital importance.
However, the Big Data phenomenon is present in other fields such as biomedical sciences. Since
the completion of the Human Genome Project, the price for full genome sequencing1 has
dropped rapidly, thus generating vast amounts of genomic data. Effective analysis of this data
1The term full genome sequencing is used to describe the identification of ‘the genetic basis of traits and disease
susceptibilities using SNP microarrays that capture most of the common genetic variation in the human population’
(Ng & Kirkness, 2010). Single nucleotide polymorphism (SNP) microarray is the technique used to accomplish this.
1
promises to revolutionise medicine by uncovering causes for genetic diseases, as well as
developing innovative personalised drug treatments. It should be noted that the human
genome consists of around 30,000 protein-coding genes and selecting the ones associated with a
particular genetic disease such as cancer is no trivial task.
Outside the scientific domains commercial companies such as Google, Amazon, Facebook and
Apple are storing large amounts of user data in hopes of developing better user experience by
effectively analysing and using this information. One example area is digital advertising, where
Amazon, for example employs previously stored user activity to generate recommendations for
products and services. One approach the company uses for offering personalised
advertisements is finding similarity between the products purchased and other available
products (Linden, Smith, & York, 2003). One should realise that a product on Amazon, such as a
book, is described by a number of characteristics such as genre, publisher, product dimensions,
author, price, average review score, etc. Evidently, publisher or product dimensions might not
be of vital importance to whether a person will purchase a book; however, genre and author
might be. Thus selecting a good set of characteristics is important.
All the above mentioned examples are characterised with high-dimensionality, and require
extraction of the most important attributes describing the target concept. In the field of Machine
Learning this is typically accomplished by the means of feature selection, which focuses on
identifying a subset of features which best achieve a certain goal, such as discovering the main
genes, in which mutations can lead to cancer.
It should be noted that obtaining a robust subset of features, also known as ‘feature preferences’
(Kalousis, Prados, & Hilario, 2007), is a necessity in the areas mentioned above. Since experts
will invest considerable amount of time, research effort and money into further exploring the
selected set of ‘preferences’ it is paramount that small variations in the data do not drastically
change these. This links closely with the notion of stability of a feature selection algorithm,
defined as the “robustness of the feature preferences it produces to differences in training sets
drawn from the same generating distribution” (Kalousis, Prados, & Hilario, 2007).
Surprisingly, this area of Machine Learning has received little attention from the research
community. This has motivated the investigation of various stability measures and their
applications in the empirical evaluation of feature selection criteria.
As identified above, the aim of this project is to research and evaluate the stability of different
feature selection criteria. In order to achieve this, a detailed research has been conducted,
followed by experimental work and consolidation of results. The process involved critical
analysis of every criterion’s strengths and limitations, followed by careful examination of their
2
relationship with each other. To facilitate this a custom implementation of the criteria and
stability measures has been developed, followed by extensive testing and validation.
The project work has been supported by aiming to accomplish the following project objectives:
The remainder of the report is organised as follows: Chapter Two provides an overview of the
feature selection criteria and stability measures explored in this project. Chapter Three describes
the project structure and technologies used. Chapter Four focuses on the details of the
implementation, followed by a thorough explanation of the testing performed. Chapter Five
outlines the experimental work undertaken, the characteristics of the datasets used, followed by
analysis on the results. Chapter Six serves as a conclusion and highlights achievements and
future work.
The notion of feature selection and stability was introduced, and its importance in relation to
Big Data and Machine Learning applications. The chapter then set out the aims and objectives
and presented a brief overview of the report’s structure.
3
Chapter Two
2 BACKGROUND
This chapter will provide a review of the research already conducted with respect to feature
selection and stability, which aims to introduce the reader to the vital concepts explored in
detail in this project. An outline of the strengths and weaknesses of each of the techniques shall
be presented.
In Machine Learning, the task of identifying the category or class of an unseen instance based
on past data, also known as classification plays a prominent role. It has been established that in
high-dimensional problems feature reduction could improve classification accuracy as
discussed by Das (Das, 2001) and Xing, Jordan and Karp (Xing, Jordan, & Karp, 2001). As a
result the notion of feature selection stability has traditionally been associated with it. However,
this provides no means of distinguishing the behaviour of the feature selection algorithm with
respect to changes in the data, and that of the classifier. One can easily infer that this distinction
is quite important; since a low stability score in that case presents no information as to whether
the feature selection algorithm produces a non-optimal feature subset or the classifier’s
sensitivity to the data is too high.
Kalousis was one of the first people to investigate feature selection stability separately
(Kalousis, Prados, & Hilario, 2007). He proposed several similarity measures, which can be
employed independently of any learning model and cater for the different outputs produced by
feature selection algorithms. These algorithms do one of three things; assign a weight or a score
to features, rank them or select a small subset. As a result Kalousis proposed three separate
measures: Pearson’s correlation coefficient for weight-scoring outputs, Spearman’s rank
correlation coefficient for ranking and Tanimoto distance for measuring set similarity. Measures
operating on subsets of features are the focus of this research and will be explored in detail
shortly.
Each of the similarity measures 𝑆𝑥 presented below operates on two sets alone. They can be
used to measure the total similarity 𝑆 of 𝑚 sequences 𝐴 = {𝐴1 , 𝐴2 , 𝐴3 , … 𝐴𝑚 } by averaging the
sum of the similarity scores for every two subsets.
4
This leads to a generalised approach for computing the overall stability regardless of the
similarity measure employed and is defined as:
∑𝑚−1 𝑚
𝑖=1 ∑𝑗=𝑖+1 𝑆𝑥 (𝐴𝑖 , 𝐴𝑗 )
𝑆 (𝐴) = (1)
𝑚(𝑚 − 1)
2
2.2.1 Tanimoto distance
An adaptation of Tanimoto distance has been proposed by Kalousis as a possible measure for
set similarity (Kalousis, Prados, & Hilario, 2007). It relies on computing the number of elements
in the intersection of two sets s and s’ divided by the number of elements in the union and can
be expressed as follows:
′)
|𝑠 ∩ 𝑠 ′ |
𝑆𝑡 (𝑠, 𝑠 = (2)
|𝑠 ∪ 𝑠 ′ |
The values are in the range of [0, 1], where 0 characterises two completely different subsets, and
1, two subsets which are identical. This measure has advantages, such as computational
efficiency and simplicity of implementation.
When evaluating the stability of different feature selection algorithms, Dunne proposed another
similarity measure based on the Hamming distance (Dunne, Cunningham, & Azuaje, 2002). It
employs the masks generated by the feature selection algorithm, 1 denoting a feature has been
selected and 0 that it has not. Later Kuncheva expressed the Hamming distance in terms of two
sets in the following manner (Kuncheva, 2007), where ‘\’ denotes the set difference operation
and 𝑛 is the total number of features in the dataset.
|𝑠\𝑠 ′ | + |𝑠 ′ \𝑠|
𝑆ℎ (𝑠, 𝑠 ′ ) = 1 − (3)
𝑛
Much like Tanimoto distance this similarity measure does not factor for chance.
5
2.2.3 Kuncheva’s consistency index
𝑟𝑛 − 𝑘 2
𝑆𝑐 (𝑠, 𝑠 ′ ) = (4)
𝑘(𝑛 − 𝑘)
𝑟 − | s ⋂ s’|
𝑘 - number of selected features
𝑛 - total number of features
𝑠, 𝑠′ - subsets of features with the same size
The values range between [-1, 1], where 1 can be obtained for two identical sets (𝑟 = 𝑘). The
𝑛
minimal value is produced when 𝑘 = 2 and 𝑟 = 0. As noted by Kuncheva the value of the index
is not defined for 𝑘 = 0 and 𝑘 = 𝑛. Nevertheless, for completeness these can be implemented as
0. By taking into account the total number of features the consistency index satisfies all three
requirements introduced above.
Pearson’s correlation coefficient has been widely used in statistics and economics as a measure
for correlation between two variables. In the context of feature selection stability Kalousis
proposed its use for measuring similarity of two separate weighting-scoring outputs produced
by feature selection algorithms. It can be formulated as follows, where 𝜇𝑤 and 𝜇𝑤′ are the
sample means of 𝑤 and 𝑤 ′ respectively.
6
2.3 FEATURE SELECTION CRITERIA
Feature selection has been quite a prominent research area in Machine Learning due to the
increasing number of datasets, characterised by high-dimensionality, as highlighted previously.
Such datasets consist of a high number of features and a low number of samples, which greatly
reduces classification accuracy as well as hinders computational performance. To mitigate these
problems feature selection techniques, aimed at reducing the number of dimensions, have been
widely employed. Huang provides a detailed overview of different feature selection methods
and their distinction into filter, wrapper and embedded methods (Huang, 2015), each of which
will be explained in the following paragraphs.
Wrapper methods evaluate the feature quality by exploring the feature subset space and its
effect on a particular learning model’s accuracy. Their main drawbacks are computational
inefficiency and often being specific to the learning algorithm used.
Embedded methods incorporate feature selection whilst constructing the learning model.
Examples include classification and regression trees. A key advantage of these types of methods
is that they are more computationally efficient in comparison to wrapper approaches.
Nevertheless, they have considerable disadvantages as well; Huang discusses their difficulty in
finding the globally optimal solution, due to the attempt to estimate the model parameters
whilst selecting features (Huang, 2015).
Filters, on the other hand rely on statistical measures, independent from a classifier, to estimate
the relevance of a feature to the target concept. As a result, they produce generalised subsets,
which might not in all cases yield the highest classification accuracy. They are characterised by
computational efficiency and less overfitting2. A majority of them are myopic3 measures, such as
Mutual information and Gini index. Additionally, there are also non-myopic approaches such
as Relief, firstly introduced by Kira (Kira & Rendel, 1992). A comparison of the different filter
based feature selection criteria is the focus of this research.
A number of widely used feature selection approaches rely on the notion of mutual
information, first defined in the field of Information Theory. To understand mutual
information, one should be familiar with the term entropy, which attempts to quantify the
amount of uncertainty present in a distribution of events of a random variable 𝑋. (Shannon,
1948). If one event is highly likely to occur then the entropy is low, since there is little
uncertainty regarding the outcome. If on the other hand, all events in the distribution are
2 In Machine learning, overfitting relates to a learning model being too complicated for the underlying data, thus
capturing noise. This leads to a model which is too specific to the data.
3 Myopic approaches consider each feature independently from the others and do not detect feature interactions.
7
equally probable, then the entropy is maximal, since no prevalence of a particular event is
present. Shannon defined entropy as:
Furthermore, Shannon demonstrated the means of conditioning entropy based on other events.
Suppose we have another distribution of events, 𝑌 then the conditional entropy of 𝑋 given 𝑌 is
regarded as ‘the amount of uncertainty remaining in 𝑋 after we learn the outcome of 𝑌’ (Brown,
Pocock, Zhao, & Luján, 2012). Its mathematical formulation is defined in the following manner:
(7)
𝐻 (𝑋|𝑌) = − ∑ 𝑝(𝑦) ∑ 𝑝(𝑥|𝑦) log 𝑝(𝑥|𝑦)
𝑦∈𝑌 𝑥∈𝑋
Hence the mutual information shared between 𝑋 and 𝑌 can naturally be describes as a
difference of the two entropies, the amount of uncertainty present in 𝑋 before the outcome of
event 𝑌 is known and after. It should be noted that mutual information is symmetric, hence the
name.
To aid understanding a practical example is presented. Suppose that 𝑋 denotes the question
whether a person has heart failure or not. We can assume that without any additional
information about the individual our uncertainty regarding the outcome is high. However, by
examining his/hers family history, our uncertainty lowers. If the person’s relatives suffer from
similar heart condition, then it is more likely that he/she will experience heart failure as well.
Similarly, if there is no evidence that any of his/hers family members have had this diagnosis,
that person might be less likely to experience heart problems. Therefore, the outcome of 𝑌,
denoting whether cardiovascular diseases are present in the family, lowers the uncertainty. This
leads to a big difference between 𝐻(𝑋) and 𝐻(𝑋|𝑌), or high mutual information.
In the context of Machine Learning, one can infer that the higher the mutual information
between a feature 𝑋 and a class 𝑌 is, the more useful that feature will be in classification.
Therefore, computing the mutual information between all features with respect to the class, and
selecting the top ones, with the highest values, will result in a good feature selection criterion.
Brown refers to this as Mutual information maximisation criterion, which uses mutual
information to define the feature score (Brown, Pocock, Zhao, & Luján, 2012).
8
𝐽𝑚𝑖𝑚 (𝑋𝑘 ) = 𝐼(𝑋𝑘 ; 𝑌) (9)
It is important to note that this criterion does not consider inter-feature relationships, the fact
that when grouped together certain features may be highly discriminatory even if individually
they are not, making it a myopic measure. Another disadvantage is its inability to detect feature
redundancy, which as discussed by Huang is the view that ‘two perfectly correlated features are
redundant to each other because adding one feature on top of the other will not provide
additional information; and hence, will not improve model accuracy’ (Huang, 2015). A possible
solution to this problem has been proposed in the form of the Mutual information feature
selection criterion (Battiti, 1994) which will be explored in detail in the following section.
It should be noted that Guyon argued that high correlation does not always equate redundancy
(Guyon & Elisseeff, 2003). He empirically demonstrated that two highly correlated features can
improve classification accuracy, and reached the conclusion that only perfectly correlated
features are redundant to each other. So this implies that this criterion operates under the
assumption that features are class-conditionally independent.
Moreover Estévez argued that the redundancy term will increase in magnitude with respect to
the relevancy term as the cardinality of the subset of selected features grows (Estévez, 2009). He
inferred that this may result in the early selection of irrelevant features.
As it can be observed, this criterion is more computationally expensive than the previous one
due to its need to estimate the mutual information between the feature in question and all
previously selected features.
9
2.3.3 Gini index
where:
Due to its myopic nature, the Gini index has low computational requirements, thus making it
attractive in high-dimensional data analysis. Nevertheless, as noted by Čehovin when used on
numerical data it results in overestimation of the features’ quality (Čehovin & Bosnić, 2010). An
improvement can be obtained by discretising the data prior to analysis, which is the approach
undertaken in this project. Similarly to Mutual information maximisation, the Gini index is
unable to detect redundant features, as well as inter-feature relationships.
2.3.4 ReliefF
The biggest disadvantage to the above described criteria is their inability to detect feature
interactions. In domains such as protein folding, where amino-acids’4 interactions determine the
folded structure of a protein and its functionality, discovering inter-feature relationships is
paramount.
To cater for this type of problem Kira proposed a different approach to feature selection called
Relief (Kira & Rendel, 1992). What Relief does differently is that it performs both a global and a
local search before updating the feature relevance score. It starts by selecting a random instance
and finding its nearest neighbour from the same class, called the nearest-hit, and the nearest
neighbour from the other class, nearest-miss. It then examines how well each feature
10
distinguishes between the classes. Using this information, it updates the feature weight. The
details of the original algorithms can be found below:
As discussed by Liu this particular function tends to underestimate numerical features (Liu &
Motoda, 2007). The example he used to illustrate this phenomenon is:
Let 𝑋𝑖 ∈ [1,2,3,4,5,6,7,8]
If 𝑋𝑖 were nominal the value obtained by the difference function would be 𝑑𝑖𝑓𝑓(2,5) = 1.
|2−5|
However, if 𝑋𝑖 were numerical then the value would be 𝑑𝑖𝑓𝑓(2,5) = 8−1
≈ 0.43. Since this is
used to update the feature weights, Relief underestimates numerical features. As a solution Liu
suggests the use of the contextual merit function proposed by Hong. It provides a generalisation
of the 𝑑𝑖𝑓𝑓function for numerical attributes (Hong, 1997).
11
Here 𝑡𝑒𝑞 and 𝑡𝑑𝑖𝑓𝑓 are user-defined parameters, where 𝑡𝑒𝑞 is the threshold below which two
values are considered equal and 𝑡𝑑𝑖𝑓𝑓 is the threshold above which two values are regarded as
distinct. Hong mentions the following default values for these parameters:
Kira used the Euclidean distance for the nearest neighbour search and the squared difference, as
it can be seen above. Interestingly, Kononenko reported that in his experiments using only the
difference seems to yield the same results (Kononenko, Šimec, & Robnik-Šikonja, 1997). He also
used the Manhattan distance in the nearest neighbour search as opposed to the Euclidean one,
though an explicit reason has not been provided. The author of this report has also
experimented with both approaches and found little difference in the results they produce.
An interesting observation about the relationship between the update rule in Relief and Gini
index has been noticed by Kononenko. He demonstrated that the update rule can be expressed
as an approximation of the following difference in probabilities (Kononenko, Šimec, & Robnik-
Šikonja, 1997):
where:
𝑃(𝑋𝑗 )2
′
𝐺𝑖𝑛𝑖 (𝑋) = ∑ ( ∑ 𝑃(𝑌𝑐 |𝑋𝑗 )2) − ∑ 𝑃(𝑌𝑐 )2 (17)
∑𝑗 𝑃(𝑋𝑗 )2
𝑗 𝑐 𝑐
12
𝑃(𝑋𝑗 )2 𝑃(𝑋𝑗)
As noted by Kononenko the term is what distinguishes it from the Gini index’ ∑ ,
∑𝑗 𝑃(𝑋𝑗 )2 𝑗 𝑃(𝑋𝑗 )
providing Relief with a normalisation for multivalued features. This effectively solves the
problem of multivalued feature overestimation, which impurity functions such as Mutual
information and Gini index have (Kononenko, Šimec, & Robnik-Šikonja, 1997). A detailed
reproduction of the relationship demonstrated by Kononenko can be found in Appendix B.
It is worth noting that Relief was designed only for binary class problems. Moreover, as
discussed by Kononenko by using only the nearest-hit and nearest-miss of a chosen instance, it is
sensitive to noisy data (Kononenko I. , 1994). To overcome these two limitations he proposed an
extension, which finds k-nearest neighbors of that instance from each class and averages their
contributions. If this average is of nearest-misses, it is then weighted accordingly with the prior
probability of that class (Kononenko, Šimec, & Robnik-Šikonja, 1997). This led to the ReliefF
algorithm. Due to its advantages over the classic Relief the author has decided to employ it in
the experimental work.
ReliefF’s asymptotic complexity is 𝑂(𝑚 ∗ 𝑛 ∗ 𝑡), with the selection of the 𝑘 nearest instances
being the most expensive operation. Therefore, employing techniques such as k-d trees and
cover trees for nearest neighbour search, can improve efficiency drastically.
13
2.4 CHAPTER SUMMARY
This chapter introduced the reader to the existing research done in the areas of stability and
feature selection criteria central to this project. A review of several similarity measures was
presented, with their advantages and drawbacks explored, followed by a detailed description of
the different feature selection criteria used.
14
Chapter 3
3 METHODS AND DESIGN
This chapter will outline the project constraints, planning and the software engineering
practices employed. It will discuss the choice of programming language, associated libraries,
and the development environment that has been used.
One of the key characteristics of this project is that it is date-driven5, which has prompted the
author to employ agile software engineering planning techniques. Due to the short length of the
development time delivering the most important aspects of the project in a timely manner has
been crucial.
This has been achieved by effectively using short incremental iterations and user stories. ‘Short
iterations’ refers to a process which involves dividing the problem into smaller units, each of
which in itself represents a complete piece of functionality, and incrementally completing them.
This results in the delivery of working software at the end of the iteration, with typical length of
two weeks.
‘User stories’ facilitate the achievement of the timely delivery of project features. They consist of
a brief description in plain English of the desirable functionality of the system, and by
reprioritising them it can be ensured that the most important aspects of the software shall be
delivered first. Table 1 illustrates the ‘user stories’ identified, along with their respective
priorities. It can be observed that those are closely related to the project’s objectives.
5Date-driven is an agile term, which Cohn uses to define a project ‘that must be released by a certain date but for
which the feature set is negotiable.’ (Cohn, 2005)
15
Iteration Stories Priority
As a researcher, I want to be able to select the most relevant features, describing
Iteration 1 my data, based on their Mutual Information, so that I can have a clear view of High
factors underpinning the problem. (Mutual information maximisation criterion)
As a researcher, I would like to minimise the feature redundancy so that I can
Iteration 2 obtain a subset or relevant features. (Mutual information feature selection High
criterion)
As a researcher, I would like to be able to upload a .csv data files, since this is one
Low
of the most commons ways my data is stored.
As a researcher I would like to be able to quantify the stability of different criteria
Iteration 3
using a basic similarity measure so that I can understand how robust the subset
High
obtained from the method is to slight variations of the training data. (Tanimoto
distance)
As a researcher, I would like to see the stability measure of the feature selection
algorithm in a common correlation measure such as Pearson's correlation High
coefficient, so that I can easily interpret it.
Iteration 4 As a researcher, I would like to have a stability measure that factors randomness
and chance in the feature selection, so that I can ensure that any stable subsets High
produced are done by design and not by chance. (Kuncheva's consistency index)
As a researcher, I would like to implement Hamming distance in order to compare
Medium
it to the other existing similarity measures.
As a researcher, I would like the ability to visually interpret the stability results of
High
each criterion.
As a researcher, I would like to able to use the feature selection algorithms on
Medium
Iteration 5 continuous data attributes to cater for different datasets
As a researcher, I would like the continuous data to be discretized in the most
optimal way so that I can be sure that no relevant information will lost during the Medium
process.
As a researcher, I would like to be able to select a feature subset based on their Gini
Iteration 6 index values, since it is computationally inexpensive and will present highly High
discriminatory features.
As a researcher, I would like to be able to employ a feature selection technique
Iteration 7 Medium
which considers inter-feature relationships. (Relief)
Table 1: Iterations, user stories and their priorities, coinciding with the main objectives of this project
3.3 TECHNOLOGIES
The supporting technologies behind any software engineering or research project are vital to its
length and quality. Therefore, enough time had been dedicated to exploring several different
languages, their strengths and limitations.
16
3.3.1 R
The focus area of this project is in the domain of Machine Learning, which relies heavily on
statistical calculations. With this in mind a key capability of the chosen language had to be
excellent support and optimisations for matrix operation, good integration with visualising
libraries and well-established status within the research community and industry.
Figure 1 provides statistical information regarding the most widely used languages within the
fields of data mining, data analysis and data science. From the six languages identified those
which are highly optimised for matrix operations, freely available and are well-established in
these domains are R and Python.
It should also be noted that when compared to other languages like Python, who have been
known for their simplicity and suitability for beginners, R is considered to have a steep learning
curve as highlighted by Theuwissen (Theuwissen, 2016). Furthermore, he commented that R has
been well-established in the research community and that most importantly its visualisation
capabilities surpass those of Python. The final point has been of vital importance to this project
since effective visualisation of results is critical. This prompted the choice of R over Python
despite its drawbacks. Furthermore, it presented the challenge of learning a new language.
6
This graph was obtained from: http://www.kdnuggets.com/polls/2013/languages-analytics-data-mining-data-
science.html
17
3.3.2 RStudio
Another advantage of using R is the RStudio development environment, which provides access
to debugging tools, instant graph visualisation and a simple command line interface. It allows
good integration with external packages and facilitates both version control and code
organisation. Moreover, it has good support from the community and is freely available. All
these advantages have made it the preferred choice over the terminal-based environment.
3.3.3 Packages
One of the main initial concerns has been related to the need for effective means of visualising
the results. An objective of the project is to investigate and compare the stability of different
feature selection criteria. In order to achieve this, the results needed to be presented in a form
that is easily understandable and enhanced the analysis process, as opposed to hindering it.
This led to research in data visualisation with R and then to the ‘plotly’7 package. Plotly is an
open-source interactive graphic library for multiple programming languages, including R. It
provides a fast way for generation of histograms, charts and line plots. This package has a steep
learning curve and requires the comprehension of the documentation and examples to allow a
novice user to utilise its capabilities.
It should be noted that the speed of the experimental work has been quite important in such a
short project. Hence an application has been created to cater for the author’s needs. However,
since the focus of this project is not on application development the need for a package
supporting quick and effortless web development in R has become apparent. After careful
investigation, the ‘shiny’8 and ‘shiny dashboard’9 packages have been found to facilitate the
creation of a dashboard-style web interface with ease. This provided a way for producing a
general-purpose application, where datasets could be manipulated easily, the stability of
different feature selection criteria analysed and the results consolidated. Nevertheless, these two
packages have their limitations. For example, customising colours or buttons is quite difficult if
not impossible in certain cases. However, for the purposes of this project it has served the
purpose of creating a clean interface facilitating the user’s needs.
Another package of vital importance to this project has been ‘RUnit’10, which supports the
creation of test suites and since most computer science projects require extensive testing, this
proved to be one of the most useful libraries. With the help of a test suite produced, the author
could safely introduce modifications and improvements in the code. The topic of testing and
validation will be explored in detail in the next chapter.
7 Plotly is available from https://plot.ly/r/, along with complimentary tutorials and start up guides
8
Shiny is available from http://shiny.rstudio.com/, along with complimentary tutorials and start up guides
9
Shiny dashboard can be obtained from following GitHub repository https://rstudio.github.io/shinydashboard/
10 RUnit is available from the CRAN repository https://cran.r-project.org/web/packages/RUnit/index.html
18
3.4 ARCHITECTURE
A principle employed when designing the structure of the project is ‘Separation of concerns’,
which as defined by Dijkstra ‘even if not perfectly possible, is yet the only available technique
for effective ordering of one's thoughts’ (Dijkstra, 1982). Therefore, as previously established a
web-based tool, which aids the research process, has been created. It consists of three main
components, a thin-client, a server and a main computational core.
Schneiderman suggests several guidelines which need to be employed when designing a user
interface amonst which is that one should strive for consistency (Shneiderman & Plaisant, 2010).
To ensure this throughout the application all graphical representations have been visualised
using line plots along with a consistent colour scheme.
The dashboard-style of all pages has been chosen deliberately since it serves as a metaphor for
physical dashboards which people rely on in their every day lives. This technique has been
employed by many companies, such as Apple, as a tool which allows new users to grasps
concepts quickly (Apple, 2016).
The server’s responsibilities include graph generation, retrieval of results and communication
with the main computational core. The decision to split the server from the core has been
influenced by the need for abstraction to facilitate the independent use of the stability measures
and feature selection criteria. The server simply utilises the necessary components from the
core, passes the relevant information to the client and is responsible for generating the graphs.
The computational core is the most fundamental component of the project. It comprises of the
different feature selection criteria, all the similarity measures discussed, probability estimation
functions which support the criteria, data pre-processing capabilities and file manipulation.
Consequently, the core has been divided into five main subcomponents, illustrated in Figure 2,
whose implementation will be explored in detail in the next chapter.
19
The separation of the subcomponents facilitates the need to flexibility and extensibility. This
implies that if a reimplementation of the probability estimation functions is required, it could be
done without affecting other parts of the code. Due to the generalised approach for stability
estimation, the addition of a new criterion is a trivial task and requires minimal effort and
change in the existing codebase.
To achieve the aim of this project several experiments have been conducted. However, since the
output of them is used for formulating the hypothesis about the behavior of different feature
selection criteria, a careful design process has been required. By constructing a design protocol
one ensures the repeatability of the work, along with verifying that the experiment will indeed
answer the question of interest. Therefore, the following protocol has been devised. A
bootstrapping technique, detailed in Appendix G, is used to introduce small variations in the
data, and the number of repetitions, which is thirty, has been chosen to provide a more
representative stability score for each feature selection criterion.
20
3.6 CHAPTER SUMMARY
This chapter presented the main technologies and software engineering practices utilised in the
development of this project. It described the design choices, provided an overview of the main
architecture and discussed the project planning process.
21
Chapter 4
4 IMPLEMENTATION
This chapter will provide a detailed review of the project implementation and the challenges
faced. Due to the main focus of this project being research and not web application
development this chapter will focus largely on the computational core but will feature details
regarding the user interface and the server as well. This chapter will also explore the different
testing techniques employed for validation.
The user interface has been developed using the ‘shiny’ and ‘shiny dashboard’ packages, which
are web application frameworks. They provide basic building blocks such as menu bars,
sidebars, headers, plot outputs, table outputs and many more, which facilitate the creation of a
web application for data analysis in an efficient way. The figures below demonstrate the
interface’s capabilities.
As it can be observed in Figure 3, the user has the ability to upload data, labels, discretise it if
necessary and preview it in a table format. Such a represtantion proves to be quite consistent
with tools such as Microsoft Excel, thus exploiting the user’s pre-existing knowledge to create a
better user experience.
Figure 4 demonstrates the feature selection stability evaluation page where the user can select
the number of times the algorithms should be run and which criteria should be compared. The
graphs are interactive and facilitate the data analysis process. Furthermore, if the user wishes
he/she could exclude certain criteria on the graph directly and compare only two at time.
Figure 5 illustrates the result consolidation capabilties. The user has the ability to examine the
characterists of the datasets, such as number of features, classes or examples, and display
graphical representation of previously run results. This page serves as a storage place for all
experimental work and is accessbile from multiple locations. This has proven to be valuable
during the experimental analysis stage of the project.
Figure 6 presents the example recreation screen, which contains research paper reproductions
illustrating the stability measures’ performance. Lastly, Figure 7 provides an overview of the
References page which caters for the central storage of background reading.
22
Figure 3: User interface: data upload and preprocessing Figure 4: User interface: stability evaluation
Figure 5: User interface: result consolidation Figure 6: User interface: example reproductions
23
4.3 THE SERVER
Section 3.4.2 highlighted the server’s main responsibilities whose implementation shall be
discussed in detail. For retrieval and storage of results the server utilizes the file manipulation
subcomponent of the computational core. Similarly, when data discretisation is required the
server utilizes the data pre-processing subcomponent.
Graph generation is required in a couple of instances and has been implemented using the
‘plotly’ package. Firstly, upon conducting experimental work the result is presented in a form of
a line graph. The server obtains the results from the computational core. Afterwards, they are
passed to the client for visualisation. Their update is achieved with the help of ‘reactive
expressions’11. Additionally, if the user wishes to preserve the results for future use, a special R
file is created containing the data structure of the output.
The saved experimental work could then be read from the server’s filesystem upon start-up.
Similar techniques have been employed for the example reproduction page, with all
experimental results being stored as files.
As outlined previously the computational core has been divided into five subcomponents,
ensuring modularity and abstraction.
The first subcomponent is responsible for file manipulation, namely opening and saving files.
Those capabilities are utilized when the user uploads the data and if he/she wishes to save the
results obtained from the analysis. It is also used in the example sections for reading previously
stored data analysis results and dataset information.
The data pre-processing subcomponent provides data discretization capabilities, which have
become necessary to the performance improvement of several feature selection criteria as
mentioned in section 2.3.3.
Discretisation is the process of converting continuous attributes into discrete ones. As explained
by Clarke it usually involves the process of splitting the continuous space into 𝐾 partitions
corresponding to discrete values (Clarke & Barton, 2000). This process introduces what is
11Reactive expression facilitate the control of which parts of the application need to be update and when. They aim to
avoid unnecessary work.
24
known as discretisation error due to the inability to accurately represent all the states of a
variable.
Therefore the number of partitions 𝐾 is important to minimize this. In 1926 Sturges had
proposed an approach for estimating the number of partitions given a sample size 𝑛, which
later on became known as ‘Sturges rule’ (Sturges, 1926).
An R package ‘infotheo’12 has been used to perform the actual discretization and the number of
partitions has been provided using the above formula.
Since many of the feature selection criteria rely on probability estimations, such as prior and
posterior probabilities, a separate subcomponent has been developed to facilitate reusability
and modularity. All methods in it rely on the same input parameters, namely the dataset and
the column number of the data labels, and returned matrices as an output. The output type was
chosen after investigation of R’s performance using profiling tools which had demonstrated the
superiority of matrices over other R data types such as ‘data frames’13.
As per the definition of stability its calculation involves introducing small perturbations in the
data and applying the feature selection algorithm a number of times. In the field of Machine
learning, a technique for achieving this is bootstrapping, detailed in Appendix G. After a
dataset is generated using this approach a feature selection method is used to produce the
feature preferences. A user-defined parameter controls the number of times the process is
performed and afterwards the preferences are passed to the stability estimator, along with the
type of similarity measure to be used. A detailed pseudocode of the stability function is
supplied in Appendix C.
25
4.4.4 The feature selection criteria
The feature selection subcomponent is responsible for invoking the correct criteria, performing
the necessary computations for subset generation and returning a set of feature scores. The first
and last parts are straightforward and are implemented as a separate interface. Each of the
criteria is separated into an individual function and file, to ensure low coupling. The Mutual
information based approaches along with Gini index utilise the probability estimation
subcomponent, which provides the prior and posterior probabilities necessary.
4.5 TESTING
In order to verify the correctness of the implementation, along with the author’s understanding
of the subject matter, several testing techniques have been employed. Together they formed the
test suite of the project.
Embury provided several guidelines for ‘a good quality test suite’ (Embury, 2012). Firstly, test
cases need to be independent, in other words always produce the same result regardless of the
execution context. Secondly, they should be repeatable, meaning that the same output should be
produced every time the test is run (assuming no code modifications). Thirdly, there should be
no redundant test cases, i.e. a test case which verifies the same behaviour as another one. The
last guideline is of importance since running test suites is computationally expensive, consumes
resources and introduces waste, since unnecessary test cases need to be maintained. The author
has taken the above considerations into account when designing the test suite.
26
4.5.1 Unit tests
In software development, unit tests are developed in such a manner so that they can verify the
smallest testable piece of the software. They are generally fast and useful in the debugging
process since they can pinpoint a bug’s location very accurately. When designing test cases the
author has decided to employ black box testing techniques such as boundary value analysis14
and equivalence partitioning15. This ensured all necessary scenarios have been covered with an
optimal set of test cases. Overall there are eighteen separate functions all of which have unit
tests.
This type of tests is used to verify the correctness of multiple components working together. For
example, the feature selection component relies on the probability estimation one for obtaining
the necessary prior and posterior probabilities. To test the components interactions a total of six
such tests have been employed. The output of these along with the one from the unit tests can
be found in Appendix F.
The above mentioned approaches test the correctness of the implementation only against the
author’s individual understanding of the problem. As a counter-measure validation of the
results with external tools and libraries has been performed.
To test Mutual information based criteria the FEAST16 toolbox for Matlab has been utilized. The
implementation of the ReliefF algorithm has been verified using the ‘dprep’17 package for R and
certain similarity measures like Pearson’s correlation coefficient have been validated using
online tools such as the ‘Correlation coefficient calculator’18.
This chapter presented in detail the architectural components and their use. It explored various
testing techniques and their practical application in the context of this project.
14 As defined in the testing standards glossary, boundary value analysis is: ‘A test case design technique for a
component in which test cases are designed which include representatives of boundary values’ (Standards, 2016).
15 As defined in the testing standards glossary, equivalence partitioning is: ‘A test case design technique for a
component in which test cases are designed to execute representatives from equivalence classes’ (Standards, 2016).
16 FEAST has been created by G. Brown, A. Pocock, M.-J. Zhao, M. Lujan and is available at
http://www.cs.man.ac.uk/~gbrown/software/
17 ‘dprep’ is an R package available from the CRAN https://cran.r-project.org/web/packages/dprep/index.html
27
Chapter 5
5 EXPERIMENTS AND ANALYSIS
This chapter will explore in detail the experiments conducted, the dataset used and the
techniques applied. It will discuss the choice of a similarity measure and will investigate the
stability of the different criteria by providing an analysis of the experimental results.
In order to identify the best similarity measure to use, the author has conducted several
experiments which aim to explore the strengths and weaknesses of each one.
5.2.1 Experiments
In Chapter 2, four different similarity measures have been presented. In order to compare these,
two examples from Kuncheva’s paper have been reproduced. They focus on Tanimoto distance,
Hamming distance and the Consistency index (Kuncheva, 2007). Additionally in section 2.2.4
another similarity measure based on Pearson’s correlation coefficient has been presented and
has been added to the reproduced examples for completeness.
Experiment one aimed to demonstrate the behaviour of the different similarity measures when
applied on a small subset of selected features. The following hypothetical sequences of features,
𝑠 and 𝑠 ′ , have been produced by two runs of a feature selection algorithm on a dataset with 𝑛 =
10 features.
𝑠 = {𝑓9 , 𝑓7 , 𝑓2 , 𝑓1 , 𝑓3 , 𝑓10 , 𝑓8 , 𝑓4 , 𝑓5 , 𝑓6 }
𝑠 ′ = {𝑓3 , 𝑓7 , 𝑓9 , 𝑓10 , 𝑓2 , 𝑓4 , 𝑓8 , 𝑓6 , 𝑓1 , 𝑓5 }
These sets have been used to illustrate the behaviour of all measures as subsets of size 𝑘 = 1 to
10 have been examined.
28
by assigning weight 1 to all features in the selected subset and 0 to all features which have not
been chosen. These weights can then be used to compute its values.
Figure 8 demonstrates the results of this experiment; more details about which can be found in
Appendix E.
Kuncheva’s second example aims to demonstrate the advantages of the consistency index over
Tanimoto distance and Hamming distance using the total stability obtained from ten random
sequences of size ten. Similarly to experiment one, Pearson’s correlation coefficient has been
added for completeness. Using the random sequences, a number of features have been selected
and the stability amongst all ten subsets computed using equation (1). Figure 9 illustrates the
output of this experiment.
29
5.2.2 Which stability measure should be used?
The most important observation that can be made from analysing the results of these
experiments is that Pearson’s correlation coefficient produces identical values to the
Consistency index. The results imply that under the assumptions outlined above Pearson’s
correlation coefficient is equivalent to the Consistency index; a detailed proof of which can be
found in Appendix A. It should be noted that this has simultaneously been observed by fellow
researchers in the field Nogueira and Brown (Nogueira & Brown, 2016).
Additionally, the first example clearly demonstrates that Tanimoto distance and Hamming
distance produce high values when the subset size 𝑘 approaches the total number of elements 𝑛.
In contrast the Consistency index and Pearson’s correlation coefficient adjust the value
accordingly to factor for chance.
Example two demonstrates that the Consistency index and Pearson’s correlation coefficient
retain a value of 0 for all subset sizes, whilst the stability reported by Tanimoto distance and
Hamming distance varies considerably. This is particularly important since high stability scores
should be obtained only by design and not by chance.
Considering the above identified advantages the author has decided to employ Pearson’s
correlation coefficient in the stability evaluation of feature selection criteria.
The aim of this research is to evaluate the stability of different feature selection criteria which
has been achieved by conducting several experiments, outlined in detail below.
5.3.1 Experiments
The empirical evaluation of the criteria involved the use of datasets19 with various
characteristics. Table 2 provides detailed information about them. Some datasets like the breast
cancer and splice contain a small number of features and classes and a large number of samples,
whilst other such as the lung cancer, penglung, pengcolon and penglymph are described by a large
number of features and classes but contain only a few examples. These have been chosen
deliberately to facilitate the analysis of the feature selection criteria under various conditions.
19 The datasets used in this research have been obtained from https://archive.ics.uci.edu/ml/index.html
30
Data Features Examples Classes
breast 30 569 2
congress 16 435 2
heart 13 270 2
ionosphere 34 351 2
krvskp 36 3196 2
lungcancer 56 32 3
parkinsons 22 195 2
pengcolon 2000 62 2
penglung 325 73 7
penglymph 4026 96 9
sonar 60 208 2
splice 60 3175 3
waveform 40 5000 3
Since stability aims to evaluate the robustness of the feature selection algorithm to small
perturbations to the data, a way is needed to introduce these. In Machine learning, a technique
called ‘bootstrapping’ could be used, which is explained in detail in Appendix G.
By applying the design protocol described in section 3.5 the following results have been
obtained. The x-axis denotes the number of features selected and the y-axis the stability score
for each subset calculated using Pearson’s correlation coefficient, with the appropriate
confidence intervals. Blue illustrates the scores achieved by Mutual information maximization
criterion (MIM); orange corresponds to Mutual information feature selection (MIFS); green to
Gini index (GI) and red to ReliefF (RF).
31
Figure 10: Breast dataset stability Figure 11: Krvskp dataset stability
Figure 12: Splice dataset stability Figure 13: Waveform dataset stability
Figure 14: Congress dataset stability Figure 15: Heart dataset stability
Figure 16: Ionosphere dataset stability Figure 17: Parkinson’s dataset stability
32
Figure 18: Lung cancer dataset stability Figure 19: Sonar dataset stability
Figure 20: Penglung dataset stability Figure 21: Pengcolon dataset stability
33
5.3.2 How stable are the criteria?
As it can be observed from the experimental results, there seem to be no criterion which is
stable in all cases. However, a few generalized conclusions could be drawn.
Mutual information maximization and the Gini index criterion exhibit very similar behaviour in
all cases. This is likely due to their method of computing feature scores. Both are impurity
functions, which aim to approximate the difference between prior and posterior probabilities.
Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14 characterize the behaviour of the feature
selection criteria when the dataset contains a large number of examples and a small number of
features. In those instances Mutual information maximization (MIM) and Gini index (GI)
appear to be the most stable. This is likely due to the fact that introducing small variations in
datasets with large number of examples does not result in the change of the prior and posterior
probabilities, which MIM and GI rely on. ReliefF produces quite high results as well which can
be associated with its relationship to GI identified in section 2.3.4. However, since ReliefF relies
on nearest neighbours’ estimation, the bootstrapping technique might affect the nearest-hits and
nearest-misses, thus resulting in lower stability scores than GI and MIM report. Mutual
information feature selection (MIFS) appears the least stable in these datasets. Since the
relevancy term is the same as in MIM, this would imply that variations in the data cause
changes in the redundancy term, or the overall correlation between features. Interestingly, MIFS
appears to achieve high stability scores on the Krvskp dataset possibly due to the high number
of examples with respect to the number of features and classes.
Additionally, the behaviour of this criterion when applied to the Waveform dataset is peculiar.
Its stability is poor initially and increases after the top twenty features have been selected, as
opposed to MIM, GI and ReliefF whose stability degrades at that point. This could indicate the
presence of highly correlated features with high discriminatory power. In these instances MIM,
GI and ReliefF will successfully select all of them even with small variations. MIFS, on the other
hands, could be affected by which of the correlated ones are chosen first. After the twenty
feature mark most of the features could be of similar relevance thus explaining the drop in
stability of MIM, GI and ReliefF. However, since these would have been selected by MIFS
earlier its stability is lower beforehand.
Figure 15, Figure 16 and Figure 17 report the results obtained from applying the experimental
protocol on datasets with a relatively small number of features but far less samples.
Interestingly, Figure 15 seems to differ from the other two because all algorithms fail to identify
a consistent set of one to five most important features. This behaviour could be due to the
features describing the dataset. It is likely that there is no single feature which has significantly
higher discriminatory power in comparison to the others. This could lead to different features
receiving slightly higher scores when perturbations in the data are introduced. The poor
34
performance of ReliefF, which considers feature interactions, could infer that most of the
features play an important role for distinguishing between the classes.
Figure 16 and Figure 17 demonstrate the effect of the decreasing number of instances. MIM and
GI identify the top two features quite well but have problems producing a stable subset
afterwards. ReliefF on the other hand produces lower stability values possibly due to the
sampling technique which removes certain hits and misses and introduces new ones. MIFS
identifies the most important feature but its performance degrades significantly afterwards. As
it can be observed it achieves higher stability results on the Parkinson’s dataset. Since the
relevancy term is the same as MIM this would imply that the redundancy term is able to
compensate and likely introduces a penalty for the same highly correlated features.
Figure 18, Figure 19, Figure 20, Figure 21 and Figure 22 illustrate the behaviour of the criteria in
extreme cases, which is large number of features and classes and small sample size, with all
criteria performing poorly. ReliefF, which relies on nearest-hits and misses, suffers from the
small sample size since there are not enough instances belonging to each class, and often the
resampling removes key ones. Similarly the prior and posterior probability estimations are
highly affected when variations are present, resulting in the poor stability score of MIM and GI.
MIFS appears to achieve slightly higher average stability with the increase of the subset size,
perhaps due to early detection of certain highly correlated features. As noted by Brown, with
the increasing number of features the redundancy term in MIFS grows, whilst the relevancy one
remains the same, thus selecting features that minimize redundancy as opposed to the desired
trade-off between relevancy and redundancy (Brown, Pocock, Zhao, & Luján, 2012).
This chapter introduced the reader to the experiments conducted and the results obtained. A
thorough analysis of the different feature selection criteria was performed, exploring their
behaviour under various conditions.
35
Chapter 6
6 CONCLUSIONS
The final chapter will summarize the key findings and contributions of this project as well as
discuss future work on the subject.
6.2 ACHIEVEMENTS
This section will provide a summary of the key achievements of this project. To aid
understanding they have been divided into two groups, scientific and personal ones. As it can
be observed they clearly demonstrate that the project’s aim and objectives have been met.
36
6.3 EXPERIMENTAL ANALYSIS SUMMARY
The theoretical research and empirical evaluation of the various similarity measures has led to
the reintroduction of Pearson’s correlation coefficient as a possible measure to be employed. A
peculiar finding has been its equivalence to Kuncheva’s consistency index, which has been
proven both mathematically and experimentally. Consequently, this research relied on the
correlation coefficient in the stability calculations.
The results obtained from the experimental work on the stability of feature selection criteria
illustrates that impurity based measures such as Gini index and Mutual information
maximization exhibit stable behaviour in datasets, characterized by small number of features
and classes and large number of samples, whilst ReliefF’s achieves a slightly lower score.
ReliefF’s stability is highly influenced the nearest-hits and nearest-misses, which could attest its
behaviour. Mutual information feature selection appears to be quite unstable in most cases.
Moreover, the empirical evaluation of all four criteria demonstrated their poor stability when
applied to high-dimensional small-sample datasets, thus implying that no criterion appears to
be superior to the others. Therefore, the choice of a feature selection criterion should also be
based on the data characteristics to ensure maximal stability.
This research has focused on evaluating the stability of the feature selection criteria without the
use of a classifier, and while it gives insights into their behaviour it does not provide any
evidence for the classification capabilities of the subsets. In the future the topic of optimizing
both stability and classification accuracy could be explored.
It should also be noted that one of the limitations of this study is that the stability measure used
for evaluation does not cater the possibility that different highly correlated features could be
selected in each run. As discussed by Brown, the output from each run could contain features
with different array indices; however, due to their high correlation they are effectively identical
(Brown, Pocock, Zhao, & Luján, 2012). To address this problem Yu proposed a new stability
measure called ‘information consistency’ (Yu, Ding, & Loscalzo, 2008). Using this measure for
the stability estimation of feature selection criteria could be a future topic for research.
A further limitation of this study is that the stability of the criteria has not been examined on
dataset with more than four thousand features. Since feature selection is widely used on
genomic and proteomic datasets, which contain more features, in the future it will be useful to
evaluate the stability across such datasets as well.
37
Furthermore, another possible topic for future investigation could be the stability evaluation of
approaches such as Markov blankets for detecting weakly relevant but non-redundant features
(Koller & Sahami, 1995), the Joint mutual information criterion (Yang & Moody, 1999) and the
Hilbert-Schmidt Independence criterion (Okun, 2011), which appear to be widely employed.
Exploring the stability of more exotic feature selection algorithms, such as ones based on
cooperative Game Theory (Sun, et al., 2012) and filter approaches relying on genetic algorithms
(Lanzi, 1997), could provide new insights about their behaviour and applicability to high
dimensional problems.
Finally, this study does not address the problem of selecting a final set of features to be explored
further. Since the algorithms will produce slightly different preferences as an output, a method
for constructing a final set of size 𝑘, obtained from all sequences, is paramount both for domains
which employ feature selection alone and ones which utilize a classifier. In the future, the
development of a standardized approach to achieve this task will be of great benefit.
This chapter provided the reader with the concluding remarks and summarized achievements.
It also explored the limitations of the study and suggested topics for future research.
38
BIBLIOGRAPHY
Apple. (2016, Apr 21). iOS Human Interface Guidelines: Design principles. Retrieved from Apple:
https://developer.apple.com/library/ios/documentation/UserExperience/Conceptual/MobileHI
G/Principles.html#//apple_ref/doc/uid/TP40006556-CH4-SW1
Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning.
Neural Networks, IEEE Transactions on 5.4, 537-550.
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. Belmont,
CA, USA: Wadsworth International Group.
Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional likelihood maximisation: a unifying
framework for information theoretic feature selection. The Journal of Machine Learning
Research, 27-66.
Čehovin, L., & Bosnić, Z. (2010). Empirical evaluation of feature selection methods in classification.
Intelligent data analysis 14.3, 265-281.
CERN. (2016, April 07). Processing What To Record. Retrieved from CERN:
http://home.cern/about/computing/processing-what-record
Chen, K. (2016, April 9). Naive Bayes Classifier. Retrieved from COMP24111 Machine Learning:
http://studentnet.cs.manchester.ac.uk/ugt/COMP24111/materials/slides/Naive-Bayes.pdf
Clarke, E. J., & Barton, B. A. (2000). Entropy and MDL discretization of continuous variables for Bayesian
belief networks. International Journal of Intelligent Systems 15.1, 61-92.
Das, S. (2001). Filters, wrappers and a boosting-based hybrid for feature selection. ICML. Vol. 1, 74-81.
Dijkstra, E. W. (1982). On the role of scientific thought. Selected writings on computing: a personal
perspective, 60-66.
Dunne, K., Cunningham, P., & Azuaje, F. (2002). Solutions to instability problems with sequential
wrapper-based approaches to feature selection. Journal of Machine Learning Research, 1-22.
Embury, S. M. (2012, Sept). What's in a Test? A Brief Introduction. Retrieved from COMP33711: Agile
Software Engineering:
https://moodle.cs.man.ac.uk/file.php/357/Handouts/AgileTesting/comp33711TestingBasics.pdf
Estévez, P. A. (2009). Normalized mutual information feature selection. IEEE Transactions on 20, no.2,
189-201.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of
Machine Learning Research 3, 1157-1182.
Hong, S. J. (1997). Use of contextual information for feature ranking and discretization. Knowledge and
Data Engineering, IEEE Transactions on 9.5, 718-730.
39
Huang, S. H. (2015). Supervised feature selection: A tutorial. Artificial Intelligence Research, 22.
Kalousis, A., Prados, J., & Hilario, M. (2007). Stability of Feature Selection Algorithms: a study on high
dimensional spaces. Knowledge and information systems, 95-116.
Kira, K., & Rendel, L. A. (1992, July 12). A practical approach to feature selection. Proceedings of the
ninth international workshop on Machine learning, 249-256.
Koller, D., & Sahami, M. (1995). Toward optimal feature selection. In 13th International Conference on
Machine Learning, 284--292.
Kononenko, I. (1994, April 6). Estimating attributes: analysis and extensions of RELIEF. Machine Learning:
ECML-94, 171-182.
Kononenko, I., Šimec, E., & Robnik-Šikonja, M. (1997). Overcoming the myopia of inductive learning
algorithms with RELIEFF. Applied Intelligence 7.1, 39-55.
Kuncheva, L. I. (2007, Feb 12). A stability index for feature selection. Artificial intelligence and
applications, 421-427.
Lanzi, P. L. (1997). Fast feature selection with genetic algorithms: a filter approach. IEEE International
Conference on. IEEE, 537-540.
Linden, G., Smith, B., & York, J. (2003). Amazon. com recommendations: Item-to-item collaborative
filtering. Internet Computing, IEEE 7.1 , 76-80.
Liu, H., & Motoda, H. (2007). Computational methods of feature selection. Liu, Huan, and Hiroshi
Motoda: CRC Press.
Lustgarten, J. L., Gopalakrishnan, V., & Visweswaran, S. (2009, Jan). Measuring stability of feature
selection in biomedical datasets. AMIA, 406-410.
Ng, P. C., & Kirkness, E. F. (2010). Whole genome sequencing. In Genetic Variation (pp. 215--226).
Humana Press.
Nogueira, S., & Brown, G. (2016). Tutorial Stability Measures. Unpublished manuscript.
Okun, O. (2011). Feature Selection and Ensemble Methods for Bioinformatics. Hershey: PA: Medical
Information Science Reference.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 623-
656.
Shneiderman, B., & Plaisant, C. (2010). Designing the User Interface: Strategies for Effective Human-
Computer Interaction. Reading, MA: Addison-Wesley Publ. Co.
Sturges, H. A. (1926). The choice of a class interval. Journal of the american statistical association
21.153, 65-66.
40
Sun, X., Liu, Y., Li, J., Zhu, J., Chen, H., & Liu, X. (2012). Feature evaluation and selection with cooperative
game theory. Pattern recognition, 45(8), 2992-3002.
Theuwissen, M. (2016, Apr 04). R vs Python for Data Science. Retrieved from kdnuggets:
http://www.kdnuggets.com/2015/05/r-vs-python-data-science.html
Ward, J. S., & Barker, A. (2013). Undefined by data: a survey of big data definitions. arXiv preprint
arXiv:1309.5821.
Xing, E. P., Jordan, M., & Karp, R. M. (2001). Feature selection for high-dimensional genomic microarray
data. ICML Vol. 1, 601-608.
Yang, H. H., & Moody, J. E. (1999). Data Visualization and Feature Selection: New Algorithms for
Nongaussian Data. NIPS. Vol. 99, 687-693.
Yu, L., Ding, C., & Loscalzo, S. (2008). Stable feature selection via dense feature groups. Proceedings of
the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 803--
811.
41
Appendices
APPENDIX A
This appendix details the proof of the equivalence between Pearson’s correlation coefficient and
the Consistency index.
Consistency index
′)
𝑟𝑛 − 𝑘 2
𝑆𝑐 (𝑠, 𝑠 =
𝑘(𝑛 − 𝑘)
Pearson’s correlation coefficient
I. r = | s ⋂ s’|
II. n – total number of features
III. k – number of selected features
IV. s, s’ – two sets containing the features selected
V. w, w’ – two sets containing the weights of all features
Assumptions:
Proof:
𝜇𝑤 , 𝜇𝑤′ are the sample means of the weighting sets 𝑤, 𝑤 ′ respectively. Since 1 denotes selected
features and 0, non-selected ones, and using Assumption I we can infer that:
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑘
𝜇𝑤 = 𝜇𝑤′ = =
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑛
𝑘 𝑘
∑𝑖 (𝑤𝑖 −
)(𝑤𝑖 ′ − )
𝑆𝑝 (𝑤, 𝑤 ′ ) = 𝑛 𝑛
√∑𝑖(𝑤𝑖 − 𝑘 )2 ∑𝑖(𝑤𝑖 ′ − 𝑘 )2
𝑛 𝑛
42
Let:
𝑘 𝑘
𝐴 = ∑ (𝑤𝑖 − ) (𝑤𝑖′ − )
𝑖 𝑛 𝑛
𝑘 2 𝑘
𝐵 = √∑ (𝑤𝑖 − ) ∑ (𝑤𝑖 ′ − )2
𝑖 𝑛 𝑖 𝑛
For:
𝑘 𝑘
𝐴 = ∑ (𝑤𝑖 − ) (𝑤𝑖′ − )
𝑖 𝑛 𝑛
Taking into account Assumption II there are four possible cases for the different values
of 𝑤𝑖 , 𝑤𝑖′ .
II. 𝑤𝑖 = 1, 𝑤𝑖′ = 0, the features selected in set 𝑠, but not in set 𝑠′; 𝑘 − 𝑟
For those values the equation takes the following form:
𝑘 𝑘 𝑘 𝑘2
(1 − ) (0 − ) = − + 2
𝑛 𝑛 𝑛 𝑛
III. 𝑤𝑖 = 0, 𝑤𝑖′ = 1, the features selected in set 𝑠′, but not in set 𝑠; 𝑘 − 𝑟
For those values the equation takes the following form:
𝑘 𝑘 𝑘 𝑘2
(0 − ) (1 − ) = − + 2
𝑛 𝑛 𝑛 𝑛
IV. 𝑤𝑖 = 𝑤𝑖′ = 0, the features not selected in both sets; their number is
𝑛 − (𝑟 + 𝑘 − 𝑟 + 𝑘 − 𝑟) = 𝑛 − 2𝑘 + 𝑟
For those values the equation takes the following form:
𝑘 𝑘 𝑘2
(0 − ) (0 − ) = 2
𝑛 𝑛 𝑛
43
Using these four cases we can rewrite the equation as follows:
𝑘 𝑘
𝐴 = ∑ (𝑤𝑖 − ) (𝑤𝑖′ − )
𝑖 𝑛 𝑛
2𝑘 𝑘 2 𝑘 𝑘2 𝑘 𝑘2
= 𝑟 (1 − + ) + (𝑘 − 𝑟) (− + 2 ) + (𝑘 − 𝑟) (− + 2 )
𝑛 𝑛2 𝑛 𝑛 𝑛 𝑛
2
𝑘
+ (𝑛 − 2𝑘 + 𝑟) 2
𝑛
2𝑟𝑘 𝑟𝑘 2 𝑘 2 𝑘 3 𝑟𝑘 𝑟𝑘 2 𝑘 2 𝑘 3 𝑟𝑘 𝑟𝑘 2 𝑘 2 2𝑘 3
=𝑟− + 2 − + + − 2 − + + − 2 + − 2
𝑛 𝑛 𝑛 𝑛2 𝑛 𝑛 𝑛 𝑛2 𝑛 𝑛 𝑛 𝑛
2
𝑟𝑘
+ 2
𝑛
2
𝑟𝑛 − 𝑘
=
𝑛
For:
𝑘 2 𝑘
𝐵 = √∑ (𝑤𝑖 − ) ∑ (𝑤𝑖 ′ − )2
𝑖 𝑛 𝑖 𝑛
Taking into account Assumption II there are two possible cases for each of the values of 𝑤𝑖 , 𝑤𝑖′ .
44
Using these cases we can rewrite the equation as follows:
𝑘 2 𝑘 2
𝐵 = √∑𝑖 (𝑤𝑖 − ) ∑𝑖 (𝑤𝑖′ − )
𝑛 𝑛
2𝑘 𝑘 2 𝑘2 2𝑘 𝑘 2 𝑘2
= √(𝑘 (1 − + 2 ) + (𝑛 − 𝑘 ) 2 ) (𝑘 (1 − + 2 ) + (𝑛 − 𝑘 ) 2 )
𝑛 𝑛 𝑛 𝑛 𝑛 𝑛
2
2𝑘 𝑘 2 𝑘2
= √(𝑘 (1 − + 2 ) + (𝑛 − 𝑘 ) 2 )
𝑛 𝑛 𝑛
2𝑘 𝑘 2 𝑘2
= 𝑘 (1 − + 2 ) + (𝑛 − 𝑘 ) 2
𝑛 𝑛 𝑛
2𝑘 2 𝑘 3 𝑘 2 𝑘 3
=𝑘− + 2+ −
𝑛 𝑛 𝑛 𝑛2
𝑘(𝑛 − 𝑘 )
=
𝑛
Using the above equations we can rewrite Pearson’s correlation coefficient in the following
manner:
𝑟𝑛 − 𝑘 2
′
𝐴 𝑛 𝑟𝑛 − 𝑘 2
𝑆𝑝 (𝑤, 𝑤 ) = = =
𝐵 𝑘 (𝑛 − 𝑘 ) 𝑘 (𝑛 − 𝑘 )
𝑛
′)
∑𝑖(𝑤𝑖 − 𝜇𝑤 )(𝑤𝑖 ′ − 𝜇𝑤′ ) 𝑟𝑛 − 𝑘 2
∴ 𝑆𝑝 (𝑤, 𝑤 = = = 𝑆𝑐 (𝑠, 𝑠 ′ )
2
√∑𝑖(𝑤𝑖 − 𝜇𝑤 ) ∑𝑖(𝑤𝑖 ′ − 𝜇𝑤′ )2 𝑘(𝑛 − 𝑘)
45
APPENDIX B
This appendix explores the relationship between Relief and Gini index.
Gini index:
As noted by Kononenko if the nearest instance condition is removed this can be rewritten in the
following manner (Kononenko, Šimec, & Robnik-Šikonja, 1997):
Notations:
46
Using the above defined notations the Relief update rule becomes:
= 1 − 𝑃𝑠𝑋|𝑑𝑐 − 1 + 𝑃𝑠𝑋|𝑠𝑐
= 𝑃𝑠𝑋|𝑠𝑐 − 𝑃𝑠𝑋|𝑑𝑐
Since the prior class probability does not depend on the attribute 𝑋 it can be considered as a
constant and the weight update rule takes the following form:
The probability of two instances being in the same class 𝑃𝑠𝑐 is that simply for each class the
probability that one instance belongs to that class times the probability that the second one
belongs to the same one.
As a result:
𝑃𝑠𝑐 = ∑ 𝑃(𝑌𝑐 )2
𝑐
47
Similarly the prior probability that two instances have the same value can be expressed as:
𝑃𝑠𝑋 = ∑ 𝑃(𝑋𝑗 )2
𝑗
𝑃𝑠𝑐∩𝑠𝑋
𝑃𝑠𝑐|𝑠𝑋 =
𝑃𝑠𝑋
𝑃𝑠𝑐∩𝑠𝑋
=
∑𝑗 𝑃(𝑋𝑗 )2
𝑃𝑠𝑐|𝑠𝑋 𝑃𝑠𝑋
=
∑𝑗 𝑃(𝑋𝑗 )2
𝑃𝑠𝑐|𝑠𝑋 ∑𝑗 𝑃(𝑋𝑗 )2
=
∑𝑗 𝑃(𝑋𝑗 )2
𝑃(𝑋𝑗 )2
= ∑( 2 ∑ 𝑃(𝑌𝑐 |𝑋𝑗 )2)
∑𝑗 𝑃(𝑋𝑗 )
𝑗 𝑐
Using this and the above two equations the update rule can be rewritten as follows:
Let:
𝑃(𝑋𝑗 )2
′
𝐺𝑖𝑛𝑖 (𝑋) = ∑ ( ∑ 𝑃(𝑌𝑐 |𝑋𝑗 )2) − ∑ 𝑃(𝑌𝑐 )2
∑𝑗 𝑃(𝑋𝑗 )2
𝑗 𝑐 𝑐
48
∴ 𝑊 [𝑋] = 𝑐𝑜𝑛𝑠𝑡 ∑ 𝑃(𝑋𝑗 )2 𝐺𝑖𝑛𝑖′(𝑋)
𝑗
The similarity between 𝐺𝑖𝑛𝑖 ′ (𝑋) and 𝐺𝑖𝑛𝑖(𝑋) is evident with the only difference being the
𝑃(𝑋𝑗)2 𝑃(𝑋𝑗)
∑𝑗 𝑃(𝑋𝑗)2
term in 𝐺𝑖𝑛𝑖′ (𝑋) . In the original 𝐺𝑖𝑛𝑖(𝑋) we have the following ∑ .
𝑗 𝑃(𝑋𝑗 )
Since:
∑𝑗 𝑃(𝑋𝑗 ) = 1
𝑃(𝑋𝑗 )
∴ 𝑃(𝑋𝑗 ) =
∑𝑗 𝑃(𝑋𝑗 )
As noted by Kononenko the term used for the weight update in the Relief algorithm serves as a
normalisation for multivalued features. This effectively solves the problem of multivalued
feature overestimation, which impurity functions such as Mutual information and Gini index
have (Kononenko, Šimec, & Robnik-Šikonja, 1997).
49
APPENDIX C
This appendix presents the pseudocode used for implementing stability estimation and the
feature selection criteria, crucial to this project.
50
Algorithm 2: Mutual information
Input: 𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏– a 2d array containing the conditional probability between the variables
𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑋– a 1d array containing all the prior probabilities for 𝑋
𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑌– a 1d array containing all the prior probabilities for 𝑌
Output: 𝑚𝑖– the mutual information between 𝑋and 𝑌
1 𝑚𝑖 = 0
2 for 𝑣𝑎𝑙𝑢𝑒𝑋 = 1 to 𝑙𝑒𝑛𝑔𝑡ℎ(𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑋)
3 for 𝑣𝑎𝑙𝑢𝑒𝑌 = 1 to 𝑙𝑒𝑛𝑔𝑡ℎ(𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑌)
𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏[𝑣𝑎𝑙𝑢𝑒𝑋,𝑣𝑎𝑙𝑢𝑒𝑌]
4 𝑚𝑖 += 𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏[𝑣𝑎𝑙𝑢𝑒𝑋, 𝑣𝑎𝑙𝑢𝑒𝑌] ∗ 𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑌[𝑣𝑎𝑙𝑢𝑒𝑌] ∗ 𝑙𝑜𝑔2 ( )
𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑋[𝑣𝑎𝑙𝑢𝑒𝑋]
5 end for
6 end for
7 return 𝑚𝑖
51
Algorithm 4: Mutual information feature selection criterion
Input: 𝑑𝑎𝑡𝑎 – a 2d array containing all 𝑚 samples and 𝑛 features
𝑑𝑎𝑡𝑎𝐶𝑙𝑎𝑠𝑠 – the column number of the data class
Output: 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 – the features, ordered by the algorithm’s preference
1 𝑚𝑖𝐶𝑙𝑎𝑠𝑠 = [0 … 0]
2 0 ⋯ 0
𝑚𝑖𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠 = [ ⋮ ⋱ ⋮ ]
3
0 ⋯ 0
4 𝑏𝑒𝑡𝑎 = 1
5 𝑝𝑟𝑖𝑜𝑟𝐶𝑙𝑎𝑠𝑠 = 𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒(𝑑𝑎𝑡𝑎, 𝑑𝑎𝑡𝑎𝐶𝑙𝑎𝑠𝑠)
6 𝑝𝑟𝑖𝑜𝑟𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠 = 𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝐴𝑙𝑙(𝑑𝑎𝑡𝑎[, −𝑑𝑎𝑡𝑎𝐶𝑙𝑎𝑠𝑠])
7 for 𝑓 = 1 to 𝑛
8 𝑐𝑜𝑛𝑑𝐶𝑙𝑎𝑠𝑠 = 𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏𝐵𝑒𝑡𝑤𝑒𝑒𝑛𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠[𝑑𝑎𝑡𝑎, 𝑓, 𝑑𝑎𝑡𝑎𝐶𝑙𝑎𝑠𝑠]
9 𝑚𝑖𝐶𝑙𝑎𝑠𝑠[𝑓] = 𝑚𝑢𝑡𝑢𝑎𝑙𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛(𝑐𝑜𝑛𝑑𝐶𝑙𝑎𝑠𝑠, 𝑝𝑟𝑖𝑜𝑟𝐶𝑙𝑎𝑠𝑠, 𝑝𝑟𝑖𝑜𝑟𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠[𝑓])
10 end for
11 for 𝑖 = 1 to 𝑛
12 for 𝑗 = 1 to 𝑛
13 𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏 = 𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏𝐵𝑒𝑡𝑤𝑒𝑒𝑛𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠(𝑑𝑎𝑡𝑎, 𝑖, 𝑗)
14 𝑚𝑖𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠[𝑖, 𝑗] = 𝑚𝑢𝑡𝑢𝑎𝑙𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛(𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏, 𝑝𝑟𝑖𝑜𝑟𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠[𝑖, ],
15 𝑝𝑟𝑖𝑜𝑟𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠[𝑗, ])
16 end for
17 end for
18 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑎 = 𝑚𝑖𝐶𝑙𝑎𝑠𝑠
19 for 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑇𝑜𝑆𝑒𝑙𝑒𝑐𝑡 = 1 to 𝑛
20 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 [𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑇𝑜𝑆𝑒𝑙𝑒𝑐𝑡] = 𝑠𝑒𝑙𝑒𝑐𝑡𝑁𝑢𝑚𝑏𝑒𝑟𝑂𝑓𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠(𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑎, 1)
21 𝑚𝑖𝑆𝑒𝑙𝑒𝑐𝑡𝑒𝑑 = 𝑚𝑖𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠[ , 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠]
22 𝑟𝑜𝑤𝑆𝑢𝑚 = 𝑟𝑜𝑤𝑆𝑢𝑚𝑠[ 𝑚𝑖𝑆𝑒𝑙𝑒𝑐𝑡𝑒𝑑]
23 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑎 = [0 … 0]
24 for 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 = 1 to 𝑛
25 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑎[𝑓𝑒𝑎𝑡𝑢𝑟𝑒] = 𝑚𝑖𝐶𝑙𝑎𝑠𝑠[𝑓𝑒𝑎𝑡𝑢𝑟𝑒] − 𝑏𝑒𝑡𝑎 ∗ 𝑟𝑜𝑤𝑆𝑢𝑚[𝑓𝑒𝑎𝑡𝑢𝑟𝑒]
26 end for
27 end for
28 return 𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠
52
Algorithm 5: Gini index
Input: 𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑋– a 1d array containing all the prior probabilities for 𝑋
𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑌– a 1d array containing all the prior probabilities for 𝑌
𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏– a 2d array containing the conditional probability between the variables
Output: 𝑔𝑖 – the gini index between the variables
1 𝑔𝑖 = 0
2 for 𝑦 = 1 to 𝑙𝑒𝑛𝑔𝑡ℎ(𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑌)
3 𝑥𝐶𝑜𝑛𝑑𝑆𝑞𝑢𝑎𝑟𝑒𝑑 = 0
4 for 𝑥 = 1 to 𝑙𝑒𝑛𝑔𝑡ℎ(𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑋)
5 𝑥𝐶𝑜𝑛𝑑𝑆𝑞𝑢𝑎𝑟𝑒𝑑+= 𝑐𝑜𝑛𝑑𝑃𝑟𝑜𝑏[𝑥, 𝑦]^2
6 end for
7 𝑔𝑖+= 𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑌[𝑦] ∗ 𝑥𝐶𝑜𝑛𝑑𝑆𝑞𝑢𝑎𝑟𝑒𝑑
8 end for
9 𝑥𝑃𝑟𝑖𝑜𝑟𝑆𝑞𝑢𝑎𝑟𝑒𝑑 = 0
10 for 𝑥 = 1 to 𝑙𝑒𝑛𝑔𝑡ℎ(𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑋)
11 𝑥𝑃𝑟𝑖𝑜𝑟𝑆𝑞𝑢𝑎𝑟𝑒𝑑+= 𝑝𝑟𝑖𝑜𝑟𝑃𝑟𝑜𝑏𝑋[𝑥]^2
12 end for
13 𝑔𝑖−= 𝑥𝑃𝑟𝑖𝑜𝑟𝑆𝑞𝑢𝑎𝑟𝑒𝑑
14 return 𝑔𝑖
53
APPENDIX D
This appendix explores the ‘zero conditional probability problem’, which is often encountered
in probability estimation and refers to a dataset not containing an example of class 𝑌𝑐 with
value 𝑋𝑖 for an attribute 𝑋. This leads to 𝑃 (𝑋𝑖 |𝑌𝑐 ) = 0 which yields incorrect results. This is
particularly common for datasets consisting of a small number of samples but characterised by
high number of classes and attribute values. To mitigate this problem an approach discussed by
Chen has been implemented. It relies on virtual examples in the conditional probability
estimation (Chen, 2016).
𝑛𝑐 + 𝑚𝑝
𝑃(𝑋𝑖 |𝑌𝑐 ) = (19)
𝑛+𝑚
𝑛𝑐 – number of training examples for which 𝑋 = 𝑋𝑖 and 𝑌 = 𝑌𝑐
𝑚 - number of virtual examples, usually 𝑚 ≥ 1
1
𝑝 – prior estimate, usually 𝑝 = 𝑡 where 𝑡 is the number of possible values for 𝑋
𝑛 - number of training examples for which 𝑌 = 𝑌𝑐
54
APPENDIX E
This appendix provides calculations on example one reproduced from Kuncheva’s paper.
Let 𝑠 and 𝑠 ′ be two hypothetical sequences of features produced by two runs of a feature
selection algorithm on a dataset with 𝑛 = 10 features.
𝑠 = {𝑓9 , 𝑓7 , 𝑓2 , 𝑓1 , 𝑓3 , 𝑓10 , 𝑓8 , 𝑓4 , 𝑓5 , 𝑓6 }
𝑠 ′ = {𝑓3 , 𝑓7 , 𝑓9 , 𝑓10 , 𝑓2 , 𝑓4 , 𝑓8 , 𝑓6 , 𝑓1 , 𝑓5 }
𝑠(3) = {𝑓9 , 𝑓7 , 𝑓2 }
𝑠 ′ (3) = {𝑓3 , 𝑓7 , 𝑓9 }
Using this information we can compute the similarities according to the different measures.
Tanimoto distance:
′)
|𝑠 ∩ 𝑠 ′ | 2
∴ 𝑆𝑡 (𝑠, 𝑠 = = = 0.5
|𝑠 ∪ 𝑠 ′ | 4
Hamming distance:
55
Consistency index:
𝑟𝑛 − 𝑘 2 2 ∗ 10 − 32 11
𝑆𝑐 (𝑠, 𝑠 ′ ) = = = ≈ 0.5238
𝑘(𝑛 − 𝑘) 3(10 − 3) 21
Considering this example, by assigning weight 1 to all features in the selected subset and 0 to all
features which have not been chosen, we obtain the following weights for subsets 𝑠 and 𝑠′:
𝑤 = {0,1,0,0,0,0,1,0,1,0}
𝑤 ′ = {0,0,1,0,0,0,1,0,1,0}
3
𝜇𝑤 = 𝜇𝑤′ =
10
∑𝑖 (𝑤𝑖 − 𝜇𝑤 )(𝑤𝑖 ′ − 𝜇𝑤′ )
𝑆𝑝 (𝑤, 𝑤 ′ ) = ≈ 0.5238
√∑𝑖 (𝑤𝑖 − 𝜇𝑤 )2 ∑𝑖 (𝑤𝑖 ′ − 𝜇𝑤′ )2
56
APPENDIX F
The following appendix contains the output of the test suite for this project.
RUNIT TEST PROTOCOL -- Sun May 01 14:19:12 2016
***********************************************
Number of test functions: 24
Number of errors: 0
Number of failures: 0
1 Test Suite :
final-year-project - 24 test functions, 0 errors, 0 failures
Details
***************************
Test Suite: final-year-project
Test function regexp: ^test.+
Test file regexp: ^\d+\.R
Involved directory:
tests
---------------------------
Test file: tests/1.R
test.calculatePriorProbability: (2 checks) ... OK (0.03 seconds)
---------------------------
Test file: tests/10.R
test.estimateHammingDistance: (7 checks) ... OK (0 seconds)
---------------------------
Test file: tests/11.R
test.estimateNumBinSturgesRule: (5 checks) ... OK (0 seconds)
---------------------------
Test file: tests/12.R
test.estimateNumBinScottsRule: (6 checks) ... OK (0 seconds)
---------------------------
Test file: tests/13.R
test.calculateGiniIndexFeature: (4 checks) ... OK (0.01 seconds)
---------------------------
Test file: tests/14.R
test.differenceFuntion: (3 checks) ... OK (0 seconds)
---------------------------
Test file: tests/15.R
test.calculateMinMaxDiffValues: (4 checks) ... OK (0 seconds)
---------------------------
Test file: tests/16.R
test.mutualInformationFeatureSelection: (4 checks) ... OK (0.02 seconds)
---------------------------
Test file: tests/17.R
test.mutualInformationMaximisation: (4 checks) ... OK (0.01 seconds)
---------------------------
Test file: tests/18.R
test.calculateMutualInformationSingleFeature: (4 checks) ... OK (0 seconds)
---------------------------
Test file: tests/19.R
test.mutualInfoCriterionBetweenSelectedFeatures: (8 checks) ... OK (0 seconds)
---------------------------
57
Test file: tests/2.R
test.calculateNumFeatureValues: (4 checks) ... OK (0 seconds)
---------------------------
Test file: tests/20.R
test.calculateGiniIndexSingleFeature: (4 checks) ... OK (0 seconds)
---------------------------
Test file: tests/21.R
test.featureSelectionBasedOnGiniIndex: (4 checks) ... OK (0 seconds)
---------------------------
Test file: tests/22.R
test.calculatePriorProbabilityFeatures: (16 checks) ... OK (0 seconds)
---------------------------
Test file: tests/23.R
test.calculateCondProbBetweenVariables: (30 checks) ... OK (0 seconds)
---------------------------
Test file: tests/24.R
test.findNearestImproved: (8 checks) ... OK (0.01 seconds)
---------------------------
Test file: tests/3.R
test.calculateCondProbWithM: (20 checks) ... OK (0 seconds)
---------------------------
Test file: tests/4.R
test.calculateMutualInformation: (4 checks) ... OK (0 seconds)
---------------------------
Test file: tests/5.R
test.selectNumberOfFeatures: (3 checks) ... OK (0 seconds)
---------------------------
Test file: tests/6.R
test.estimateTanimotoDistance: (7 checks) ... OK (0 seconds)
---------------------------
Test file: tests/7.R
test.estimateConsistencyIndex: (10 checks) ... OK (0.03 seconds)
---------------------------
Test file: tests/8.R
test.estimatePearsonsCorrelationCoefficient: (8 checks) ... OK (0 seconds)
---------------------------
Test file: tests/9.R
test.weightsFromSet: (22 checks) ... OK (0 seconds)
58
APPENDIX G
Suppose there is a dataset containing 𝑁 items. ‘Bootstrapping’ performs sampling of 𝑁 items
with replacement from the original dataset. This method can be used to generate a new dataset
with similar characteristics to the original one. Therefore, it can be successfully used in the
evaluation the stability of different feature selection criteria.
Figure 23 illustrates this approach. As it can be observed the first bootstrap set contains samples
4 and 7 more than once, whilst samples 5 and 8 are missing. The second one contains a
duplicate of sample 8, with sample 9 missing.
Bootstrap 1
…
Sample 4
Sample 4
Sample 6
Sample 7
Sample 7
Dataset Sample 9
… …
Sample 4
Sample 5
Sample 6 Bootstrap 2
Sample 7 …
Sample 8 Sample 4
Sample 9 Sample 5
… Sample 6
Sample 7
Sample 8
Sample 8
…
59