0% found this document useful (0 votes)
20 views5 pages

Elaboudi 2016

Uploaded by

yuhanghe719
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views5 pages

Elaboudi 2016

Uploaded by

yuhanghe719
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Review on Wrapper Feature Selection Approaches

Naoual El aboudi Laila Benhlima


Université Mohammed V Université Mohammed V
Ecole Mohammadia d’ingénieurs Ecole Mohammadia d’ingénieurs
Rabat, Maroc Rabat, Maroc
Email: nawal.elaboudi@gmail.com Email: benhlima@emi.ac.ma

Abstract—The main objective of feature selection process


consists of investigating the optimal feature subset leading to
better classification quality while spending less computational
cost compared to maintaining the whole initial feature set. The
problem of feature selection has been extensively researched
since the early beginnings of machine learning . Even though
several methods were proposed to handle the issue of feature
selection using a variety of techniques, it is difficult to identify
a specific method as the most fitted one regarding the feature
subset selection issue. In this study, we provide an overview of
existing wrapper methods pointing out their weaknesses and their
strengths.
Fig. 1. Wrapper Feature Selection Method
I. I NTRODUCTION
Feature selection constitutes an important preprocessing resources, especially when handling large datasets. F-score
phase in machine learning. The reason behind performing this [3] , mutual information [4] and information gain illustrate
task lies in the fact that reducing the dimensionality of the examples of popular filters. On the contrary, wrapper methods,
original feature space allows in general gains in both clas- involve a particular learning algorithm that will be adopted
sification accuracy and computational cost. Therefore, such a to evaluate the accuracy performance of a candidate feature
step is highly recommended in the presence of high dimension subset which lead to better solutions. Nevertheless, they are
data, which seems to be the expected trend in the future. difficult to deploy in the presence of high dimensional data as
To reach a maximum classification accuracy while spending heavy computational burden is needed even when using simple
a minimum computational effort, it is crucial to select the learning algorithms. Integrating the complementary aspects of
most appropriate feature subset by categorizing features into filters and wrappers has led to a third approach that consists
relevant and irrelevant ones in the initial feature set while being of hybrid approach. Hybrid methods attempt to combine the
capable of detecting redundant attributes. [1] strengths of both filters and wrappers aiming to design more
In order to label correctly a given instance during classifica- efficient solutions.
tion, it is required to employ relevant features which are highly In this study, we present a summary of existing feature
informative regarding classification process. On the other side, selection wrapper techniques with a special emphasis on recent
irrelevant features have no impact on classification results, evolutions that achieved promising results. Figure 1 illustrate
while redundant features are equivalent to some other relevant the general principle of a wrapper feature selection model.
attributes. The objective of feature selection is to identify rele-
vant features and eliminate irrelevant and redundant attributes II. R ELATED W ORKS
seeking the ideal situation where the selected features are the Since it represents an essential component of data prepro-
most adequate ones to achieve good results in terms of the cessing, the feature selection problem has been the subject of
classifcation process. extensive research, especially in the machine learning frame-
In general, a given FS method belongs to one of two basic fam- work. In this context, a review on feature selection approaches
ilies: filters and wrappers [2]. Another family deriving from based on evolutionary techniques was presented by Abd-
the two previous ones is hybrid methods. Filter methods do not Alsabour in [5]. Kumar proposed a comprehensive overview
employ a learning algorithm to evaluate a feature subset since of existing feature selection methods in [6] to classify and
they adopt statistical measures to rank available features, then compare them.
those achieving scores below a predetermined threshold are Recently, Xue conducted in [7] an interesting survey on
automatically rejected. Users often choose filters since they are feature selection methods where she focused on evolutionnary
simple to design and do not require important computational techniques highlighting their key aspects which include the
978-1-5090-5579-1/16$31.00 c 2016 IEEE model representation, the benefit mechanisms and the fitness
function design. Furthermore, Jovic presented in [8] a review 1) Genetic Algorithm for Feature Selection: Genetic Algo-
which takes into consideration most of the commonly used FS rithms (GAs) represent a very popular class of optimization
techniques. In that study, the emphasis is on the application metaheuristics that achieved satisfactory results in many appli-
aspects of reviewed solutions. In this paper, we propose a cations including feature selection problem since their intro-
review on feature selection techniques and we focus precisely duction by Holland in 1970, Figure 2 illustrates the principle of
on wrapper feature selection approaches as they allow to GA in feature selection. For instance, many authors proposed
achieve better results. several variants of GAs to handle the feature selection issue
in [10], [11], [12], [13], [14].
III. F EATURE SELECTION BASED ON SEARCH STRATEGY
In genetic algorithm, the population is composed of candidates
Feature selection, in the case of a search strategy, is per- called chromosomes which are generally coded in the form of
formed by selecting an algorithm that looks for the optimal a binary sequence where a digit represents a gene. During
feature subset according to a specific objective function. In each iteration, a serie of operations is applied to candidates in
other words, the problem of feature selection is translated into order to enhance the quality of individuals forming the next
an optimization problem where the user attempts in general to population (Generation). The following operators are applied
maximize the classification accuracy while minimizing the size consecutively during every iteration:
of the corresponding feature subset. There are many search Selection: Based on fitness function scores, this operator will
strategies in the literature, each of them belongs to one of the select randomly candidates for the next step organizing them
three following families: in the form of couples.
A. Exponential Complexity Methods Crossover: This operator imitates reproduction, therefore each
As their name suggests, these methods have an exponential couple represents parents that will give birth to two childs,
complexity, thus the number of evaluated subsets increases each of them shares a part of its sequence with one of its
dramatically even in the presence of medium size feature parents. This sequence is obtained thanks to a divide in parents
space, this phenomenon is known as the curse of dimension- sequences from a given position named crossover point.
ality. For instance, exhaustive search presented in [9] is an Mutation: The newly formed individuals will be subjected to
intuitive exponential complexity method, Although this solu- another transformation that consists of modifying their genes
tion ensures finding the optimal feature subset, it is impractical according to a predefined probability.
to implement due to deterrent computational cost. When addressing the problem of feature selection using GAs,
It considers all possible subsets which guaranteed to find the the model is straightforward since a binary string is adapted
optimal solution. However, it is difficult to run, even with to represent a candidate feature subset. Indeed, each feature
moderate size feature sets. subset may be coded by a binary chromosome which length
represents the size of the original features set. In this case 0
B. Population Based Approaches for Feature Selection means that a given feature is ignored, while 1 indicates that
Population based methods for feature selection are very the corresponding feature has been selected. As the search
popular, their success may be explained by the good tradeoff proceeds, the quality of individuals improves over generations
they achieve between computational effort and the quality of since the algorithm keeps candidates achieving high scores
the provided solution [7]. These methods adopt a particular with respect to the fitness function. The main challenge when
population based optimization metaheuristic which common adopting GAs is to maintain diversity among population to
principle is to mimic evolution toward better solutions in avoid narrowing the search by a too strong elitist strategy,
nature. Each population based algorithm has a set of candidate which will make the search trapped in some local minima.
solutions which are updated iteratively based on a specific There are few studies dedicated to the topic of population
mechanism seeking to obtain better solutions according to a diversity in the context of feature selection based on GAs, for
given fitness function. After reaching a stop criterion, the al- instance, Alsukk proposed in [13] a solution called diverse
gorithm ends running and provides the best solution among its genetic algorithm that tackles the issue of diversity by mod-
population. In general, every population based metaheuristic ifying the selection and crossover operators. The particular
has several parameters impacting highly the way the search issue of diversity deserves more research effort due to the
is performed. In the absence of default values for those promising potential of GAs in handling the feature selec-
parameters, it is up to the user to tune them according to tion problem.Moreover, one of the classical problems when
the specific problem been handled. Furtheremore, this kind of dealing with GAs is their slow convergence. In this context,
algorithms has to be fed with an initial population with respect Jung proposed in [14] a variant of GA for feature selection
to a certain initialisation strategy. For instance, a random that minimizes fitness evaluations thanks to a dynamic cost
initialisation strategy may lead to results different from a more function achieving encouraging outcomes. Another powerful
biased initialisation policy see Xue. Genetic Algorithms (GA) evolutionary algorithm is the particle swarm optimization
and Particle Swarm Optimization (PSO) represent population metaheuristic which will be discussed in the following section.
based techniques that are widely adopted in the context of
feature selection, thus, those kinds of algorithms will be 2) Particle swarm Optimization for feature selection: Par-
examined closely in the next sections. ticle swarm optimization is a very popular optimization tech-
TABLE I
F EATURE S ELECTION M ETHODS BASED ON S EARCH S TRATEGY

Search Strategy Algorithms


Exponential Branch and Bound Algorithm [9]
Genetic Algorithm [12]
Population Particle Swarm Algorithm [20]
Based Bee Colony Algorithm [21]
Advanced Binary Ant Colony Optimisation [22]
Sequential Forward Selection [18]
Sequential Sequential Backward Selection [19]

moradi obtained good results by designing a BPSO based


feature selection solution where local search is introduced to
guide the search more efficiently.
C. Sequential selection strategies for feature selection
These algorithms belong to the family of greedy algorithms,
they were named sequential due to their iterative functionning.
1) The Sequential Feature Selection (SFS) algorithm: This
algorithm starts its query by an empty set, then it looks for
Fig. 2. GA for Feature Selection
the feature that allows to reach the best classification accuracy,
when identified, this feature is simply added to the empty set
which will form the pursued feature subset. This procedure
nique, its simple benefit mecahnism makes it easy to deploy is repeated as necessary until no possible improvement of
in several fields. The feature selection problem took advantage the classification accuracy is possible by adding any of the
from PSO, where it was studied extensively achieving often remaining features [18]. Although this algorithm returns a
encouraging results. Depending on the particle representation, solution in a reasonable amount of time, the quality of the
PSO has two versions either continuous or binary that is provided solution is expected to be poor since the search is
refered as binary PSO (BPSO). very limited as a selected feature can not be removed in further
Particles in PSO represent candidate solutions for the problem. iterations.
Each of them is associated with two main characteristics: 2) The Sequential Backward Selection (SBS) algorithm:
velocity and position. During each iteration, the position of This algorithm operates similarly to the SFS, exept that the
a given particle is updated depending on the value assigned search is performed in the opposite direction. In fact, the
to its velocity combined to its best previous own position and algorithm starts with a set containing all the available features,
the position of the best element among the global population then it removes the feature which elimination increases clas-
of particles. sification accuracy the most. This task is repeated over again
The representation of each particle in PSO for feature selection until no improvement of classification performance is possible
is the same of GA one, it consists of a bit-string which by removing any of the remaining features [19]. Similiraly
dimension is equal to the cardinality of the feature set. The bit- to its counterpart represented by SFS, SBS has no guarantee
string is composed either of binary numbers in BPSO variant of finding the optimal feature subset, nevertheless its quick
or real-value numbers in the case of continuous PSO. In the convergence toward a solution is assured. The table I recaps
case of BPSO studied in [15], 1 means the corresponding the previous section by summarizing some of the well known
feature is selected and 0 means it is eliminated. When adopting contributions in the field of feature selection methods based
continuous PSO, it is important to define some threshold value on search strategy.
in order to decide whether a feature is selected or discarded,
such parameter is difficult to set since its value influences IV. F EATURE S ELECTION A PPROACHES FOR H IGH
to a great degree the performance of the resulting solution. D IMENSIONAL DATA
Thanks to its satisfactory results, PSO algorithm has been the The amount of available data in almost all applications
subject of steady research which has led to efficient variants has become huge over the recent period of time. In fact,
more adapted to handle the feature selection problem than data tends to include not only large number of instances but
the classical version. For instance, Xue proposed in [16] a contains also a high dimensional feature space. Such a trend
PSO feature selection model with new benefit mechanisms that is expected to continue in the years ahead, it is known as the
take into consideration both the classification accuracy and the phenomenon of Big Data. When facing data with important
number of features. Those proposed benefit mechanisms were volume, conventional approaches fail to deliver acceptable
used in conjunction with new initialisation strategies inspired results within a reasonable time period due to poor scalability.
from sequential selection algorithms achieving some of the In order to overcome such limitations, a bunch of solutions
best results in the literature. In a recent contribution, see [17], were proposed by researchers. Indeed, parallel computing
solution confirmed itself as a leading option in the domain
of Big Data. It consists in splitting the procedure performed
by an algorithm into multiple tasks that can be processed
simultaneously on different machines. For instance, generating
new candidates, evaluating the fitness function are tasks that
could be parallelized [23] .

A. Parallel GA approach for feature selection


Due to their inherent properties, GAs are fitted to
parallelization. The implementation of parallel GA for feature
selection through mapreduce can be realized according to one
of the three following approaches:

• Global Parallelisation Model


Fig. 3. General Framework of Streaming Feature Selection Method
• Fine-grain Parallelisation Model

• Coarse-grained Parallelisation Model is difficult to assess. The general framework of streaming


feature selection method is shown in Figure 3. For instance,
On the one hand, in the fitness evaluation level (Global grafting method may handle this kind of situations. Indeed,
Parallelisation Model) only the task of fitness evaluation it is a popular feature selection algorithm in the framework
for each individual is parallelized, other operations remain of streaming data [26]. It considers the selection of the
centralized. optimal feature subset as an integral component of designing
On the other hand, in the individual Level or grid Model a prediction model in a regularized learning context. Grafting
(Fine-grain Parallelisation Model), each solution (individual) algorithm runs iteratively in an incremental manner, it builds
is given a position on a grid and the GA operations take up a feature set while training a model thanks to a gradient
place in parallel by calculating simultaneously the fitness descent. In each iteration, a fast gradient based heuristic is
value while the selection operator is applied only to the small adopted to select the feature that is more likely to enhance the
adjacent neighbourhood. current model, with the model being incrementally optimized
Finally, in the population level or island model (Coarse- using gradient descent.
grained Parallelisation Model), the population is separated V. D ISCUSSIONS AND C HALLENGES
into islands and the GA runs independently on each of them.
A. classifier choice
Seeking a suitable feature subset, Ferruci applies in[24] an
island model through the next steps: When designing a wrapper feature selection model, the
choice of a the classifier influences to a great degree the quality
of the obtained feature subset. A recent study realized by Xue
• Initialisation phase
in [27] focused on the computational aspects of wrappers using
• Generation phase
different classifiers by performing extensive experimentation.
The initialisation phase consists of generating a random pop- This investigation has very helpful conclusions, it recommands
ulation of candidate feature subsets represented in a classical adopting support vector machine algorithm [28] in the pres-
manner by a string of binary variables indicating whether ence of a moderate sized feature space especially when time
a feature is selected or eliminated, then the fitness function constraints are in force which seems to be the predominant
of each candidate solution is computed by determining the case in real world applications. On the contrary, when dealing
classification accuracy during the generation phase, in the fol- with large features space, KNN classifier proves to perform
lowing step, GAs operators are applied until a target accuracy better.
is obtained or a maximum number of iterations is reached.
At the end of the algorithm, the final population will be tested B. Scalability
with the test dataset. In the presence of high dimensional feature space, the
existing feature selection methods face difficulties due to a
B. Feature Selection for Streamming Data lack of scalability. In fact, most of them obtain modest results
In the majority of available work on feature selection, regarding feature selection process for large scale classification
authors assume that the whole feature space is already at the when the total number of features exceeds thousands or tens of
user’s disposal, which may contrasts with practical applica- thousands. Therefore, it seems crucial that this area of research
tions where data is made available in a streamming way [25]. deserves more efforts in order to be capable of handling
In such cases, it is challenging to determine a suitable feature the constantly increasing amount of data especially in the
subset as the interaction between current and future features streamming case.
C. Stability [13] A. A.-A. A AlSukker, R N Khushaba, “Enhancing the diversity of
genetic algorithm for improved feature selection,” in Systems Man and
Stability is defined as the sensitivity of a method to vari- Cybernetics (SMC), 2010 IEEE International Conference on, Oct 2010,
ations in the training set. The stability issue is observed in pp. 1325–1331.
[14] V. K. J. D. P. M. A. S. M. J. J. Z. Vassil Alexandrov, Michael Lees,
some of the well-known feature selection algorithms. Those “2013 international conference on computational science a guided hybrid
methods experience some loss in their efficiency as soon as a genetic algorithm for feature selection with expensive cost functions,”
small data perturbation is introduced in the training set. Even Procedia Computer Science, vol. 18, pp. 2337 – 2346, 2013.
[15] T. C.-J. Y. C.-H. Chuang Li-Yeh, Chang Hsueh-Wei, “Improved binary
though feature selection methods have become recently more pso for feature selection using gene expression data,” Comput. Biol.
tailored seeking to adapt more to the problem they adress, Chem., vol. 32, no. 1, pp. 29–38, Feb. 2008.
those techniques still lack resilience regarding stability. In [16] W. N. B. Bing Xue, Mengjie Zhang, “Particle swarm optimisation
for feature selection in classification: Novel initialisation and updating
fact, each feature selection method is based on a particular mechanisms,” Applied Soft Computing, vol. 18, pp. 261 – 276, 2014.
optimization metaheuristic that has several own parameters [17] M. G. Parham Moradi, “A hybrid particle swarm optimization for feature
needing to be tuned, for instance the population size parameter subset selection by integrating a novel local search strategy,” Applied Soft
Computing, vol. 43, pp. 117 – 130, 2016.
. A deeper understanding of how those values interact with [18] A. W. Whitney, “A direct method of nonparametric measurement selec-
a particular dataset may improve the ability of the resulting tion,” IEEE Transactions on Computers, vol. C-20, no. 9, pp. 1100–1103,
model. 1971.
[19] D. G. T Marill, “On the effectiveness of receptors in recognition
systems,” IEEE Transactions on Information Theory, vol. 9, no. 1, pp.
VI. C ONCLUSION AND F UTURE W ORKS 11–17, 1963.
[20] Z. M. Tran Binh, Xue Bing, Overview of Particle Swarm Optimisation
Feature selection represents a crucial phase in the field of for Feature Selection in Classification. Springer International Publish-
ing, 2014, pp. 605–617.
machine learning since it plays an important role in design- [21] A. K. Rana Forsati, Alireza Moayedikia, “Article: A novel approach for
ing predictors capable of achieving satisfactory classification feature selection based on the bee colony optimization,” International
results. The objective of this overview is not to rank the Journal of Computer Applications, vol. 43, no. 8, pp. 13–16, April 2012.
[22] H. N.-p. Shima Kashef, “A new feature selection algorithm based
available wrapper feature selection methods. The aim of this on binary ant colony optimization,” in Information and Knowledge
study is rather to put the light on the main issues regarding Technology (IKT), 2013 5th Conference on, May 2013, pp. 50–54.
different aspects of this category of models while highlighting [23] F. S. J Silva, A Aguiar, “A parallel computing hybrid approach for
feature selection,” in Computational Science and Engineering (CSE),
the challenges faced by this technology in the near term which 2015 IEEE 18th International Conference on, Oct 2015, pp. 97–104.
will certainly come under the pressure of Big Data. [24] P. S. F. S. Filomena Ferrucci, M Tahar Kechadi, “A framework for
genetic algorithms based on hadoop,” CoRR, vol. abs/1312.0086, 2013.
[25] H. W. W. D. Xindong Wu, Kui Yu, “Online streaming feature selection,”
R EFERENCES in Proceedings of the 27th International Conference on Machine Learn-
ing (ICML-10), June 21-24, 2010, Haifa, Israel, 2010, pp. 1159–1166.
[1] L. H. Yu Lei, “Efficient feature selection via analysis of relevance and [26] J. T. Simon Perkins, “Online feature selection using grafting,” in In
redundancy,” J. Mach. Learn. Res., vol. 5, pp. 1205–1224, Dec. 2004. International Conference on Machine Learning, 2003, pp. 592–599.
[2] L. Y. Huan Liu, “Toward integrating feature selection algorithms for [27] B. W. N. Xue Bing, Zhang Mengjie, “A comprehensive comparison on
classification and clustering,” Knowledge and Data Engineering, IEEE evolutionary feature selection approaches to classification,” International
Transactions on, vol. 17, no. 4, pp. 491–502, April 2005. Journal of Computational Intelligence and Applications, vol. 14, no. 02,
[3] S. Ding, “Feature selection based f-score and aco algorithm in support p. 1550008, 2015.
vector machine,” in Knowledge Acquisition and Modeling, 2009. KAM [28] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning,
’09. Second International Symposium on, vol. 1, Nov 2009, pp. 19–23. vol. 20, no. 3, pp. 273–297, 1995.
[4] d. B. J. Lee Sungyoung, Park Young-Tack, “A novel feature selection
method based on normalized mutual information,” Applied Intelligence,
vol. 37, no. 1, 2012.
[5] A.-A. Nadia, “A review on evolutionary feature selection,” in Proceed-
ings of the 2014 European Modelling Symposium, ser. EMS ’14, 2014,
pp. 20–26.
[6] S. M. Vipin Kumar, “Feature selection: A literature review,” Smart CR,
vol. 4, pp. 211–229, 2014.
[7] W. B. X. Y. B. Xue, M. Zhang, “A survey on evolutionary computation
approaches to feature selection,” IEEE Transactions on Evolutionary
Computation, vol. PP, no. 99, pp. 1–1, 2016.
[8] N. B. A Jovic, K Brkic, “A review of feature selection methods with
applications,” in Information and Communication Technology, Electron-
ics and Microelectronics (MIPRO), 2015 38th International Convention
on, 2015, pp. 1200–1205.
[9] K. F. P M Narendra, “A branch and bound algorithm for feature subset
selection,” IEEE Transactions on Computers, vol. C-26, no. 9, pp. 917–
922, 1977.
[10] P. L. Lanzi, “Fast feature selection with genetic algorithms: a filter
approach,” in Evolutionary Computation, 1997., IEEE International
Conference on, Apr 1997, pp. 537–540.
[11] H. S. E. Haupt Randy L, Practical Genetic Algorithms with CD-ROM.
Wiley-Interscience, 2004.
[12] Y. Z. Jianjiang Lu, Tianzhong Zhao, “Feature selection based-on genetic
algorithm for image annotation,” Knowledge-Based Systems, vol. 21,
no. 8, pp. 887 – 891, 2008.

You might also like