Construct Offline and Online Membership Functions Based On SVM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Construct Offline and Online Membership Functions

Based on SVM
Xiangsheng Rong1 Ping Ling2 Ming Xu1
1
Xuzhou Air Force College of P. L. A, Xuzhou 221000, P. R. China
2
School of Computer Science, Xuzhou Normal University, Xuzhou 221116, P. R. China

Abstract along which data are well separated. For a query, it is


tested by offline MFs firstly. If the decision is not
The classification algorithm presented in this paper confident sufficiently, the query will be addressed by
consists of Offline and Online Membership Functions, online MFs. Online MFs are developed by integrating
named as OOMF. They cooperated with each other to neighborhood size and distance from query and class.
provide qualified class label of confidence. The offline OOMF uses some strategies to save computation.
membership function is derived from decision Firstly, dataset is reduced by a Support Vector
functions yielded by a weighted SVMs approach Clustering (SVC) [5]. But in this paper Kernel scale of
(WSVM). The online membership function works in SVC is tuned adaptively, so named as TSVC. Secondly,
the scenario where offline membership function is of hyper parameters concerned with support-vector
low discrimination. And it is designed by a new kNN procedure are learned from data context. Thirdly, a
(NkNN) that is encoded with a class-wise metric. heuristic is presented to specify the size of the
Some strategies bring computational ease: hyper neighborhood where NkNN works. Experiments on
parameters concerned are tuned context-dependently; real datasets demonstrate the better performance of
training dataset is reduced by a tuning support vector OOMF over the state-of-the-art fuzzy classification
clustering (TSVC); and working set of NkNN is methods and other popular classification approaches.
pre-specified. We describe experimental evidence of
classification performance improved by our schema
over state of the arts on real datasets. 2. Related Knowledge
Firstly, we review SVM. For l samples: (x1, y1), (x2,
Keywords: Offline and online membership function, y2)…(xl, yl) sampling from X×Y, where X = Rn, Y =
SVM, Weighted schema, Parameter tuning {1, -1}. The optimal classification interface is
determined by:
1. Introduction g ( x ) = Σ α i yi K ( xi , x ) − b (1)
i

Common fuzzy classifiers predict data label according The orientation vector α and offset vector b are
to membership functions [1]. Its flexibility to assign obtained by optimizing: t
data belonging to multi classes with different degrees 1
max Σ α i − Σ Σ α iα j yi y j K ( xi , x j ) (2)
makes this kind of methods popular in many α i 2 i j
applications. But much priori knowledge is required to
s.t. 0 ≤ α i ≤ Csvm , Σ α i yi = 0 .
define traditional membership functions (MFs). This i
paper proposes a framework by combining offline Points with 0 < α i < Csvm are nbSV. bSV is
MFs and online MFs (OOMF). These two types of MF
are constructed by two hard classifiers, with aim to points with α i = C svm .
take advantage of their well-founded analysis k nearest neighbor (kNN) finds query’s k nearest
procedure to improve classification accuracy. The two neighbors and predicts it as the most frequent one
hard models are WSVM and NkNN, stemming from occurring of neighbors. This paper uses it to deal with
1-vs-r SVMs [2]-[3] and kNN [4]. WSVM modifies the case that memberships of all classes are below some
SVMs into a weighted version, and weights of basic threshold.
classifiers are integrated with decision function values Then it proceeds to SVC. It finds the smallest
to define offline MFs. NkNN explores query’s hyper sphere containing all data. The produced nbSVs
neighborhood under the guidance of a class-wise form cluster contours. It corresponds to below
metric. This metric is derived from SVM decision optimization:
interfaces; for they hold most dicriminant direction
max γ Σγ i K ( xi , xi ) − ΣΣγ i γ j K ( xi , x j ) (3) 12
inner-cluster points
nbSVs
10

s.t. ∑ γ i = 1 , 0 ≤ γ i ≤ C svc 8

i
6

Y
2

3. OOMF Schema -2

-4

For the M-classification problem, OOMF does: -6


-6 -4 -2 0 2 4
X
6 8 10 12 14

1) Reduce training dataset using TSVC; Fig.2: SVC with 1/σ2 = 0.537.
2) Create SVMs and confidence weights {βIA};
3) Define offline MFs: hj (j = 1…M) based on
SVMs decision functions and {βIA}; 3.2. Offline MFs and WSVM
4) hmax = max { hj };
Offline MFs are defined during WSVM training
5) hsec = second-max { hj };
process. Different from traditional SVMs, where basic
6) If (hmax – hsec) < ε
classifiers are weighted equally, WSVM seeks
7) Formulate class-wise metrics;
coefficients βIA for basic SVM (class I-vs-r) (I = 1…M)
8) Create online MFs hj with NkNN;
to show its decision capacity on class A. Weights of all
9) Else label(q) = hmax (q);
SVMs form a matrix β = (βIA)M×M. Set βII = 1, which is
The decision whether the result of offline MFs is
natural that SVM (I-vs-r) is absolutely confident to
confident or not, is controlled by threshold ε, which is
declare query’s membership to class I. For point x, it
he difference between its top value of and the second
corresponds to a series of function values with respect
value. In this paper, ε is set as 0.3·hmax.
to basic SVM (I-vs-r), and these values from a row
vector: Fx = (f 1(x), f 2 (x)…f M (x)), where f I is the
3.1. TSVC decision function of basic SVM (I-vs-r). The offline
MF with respect to class A, hA (x), is developed as:
TSVC is conducted on each class respectively to
extract data representatives. Csvc is set as 1. For point x h A ( x ) = Fx ⋅ β • A (5)
TSVC sets its scale factor σx = ||x-xr||. Affinity between With β
M
= Σ I =1β IA β IA ≥ 0 (6)
•A
x and y is scaled by their scale factors’ product, that is:
Weight βIA is designed based on the distance from
|| x − y||2 || x − y||2
k ( x, y ) = exp( − σ ⋅σ ) = exp( − || x − x ||⋅|| y − y || ) (4) f1 to class A:
x y r r
xr is the rth nearest neighbor of x. For the given r, if
⎧ 1 I =A
||x- xr||<|| y-yr||, it means the density of x’s neighborhood ⎪
is denser than that of y. Here, r is set as the max gap in β IA = ⎨− exp[ dis ( Center _ A, f I )]
I≠A
(7)
⎪ ΣM
the list of distances from x to other points: r = max j ⎩ J =1, J ≠ I exp[ dis ( Center _ J , f I )]
{d(x, xj) - d(x, xj-1)}. Rows of Euclidean distance matrix Exponential mechanism is used to keep βIA stable.
d(x, xj) are sorted in an ascending order. Minus in I≠A case is introduced to match the negative
This tuning produces more nbSVs than traditional value of f I (x) when x belongs to rest classes except I.
SVC. These nbSVs are located on both boundaries and dis(Center_A, fI ) computes the distance between the
important positions where sharp variance of density center of class A and fI, as shown in (8), where
happens. So an informative sketch of dataset is Center_A is the average of A’s data representatives.
described by nbSVs, which act as data representatives. The first term is the margin of fI, where ||w||I is the
Fig. 1 and 2 show results of the tuning SVC and weight vector of fI. The second term is the distance
fixed-scaled SVC. from class center to the nearest nbSV of fI.
1 + min
12

dis (Center _ A, f I ) ≈ s∈{nbSVs } || Center _ A − s ||


inner-cluster points

||wI ||2
10 nbSVs

6
(8)
4
Y

0
3.3. Basic SVM Construction
-2

-4

-6
Tune SVM Kernel scale. Firstly for class I, its Kernel
-6 -4 -2 0 2 4
X
6 8 10 12 14
scale factor is designed as:
Fig. 1: Tuning SVC. τ I = ave{|| x − xr ||} with x ∈ I (9)
Gaussian Kernel of SVM I-vs-r sets its scale as:
2
σ I = τ I ⋅ τ rest ( I ) with τ rest ( I ) = ave{τ J | J ≠ I } (10)
Tune penalty parameter individually. This risk minimization. Viewed from geometry light, to
paper equips the individual Csvm for each point to point x on level curve fA (x) = 0, the gradient vector fA’
express its individual demand for slack variable. To (x) indicates the perpendicular orientation along which
those outliers or bSVs, they hope a big Csvm to emphasis data can be well separated over x’s neighborhood.
the slack, but to inner-class-points, they need a small That is, fA’ (x) tells the local relevance of input features
one to highlight maximum margin. Therefore we set in the sense of identifying class A.
Csvm(x) for x in (11). We probe a representative point PA from class A to
||x − x || generate the discriminant direction with respect to A.
Csvm ( x ) = r r (11) The point closest to curve fA is selected as PA. Clearly,
Check tuning effect. We perform tuned SVM and PA comes from support vector set, so it can be found by
standard SVM on real datasets [6]. Standard SVM sets following optimization:
its hyper parameters by 5-fold cross validation. From min x fA (x), with x∈A. (13)
Table 1, tuned SVM gives competitive results with the Denote fA’(PA) = gA = (gA1, gA2,…, gAn,). Then
optimal results of standard SVM, while consuming less magnitude of each component reveals the importance
computation time. This indicates the quality of tuning of the corresponding dimension when identifying class
strategies. A. Based on this idea, class-wise metric special for class
A is defined as:
tuned SVM standard SVM
Data
Error(%) Time(s) Error(%) Time(s)
Cd A ( x, y ) = ( x − y )T μ A ( x − y ) (14)
Iris
0 0.672 0 1.108
(class 1-vs 2, 3) A exp(|g Ai |)
Iris μ = (15)
4.2 0.702 4.1 1.131 i Σ j =1 exp(| g Aj |)
n
(class 2-vs 1, 3)
Iris Introduce the center of sNEIA of class A, x(A), to
3.27 0.691 3.26 1.107
(class 3-vs 1, 2) yield the distance from x to A:
Breast Cancer 2.52 3.87 2.41 7.05
Table 1: Classification accuracies and time cost comparison
Cd A ( x, A) = ( x − x ( A ) )T μ A ( x − x ( A ) ) (16)
on the average of 20 runs. (%) (20% data are randomly
sampled for training).
4.3. sNEIA Specification
4. Online MFs and NkNN The investigation of sNEIA is pricy. This paper uses
the max gap of distance information from query to
A-class members to set that size. That is:
4.1. Define Online MFs from NkNN Cd A ( Q , x j +1 ) −Cd A ( Q , x j )
t A = max j { Cd A ( Q , x j ) −Cd A ( Q , x j −1 ) | x j ∈ A} (17)
NkNN considers sub neighborhood in each class A:
Here distance list CdA (Q, xj) is sorted in the
sNEIA. sNEIA is developed under the guidance of the
ascending order. Set CdA (Q, x0) = 0, and CdA (Q, x1) = 0,
metric customized to A. This metric also helps to
then tA tells the number of sNEIA members including Q
compute the distance between query and A. Sub
itself.
neighborhood size is taken as class frequency and the
distance as the weight. Denote the class frequency tA =
| sNEIA |. Let CdA (x, A) be the distance between x and 5. Experimental Results
class A, then online MF is defined as:
Cd A ( x , A ) Data SVM1r SVM11 FCM FSVM OOMF
hA ( x ) = (1 −
ΣM
) ⋅ tA (12)
Thyroid 4.37 4.42 4.92 5.14 4.26
I =1 Cd I ( x , I )
Heart 5.76 6.14 5.27 7.11 5.31
Diabetes 11.12 12 10.9 11.2 8.14
4.2. Class-Wise Metric Wine 28.4 26.2 27.3 26.5 30.67
Waveform 3 2.6 3.6 3.3 3
Class-wise metric is expected to reveal more of class’s
Liver 8.12 8.3 8.2 7.6 7.92
intrinsic data features, and consequently produce small Table 2: Comparison on classification error (%) (30% data
inner-class distance values and big inter-class ones. are sampled randomly for training.)
Another reason to define a new metric is to overcome
the curse-of-dimensionality that all kNN-based First six datasets are taken from UCI Machine
methods have to deal with. The new metric is derived Learning Repository [6]. In Table 2, OOMF is
from SVM decision interface, the byproduct of compared with two SVM-based classifiers: SVMs of
WSVM, which facilitates computation greatly. For 1-vs-r version (SVM1r) [7] and SVMs of 1-vs-1
decision function of SVM (A-vs-r), fA, viewed under version (SVM11) [8]; and two fuzzy classifiers: FCM
theoretical light, it is optimal in the sense of structural
[9], and FSVM [10]. These classifiers set hyper 6) {NG1 (50), NG2 (100), NG7 (150), NG8 (50)}
parameters by cross validation. 7) {NG7 (100), NG8 (50), NG12 (200), NG16 (50),
Compared with two SVM-based classifiers, NG17 (100)}
OOMF’s improvement is obvious due to its soft
decision fashion. Two SVM-based approaches are a) b) c) d) e) f) g) h) i)
competitive. In three of six datasets OOMF achieves 1) 32.0 30.8 31.9 37.7 33.1 34.4 35.8 30.2 30.2
best result and this behavior is better over two fuzzy 2) 33.7 30.8 31.2 34.9 31.0 29.0 31.4 31.2 31.2
classifiers, which comes from the employment of local 3) 17.5 16.4 15.9 17.1 15.9 15.3 16.7 15.7 16.1
MFs. OOMF behaves not so well in Wine dataset, 4) 15.9 15.1 16.3 18.3 16.5 15.2 14.8 14.5 14.3
5) 15.8 13.2 12.9 14.4 13.8 13.7 13.6 12.5 12.6
because in this set 178 data cover 13 dimensions and
6) 14.9 13.8 13.9 13.6 13.7 12.6 13.0 13.8 13.6
the neighborhood information is too weak to facilitate
7) 15.6 12.3 12.3 14.7 12.9 12.0 11.6 12.5 12.2
online MF’s job. For two fuzzy methods, FSVM Table 3: Classification error comparison on News Group (%)
exhibits better behaviors on average. It relies on the
nonlinear decision interface produced by SVM, while From Table 3, it finds that in the subsets where
FCM depends on regular partition based on Euclidean class boundaries are not distinct, like {NG2, NG3,
metric, so the lack of adaptation to datasets leads to NG4}, {NG8, NG9, NG10}, OOMF shows a unique
high error ratios. good job among its peers. This is attributed to the
Then OOMF is performed on News group [11]. contribution of local MFs to identify data’s label
This dataset is a compilation of about 20,000 articles context-dependently. In the subsets where class
(email messages) evenly divided among the 20 boundaries are distinct, OOMF yields steady and
categories like religion, politics and sports. We label moderate results and usually its result follows
each newsgroup as follows: secondly the optimal result. Here, SVM1r does a better
NG1: alt.atheism; NG2: comp.graphics; NG3: job than SVM11. C4.5 and Machete work poorly in
comp.os.ms.windows.misc; NG4: some sets due to their greedy idea. Scythe modifies the
comp.sys.ibm.pc.hardware; NG5: greedy nature used by Machete and thereby achieves
comp.sys.mac.hardware; NG6: comp.windows.x; NG7: higher accuracy. DANN work well, but the metric it
misc.forsale; NG8: rec.autos; NG9: rec.motorcycles; employs approximates the weighted Chi-squared
NG10: rec.sport.baseball; NG11: rec.sport.hockey; distance, which will causes its failure in datasets of
NG12: sci.crypt; NG13: sci.electronics; NG14: sci.med; non-Gaussian distribution. Adamenn works well in
NG15: sci.space; NG16: soc.religion.christian; NG17: most cases, but it requires huge cost to tune six
talk.politics.guns; NG18: talk.politics.mideast; NG19: parameters. If cost is considered, OOMF is a fine
talk.politics.misc; NG20: talk.religion.misc. choice.
We apply tf.idf weighting schema to express
documents. We delete the stop words and words that
appear too few times, and then normalize each 6. Conclusions
document vector to have unit Euclidean length. Some
other approaches are considered on the average A fuzzy classification algorithm OOMF is described in
classification error rates of 10 runs: Simple kNN; C4.5 this paper. Its offline MFs are defined by WSVM.
decision tree [12]; Machete [4], a recursive partitioning WSVM modifies SVMs schema by equipping basic
procedure, where the dimension used for splitting at classifier with decision weights, which are integrated
each step is the one that maximizes the estimated local into decision function to form MFs. If the offline
relevance; Scythe [4], a generalization and model presents poor confidence, the online MFs are
modification of Machete method; DANN, an adaptive created by NkNN. NkNN also use a weighted strategy
nearest neighbor method [13]; and Adamenn, another to do label assignment. The neighborhood where
adaptive nearest neighbor approach proposed in [14]. NkNN works is formulated under the class-wise
For the clearness of table, these methods are denoted metric that is derived from SVM decision function.
as: a) kNN, b) SVM1r, c) SVM11, d) C4.5, e) Machete, f) Training dataset size is reduced and hyper parameters
Scythe, g) DANN, h) Adamenn, i) OOMF. We sample are learned data-dependently, which bring
from some classes to form experiment subsets, and computational benefit. Experiments on real datasets
these subsets are listed below, where numbers in evidence fine performance and efficiency of OOMF.
bracket are sampling size. Experiment results are
recorded in Table 3. References
1) {NG2, NG3, NG4} (300)
2) {NG2 (150), NG3 (50), NG4 (200)} [1] T. Inoue, S. Abe, Fuzzy support vector machine
3) {NG6, NG7, NG8} (150) for pattern classification, Proc. of International
4) {NG7 (200), NG8 (150), NG9 (350)} Joint Conference on Neural Networks, pp.
5) {NG1, NG2, NG7, NG8} (200) 1449-1455, 2001.
[2] N. Cristianini, J. Shawe-Taylor, An Introduction
to Support Vector Machines, Cambridge
University Press, London, 2000.
[3] L. P. Wang, Support Vector Machines: Theory
and Application, Springer, Berlin Heidelberg New
York, 2005.
[4] J. H. Friedman, Flexible Metric Nearest Neighbor
Classification, Tech. Report, Dept. of Statistics,
Stanford University, 1994.
[5] A. Ben-Hur, D. Horn, H. T. Siegelmann, „Support
Vector Clustering, Journal of Machine Learning
Research pp. 125-137, 2001.
[6] http://www.uncc.edu/knowledgediscovery.
[7] V. Vapnik, Statistical Learning Theory, Wiley
New York, 1998.
[8] T. J. Hastie, R. J. Tibshirani, Classification by
Pairwise Coupling. Advances in Neural
Information Processing Systems, MIT Press,
Cambridge, 10:507-513, 1998.
[9] J. C. Bezdek, Pattern Recognition with Fuzzy
Objective Function Algorithms, New York,
Plenum Press, 1981.
[10] H. Han-Pang, L. Yi-Hung, Fuzzy Support Vector
Machines for Pattern Recognition and Data
Mining, Fuzzy Systems , 4(3):826-835, 2002.
[11] http://www.cs.cmu.edu/afs/cs/project/theo-11/ww
w/naive-bayes.html.
[12] J. R. Quinlan, C4.5: Programs for Machine
Learning, Morgan-Kaufmann Publishers, 1993.
[13] T. Hastie, R. Tibshirani, Discriminant Adaptive
Nearest Neighbor Classification, IEEE
Transaction on Pattern Analysis and Machine
Intelligence, 18(6):607-615, 1996.
[14] C. Domeniconi, J. Peng, D. Gunopulos, An
Adaptive Metric Machine for Pattern
Classification, Advances in Neural Information
Processing Systems, 13, 2000.

You might also like