Usersguide For LNKnet User's Guide
Usersguide For LNKnet User's Guide
Usersguide For LNKnet User's Guide
Linda Kukolich
Richard Lippmann
Printed:
June 1993
Revision 1, June 1994
Revision 2, April 1995
Revision 3, May 1999
Revision 4, February 2004
Acknowledgments
Support for the development of LNKnet was provided by a number of government
agencies and MIT Lincoln Laboratory. Individuals from many organizations helped
support this work. They include Peter Blankenship, Jim Cupples, Laurie Fenstermacher,
John Hoyt, Barbara Yoon, Tod Luginbuhl, and Roy Streit. The first user interface and
some of the first LNKnet classifiers were written by Dave Nation during a six-month
stay at MIT Lincoln Laboratory. Dave also contributed many ideas that helped structure
LNKnet software. Many classification and feature selection algorithms were first programmed and tested by Yuchun Lee, Kenney Ng, Eric Chang, William Huang, and
Charles Jankowski before they were rewritten for incorporation into LNKnet. Valuable
feedback concerning the user interface and LNKnet software was provided by Mike
Richard, students at the Air Force Institute of Technology, and by many members of the
Speech Systems and Technology, Information Systems Technology, and Machine Intelligence Groups at MIT Lincoln Laboratory. The selection of algorithms included in
LNKnet was strongly influenced by experiments performed on speech, computer intrusion detection, and other data bases at MIT Lincoln Laboratory, by results presented at
the annual Neural Information Processing Systems - Natural and Synthetic conference, and by the results of studies performed as part of the ARPA Neural Network program directed by Barbara Yoon.
Table of Contents
CHAPTER 1
Introducing LNKnet
1.1 Overview
1.2 Algorithms
CHAPTER 2
A LNKnet tutorial
2.1 UNIX Setup
10
11
12
15
19
21
22
30
33
39
47
49
Classifiers
29
37
CHAPTER 3
44
51
51
Table of Contents
56
64
CHAPTER 4
Clustering
4.1 K-Means
61
70
73
73
73
CHAPTER 5
75
80
80
83
85
Plots
87
89
91
91
93
96
101
102
103
104
106
CHAPTER 7
77
77
CHAPTER 6
75
107
108
109
109
109
111
Table of Contents
112
113
CHAPTER 8
117
117
121
126
Appendix A
115
113
127
Appendix B
Appendix C
Appendix D
vii
Table of Contents
Appendix E
BIBLIOGRAPHY 193
SUBJECT INDEX 197
viii
CHAPTE R 1
Introducing LNKnet
1.1 Overview
LNKnet software was developed to simplify the application of the most important statistical, neural network, and machine learning pattern classifiers. The acronym LNK
stands for the first initials of three principal programmers (Richard Lippmann, Dave
Nation, and Linda Kukolich). An introductory article to LNKnet, which is meant to supplement this users guide, is available in [27]. This article reviews approaches to pattern
classification and illustrates how LNKnet was applied to three different applications.
LNKnet software was originally developed under Sun Microsystems Solaris 2.5.1
(SunOS 5.5.1) UNIX operating system under Sun Open Windows. It was then ported to
Solaris 2.6 (SunOS 5.6) and to Red Hat Linux. It was also recently modified to run
under Microsoft Windows operating systems using the Cygwin environment. Binary
versions of LNKnet are provided for Red Hat Linux, Solaris 2.6 and higher, and the
Windows Cygwin environment. Source code is also provided and it is relatively easy to
recompile LNKnet under other versions of Linux and Unix because the GNU auto configuration tools are used to control compilation. All illustrations and descriptions in this
guide show windows and plots as they appear under Solaris 2.5.1 using Open Windows
except for Support Vector Machine windows and plots which are as they appear under
Red Hat Linux. Windows and plots appear slightly different under the other operating
systems. LNKnet includes a graphical user interface to over 22 pattern classification,
clustering, and feature selection algorithms. Decision region plots, scatter plots, histograms, structure plots, receiver operating characteristics plots, and other types of visual
outputs are provided. Experiment log files and plots can be reviewed from the LNKnet
graphical user interface. Classifiers can be trained on data bases with thousands of input
features and millions of training patterns.
The three primary approaches to using LNKnet are shown in Figure 1.1. Experimenters
can use the LNKnet point-and-click user interface, manually edit shell scripts that contain LNKnet commands to run batch jobs, or embed generated C versions of trained
LNKnet classifiers in application programs. The point-and-click graphical user interface, listed on the top of Figure 1.1, can be used to rapidly and interactively experiment
with classifiers on new data bases. This makes it relatively easy to explore the effectiveness of different pattern classification algorithms, to perform feature selection, and to
select algorithm parameters appropriate for different problems. A new data base must
first be put into a simple ASCII format that can be hand-edited using a text editor. Users
then make selections on LNKnet windows using a mouse and keyboard, and run experiments by pushing buttons using the mouse. A complex series of experiments on a new
moderate-sized data base (10,000s of patterns) can be completed in less than an hour.
BATCH MODE
USING UNIX SHELL
SCRIPTS
FIGURE 1.1
GENERATE C
ROUTINES
EMBED C ROUTINES IN
USER APPLICATION
PROGRAMS
1.2: Algorithms
knn2c, etc.) can be used to generate C source code to implement LNKnet classifiers and
how this source code can be embedded in a users program (see Section 7.2).
This guide assumes that the reader is familiar with the basic concepts of pattern classification. The article mentioned above [27], provides a brief introduction to LNKnet and
pattern classification. Recent reviews of pattern classification techniques including neural networks and machine learning approaches are available in [2,7,14,40]. Good older
discussions of pattern classification are available in [1,9,30]. Algorithmic descriptions
of the classifiers included in LNKnet are included in the references listed in Table 1.1.
Many of these algorithms are also described in [5,7,14].
SUPERVISED
TRAINING
NEURAL
NETWORK
ALGORITHMS
Back-Propagation(BP) [21,25]
Cross-Entropy BP [36]
Top-2-Difference BP [10,11]
UNSUPERVISED
TRAINING
(Clustering)
Leader Clustering
[12,28]
Committee [7]
CONVENTIONAL
PATTERN
CLASSIFICATION
ALGORITHMS
TABLE 1.1:
Principal Components
Analysis [7,9]
1.2 Algorithms
Table 1.1 lists the static pattern classification, clustering, and feature selection algorithms that are available in LNKnet. Algorithms include classifiers trained using supervised training with labeled training data, clustering algorithms trained without
supervision using unlabeled training data, and classifiers that use clustering to initialize
internal parameters and then are trained further with supervised training. Canonical linear discriminant and principal components analyses are provided to reduce the number
of input features using new features that are linear combinations of old features. Forward and backward searches are provided to select a small number of features from
among the existing features. These searches can be performed using any LNKnet classifier with N-fold cross validation or using a nearest neighbor classifier and leave-one-out
cross validation. Bracketed references after algorithm names in Table 1.1 are to references in the bibliography that provide detailed descriptions of algorithms. Overall summaries and comparisons of these algorithms are available in [2,11,14,18,
21,22,24,25,28,29,36,40].
CHANGE
CLASSIFIER
STRUCTURE
SELECT
CLASSIFIER AND
CLASSIFIER
STRUCTURE
TRAIN USING
TRAINING
DATA
TEST USING
EVALUATION
DATA
case, the split often assigns 60% of the patterns to training data and 40% to test data.
When only tens of patterns are available, only the training data is used with 10-fold
cross validation. LNKnet automatically performs 10-fold (or more general k-fold) cross
validation, but it does not partition the initial database into training, evaluation, and test
sets. This partitioning must be performed prior to using LNKnet. It was not automated
because the number of partitions depends on the number of patterns, partitioning is
often predefined, and partitioning often depends on ancillary pattern characteristics that
are not included as pattern features.
can be skipped, allowing the classifier to use any or all features of the raw data or the
normalized data.
SIMPLE
NORMALIZATION,
PCA, OR LDA
RAW INPUT
DATA FROM
FILE
FIGURE 1.3
NO
NORMALIZATION
SELECT
FEATURES
USE ALL
FEATURES
LNKnet
INTERNAL
CLASSIFIER
INPUT
scripts to run LNKnet experiments in a batch mode, a review of advanced LNKnet features, and a description of input and output data formats and files. The Appendix contains a list of common problems and questions, instructions for installing LNKnet,
listings of the shell scripts created during the tutorial, descriptions of installed data
bases, and a short tutorial which describes how to use the mouse in Sun OpenWindows.
Detailed descriptions of LNKnet programs are available in man pages which are
accessed using the UNIX man(1) command. The page LNKnet(1) lists all LNKnet programs and classifier(1) lists flags common to all classifier programs.
CHAPTE R 2
A LNKnet tutorial
This tutorial introduces some of the general LNKnet classifiers and options. With LNKnet you will solve a speech classification problem using a Multi-Layer Perceptron and a
K Nearest Neighbor algorithm. You will generate diagnostic plots. You will continue an
experiment by restoring LNKnet windows using the experiments screen file. You will
use feature selection and normalization to reduce the number of input features in an
experiment and you will use cross validation to experiment on a small data base. The
tutorial assumes you are using the C-shell (csh) and running under the Solaris operating
system.
FIGURE 2.1
The classifier
should be MLP
The full
experiment name
will be X1mlp
Test using
Evaluation data
Always hit
Carriage Return
or Tab after
changing any
LNKnet Text
Field
to be changed from the default value, the note pointing to the control is surrounded by a
box. For example, the Train and Test button on the left side of Figure 2.1 is not normally
depressed by default and it must be depressed to run this tutorial.
11
the plane or another, dividing the input space in half. These half spaces are combined
and smoothed in upper layers to assign classes to regions of the input space.
To train network weights, training data is presented to the classifier several times. On
the parameter popup shown in Figure 2.2, set the number of epochs to 20. An epoch is
a full pass through the training data, so each pattern will be presented 20 times over the
course of training. Specify the network to have 2 inputs, 25 hidden nodes, and 10 outputs by entering 2,25,10 on the Nodes/Layer field on the second line in the window. Do
not add spaces before or after the commas in the 2,25,10 or other comma delimited
lists. The step size, which is the rate at which the weights are changed, must also be set.
Change the step size to 0.2, remembering to hit carriage return when you have done so.
Changes to LNKnet text fields do not take effect unless carriage return is hit afterwards.
Other MLP parameters are set on three additional MLP parameter windows which can
be displayed using three buttons on the main MLP window. There are explanations of
the parameters on these windows in Section 3.1.1 in this Users Guide and on the
mlp(1) manual page.
FIGURE 2.2
Network topology:
2 inputs,
25 hidden nodes,
10 outputs
Number of epochs
Change
Step size to
0.2
12
13
late 50s from 67 men, women, and children. Each talker said the ten words, spectrograms were made from the waveforms, and resonant or formant frequencies for the
vowels were selected. The features of the vowel data base come from the first two formant frequencies. The LNKnet data base pbvowel has the original data. Do not continue
unless the bottom of the data base window appears as it does in Figure 2.4.
FIGURE 2.4
If necessary, change
Data Path to fill Data
Base List
2.4.3 Normalization
For many classifiers, classification results are improved when the data has been normalized in some way. Although this vowel data has already been normalized to range from
zero to one, better results are achieved when the data is given zero mean, unit variance
using simple normalization. Display the normalization window by selecting Feature
Normalization... on the main window. The normalization window in Figure 2.5 will
appear. Check that Simple Normalization is selected. If LNKnet cannot find the normalization file it will report an error at the bottom of the normalization window and
show a small stop sign on the main window. Normalization files are stored in the data
directory, so check the data path on the data base window. If the file really does not
exist, Section 2.14 in this tutorial describes how a normalization file can be created
using LNKnet.
14
FIGURE 2.5
Normalization Window
Three two dimensional plots are controlled from the Decision Region Plot parameter
window. They are the decision region plot, the scatter plot, and the internals plot. Push
the top most Parameters... button to bring up the Decision Region Plot window shown
in Figure 2.7. For this experiment, you should change Number of Intervals per
Dimension from 50 to 100 on the Decision Region Plot window. This will cause the
plotting program to use a finer grid for generating the decision region plot. It will take
longer to generate, but the plot will look better. Figure 2.7 shows the Decision Region
Plot window ready for the experiment.
15
FIGURE 2.6
Two one-dimensional plots are controlled from the profile plot parameter window. They
are the profile plot and the histogram plot. Push the second Parameters... button to
bring up the Profile Plot Parameters window shown in Figure 2.8. The one dimensional
plots are available for classifiers with continuous outputs including the MLP classifier.
No profile plot parameters need to be changed from their default settings.
16
FIGURE 2.7
Do Color Plots
It can be informative to see the node structure of a trained classifier. Each of those LNKnet classifiers for which it is appropriate has a structure plot. Depending on the classification algorithm, the structure plots show the input and output nodes of the classifier
and the connections between them. If these connections have weights, the weights can
be displayed. Explanations of the structure plots can be found on the manual pages for
each one and in Section 6.3 of this Users Guide. Select the third Parameters... button
to bring up the structure plot window. Select Autoscale Plot To Fit on Screen, Show
Weight Magnitudes, and Display Bias Nodes, as shown in Figure 2.9.
17
FIGURE 2.8
FIGURE 2.9
Autoscale
18
While training a classifier by cycling through the data, a classification error file is created. The accuracy of the classifier during training can be plotted in two ways, using a
cost plot or a percent error plot. The cost is the function being minimized by the classifier. These plots are not available for algorithms that train in a single pass through the
data.
If this is the first time you have run LNKnet, the Cost Plot and Percent Error Plot
parameter windows should be ready for the experiment. They should appear as in Figure
2.10.
FIGURE 2.10
During the testing portion of the experiment, a testing error file is produced. If the classification algorithm produces continuous outputs, as the MLP algorithm does, this test
error file can be used to produce several plots. Some plot parameters need to be set for
this experiment. On the Posterior Probability plot window, shown in Figure 2.11, set the
target class to 2 (hod) and select Binned Probability Plot. On the ROC plot window
shown in Figure 2.12, set the target class to 2 (hod). On the rejection plot window,
shown in Figure 2.13, set the table step to 10.
19
FIGURE 2.11
FIGURE 2.13
20
FIGURE 2.12
21
the function being minimized by the classifier. After the 20 epochs are over, a summary
of the training errors is printed. The shell script then calls the mlp program to test the
classifier using the evaluation data. The results of that test are below. Finally, the shell
script displays the requested plots. Each plot is displayed in its own plotting window.
Figure 2.14 shows the screen of a workstation after running this experiment.
FIGURE 2.14
Results in shell
window
Structure Plot
Profile Plot
Detection (ROC)
Plot
MLP parameters
X1mlp.run
Shell script
X1mlp.screen
LNKnet.note.screen
22
X1mlp.param
X1mlp.log
X1mlp.err.train
TABLE 2.1:
X1mlp.err.eval
Files Created by Plot Programs
X1mlp.region.plot.eval
2-Dimensional plots
X1mlp.profile.plot.eval
1-Dimensional plots
X1mlp.struct.plot
Structure plot
X1mlp.cost.plot
Cost plot
X1mlp.perr.plot
X1mlp.prob.plot
X1mlp.detect.plot
X1mlp.reject.plot
Rejection plot
0
---14
5
1
----
2
----
3
---3
19
1
11
Computed Class
4
5
6
---- ---- ----
7
----
8
----
9
----
12
5
2
14
1
18
---25
---1
---27
---18
---16
7
6
---16
---30
1
---9
15
3
---19
1
4
---5
Total
----17
18
20
18
16
11
18
18
16
14
----166
-------------------------------------------------------------------------------------------------------------------------------------------------------------
23
166
54
32.53
( 3.6)
0.207
Figure 2.15 shows the set of three overlaid 2D plots. There is a decision region plot (the
solid regions), a scatter plot of the evaluation data (the small white rimmed squares),
and an internals plot (the black lines). The decision region plot shows the class that
would be selected at each point on the plot. The values for the two selected input dimen-
24
sions are as you see them. Any other input dimensions are held constant either to 0 or to
values specified on the decision region plots window. The scatter plot shows the evaluation data, color coded to show the class. All patterns within a set distance of the decision
region plot plane are shown. Classification errors can be found by looking for those patterns whose color does not match the background color from the decision region plot.
The form of the internals plot depends on the type of algorithm being used for classification. In this case, the multi-layer perceptron, the lines represent hyperplanes defined
by the nodes of the first hidden layer. The hidden nodes which generate particular borders between decision regions can often be identified using this plot.
FIGURE 2.16
Total of outputs
Misclassified hoods
Figure 2.16 shows the two 1D plots. There is a profile plot (the black and colored lines
and the solid bars below them), and a histogram of the evaluation data (the squares at
the bottom).
For the profile plot, all of the input features but one are held constant while one feature
is varied. The output levels for each class are plotted. These output level lines are the
colored lines. The total of these output levels is plotted as a black line. This line should
be close to 1.0 for a well trained classifier that estimates posterior class probabilities,
like the MLP classifier. Below the output level lines is something like a one dimensional
decision region plot. It shows the class which would be chosen for a pattern with the
given generated inputs. Where the class changes, a dotted vertical line is drawn.
The histogram plot at the bottom of Figure 2.16 is in two parts. The points shown are
either all the patterns in the evaluation data set, or those which are within some distance
LNKnet Users Guide (Revision 4, February 2004)
25
of the line being sampled for the profile plot. The squares above the line represent those
patterns which are correctly classified by the current model. The squares below represent misclassified patterns. These squares are color coded by class, as in the scatter plot.
Figure 2.17 shows a structure plot for the trained multi-layer perceptron. At the bottom
of the plot there are two small black circles representing input nodes and a small black
square representing the input bias node. Below each input node is the input label for that
node. From the bottom to the middle is a set of lines of varying thicknesses. These lines
represent the weighted connections from the input layer to the hidden layer. The thickness of these and other lines is proportional to the magnitude of the connecting weight.
Some of the lines are orange, indicating that connecting weights are negative. The large
white circles represent the hidden nodes where weighted sums of the inputs are calculated and passed through a sigmoid function. The hidden layer also has a bias node. At
the top of the plot are large white circles representing the output nodes. Another set of
lines shows the weighted connections from the hidden layer to the output layer. Above
the output nodes are the class labels for the output classes.
FIGURE 2.17
26
During training, each pattern is tested by the MLP classifier. The classification results
from these tests are stored in the file X1mlp.err.train. The average percent error
in classification during each epoch of training is shown in Figure 2.18.
FIGURE 2.18
The cost of each pattern tested during training is also stored in the file
X1mlp.err.train. The average of these values for each epoch is shown in Figure
2.19.
FIGURE 2.19
The cost here is the square root of the mean squared error of the outputs normalized by
the number of classes. A desired output of 1 for the correct class and 0 for the other
classes is subtracted from the actual outputs for a pattern. These values are then squared,
averaged over the number of classes, and stored as the cost. This plot, then, averages the
costs for each epoch and takes the square root to get each point on the plot.
27
Figure 2.20 shows a posterior probability plot for class 2, hod. To generate the plot, each
evaluation pattern is binned according to its output for class 2. Five bins were used to
create this plot. They represent class 2 outputs of 0.0 to 0.2, 0.2 to 0.4, 0.4 to 0.6, 0.6 to
0.8, and 0.8 to 1.0. The average class 2 output values for the patterns in each bin are
shown as Xs. The actual percentage of class 2 patterns in a bin is drawn as a filled circle. Lines above and below the circle represent two standard deviations about the actual
percentages. The total number of patterns and the number of class 2 patterns in each bin
are displayed above the upper limit mark. For example the numbers 2/6 over the 4060 bin mean two patterns in this bin were from class 2 and there were six patterns in this
bin. If all the Xs are within the 2 standard deviation limits, the classifier provides
accurate posterior probability estimates. A table of the values in the plot is printed in the
log file and a Chi Squared test and significance values are printed in the experiment
notebook.
FIGURE 2.20
Figure 2.21 shows a receiver operating characteristics or ROC curve for class 2, hod.
This plot shows the detection rate (hod patterns labeled as hod) versus the false alarm
rate (other patterns labeled as hod) for a varying threshold value on the classifier output
for the hod class. To generate the plot, the evaluation patterns are sorted by their class
2 output values. For each point on the plot, a threshold value is set. All patterns which
have a class 2 output greater than the threshold are labeled as belonging to the class and
all other patterns are labeled as not in the target class. The detection rate and false alarm
rate that result from this labeling give the position of the plotted point. The plot in Figure 2.21 shows that the detection accuracy for the hod class is higher than 95% correct
when 10% false alarms are allowed. The quality of the ROC curve can sometimes be
judged by the area under the curve. In this case it is 98.7%, which is good. A perfect
area of 100% is achieved if there is a threshold value such that all patterns above the
28
threshold are in the target class and all patterns below the threshold are not. If the classifier output is random and contains no information, the ROC area is near 50% and the
ROC is close to the diagonal line %Detect = %False Alarms. A table of the values in
the ROC plot curve is printed in the log file and the area under the curve is printed in the
experiment notebook.
FIGURE 2.21
Figure 2.22 shows a rejection plot. To generate the plot, all evaluation patterns are
sorted by their highest output value across all classifier outputs. Patterns whose highest
outputs are below a rejection threshold are rejected and not classified. The error rate of
the classifier on the non-rejected patterns is plotted versus the percentage of the patterns
rejected. If all the patterns which cause errors have low maximum output values, the
percent error will drop until all the incorrectly classified patterns have been rejected. For
the current experiment, rejecting more than 20% of the patterns substantially reduces
the error rate on remaining patterns. The curve is erratic above 70% rejection because so
few patterns remain.
29
FIGURE 2.22
KNN Parameters
Set K to 3
On the KNN parameter window set K to 3. If you type 3, dont forget to hit carriage
return to enter the new value. Select START on the main window to start the experi-
30
FIGURE 2.23
ment. Once again a shell script is written. The order of the commands is again train,
evaluate, and plot. The files created during this experiment are shown in Table 2.2.
TABLE 2.2
X1knn.run
Shell script
X1knn.screen
LNKnet.note.screen
X1knn.log
X1knn.err.eval
31
TABLE 2.2
2-Dimensional plots
2.10.2 Plots
Because KNN trains in a single pass, the cost plot and percent error plot are not available. KNN takes a vote amongst a patterns nearest neighbors to determine the class of
that pattern. This does not produce continuous outputs so there is no profile plot, posterior probability plot, detection plot, or rejection plot. There are no connections between
the stored KNN parameters, so little information would be gained from a KNN structure
plot. This leaves us with only the decision region plot displayed in Figure 2.25.
32
FIGURE 2.25
Position of a Stored
Training Pattern
Misclassified hood
The KNN decision region plot is generated in the same way as the MLP plot. The classifier is tested at every point in a 100 by 100 grid. The classification results are shown
by drawing color coded regions for the class returned for each tested grid point. The
overlaid scatter plot is identical to that in the MLP plot. Because the classification algorithm is different, the overlaid internals plot is different for the KNN classifier. Small
black squares are drawn which show the positions of the stored training patterns.
33
On the MLP parameter window, the number of epochs was set to 20. Selecting CONTINUE Current Exper. will create a shell script that trains the previous MLP model
for 20 more epochs. A notifier window will appear which says Shell file exists: OK to
overwrite?. LNKnet by default will use the same name for the shell script that continues the experiment. Either select Overwrite or hit Return to replace the old contents of
X1mlp.run with the new script. The new shell script differs from the old one on only
two lines. The create flag is not included in the new training call and the new training
results are appended to X1mlp.log and X1mlp.err.train. When training is
complete, the new model parameters will be stored in X1mlp.param, replacing the
old ones. New versions of the plots will be created which will overwrite the existing
plot files. An entry with the new experiment results will be added to the experiment
notebook file, LNKnet.note.
2.11.2 MLP Eval Results
These are the classification results on the evaluation data after a total of 40 epochs of
training. The MLP classifier now provides an error rate of 19.88%, 3.1%. With the
given standard deviation of 3.1% the new error rate is about the same as KNNs. The
error rate might be improved with more training, but the amount of improvement for
each epoch of training becomes increasingly small.
Classification Confusion Matrix - X1mlp.err.eval
------------------------------------------------------------------------------------------------------------------------------------------------------------Desired
Class
----0
1
2
3
4
5
6
7
8
9
----Total
0
---13
1
----
2
----
3
---3
Computed Class
4
5
6
---- ---- ---1
6
15
2
2
12
18
2
7
----
8
----
13
9
----
1
3
18
3
---14
---12
---20
---18
---16
15
2
---10
---24
1
---21
13
1
---14
3
10
---17
Total
----17
18
20
18
16
11
18
18
16
14
----166
34
166
33
19.88
( 3.1)
0.181
The profile plot is shown in Figure 2.27. The new profile plot shows that the outputs for
the dominant class in each region of the input space have gotten closer to 1. The outputs
for the other classes in those regions are closer to 0, making the transitions between
classes sharper. The total line is less smooth but is still near one.
Because the weight magnitudes have not changed much, the new structure plot is almost
identical to the old one. The structure plot is in Figure 2.28. Because the pattern by pattern results from the continuation training are appended to X1mlp.err.train, the
error rate from all training is shown in the new cost plot and the new percent error plot,
not just the rate from the twenty new epochs. These new plots are in Figure 2.29 and
Figure 2.30 on page 37. On the probability plot, shown in Figure 2.31, the value in the
second bin (output values of 0.2 to 0.4) has improved. The bin for values from 0.4 to 0.6
has been combined with the 0.6 to 0.8 bin. The bins which cover the middle of the range
have been eliminated because they contain too few patterns. Figure 2.32 shows the new
ROC plot which has changed very little. The ROC area has increased to 99.0%. Figure
2.33 shows the new rejection plot. There are more high scoring correctly classified pat-
35
FIGURE 2.27
FIGURE 2.28
terns now. This can been seen because there is a downward slope in this curve with few
rejections.
36
FIGURE 2.29
FIGURE 2.30
37
FIGURE 2.31
FIGURE 2.32
38
FIGURE 2.33
is important because they can be very large. The command rm X1*.err* will
remove the error files generated during the previous experiments in this tutorial. These
files can be recreated by re-running a LNKnet experiment and are only necessary if you
want to continue training an incrementally trained classifier or if you want to generate
additional plots that depend on these files such as the error rate versus training time plot.
39
(1,...,1), class 2 on (2,...,2), and so on. Gaussian noise is added to the centers to generate
the patterns for each class. The variance of the noise depends on the input feature number. It is lowest for feature number 7 and highest for feature number 0. The variance of
the data in the eighth input dimension is 0.25. The variance in the lower dimensions
increases by 0.25 every dimension giving a variance of 2 for the first dimension. The
high numbered features thus provide more information than the lower numbered features because they have less variance.
The gnoise_var data base has 8 input dimensions. If the current plot parameter settings
are used, scatter plots may not show all of the data. Go to the Plot Selection window
and bring up the Decision Region Plots window. On the decision region plot window,
make sure that Show All Data Points is selected. For the remainder of this tutorial, only
the decision region plots will be shown. Select the check boxes for the other plots again
to remove the checks and deselect the plots.
Because this data base was generated using a Gaussian distribution for each class, a
Gaussian classifier should be used to solve this problem. On the main window, select
the classification algorithm Gauss. Go to the algorithm parameter window to check the
variables for the Gaussian classifier. Each class has the same variance, so make sure that
Same for All Classes (Grand) is selected. The variance in each direction is independent, so make sure that Diagonal Covariance Matrix is selected.
Your first experiment uses all of the input features to obtain a base error rate. Select
START to run the first feature selection experiment. The classifier should make one
error classifying the evaluation data. Figure 2.34 shows the decision region plot for the
first experiment. Because all of the evaluation data is displayed, the decision regions do
FIGURE 2.34
Internals plot. Because these are
grand variances, all of the ellipses
representing the Gaussians have
the same shape. Because the
covariance matrices are diagonal,
the axes of the ellipses are parallel
to the input dimensions.
not seem to match the scatter plot data. The internals plot for the Gaussian classifier is
the set of ellipses shown over the scatter plot. These ellipses represent the Gaussian
functions that model each class. The length and width of the ellipses are proportional to
the variances of the Gaussians. More plots could be generated showing the other dimensions by changing the Input Dimensions to Plot on the Decision Region Plot window.
40
Select PLOT ONLY on the main screen to write a shell script that generates the
requested plots without retraining or retesting the classifier first.
Each feature in this data base is a noisy estimate of the class number. All eight features
may not be necessary to get the right answers. You can try using only the first feature.
On the main window, change Experiment Name to N1. Bring up the Feature Selection window by selecting Feature Selection... on the main window. On the feature
selection window, select First N and change N to 1. Select START to run the experiment. The error rate using just the first feature should be 78% on the evaluation data.
FIGURE 2.35
Figure 2.35 shows the decision region plot for the classifier using only the first input
feature. Because there is really only one dimension being plotted there is no variation
along the Y direction of the decision region plot, the scatter plot data is all shown on the
line y=0 and the internals plot uses circles to represent the variance of the Gaussians.
For the next experiment, change the experiment name to N2. On the feature selection
window change N to 2. Now repeat the experiment using the first two inputs. The error
rate should be 62%. Finally, change the experiment name to N4 and change N to 4 on
the feature selection window. Repeat the experiment using the first 4 inputs. The error
rate should be 48%. The 2D plots for these experiments will still be generated, but they
are not shown.
The variance of the data in the first few features is too high for these features to be useful in discriminating the classes. Perhaps the error rate can be reduced by picking out
particular features rather than just taking them in order. One approach to feature selection is to create a list of features in order of presumed importance. Any of the feature
search algorithms can be used to create such a list. On the Feature Selection window
select Read Feature List from File. Because the feature list file has not been created
yet, an error sign should appear on the feature selection window and beside the Feature
Selection button on the main window. Select the Generate Feature List File... button at
the bottom of the window. This brings up the Generate Feature List File window
shown in Figure 2.36. Select Nearest-Neighbor Leave-One-Out CV as the search
LNKnet Users Guide (Revision 4, February 2004)
41
algorithm and Forward as the search direction. Select Start Feature Search to start the
search for the best set of input features to use.
FIGURE 2.36
In this search, the program feat_sel tests each feature to find the one which is most
effective in classifying the data by itself. The remaining features are then paired with the
first and the best is selected as the second feature. Features are added this way until
there are no more left. The feature sets are tested using a nearest neighbor classifier
using leave-one-out cross validation. They could also have been tested using the current
Gaussian classifier with ten-fold cross validation. The results of each step of the search
and the best set of features is printed to the screen and to a log file. The plot in Figure
2.37 shows the cross validation error rate achieved as each feature is added. We can see
that most of the features actually increase the error rate and that using features 7, 5, and
6 achieves a good error rate.
42
FIGURE 2.37
To use a subset of these features in the selected order, change the experiment name on
the main window to last3. Return to the feature selection window. Set Use to First N
and N to 3 so that you are using the first 3 features from the list in the new feature list
file. If the feature list file did not previously exist, there will still be an error message
saying so. Click the mouse on the error message or select Read Feature List From File
again to erase the message. An alternate way to use these features is to select Check by
Hand as the selection method. Then type in the following list: 7,5,6. These comma
delimited lists are used in many places in LNKnet. The list is made from integers separated by commas with no spaces or tabs. See Problem 2.12 on page 131 for more information about comma delimited lists. Figure 2.38 shows the feature selection window
with the last three features selected. This experiment should produce an error rate of 3%
on the evaluation data. The shell script and log file for this experiment, last3gauss, can
be found in Appendix C. Table 2.3 shows the results of the feature selection experi-
43
FIGURE 2.38
ments. These results can also be found in the notebook file, LNKnet.note. A copy of the
notebook file is found in Appendix C.
TABLE 2.3
Error rate of Gaussian Classifier on evaluation data using different features of gnoise_var
data
Experiment
Name
Number of
features
Eval
Error
Rate
all
All 8
1%
N1
first 1
78%
N2
first 2
62%
N4
first 4
48%
last3
3 picked
3%
44
in training and produces a smaller number of output features when there are fewer
classes than input features. The number of output features with LDA is the minimum of
M-1 or D where M is the number of classes and D is the number of original input features. Because PCA and LDA are applied to raw data before the input vectors are
handed to the classifiers, they are included as normalization methods.
To try PCA and LDA on the gnoise_var problem, display the normalization window by
selecting Feature Normalization... on the main window. Select Principal Components
as the normalization. Because the normalization file has not been created yet, an error
will appear on this window and beside the normalization button on the main window.
Select Generate Normalization File... to bring up the window which creates normalization files. The Generate Normalization File window is shown in Figure 2.39. Select
Run on this window to calculate the PCA parameters. Now select Linear Discriminant
on the normalization window. Select Run again on the normalization file generation
window to calculate the LDA parameters.
FIGURE 2.39
A plot is generated for each normalization method. The plots show the relative sizes of
the eigen values for each of the features in the rotated space. This can be taken as a measure of the importance of each feature. The plots in Figure 2.40, show that with both
45
PCA and LDA the eight input features can be replaced by one feature that accounts for
most of the variance.
FIGURE 2.40
EV87
%
81
72
63
54
45
36
27
18
9
01
Norm:PCA
4
5 number
6
7
Feature
100
EV %
90
80
70
60
50
40
30
20
10
01
Norm:LDA
4
5 number
6
7
Feature
To continue the feature reduction experiments, change the experiment name to pca. Go
to the Feature Selection window and use only the first two features by selecting First
N as the selection method and changing N to 2. Select Principal Components as the
normalization method on the normalization window. There should be an error rate of
25% on the evaluation data when you run the experiment with PCA. Change the experiment name to lda, select Linear Discriminant as the normalization method and run the
experiment one final time. You should get no errors on the evaluation data when normalizing with LDA. Figure 2.41 shows the decision region plots for these two experiments. The dimensions being plotted are the first two input dimensions after
normalization. It is possible to plot using the original dimensions by selecting Do Not
Normalize Data for Plot on the Decision Region Plot window.
FIGURE 2.41
Decision Region Plots when First Two Rotated Dimensions are Used (PCA and LDA)
46
The gnoise_var data base is unusual in that many features are noisy and contribute little
to discriminating the classes. Because principal components analysis looks for the greatest variance, it favored the lower dimensions and rotated the space to accentuate them.
An interesting exercise is to do a feature search on the gnoise_var data base with PCA
set. This will find the best set of rotated features. When this search is run, the best set
achieves an error rate of 21%. Although this is not as good as using the unrotated
dimensions with smaller variances, it is still better than using the original large variance
dimensions alone.
LDA assumes that the classes and their means can be modelled by unimodal Gaussian
distributions. Because this is correct in the case of gnoise_var, normalizing the data with
LDA produces a good classification result.
You need to select a small data base for the cross validation experiment. The iris data
base has the fewest patterns of any of the real data bases provided with LNKnet.
Select it on the Data Base window. The classes in the iris data base are three kinds of
47
iris flowers. The inputs are the sepal length and width and the petal length and width.
This data base was collected by R.A. Fisher [8] in the 1930s.
For the cross validation experiment, use the Radial Basis Function (RBF) classifier.
The RBF classifier uses a set of Gaussian basis functions to map the input space into
data clusters. In assigning a class to a pattern, the output for each class is a weighted
sum of the basis function outputs for the pattern. The RBF program trains the weights
connecting the basis functions to the outputs. LNKnet has another RBF program, IRBF
or incremental RBF, that also trains the means and variances of the basis functions. Both
programs use clustering algorithms to specify initial basis function locations.
This experiment will use the K-Means algorithm for clustering. The K-Means algorithm
generates a set of K cluster centers and assigns training patterns to these centers. It uses
these sets of patterns to iteratively improve the positions of the centers and to calculate
the final variances of the clusters.
Select RBF as the current classification algorithm. The K-Means program will be run
automatically before the RBF program, if desired. On the RBF Parameter window,
shown in Figure 2.43, select Create clusters first and select Kmeans as the clustering
algorithm.
FIGURE 2.43
Check to create clusters
Now check the K-Means parameters. On the RBF window, select Clustering Parameters... to bring up the Kmeans Parameters window shown in Figure 2.44. On the
48
Kmeans window select Cluster by class, K equal for all classes, and set K to 2 centers
per class.
FIGURE 2.44
Select START to write the cross validation shell script. This script first generates five
sets of K-means clusters. The training data for these clusters is the same as will be used
to train the classifier. The clustering program will generate two clusters for each of the
three classes, for a total of six for each of the five cross validation folds. After the clustering is finished, the RBF program is called to train and test the five classifiers. A confusion matrix and error summary is generated for each of the classifiers. At the end, a
confusion matrix and error summary is displayed for the results of all of the testing. The
combined error rate for the cross validation experiment is 4%. This result is also
appended to the notebook file which is in Appendix C. The shell script and log file from
the cross validation experiment are also in Appendix C.
49
FIGURE 2.45
Exiting LNKnet
50
Classifiers
CHAPTE R 3
MLP Parameters
Structure of network. First entry
is number of inputs, last is
number of classes. These two
values are set automatically.
Other values are the number of
nodes in each hidden layer.
The bias node in each layer is
not included.
Number of times to
cycle through all
training patterns
This algorithm has the most options of any LNKnet program. MLP classifiers examine
all the data many times in training. The first option to set is the number of times to
examine the data. The next is the structure of the network which is contained in a
comma delimited list with the number of nodes in each layer of the network. The first
and last entries are the number of inputs and the number of classes. They are set auto-
51
CHAPTER 3: Classifiers
FIGURE 3.2
Subtract a fraction
of the step size for
a weight if batch
changes change
direction.
If batch weight
update, adapt step
size for each weight
(set on START
only)
Multiply weights by
one minus this
fraction each
update
If error of an output
is less than
tolerance, do not
train its weights
Momentum of
change in weights
Update weights after
each trial or at the end
of each epoch
matically when a data base is chosen. Any other entries are the number of nodes in each
hidden layer. There is a constant bias node in each layer of the network. It is not
included in the list for the network structure. A gradient descent algorithm needs a step
size, which is a multiplier applied to the gradient when the weights are updated. The
main MLP parameter window is shown in Figure 3.1. Other LNKnet parameters are set
on three additional parameter windows. These are displayed by selecting the three buttons on the main MLP window. These other options do not normally need to be changed
and are included primarily for pedagogical purposes.
Parameters associated with training the weights are found on the MLP Weight parameters window shown in Figure 3.2. For most problems, the default settings for these
parameters are appropriate. In our MLP classifier, there are three options for changing
the step size during training. The step size for all weights can be held constant throughout each training run, the step size for all weights can be automatically reduced after a
set number of training epochs, or the step size of each weight can be adapted automatically. The step size change type selection must be coordinated with the weight update
mode, as described in the paragraph below. The initial step sizes can be the same for all
the weights in the network or a different initial step size can be set for each layer. In the
first case the initial step size is the one on the main MLP window. In the second, step
sizes for each layer are taken from the step size list on the MLP weight parameter window. Using this list, you can initialize the input weights of a network and then prevent
training of those weights by setting their step size to zero. There is a momentum term
52
which often reduces training time by moving weights in the direction of previous
changes. The weights can be systematically reduced by setting a weight decay parameter. This has the effect of pruning small weights. All weights are multiplied by one
minus the decay parameter on every trial. This is equivalent to adding a penalty term to
the cost function that penalizes large weights. There is a tolerance parameter in the
error, which turns off back-propagation if the output is within the tolerance limit of the
desired output. Finally, the magnitude of the random initial weights can be set.
Weight updates can be performed after each trial or in batches at the end of each epoch.
Fastest training typically is obtained by updating weights every trial after each training
pattern is presented. To automatically reduce the step sizes for all weights after a set
number of epochs of training, weight updates must be performed after each trial. If a
batch update is being used, it is possible to automatically set a step size for each network weight using the multiple adaptive step size algorithm. When the total correction
for a weight in one batch is in the same direction as in the previous batch, the step size
for that weight is increased. If the direction changes, the step size is reduced by a set
factor. Another factor in the speed of weight training is the order in which training patterns are presented. Remember to randomize the order of the patterns when using the
MLP classifier. The random order flag is set on the main LNKnet window.
Several different versions of back propagation are available in this version of the MLP
classifier. Most differ in the cost function used to determine the error of the outputs. The
squared-error, maximum likelihood and cross-entropy cost functions are described in
[36]. Cross-entropy and maximum likelihood cost functions sometimes provide better
posterior probability estimates than a squared error cost function. The top-two difference cost function has been called the classification figure of merit by Hampshire [10].
It attempts to minimize the number of errors on training data and can be used with all
networks. It should normally be used with linear output nodes. A steepness of 1 uses a
maximally sharp sigmoid with the difference term and a steepness of 0 uses a maximally smooth sigmoid. The perceptron convergence procedure, which is an implementation of Rosenblatts original single layer perceptron, differs from the other cost
functions. It trains a single plane for each class which separates that class from all others. All of the patterns on one side of the plane for a class are considered to be in one
class and all patterns on the other side are in the other class. The perceptron convergence procedure can only be used when there are no hidden layers. This procedure is
normally only defined for two-class problems. In the LNKnet implementation, if there
are more than two classes, multiple perceptrons are trained simultaneously to discriminate each class from the others. The classification decision is made by determining the
perceptron with the highest unclipped output. The MLP Cost Function parameter window is shown in Figure 3.3.
When a squared error or top two difference cost function is being used, there are three
choices of output function. This output function is applied to the weighted sum calculated for the output layer. The output functions are a standard sigmoid, which goes from
0 to 1 with an output of 0.5 for an input of 0, a symmetric sigmoid which goes from -1 to
1, and a linear output, which simply gives the weighted sums as the final outputs of the
network. The hidden node sigmoid functions can be either standard or symmetric. There
is a steepness parameter for these node functions. This steepness parameter can be the
same for all nodes in the network or it can be set for each layer. A higher steepness
53
CHAPTER 3: Classifiers
FIGURE 3.3
Steepness parameter
for Differential cost
value for the first hidden layer can sharpen the decision region boundaries for an MLP
classifier that has been initialized using bintree2mlp, which is explained in Chapter 7.
The MLP node function parameter window is shown in Figure 3.4.
FIGURE 3.4
Sigmoid function
applied to hidden
layer sums
54
training, rather than update each of the Nouputs*Nnodes connection weights for each
pattern, only the weights connecting the hidden nodes with the highest outputs are
updated. If enough hidden nodes are used training each pattern, the classification results
are equivalent for RBF classifiers using fast training. Fast training is not normally
required and should not be used.
FIGURE 3.5
RBF Parameters
Run the clusterer to create new
clusters. (If not checked, read
previously stored ones)
55
CHAPTER 3: Classifiers
using gradient descent which tries to minimize the squared error in the final outputs.
Each of the three variables being trained, the weights, means, and variances, has its own
step size. There is one other difference between the LNKnet RBF and IRBF classifiers.
In the IRBF classifier, the variances in each dimension can be averaged, as they are in
the Gaussian classifier using grand variances.
FIGURE 3.6
IRBF Parameters
56
57
CHAPTER 3: Classifiers
FIGURE 3.8
Untied Mixtures
B
ture or if all of the classes share a single set of tied Gaussian mixtures. Figure 3.8 illustrates the two types of Gaussian Mixtures. The lower dots in this figure represent the
Gaussian components and the upper dots represent outputs for each class. As with the
Gaussian classifier, the Gaussians in a mixture can have either diagonal or full covariance matrices. Similarly, there can be a separate variance for each Gaussian in the classifier model or the variances can be averaged giving a grand variance. The averaging
can be done over all of the Gaussians, so only one is estimated, or the Gaussians in each
mixture can be average separately, giving one variance per class.
FIGURE 3.9
58
FIGURE 3.10
Histogram Parameters
Assign N bins per
input feature. If
different bins for
each class, assign
N bins per class
Specify the
number of bins to
assign for each
class
Start of first bin
and end of last bin
for all inputs for
uniform
hypercube
Increase the
range of the
histograms by
multiplying by this
factor
3.2.3 Histogram
A histogram classifier[7] estimates the likelihood of each class by creating a set of histograms for each input feature. Input features are continuous-values and each input feature is divided into a number of bins. The likelihood assigned to each bin is proportional
to the number of training patterns that fall in that bin divided by the bin width. In testing, the likelihoods for each input dimension are multiplied to give an overall likelihood
for each class. An optional per class diagonal Gaussian classifier can be used to determine the class of all patterns that fall outside histogram bins. Unlike the naive Bayes
classifier, the histogram classifier is designed for continuous valued inputs and provides
many alternative approaches to categorize continuous data by forming bins.
The LNKnet histogram classifier provides several options for dividing the input space
into bins. A fixed set of bins can be defined which evenly divides the space into smaller
hypercubes (uniformly segmented hypercube). This works best when all the input features have been normalized to have the same ranges. The bins can be autoscaled by calculating one set for each input feature (Separate bins for each input feature). This
allows for more variability across input dimensions. Finally, separate bins can be found
for each class. This allows the greatest flexibility in the histogram parameters, binning
each class and input feature. The range covered by the histogram can be multiplied by
LNKnet Users Guide (Revision 4, February 2004)
59
CHAPTER 3: Classifiers
FIGURE 3.11
the histogram range factor to classify test patterns found near the edges of the range
seen during training. Two methods are used for finding the edges of histogram bins. In
the first, the bins uniformly segment the covered range. This is usually better for classification. In the second method, the bins segment the space to give uniform numbers of
patterns in each bin. Where the data is denser the bins are thinner. This is usually better
for likelihood estimation.
3.2.4 Naive Bayes Classifier.
Unlike the histogram classifier, the naive Bayes classifier is explicitly designed for categorical data. It has become a popular classifier for processing large amounts of data typical of data mining applications and is not necessarily naive or simple. A
straightforward approach is used, but good performance, that rivals that of more complex classifiers, is often provided. The LNKnet graphical interface, shown in Figure
3.11, makes it possible to change the number of bins or values for each input feature.
This can be the same for all input features or it can be specified for each feature. Other
parameters, described below, can be edited by hand in the shell script produced by
LNKnet.
This classifier is designed for use with only categorical features and the categories must
be indicated by input features that take on integers ranging from zero to nvalues-1,
where nvalues is the number of different values for the input feature. Categorical features take on values that are not ordered in a meaningful way. An example would be an
input feature used to classify Internet web servers that was the name of the web server
host computer operating system. If there are 12 different types of operating systems,
then this input feature would take on 12 values. For use with LNKnet, the operating system input feature values must range from 0 to 11. Note that input features must be preprocessed to take on these integer values and they should not be further normalized
within LNKnet. For example, simple normalization as assigned in the LNKnet Feature Normalization window should not be used. This will change input feature values
to be non-integers that do not range from 0 to nvalues-1. Likewise, other forms of normalization should not be used.
60
Every implementation of naive Bayes classifiers must address three subtle issues. The
first is how to assign probabilities to bins containing values not seen in any training patterns. For example, if a feature can take values from 0 to 11, but the value 3 is never
seen during training, a non-zero probability must be assigned to the value 3 seen during
testing. The Laplace correction is used in this program because it often works well [19].
A less common variant can be selected by adding the -unity_laplace flag for the
nbayes_lnk command in the shell script that LNKnet produces. The second issue is how
to assign bin probabilities for categorical features when training patterns take on values
that are outside the expected range. For example if the number of bins for a feature is set
to 12, then feature values should range from 0 to 11. Other input feature values such 12
or 21 are outside this range. This program creates an extra unseen bin for any feature
where this occurs. All patterns that fall outside the expected range are counted as falling
in this bin. These patterns can be ignored by adding the -ignore_unseen flag to the
nbayes_lnk command in the shell script that LNKnet produces. The third issue is how to
treat features in testing that take on values that are outside the expected range. This program ignores such features. If there are many features, and at least one takes on an
expected value, but all others take on unexpected values, then classification is still possible and will be based on the one feature. If all features take on unexpected values, then
no class will be selected and no classification decision will be made.
3.2.5 Parzen Window
For a Parzen window classifier [7,39], kernel functions are placed over each training
pattern. Kernel functions can be Gaussians or rectangular pulse functions. Kernel functions can be uniform, that is circular or square functions, or the length of each side can
be proportional to the variance of each input feature, that is elliptical or rectangular. All
kernel functions can have the same shape or there can be separate kernel function
shapes for each class. The class likelihood of an input pattern is the sum of the likelihoods for each kernel function in the class normalized by the number of training patterns in the class. The Parzen window classifier can map very complicated likelihood
functions with little training. The variance of all kernel functions is initially set equal to
the variance of the training data. This variance can be reduced or increased using the
variance multiplier.
61
CHAPTER 3: Classifiers
FIGURE 3.12
62
63
CHAPTER 3: Classifiers
terns which pass the test are assigned to one node and those which fail are assigned to
another. Tests for the two new nodes are found and training continues until there are no
nodes that have training patterns from more than one class. Before testing, the tree can
be pruned to a set number of non-terminal nodes. This reduces the size of the tree and
can improve classification error rates in testing. To prune, the non-terminal node which
least affects the error rate on all the training data is found. It is made into a terminal
node and its children are removed from the tree. Nodes are cut until the desired number
of non-terminal nodes in the tree is reached. The BINTREE parameter window is shown
in Figure 3.17. The Split Using Linear Feature Combinations option should not be
used except for pedagogical purposes because the power of the BINTREE classifier
comes from the simpler single-feature splits performed by default.
FIGURE 3.17
i xi C
i
65
CHAPTER 3: Classifiers
ditions. The value of cbound for a particular problem must be selected empirically
using cross-validation. Figure 3.18 shows the SVM LNKnet window. The value for
cbound is set using the upper box labeled Lagrange Multiplier Upper Bound.
FIGURE 3.18
Lagrange multiplier
upper bound
(cbound)
Output
approximates
posterior prob or is
raw unprocessed
Gaussian kernel
Polynomial kernel
power
Select numerical
tolerance to decide
when Lagrange
multipliers are
considered 0.0 and
when KKT conditions
are considered +/- 1.0
Polynomial kernel
scale factor usually
set to number of
inputs
The kernel type determines whether an SVM is a simple linear discriminator or whether
it maps the inputs to a higher-order space. Kernel types are selected in the left middle of
the SVM window shown in Figure 3.18. It is possible to use linear kernels, Gaussian
kernels, polynomial kernels (xy)n, and inhomogeneous polynomial kernels (xy + 1)n.
Some kernels have free parameters these are selected on the right middle of the SVM
window. The standard deviation has to be selected for the Gaussian kernel and the order
has to be selected for the polynomial kernels. In addition, the inner terms in the polynomial and inhomogeneous kernels can be divided by a scale factor before being raised to
power. This improves numerical stability if there are many input features. For example
you could divide by 256 if there were 256 input features and the data was normalized to
a mean of zero and standard deviation of one. This scale factor is entered in the bottom
right box shown in Figure 3.18. Kernel locations are normally not stored for linear SVM
classifiers because they are not required for classification. To force storage of linear
SVM kernels to plot them with the internals plot check the box in the bottom middle
of the SVM window.
SVM classifiers only discriminate between two classes and extensions are required for
multi-class problems. Two approaches can be selected using check boxes in the upper
left of the SVM window. The upper check box constructs M component binary classifiers which separate each class from all the remaining classes. During testing, the classification decision corresponds to the class of the component classifier with the highest
output (before the clipping nonlinearity). The lower check box constructs many more
66
simpler binary classifiers that separate all possible combinations of classes taken two at
a time. This results in M*(M-1)/2 simple classifiers. During testing, the class with the
most votes across all binary component classifiers is selected. In the case of ties, outputs
(before the clipping nonlinearity) for each class are scanned across all pairwise classifiers that include that class, and the minimum is found. These minimum values are compared to find the class with the highest minimum value. The final classification decision
corresponds to that class. The second pairwise approach sometimes provides better performance. Although it requires many more classifiers, they are simpler, and overall
training time is often similar across both approaches. For reference, the total number of
classifiers that will be created is printed in the upper middle of the SVM window.
Classical SVM classifiers provide zero/one outputs that indicate only whether the input
pattern belongs to class A or B. They do not provide posterior probabilities that can be
used to adjust differences in prior probabilities between training and testing, assign
costs to different types of errors, reject patterns, and form complete ROC curves. LNKnet software approximates posterior probabilities using an approach motivated by [4]
but simplified to use only training data. The output of each component SVM (before the
clipping nonlinearity) is fed into a sigmoid function with an output ranging from 0 to
1.0 and constrained to produce an output of 0.5 when the input is at the decision region
boundary (input = 0.0). This constraint preserves the error rate for component binary
classifiers when errors have equal costs. The slope of the sigmoid is selected during
training to minimize the mean squared error between the sigmoid output and desired
outputs of zero and one for the two classes. Training patterns with unclipped outputs
near +/- 1 (mainly support vectors) are weighted much less in this minimization because
internal parameters in the classifier have been tuned to produce outputs of +/- 1 for these
patterns. For multi-class problems, posterior probabilities are computed from the component classifiers. When M classifiers are generated for an M-class problem, posterior
probabilities are the M outputs for each class from the M component classifiers. When
M*(M-1)/2 pairwise classifiers are generated, the posterior probability for each class is
the minimum posterior probability output for that class across all pairwise classifiers. A
sigmoid is always fit to the output of every component classifier. This fit can be used or
ignored depending on the -sigmoid_fit flag. This flag should normally be used to provide an output that approximates posterior probabilities.
This implementation of SVMs uses an efficient, fast algorithm that scales well to problems with many features and many training patterns. It uses John Platts Sequential Minimal Optimization (SMO) algorithm [34] as improved by Keerthi and Shevade [17]. The
core algorithm examines pairs of patterns (one from each class) and modifies Lagrange
multipliers using an analytic solution when patterns violate KKT conditions. Training
involves two-pass sweeps. In the first pass of a sweep, all patterns are examined one at a
time to find violations of KKT conditions. Lagrange multipliers are adapted when a violation is found. In the second pass, the subset of patterns found in the first pass that violate KTT conditions are examined repetitively and their Lagrange multipliers are
adjusted until such patterns satisfy KKT conditions. Adjustments always involve pairs
of patterns that do not satisfy KKT conditions. Another examination of all patterns to
find KKT condition violations begins the next sweep. Training stops when all patterns
satisfy KKT conditions. During training, two bias values (the bias for the hyperplane)
are maintained and used by the algorithm. These high and low bias values initially differ, and then converge to be similar after convergence. After the algorithm completes, a
final independent check is made to make sure the solution satisfies all KKT conditions.
LNKnet Users Guide (Revision 4, February 2004)
67
CHAPTER 3: Classifiers
A warning is printed along with diagnostics if the solution does not satisfy KKT conditions.
During training, information is printed out during each pass of every sweep when the
log file verbosity set in the Report Files and Verbosity window shown in Figure 2.3 is
greater than the lowest Overall Error Rate setting. The following is an example of a
table for a component classifier which separates classes 9 and 8 (digits 9 versus 8)
for the ocrdigit data base. A linear kernel was used, cbound was 1.0, and there were 120
training patterns.
TABLE 3.1
NSweeps
1
1
2
2
3
3
4
4
5
TChanged
58
60
124
124
186
204
246
655
655
KernelEvals
3140
3357
9528
9528
16061
17594
20861
50742
50742
UnBounded
53
50
53
53
47
37
40
33
33
AtUpper
0
0
0
0
0
0
0
0
0
HiBias
1.73
1.77
1.45
1.45
1.45
1.36
1.36
1.27
1.27
LoBias
0.07
0.29
0.80
0.80
1.07
1.19
1.24
1.27
1.27
DeltaBias
1.655375
1.478128
0.644528
0.644528
0.379679
0.175545
0.122306
0.001997
0.001997
The first column indicates the sweep number. As noted above, there are two passes per
sweep. The first pass examines all training patterns and the second examines only the
subset of patterns found in the first pass that violates the KKT conditions. The second
column indicates the number of patterns in a pass that violate KKT conditions. For
example, on the first pass through all 120 training patterns, 58 patterns violated the
KKT conditions. Lagrange multipliers for these patterns are all updated or changed.
The third column indicates the cumulative number of patterns where Lagrange multipliers were adapted or the total patterns with changed Lagrange multipliers. For example, on the first past, 58 adaptations occurred. A total of 655 adaptations were required
to complete training. The Fourth column shows the cumulative number of kernel evaluations required during training. When the number of input features is large, most of the
computation in this algorithm involves kernel evaluations. For this problem, more than
50,000 kernel evaluations were required to complete training. The fifth column shows
the number of support vectors that have non-zero Lagrange multipliers that are below
cbound. After training is complete, there are 33 non-zero support vectors below cbound
and none at the upper bound. All support vectors (unbounded and at the upper bound)
must be stored and used for classification. Support vectors at the upper bound correspond to patterns where outputs are not +/- 1.0. These patterns may or may not be misclassified. The final three columns show the lower bias bound, the upper bias bound,
and the difference between these bounds. See [17] for a descriptions of these bounds
and how they are computed. After training is complete, the difference between these
bias bounds should be small and less than the KKT tolerance.
Any implementation of SVMs must address numerical precision limitations and the
desired accuracy of fit to KKT conditions. LNKnet software is designed for input features that have been normalized to have zero mean and unit variance. This is achieved in
LNKnet using simple normalization in the Feature Normalization window. In addi68
tion, the accuracy desired for KKT conditions can be adjusted. KKT conditions specify
that the unclipped component classifier output for non-zero support vectors below
cbound must be +/- 1. In practice, exactly producing outputs of +/- 1 may take excessively long and have little effect on classification performance. The tolerance (absolute
difference between actual and desired outputs) allowed around desired outputs of +/- 1
can be set on the bottom left of the SVM window. This value defaults to 0.001. It can be
increased, for example to 0.01, to reduce convergence time. It is also possible to set a
lower limit on Lagrange multipliers in the lower left of the SVM window. Lagrange
multipliers below this limit are set to zero. This defaults to 0.001. It can be lowered
when Lagrange multiplier adjustments are small and below the threshold. Evidence of
small Lagrange multiplier adjustments below this limit is that the algorithm converges
rapidly to a bad solution that doesnt satisfy KKT conditions without changing
Lagrange multipliers on any training patterns. A warning will be printed with recommended changes if this occurs. This tolerance can also be increased if there are too
many small Lagrange multipliers.
This algorithm converges (usually rapidly) to a good solution. Good solutions are found
for a wide range of parameter values. Extensive error checking is performed to verify
the final solution and warnings and corrective suggestions are provided if KKT conditions are not satisfied. This only occurs if numerical precision problems occur. Such
problems usually dont occur if (1) The data is normalized to zero mean unit variance
using simple normalization, (2) If the Gaussian kernel standard deviation isnt too large
compared to the number of input features, (3) The polynomial kernel divisor is roughly
equal to the number of input features, and (4) There are no severe outlier data patterns
that are far away from other patters of the same class but among patterns of some other
class. When KKT conditions cant be satisfied, the algorithm will still converge and
warnings will be printed out stating why KKT conditions werent met and how serious
this is. These warnings can sometimes be ignored because when they occur, classifiers
are created and they typically work reasonably well. The extent of KKT violation is
printed out and small violations of KKT conditions dont affect classification performance significantly. If warnings are printed out and KKT violations are substantial, try
a different kernel (e.g. Gaussian or polynomial instead of linear), try increasing cbound,
try a different approach to building multi-class classifiers, increase the polynomial divisor scale factor, or decrease the standard deviation of Gaussian kernels. In addition try
searching for obvious extreme outlier patterns that might be due to mislabeled data. For
high-order polynomial kernels and Gaussian kernels with large standard deviations,
lowering the Lagrange multiplier tolerance may help. For multi-class problems, warnings are printed out for each component classifier and the total number of warnings is
printed out when training is complete. Search for the string WARNING in the training
log file. This software has been successfully applied to large problems with many input
features and many training patterns. Memory requirements increase roughly linearly in
the number of input features and number of training patterns.
69
CHAPTER 3: Classifiers
71
CHAPTER 3: Classifiers
72
4.1: K-Means
CHAPTE R 4
Clustering
Several of the LNKnet classifiers initialize hidden nodes or other parameters using
pre-trained clusters. Each cluster has a mean and a diagonal covariance matrix. The
clusters can be trained on labeled or unlabeled training data. That is, a separate set of
clusters can be trained for each class, or a single set of clusters can be trained for all of
the training data. When clustering labeled data, a different number of clusters can be
generated for each class.
4.1 K-Means
The K-Means clustering algorithm [7] positions a set of K centers in order to minimize
the total squared error distance between each training pattern and its nearest center. It is
trained using multiple passes through all training patterns. During a single training
epoch each training pattern is assigned to its nearest center. The position of that center is
then moved to the mean of the patterns assigned to it.
In this implementation, the K centers are initialized using a binary splitting algorithm
first described in [4]. The program first places a single center at the mean of all of the
training data. This center is then split in two, with the resulting centers being moved
slightly away from the original centers position. These centers are then trained for a set
number of epochs or until the total error goes below a threshold. The algorithm then
splits the existing centers and proceeds as before. If, during training, a center ever has
no patterns assigned to it, that center is moved near the center which accounts for the
largest amount of the total error and training proceeds as before. When a non-binary
number of centers is requested, the algorithm finds 2 log 2 ( K ) centers, the power of
two above the requested number of centers, K. This set of centers is then pruned to bring
the number back down to K. Pruning eliminates first those clusters which account for
the least total variance.
73
CHAPTER 4: Clustering
FIGURE 4.1
K Means Parameters
When splitting a
cluster, move the
two resulting
centers apart by
this percentage of
the variance of the
cluster
FIGURE 4.2
Minimum variance of a
cluster during training
Stop a round of
K-Means training
early if the centers
stop moving
above K and then prunes, just as KMEANS does. This program can also use KMEANS
to initialize the clusters. In this case, the EM algorithm is only used at the end to adjust
clusters found by kmeans.
74
Radii for
clusters in each
class. This is a
comma
delimited list of
floating point
numbers
FIGURE 4.4
Turn on randomization of
training data to get random
cluster centers
75
CHAPTER 4: Clustering
76
CHAPTE R 5
General LNKnet
Parameters
The design of LNKnet separates those parameters which are algorithm specific from
those which are general across most classification algorithms. This chapter discusses
those general LNKnet features which are available to most classification programs.
77
When the experiment action is N-fold cross validation, cross validation parameters can
be set below the action list. The user chooses whether to automatically divide the data
into training and testing folds or to read those fold assignments from a file. The format
of that file is described in Section 5.7 on page 89. The patterns can be randomized
before assignment to training and testing folds. Selecting Randomize patterns before
assigning to folds and changing the random number seed lets the user perform a series
of cross validation experiments to find an average classification error rate on the data.
Finally, at the bottom of the left side of the main window, the user can request to present
training patterns to classifiers in a random order. The user can also set the random
FIGURE 5.1
Train only
Test using the test file only
Train then test using the test file
Perform N-fold Cross validation
on the training file
number generation seed. Changing this seed changes the values for random initial
weights in some classifiers and the presentation order of randomized training patterns
for all classifiers.
The first button on the right side of the main window has a menu which selects the classification or clustering algorithm to use in the current experiment. The current algorithm
is displayed beside the menu button. Below the menu is a button which displays the
parameter window for the current algorithm. This window sets parameters specific to
the classification or clustering algorithm. The algorithm parameter windows are
described in Chapter 3 and Chapter 4.
78
Below the algorithm parameters button is a text field for setting the Experiment name
prefix. The experiment name is used for naming the files generated during an experiment. These include the shell script, screen file, log file, error files, plot files, and classifier parameter files. The prefix set here is added to the algorithm name to create the full
experiment name. For the window in Figure 5.2 the full experiment name is X1mlp.
Next on the right side of the main window is a column of buttons which display other
LNKnet popup windows. These windows are described in this chapter and in following
chapters. The first six are typically accessed in an experiment in the order they appear
on this window from top to bottom. The next three buttons display windows for performing further processing after an experiment has finished running. The last two buttons are for saving and restoring screen settings in a defaults file. This file, ~/.lnknetrc,
is read when LNKnet is started. A new set of defaults can be created by selecting Save
Screens as Default Initialization. The screens can be reinitialized to the settings in the
current defaults file by selecting Reinitialize screens from defaults.
FIGURE 5.2
Algorithm Menu
Current algorithm is MLP
Display parameter window for
current algorithm
Display Report Files and
Verbosities window
Display Data Base Selection
window
Display Normalization window
Display Feature Selection
window
Display Adjust A Priori Class
Probabilities window
Display the Plot Selection
window (see Chapter 6)
79
80
FIGURE 5.3
Experiment Notebook
Directory where experiment
and plot files will be stored
Screen file
produced by
LNKnet containing
settings for all
parameters on all
windows
Setting to control
the amount of data
stored in the error
files
Produce test error
files but not training
error files
the Data Base List scroll list. These description files all have the suffix .defaults in
their names. A new data base which does not yet have a description file will not be
listed in the Data Base list. Use the description file generation window to create the
missing description file. A data base can be selected from the scroll list or its name can
be typed (without the .defaults suffix) in the data file prefix field below the scroll list.
When a data base is selected, information about the data base is read from the description file. If LNKnet cannot find the description file, an error appears at the bottom of the
screen and a stop sign will appear beside the Data Base... button on the main window.
The description file can be created using the Description File Generation window
shown in Figure 5.5. When a data base is selected, LNKnet also finds the data files
included in the data base. The file name extensions for training, evaluation, and test
files are specified at the bottom of the data base window. The actual file names are gotten by appending the extension to the data base name. If LNKnet finds these files it
counts the total number of patterns and the number of patterns assigned to each class
in the file. To use fewer patterns than are present in a file, change the number of patterns
field. To reset the field, cause LNKnet to reread the file by reselecting the data base or
by putting the cursor on the file name extension and hitting return.
The description file generation window shown in Figure 5.5 allows the user to create or
modify a description file for a LNKnet data base. The user specifies the number of
input features, number of output classes, and labels for the input features and classes.
Selecting Generate writes the description file and adds it to the data base list on the data
base selection window. The user can select Cancel to leave the description file generation window without creating a description file. It is important to get this right. An error
in the data base description file can cause serious problems when an experiment is run.
81
FIGURE 5.4
FIGURE 5.5
82
5.4: Normalization
5.4 Normalization
When a data file is read by a classifier, it is possible to perform preprocessing for normalization. The preprocessing methods available in LNKnet either scale or rotate the
input space. The normalization parameters for a data base are calculated using only
training data.
Simple normalization rescales each input feature independently to have a mean of 0
and a variance of 1. This compensates for the differences in the means and variances of
the input dimensions. This should always be used for MLP and SVM classifiers.
Principal components analysis (PCA) rotates the input space to make the direction of
greatest variance the first dimension. The remaining orthogonal dimensions correspond
to directions of decreasing variance in the original input space. PCA can be used to
reduce the number of input dimensions by first performing PCA and then selecting only
the top N most important PCA features.
Linear discriminant analysis (LDA) assumes that classes and class means can be
modeled using Gaussian distributions. It rotates the input space to make the first dimension the direction along which the classes can be most easily discriminated. The remaining dimensions are ordered by decreasing ability to be used to discriminate the classes.
The number of features after LDA normalization is the minimum of D and M-1 where D
is the original number of input features and M is the number of classes in the data base.
The normalization method used in a LNKnet experiment is selected on the normalization window shown in Figure 5.6. Selecting a normalization method sets the normalization file name. This file is stored in the data base directory which is set on the data
base window. If this file does not exist an error will appear at the bottom of the window
and beside the Feature Normalization... button on the main window. The normalization file can be created on the Normalization File Generation window shown in Figure
5.7.
FIGURE 5.6
Select normalization
method
Normalization file in data
directory
Select to Generate
a normalization file
83
Normalization method
and file name from
normalization window.
The file will be stored in
the data directory
84
The feature list file generation window writes and runs shell scripts that create and plot
feature lists. When Run is selected on this window a shell script is written to the experiment directory and run. If Only store shell script, do not run is selected on the main
window, the shell script is not run. When the shell script is run, status information from
the feature search is printed to a log file and to the window LNKnet was originally
started in. The shell script creates a feature list file. The features in this file are plotted if
Generate Plot is checked in the lower half of the window. This plot can also be generated without repeating the feature search by selecting the Plot Only button. Selecting
Cancel stops the feature selection shell script and removes the generation window. The
shell script, log file, and plot file names are automatically generated based on the feature
list file name which is set on the feature selection window.
85
To run a feature search, the program generates a series of feature lists and for each list
performs a cross validation test on a classifier. The classification algorithm used in the
tests can be a nearest neighbor algorithm with leave-one-out cross validation or any
LNKnet classification algorithm with N-fold cross validation. The classifier used in the
second case is the one selected on the main LNKnet window. If the classification algorithm uses a clustering algorithm for initialization, the clustering algorithm for feature
searches is Kmeans, not the algorithm selected on the classifiers parameter window.
There are three directions for the selection of features for inclusion in the feature lists
tested by the classifier. The search can go forward, backward, or forward and back. In a
forward search, each feature is tried singly and the feature which gets the best classification rate is selected as the first feature. The remaining features are tested in combination
with the first feature and the best of them is added as the second feature. Features are
added this way until none are left to add.
FIGURE 5.9
86
In a backward search, the program starts with all of the features selected and tries leaving each one out. The feature which the classifier did best without is selected as the last
feature. The program goes on taking features away until none are left. The idea of a
backward search is that there may be some set of features which do well when they are
together but which do poorly individually. This set of features would not be found by a
forward search.
A forward and backward search combines the two search methods above. The program
starts searching forward with no features selected. When it has added two features, it
searches for one to take away. It continues then, adding two and taking away one, until it
has added all of the available features. This forward and backward search can find some
interdependencies in the input features which are not found using the other two
searches.
It is possible to stop a feature search early, when there are N features on the list. In the
case of a forward search this is when N have been selected. For a backward search this
is when there are N features left.
The feature selection plot shows the error rate for sets of features found during a feature
search. The X and Y limits of the plot can be chosen by the user or the plot program can
choose them using the autoscale flag. The X dimension is the feature added to the feature list to generate the classification error rate given in the Y direction. An example of
a feature selection plot is in Figure 2.37 on page 43.
87
FIGURE 5.10
TABLE 5.1
Percent Error Rates with two sets of data from the same Two Class Problem
Test Error Rate with given Ratio of
Class A to Class B during Testing
Training
Training File Ratio
of Class A to
Class B
Priors Adjustment to
Training data
1:1
10:1
1:1
No Adjustment
16.7
15.05
10:1
6.5
6.55
No Adjustment
28.7
16.9
16.6
6.5
Table 5.1 shows that the testing error rates can be greatly reduced by priors adjustment
when testing and training priors differ substantially. With two evenly sampled Gaussian
classes, an error rate of 16.7% is expected with decision boundaries equidistant from the
means of the classes as shown in the first row of Table 5.1 in the column labeled 1:1.
When there are considerably fewer patterns from one class, the overall error rate can be
88
improved to 6.5% by moving the boundary closer to the undersampled classs center.
This greatly reduces the error rate for the more common class and increases the error
rate for the undersampled class. When evenly sampled data is used for training and 10 to
1 unevenly sampled data is used for testing, the error rate is near 15%, as shown in the
10:1 column, unless some adjustment is made. Either method of priors adjustment can
be used to bring the overall error rate down to 6.5% on the unevenly sampled data as
shown in the second and third row of Table 5.1 under the column labeled 10:1. Conversely, when 10 to 1 unevenly sampled data is used in training and evenly sampled
classes are used for testing the error rate is above 28%, as shown in the fourth row of
Table 5.1 under the column labeled 1:1. Priors adjustment by sampling the training
data uniformly or scaling the outputs brings the class error rates back to roughly 16.7%
on evenly sampled data as shown in the bottom two rows of Table 5.1.
FIGURE 5.11
Do cross validation
Automatically assign data to
cross validation folds OR
read assignments of patterns
to folds from cross validation
file, vowel.train.cv
The most significant task in cross validation is the assignment of patterns to their training and testing folds. This can be performed automatically or by hand. The algorithm
which does the automatic fold assignments attempts to preserve class prior probabilities while keeping the size of test folds constant. If Randomize Patterns Before
Assignment is selected, the fold assignments depend on the random number seed. The
user can test a classifier several times by rerunning an experiment with different seeds.
If the training data is collected from different places or at different times, it can be
important that the data from different collection conditions is split up evenly during
cross validation. In such a situation, the training data can be split into folds by hand.
Because the specifications for these divisions are complicated, they are stored in a cross
validation file. Patterns are divided into partitions called splits and then the splits are
assigned to cross validation folds for training and for testing. The name of the cross validation file is set by appending the cross validation file extension, .cv, to the training
data file name, which is set on the data base window shown in Figure 5.4.
89
In the example below, there is a speaker independent speech recognizer which is being
trained on twenty-one patterns taken from four speakers. To complicate matters, speaker
1 and speaker 3 sound very similar, so speaker 3 should not give training data for
speaker 1s test and vice versa. Also, there is some data for three of the speakers which I
dont want to include in the tests. To make the testing folds easier to understand, I have
added an empty split to the middle of the fourth speakers data.
Figure 5.12 shows the speakers for each data pattern. Figure 5.13 is a list which is used
to split the data up by speaker while identifying the patterns which will not be tested.
Finally, Figures 5.14 and 5.15 are bit vectors for the train and test folds which identify
the splits to use for each. Figure 5.16 shows the cross validation file 4SPEAK.train.cv
which specifies the fold assignments for this cross validation experiment. There are
backslashes at the end of the first three lines to indicate that there are more flags on the
following line. The backslashes are immediately followed by a carriage return. When
using the backslashes, you must be careful to remember to put spaces after the comma
delimited lists. The backslash character and carriage return do not count as spaces.
FIGURE 5.12
FIGURE 5.13
Splits: (first and last pattern in each split, -1:-1 is an empty split)
0:1,2:2,3:4,5:5,6:6,7:8,9:11,12:12,13:14,15:17,-1:-1,18:20
FIGURE 5.14
Testing folds: (do not test on middle split for each speaker)
101000000000,000101000000,000000101000,000000000101
FIGURE 5.15
FIGURE 5.16
90
CHAPTE R 6
Plots
Some of the most visible and useful features of LNKnet are the many types of plots produced. All of the classifiers and clusterers can produce decision region plots that can be
overlaid with a scatter plot of the data and an internals plot of classifier parameters.
Those classifiers which have continuous outputs can produce a profile plot and a histogram plot of the data. Error files from these classifiers can be used to produce posterior
probability plots, receiver operating characteristics (ROC) curve or detection plots, and
rejection plots. Incrementally trained classifiers, those which go over the training data
multiple times, can use the cost plot or percent error plot. Many classifiers have a structure plot which shows connections between classifier nodes. Figure 6.1 shows the LNKnet window used to select these plots. Once a plot file has been generated, it can be
redisplayed or printed from the Preview and Print window described in Section 7.1.
There are also plots for showing the results from a normalization run or from a feature
search. These plots are generated on the Normalization File Generation window and the
Feature List File Generation window respectively. The normalization plot and feature
list plot are explained in Chapter 5.
91
CHAPTER 6: Plots
FIGURE 6.1
to identify the classes of the patterns. Increasing the Level of Detail to 2 changes the
plot symbols to capital letters for each class. If color is used, and there are two input features, squares with the same color as the background region are classified correctly and
those with differing colors are classified incorrectly. If there are more than two input
features, it may be necessary to limit the number of patterns displayed in the scatter plot
to get this same result. By NOT selecting Show All Data and setting the distance limit,
it is possible to plot only those patterns that fall close to the decision region plane. When
highlight misclassified data points is selected, these misclassified points are shown as
grey and correct patterns are shown colored. In a black and white plot, misclassified
points are shown normally and correct patterns are shown as tiny dots.
Finally, an internals plot can be overlaid on the scatter plot and decision region plot.
The form of the internals plot depends on the algorithm. There are three basic types.
Three classifiers use lines to show the internals. The multi-layer perceptron shows the
planes defined by the first hidden layer. The binary tree classifier shows the node tests
92
for each non-terminal node. The histogram classifier shows the edges of the histogram
bins. The second type of internals plot uses ovals, circles, or rectangles to show the size
and position of Gaussians, spheres, or hyper-rectangles used in classification or clustering. RBF, GAUSS, and HYPER, are examples of algorithms which have this type of
internals plot. The global scale factor can be used to alter the size of these figures.
Finally, the nearest neighbor algorithms which use only the positions of centers to determine the class show small squares for each stored center. KNN and LVQ are examples
of this type of algorithm. When the level of detail is raised to 3, the internals plot elements are labeled by class or by node number.
There are two other features that are related to plotting with many-dimensional data
bases. First, two plotting dimensions can be selected. The dimension numbers are
counted from zero. The plot axes limits can be specified by the user or they can be set
automatically based on the range of the scatter plot data. Second, the values for the
non-plotted dimensions can be set using a comma delimited list. The list must have
values for all dimensions, including X and Y. The X and Y settings will be ignored. For
example, if there are five input features and dimensions 0 and 4 are plotted, then the list
0,1,-75,0.5,0 sets the second, third and fourth dimensions to 1, -75, and 0.5 when
decision regions are plotted for dimensions 0 and 4. If no settings are provided, all of the
other features are set to 0 when the decision region plot is generated. Combining these
two features, selection of the plotted features and setting values for non-plotted features,
it is possible to gain some understanding of the shapes of multi-dimensional decision
regions.
There is one final feature that relates to the plotting dimensions. The plots can be generated for data before or after normalization. In the case of simple normalization, this
will change the values on the axes. When PCA or LDA normalization is being used, this
means that the decision regions can be generated using the original input dimensions or
using the rotated dimensions generated by the normalization. When the plots are being
generated for un-normalized data, there will be no internals plot. The internals plots are
derived from classifier parameters that were trained in the normalized data space and
they cannot easily be translated back into the un-normalized space.
93
CHAPTER 6: Plots
FIGURE 6.2
A color coded histogram plot is displayed in the lower half of the plot window. The
line being sampled for the profile plot is divided into a number of bins. The number of
bins is the Number of Intervals per Dimension. Each pattern is tested for its distance
from the plotted line. If Show All data Points is selected, all patterns are included in the
histogram. If Show all data is NOT set, only those patterns closer to the profile plot line
than the distance limit are plotted. Each included pattern is assigned to a bin based on
its X value. A colored square is drawn for it in that bin. If the profile plot has been
selected, the patterns are also tested using the current classifier. If the pattern is classified correctly, its square is drawn above the histogram baseline. If the pattern is misclassified the square is drawn below the baseline.
94
When the autoscale flag is set, the horizontal or X axis limits are set according to the
range of the input data in the dimension being plotted. The two vertical or Y axes are
scaled by the range of the profile plot outputs and histogram bin heights. The user has
the option of specifying the horizontal axis limits and the profile plot vertical axis limits.
As in the decision region plot, the values for the non-plotted dimensions can be set
using a comma delimited list. The list must have values for all dimensions, including the
plotted dimension X. The X setting will be ignored. For example, if there are five input
features and dimension 0 is plotted, then the list 0,1,-75,0.5,0 sets the second, third,
fourth, and fifth dimensions to 1, -75, 0.5, and 0 when a profile is plotted for dimension
0. If no settings are provided, all of the other features are set to 0 when the profile plot is
generated.
Also as in the decision region plot, the profile and histogram plots can be generated for
data before or after normalization. In the case of simple normalization, this will
change the values on the X axis. When PCA or LDA normalization is being used, this
means that the output profiles can be generated using the original input dimensions or
using the rotated dimensions generated by the normalization.
FIGURE 6.3
95
CHAPTER 6: Plots
96
Figure 6.5 shows a decision region plot and a structure plot for a binary tree classifier
trained on the XOR problem.
FIGURE 6.5
97
CHAPTER 6: Plots
FIGURE 6.6
FIGURE 6.7
Support vector machine structure plots for the 10-class vowel problem using the each
class versus others multi-class mode on the left and the all two-class combinations
multi-class mode the right.
98
Support vector machine internals plots show the location of support vectors. An example for the vowel problem is shown in Figure 6.8. Support vectors that are at the
Lagrange multiplier upper bound (cbound) are shown as circles and support vectors that
are below this bound are shown as circles around an x. Internals plots with a linear
kernal will show support vector locations only if the Store linear support vectors
check box is filled in in the SVM parameters window shown in Figure 3.18.
FIGURE 6.8
Decision regions for a support vector machine classifier for the vowel problem where the
internals shows the locations of support vectors.
99
CHAPTER 6: Plots
magnitude of the weights can be shown by increasing the thickness of the lines for large
weights. The maximum thickness of these lines can be set by the user. This can reveal
the importance of particular hidden nodes. Negative weights are drawn as hollow tubes
or are colored orange when weight magnitudes are shown. The plot can also only display those connections with weight magnitudes above a certain value. This can help
clarify plots for classifiers with many hidden nodes.
For the Gaussian mixture classifier, the hidden nodes can be combined into one tied
mixture shared by all the class output nodes or each class can have its own mixture of
Gaussian nodes. In the first case all the hidden nodes will be connected to all the output
nodes. In the second case the nodes assigned to each class mixture will be connected
only to the output node for that class. Figure 6.9 shows a structure plot and decision
region plot for a Gaussian mixture classifier trained on the XOR problem.
FIGURE 6.9
For the two radial basis function classifiers, all hidden nodes are connected to all output
nodes. The RBF classifiers can also have a constant bias hidden node. If so, it is represented as a small square beside the other hidden nodes. Figure 6.10 shows a structure
plot and decision region plot for a Radial Basis Function classifier trained on the XOR
data base.
6.3.5 Multi-Layer Perceptron Structure Plot
For the multi-layer perceptron, there can be up to 10 layers of nodes with an input layer
at the bottom of the plot, as before, and an output layer at the top. All the connections
between all the layers are weighted. The weight magnitudes can be shown and connections with very small weights can be made invisible. Negative weights can be shown as
hollow lines or they can be colored orange. Each MLP layer has a constant bias node
which can be displayed as a small black square beside the weights for that layer. Figure
100
FIGURE 6.10
6.11 shows a structure plot and decision region plot for a Multi-Layer Perceptron classifier trained for 300 epochs on the XOR problem.
FIGURE 6.11
101
CHAPTER 6: Plots
cent error over training. Examples of these plots are in Figure 2.18 and Figure 2.19 on
page 27 in the tutorial. For a percent error plot, the classification error rates for each
successive group of N patterns is calculated and plotted. A cost plot does the same with
the cost information stored in the training error file. The default value for N is the number of patterns in each training epoch. These plots can be scaled automatically based on
the range of the data and the number of patterns represented or the user can manually set
the axes limits.
FIGURE 6.12
Autoscale plot
FIGURE 6.13
102
mats for the plot. There is a scatter plot, which plots the observed posterior probabilities
in each bin against the average output value for the patterns in the bin. A line is drawn
along the diagonal (posterior=output) to indicate where perfect posterior probability
outputs should lie. The second form of the plot displays a pair of values for each bin.
The observed posterior probabilities are drawn as blue circles with lines indicating plus
and minus two standard deviations. The average bin output value is indicated with an X.
For both versions of the plot, if the observed probabilities are within two standard deviation of the average bin output, indicated by the line or the Xs, then the classifier is adequately modeling the posterior class probabilities.
To generate the plot, test patterns are assigned to bins according to their output values
for the given class. These bins can be uniformly placed from zero to 100 or the ends of
the bins can be specified using a list of floating point numbers. When specifying the bin
ends, remember that there is one more end than there are bins. The quality of the posterior probability fit is determined using a chi squared fit. To insure that there are enough
patterns in each bin to make that number valid, a minimum number of patterns per bin
(typically 5) is enforced. Bins with too few patterns are combined with neighboring bins
until the minimums are met in all the remaining bins.
The plotted values can be printed to a table in the log file. A small table showing the chi
square value for the bins and the quality or significance of the fit can be added. For this
plot, higher significance values indicate better fits. Significance values of less than 0.05
are labeled poor. These chi and significance values are also printed to the experiment
notebook file. The axes limits of the plot can be changed to examine smaller sections of
the plot area. The default setting are for 0-100% probability on both axes.
The actual number of patterns and the number of patterns in the target class for each bin
are printed above the upper two standard deviation indicator. This label text can be left
off by selecting No Text on Plot. To use the posterior probability plot with likelihood
classifiers, it is necessary to normalize the outputs to sum to one. Figure 6.14 shows
the LNKnet parameter window for the posterior probability plot. Figure 2.20 shows an
example of a posterior probability plot.
103
CHAPTER 6: Plots
FIGURE 6.14
Plot type
Split range evenly into bins or
specify the ends of each bin
ROC curve for a random classifier, equally likely to return a given output value for any
class. Because an ROC curve depends on only one output value, the ROC area does not
necessarily indicate the quality of the classifier in classification, where all outputs are
compared and the class of the maximum output is chosen.
A table of ROC plot data can be printed to the experiment log file. This table can
include interpolated values or it can include the values of all the points in the plot. The
ROC area is also printed to the log file and to the experiment notebook file. A fraction
of the patterns can be rejected, eliminating from the plot those patterns with the lowest
output values. The plot axes limits can be altered to more closely examine a certain section of the ROC curve. Figure 6.15 shows the LNKnet parameter window for the ROC
plot. Figure 2.21 shows an example ROC plot generated during the tutorial.
FIGURE 6.15
Target class
the patterns which cause errors have low maximum output values, the error rate on the
remaining patterns can be reduced by setting the threshold to reject those low scoring
patterns. As with the ROC plot, the rejection plot can print a table of plot values to the
log file. The outputs for each pattern can be normalized to sum to one, which is important if a likelihood classifier is being used. The scale of the plot can be changed to focus
on a particular section of the curve. Figure 6.16 shows the parameter window for a
rejection plot. An example rejection plot can be found in Figure 2.22.
FIGURE 6.16
Verbosity of table
printed to the log file
105
CHAPTER 6: Plots
106
FIGURE 6.17
Internals Plots and Scatter Plot for MLP training on Gap.train,1 epoch per batch, 10
epochs total
107
CHAPTER 6: Plots
6.17 was imported using a MIF file. Alternatively, the plot2ps tool can be used to convert a .plot file to a PostScript file which can be imported into other document preparation programs or printed on a postscript printer. Postscript files can also be converted to
many other graphics formats using standard plotting utilities. MIF and PostScript files
can be created on the Preview and Print window described in Section 7.1.
A final approach to including plots in reports is to edit the shell script written by LNKnet, adding the -mac and other flags beginning with -mac to all plotting commands. This
produces new files containing x,y coordinates of all points in plots suitable for importing into a spreadsheet on a Macintosh or IBM PC. These points can be used by programs such as Delta Graph or Excel to create carefully formatted and annotated plots.
Olxplot Commands
Menu Item
Keyboard
Command
Function
File->Print->(printer name)
File->Print->Printer...
File->Save
File->Quit
Quit olxplot
Clear
Next
Prev
Overlay->Enable
Overlay->Disable
Help
108
CHAPTE R 7
This chapter describes additional features and tools that are available with the LNKnet
package. Some of these tools (the Preview and Print window, C code generation from a
parameter file, and committee data base creation) are available from the LNKnet graphical user interface. The others (batch file creation from LNKnet shell scripts, multi-layer
perceptron initialization using binary tree parameters, data file creation with normalized
patterns, and data exploration with xgobi) must be run from a shell or in a shell script.
109
FIGURE 7.1
Select an action
to take on the files
below
Select files to
view, print, or
translate
Select an
alternate printer.
The print
command used
here is
lpr -Pbw1
Classify() takes a raw input pattern and a pointer to an output vector. It normalizes the
inputs and performs any feature selection that was used with LNKnet to train the classifier. For this, the classify routine uses the function normalize() which is included in the
generated C file. The routine then calculates classifier outputs using the classifier
parameters taken from the algorithm parameter file. These outputs are copied into the
output vector and the index of the output with the largest output value is returned as the
class of the input pattern.
110
To create a C subroutine file, first train a classifier as described in the tutorial. Bring up
the C File generation window, shown in Figure 7.2, by selecting C Code Generation...
on the main LNKnet window. The subroutine name suffix field sets an extension for the
classify routine name. For the window in Figure 7.2 the subroutine would be
classify_XORgauss(). To have no subroutine suffix, make the field blank. Select Generate C Code File to write and run a shell script that creates the C subroutine file. An
example parameter file for a Gaussian classifier trained on the XOR problem, the C subroutine classify_XORgauss() produced from it, and a short program that uses the subroutine to generate a decision region plot are included in Appendix C.6 on page 159.
FIGURE 7.2
C code window
111
and no training files have been created yet. Selecting Generate Committee Data Files
writes and runs a shell script that generates a committee data file for each of the
requested file types. If any files are missing or the wrong size, the shell script is not run
and an error appears. A description file for the data base is also created. The number of
classes in the data base is taken from the number of output classes field near the bottom
of the committee data base generation window. The class labels are copied from the current data base selected on the data base selection window. The number of input features
in the data base is the number of classes times the number of classifiers in the committee. The input labels are generated from the experiment names and output numbers. The
input labels are more fully described in Section 8.4 on page 127. The data files and data
base description file are stored in the experiment directory.
FIGURE 7.3
112
from another script, just like any other C-shell script. Note that this script is different
from normal LNKnet shell scripts because outputs are stored in the log file but are not
printed to the shell window and plot files are created but the plots are not displayed on
your workstation screen.
An example of using a script in a batch mode would be an experiment which tests different weight step sizes for a Multi-Layer Perceptron. This can be done interactively
from the LNKnet interface, but it might be faster to make a single script. The user could
then edit the script to make the step size parameter a variable and put the classifier training and testing commands in a loop. The new script would cycle through a list of step
size values, training and testing a classifier for each value.
113
class 1 squares are often misclassified. A binary tree classifier with 6 non-terminal
nodes can achieve an error rate of 1.2% on the disjoint testing data. The structure and
decision region plots for this classifier are shown in Figure 7.4. A multi-layer percepFIGURE 7.4
tron classifier with 6 hidden nodes requires 1000 epochs of training to achieve a similar
error rate of 1.6%. This training time can be cut to 10 epochs by initializing the multilayer perceptron using the binary tree parameters. The structure plot and decision region
plot for this classifier are shown in Figure 7.5.
FIGURE 7.5
114
To perform this experiment, first select the LNKnet disjoint data base on the data base
selection window. The LNKnet data base directory is $LNKHOME/data/class. Next,
change the experiment name prefix to disjoint and select the binary tree classifier. On
the BINTREE parameter window, select Maximum Number of Nodes during Testing
and set the number to 6. Train and test the binary tree classifier.
Next select the multi-layer perceptron classifier. On the MLP parameter window set the
number of epochs to 1000 and the node structure to 2,6,2. Train and test the multi-layer
perceptron classifier by selecting START. It may be interesting to display training plots
using movie mode. A plot every 50 or 100 epochs should be sufficient.
Now, initialize a multi-layer perceptron classifier using the binary tree classifier. First,
run the following command in your shell window:
bintree2mlp -bin_fparam disjointbintree.param \
-prune_tree -max_nodes 6 \
-mlp_fparam disjointmlp.param -nodes 2,6,2
The parameter file disjointmlp.param now holds a multi-layer perceptron with first layer
weights that match the binary tree decision node lines. The other weights in the network
are set to random values and need to be trained. On the MLP parameter window, set the
number of epochs to 10. Display the MLP weight parameter window and select Use
Step size list for weights in each layer. Set the Step size list to 0,.1. This freezes the
first layer weights to the initialized values while allowing the other weights to be
trained. Display the MLP node parameter window and select Specify sigmoid steepness for each layer. Set the Sigmoid Steepness List to 50,1. This makes the sigmoid
functions for the hidden layer act like the non-terminal node tests used in the binary tree
classifier. Now train the initialized multi-layer perceptron by selecting CONTINUE on
the main window. The pre-initialized MLP classifier achieves the same error rate as the
randomly initialized MLP classifier using 1/100 the epochs of training.
115
116
vowel.train.colors
vowel.train.row
vowel.train.glyphs
CHAPTE R 8
The LNKnet system uses many files to store classification data, normalization and feature selection parameters, experiment commands and results, classification algorithm
parameters, plots, generated C subroutines and committee data bases. This chapter
describes the default names and the formats of the files created and used by LNKnet
programs.
FIGURE 8.1
0
1
1
0
Input Features
0.1
1.1
0
1
0
0
1.1
1.1
First Pattern
Last Pattern
117
Suffix
File Name
Training
.train
pbvowel.train
Evaluation
.eval
pbvowel.eval
Testing
.test
pbvowel.test
FIGURE 8.2
FIGURE 8.3
describe -class \
-ninputs 2 -input_labels X0,X1 \
-noutputs 2 -labels EVEN,ODD
Number of input
features
Number of
output classes
Optional class
labels
Optional input
feature labels
A description file has the same general format as the .lnknetrc file. There is a dummy
command name followed by a list of flags and their values. The flags in the description
file must match those in the example below. The flag -ninputs is followed by the number of input features. The flag -noutputs is followed by the number of classes. The flag
-labels is followed by a comma delimited list containing the names of all classes beginning with class zero. The flag -input_labels is followed by a similar comma delimited
list of labels for the input features starting from feature zero. The delimiter for the label
lists can be comma(,), colon(:), or dash(-). A label list ends at the first space encountered. None of these characters can be used in class labels or input feature labels.
Table 8.2 gives examples of acceptable and unacceptable labels. A data base description
TABLE 8.2
Acceptable
Unacceptable
Reason for
Unacceptability
10/jan/94
10:jan:94
colons (:)
1cepstra
1 cepstra
space
delta_cepstra
delta-cepstra
dash (-)
July_4_95
July 4, 1995
file can also have a flag for the type of data base. The data base type flags are -class for
static pattern classification data bases, -map for input/output mapping data bases, and
-seq for sequence classification data bases. Only static pattern classification data bases
can be used with the programs described in this Users Guide. Description files are read
using the command line argument parsing routines used by all LNKnet programs. Like a
UNIX command, the description file flag list must either be one line long or every line
but the last must end with a backslash (\) immediately followed by a carriage return.
When using backslashes, it is important to remember to put spaces at the end of each
comma delimited list. The backslash and carriage return are not interpreted as spaces
For LNKnet to find the description of a particular data base, the name must be
<data_base>.defaults. For example, the description file for pbvowel is
pbvowel.defaults.
119
FIGURE 8.4
Normalization Type
file name
(pbvowel data base)
simple
.simple
pbvowel.norm.simple
.pca
pbvowel.norm.pca
.lda
pbvowel.norm.lda
120
Direction
Extension
Normalization
Type
Normalization
Character
File Name
forward
.for
none
.N
pbvowel.for.N.param
forward
.for
simple
.S
pbvowel.for.S.param
TABLE 8.4
Direction
Extension
Normalization
Type
Normalization
Character
File Name
backward
.back
PCA
.P
pbvowel.back.P.param
forward and
for_bk
LDA
.D
pbvowel.for_bk.D.param
back
121
FIGURE 8.5
Description File
XOR.defaults
Training Data
XOR.train
Notebook File
LNKnet.note
Training Results
Test3mlp.err.train
Profile Plot
mlp_plot_bound
Test
mlp
Testing Results
Test3mlp.err.test
Log File
Test3mlp.log
Cost Plot
plot_cost
Normalization Data
XOR.norm.pca
Shell Script
Test3mlp.run
Screen File
Test3mlp.screen
Train
mlp
Parameter File
Test3mlp.param
Testing Data
XOR.test
Structure Plot
plot_mlp
Rejection Plot
plot_reject
Structure Plot File
Test3mlp.struct.plot
122
Verbosity Level
=========<classifier> Begin
All Levels
Verbosity 3 or over
Verbosity 3 or over
Confusion Matrix
Verbosity 2 or over
Notes
Incrementally Trained
classifiers only
123
TABLE 8.5
Verbosity Level
Error Summary
Verbosity 1 or over
All Levels
All Levels
========<classifier> End
All Levels
Notes
The file name extension for log files is .log. Thus the full log file name for the example
experiment is Test3mlp.log. Section C.1.2 gives an example log file from the LNKnet
tutorial.
8.2.5 Algorithm Parameter Files
After a classifier has been trained, it is saved in a parameter file. The first items in the
parameter file are all program flags and their settings and the date and time that training
was started. Following this is information on any normalization performed on the training data before it was presented to the classifier. Any data presented to this classifier for
testing will use these same normalization parameters. Finally, the classifier parameters
are stored. If this parameter file was generated during N-fold cross validation, it will
have several sets of classifier parameters.
The file name extension for parameter files is .param. Thus the full parameter file name
for the example experiment is Test3mlp.param. A parameter file for the Gaussian classifier is found in Section C.6.1.
FIGURE 8.6
Pattern
Number
0
1
2
3
True Class
0
1
0
1
1
1
0
1
1
0
0
0
0.287
0.056
0.029
0.147
Classifier Outputs
0.463
0.241
0.827
0.380
0.535
0.765
0.171
0.612
Input Pattern
0.1
1.1
1.1
0.1
0.1
0.1
1.1
1.1
124
file is written. If Error File Verbosity is set to Classification Results (-verror 1), for each
pattern which is tested, the entries shown in Table 8.6 are written to the error file. Each
TABLE 8.6
Correct Class
Classifiers
Class
Classification
Error
Cost
tested pattern generates a line in this file. If the Error File Verbosity is Results+Outputs
(-verror 2), after the results entries, the classifier outputs are written to the file, as shown
TABLE 8.7
in Table 8.7. Finally, if the Error File Verbosity is Results+Outputs+Inputs (-verror 3),
the normalized input pattern is written to the file after the outputs as shown in Table 8.8.
TABLE 8.8
The file name extension for error files is .err. The data base file extension is also used to
tell which data base file these are the results for. Thus, if the example multi-layer perceptron classifier stores pattern by pattern classification results during training, they go
into Test3mlp.err.train. Table 8.9 shows the default extensions for data file types and the
resulting default error file names. The .test_on_train extension cannot be changed from
the LNKnet graphical user interface.
TABLE 8.9
File Type
File Type
Extension
Train on train
.train
Test3mlp.err.train
Test on train
.test_on_train
Test3mlp.err.test_on_train
Test on eval
.eval
Test3mlp.err.eval
Test on test
.test
Test3mlp.err.test
.cv
Test3mlp.err.cv
125
ture selection plots depend on the name of the parameter file on which the plot is based.
For normalization plots, add .plot to the parameter file name. For feature selection plots,
replace the extension .param with .plot. Table 8.10 shows the extensions and plot names
for the example classification experiment. The decision region and profile plot names
are for plots using the evaluation data file.
TABLE 8.10
Plot file names for Experiment Test3mlp and Evaluation data file
Plot Type
Decision region
.region.plot<file type>
Test3mlp.region.plot.eval
Profile
.profile.plot<file type>
Test3mlp.profile.plot.eval
Structure
.struct.plot
Test3mlp.struct.plot
Cost
.cost.plot
Test3mlp.cost.plot
Percent error
.perr.plot
Test3mlp.perr.plot
Posterior probability
.prob.plot
Test3mlp.prob.plot
ROC (detection)
.detect.plot
Test3mlp.detect.plot
Rejection
.reject.plot
Test3mlp.reject.plot
These files can be displayed using olxplot under OpenWindows or xplot under MIT X.
One way to print plot files is to first convert them to PostScript using plot2ps. Plots can
be added to FrameMaker documents if they are first translated into Maker Interchange
Format using plot2mif. Both of these programs are available on the Print window
described in Section 7.1.
126
FIGURE 8.7
127
128
APPE NDIX A
129
2.2. Problem: There are no files in the data base scroll list on the data base window or the
desired data base is not on the list.
Solution: There are several possibilities here:
1. The path to the data base directory is wrong. Change the data base directory path.
2. There are no data base description files for the data bases in this directory. Enter the data base
names in the data file prefix field and generate description files for the data bases on the Description File Generation window.
3. The description files use the wrong suffix. Data base description files must be named <data-
base>.defaults. Only file names which include the string .defaults are displayed in the data base
scrolling list.
2.3. Problem: On the data base window, under patterns per class it says Not classification
data
Solution: Check that the data base selected is a static pattern classification data base, not an
input/output mapping or sequence classification data base.
2.4. Problem: When I select a data base I get a warning on my shell window, Noutputs from
defaults file doesnt match data file
Solution: There are two possible problems here:
1. The training, testing, or evaluation file being read is not a LNKnet classification data file.
2. There is a pattern in the data file with a class label which is out of range. The class numbers at the
2.5. Problem: There are red stop signs beside some of the buttons on the main window.
Solution: There are important errors on these windows. Unless these errors are cleared, an
experiment started now will not run correctly. Select the buttons and clear the errors before
starting the experiment.
2.6. Problem: File names and parameters entered in LNKnet windows are not updated during
an experiment.
Solution: A carriage return or a tab must be entered after typing anything in a LNKnet window before the new entry is read in. Play it safe and hit carriage return after typing anything.
2.7. Problem: The number of patterns in a data base file is not set when the data base is
selected.
Solution: If the data base directory is correct, check the data file extension.
2.8. Problem: There is an error on the normalization window, Normalization file does not
exist.
Solution: If the normalization file DOES exist, check the data base directory, data base
selection, and the normalization selection. Otherwise, create the normalization file on the
normalization file generation window.
2.9. Problem: There is an error on the feature selection window, bad feature list or not enough
labels.
Solution: There are two possibilities here:
1. Check the feature selection parameters.
130
2. Check that the number of input features on the data base window is the same as the number of
inputs. Change the input feature list on the description file generation window.
2.10. Problem: There is an File does not exist error on one of the windows after the missing
file has been created.
Solution: Click the mouse on the error Stop sign to erase the message.
2.11. Problem: Options are grayed out on a parameter window and cannot be selected.
Solution: Options that are inconsistent with previous selections are grayed out and cannot
be selected. For example, the perceptron convergence procedure cost function of the multilayer perceptron classifier is only available if there are no hidden layers. When it is selected,
the field for specifying network topology is grayed out. Select a parameter value that is consistent with the desired option.
2.12. Problem: A comma delimited list is not read correctly.
Solution: There are two possibilities here:
1. Do not leave spaces between the commas. If this is a list of strings, there are three delimiter char-
acters that can be used: comma (,), colon (:), and dash (-). Check that these characters are not
included inside any of your desired list strings. For example first_formant is a valid label because
an underscore is used to indicate the space between the words, but first-formant is not valid
because a dash is used within the label.
2. Always put a space after a comma delimited list. Because dash is used to mark flags and is also a
list delimiter, forgetting the space after a list can cause LNKnet to add the name of the next flag
to the list. That flag will not be set, since it has already been parsed as being a part of a list. This
is most often a problem when flags are put on multiple lines in a file with backslashes (\) at the
ends of lines. When the file is read, the first character of the second line is placed directly after
the last character of the first line. No extra spaces are inserted.
2.13. Problem: Per-epoch error information is not printed out when training an MLP classifier.
Solution: Set the log file verbosity flag on the Report Files and Verbosities window to print
out Summary+Confusion+Flags+Epochs.
3. Misc.
3.1. Problem: It is annoying when I keep the same experiment name to keep having to move
my mouse after starting another experiment and clicking on the button that says its ok to
overwrite the old experiment
Solution: Experts move the small window that queries you about overwriting to be located
over the button used to start a new experiment. You can then dismiss the second small verification window with a second mouse click in the same location.
3.2. Problem: It is difficult to select features using a comma separted list of feature numbers
because I keep forgetting which numbers correspond to which feature names.
Solution: Most of us get around this by keeping a listing of the feature names along with
their numbers. This list is provided at the bottom of the feature selection window if you
select all features. Once you guess at the feature numbers and hit carriage return, the feature
LNKnet Users Guide (Revision 4, February 2004)
131
names are displayed at the bottom of the feature-selection window. Also remember that feature numbers start at zero and not at one.
3.3. Problem: When I have many input features, the feature selection window extends way off
the screen and I cant see the whole window at once.
Solution: Either use smaller feature names, or drag the window to left or right using the
mouse. You can also resize this window, after performing feature selection.
4. Known Limitations
4.1. Problem: I think I found a problem or want a new feature.
Solution: Send questions, requests, and bug reports to Linda Kukolich
(kukolich@sst.ll.mit.edu) or Richard Lippmann (rpl@sst.ll.mit.edu).
4.2. Problem: LNKnet windows come up all black or with black writing on black buttons.
Solution: LNKnet was developed on a color Sparc station. It has not been debugged on
black and white terminals and may not work on them. The problem may be solved in newer
versions of OpenWindows.
4.3. Problem: When specifying a large number of cross validation folds in a file, the following
error occurs: Number of labels has exceeded 255. The list has been truncated.
Solution: The maximum number of entries in the -cv_splits, -cv_train_mask, and
-cv_test_mask arguments of a cross validation file is 255. Because a split is defined by pairs
of entries, the maximum number of splits is 127. The maximum number of cross validation
folds is 255. This may be changed in a future version of LNKnet.
5. MLP Training
5.1. Problem: MLP training is slow
Solution: There are several things that can be tried to speed up MLP training:
1. Make sure that random presentation order is selected on the main window.
2. Make sure that the weights are being updated after every trial.
3. Increase the step size.
6. Plots
6.1. Problem: Decision region plots are not in color.
Solution: Check that color plots are selected on the Decision Region Plot window.
132
6.2. Problem: Decision region plots are too jagged and blocky.
Solution: Increase the number of points per dimension on the Decision Region Plot window.
6.3. Problem: Scatter plot or Histogram plot does not show patterns.
Solution: Turn on autoscale, adjust the setting for Xmin, Xmax, Ymin, and Ymax, turn on
show all patterns, or increase the distance limit for pattern display.
6.4. Problem: I have problems looking at plots while running MIT X.
Solution: Change the name of the binary file xplot to olxplot, after saving olxplot. Xplot is
designed to run under MIT X. Olxplot is designed to run under SUN OpenLook.
6.5. Problem: The screen rapidly gets cluttered with plots.
Solution: Turn plots off and eliminate existing plots rapidly by typing q when the mouse
is over a plot window.
6.6. Problem: How do I make hard copies of plots?
Solution: See Section 7.1 on page 109 or Section 6.9 on page 107.
6.7. Problem: Plots do not run and generate the error:
ld.so: Undefined symbol: _XtQString
Solution: The program which displays plots, olxplot, was written using the OpenLook
Intrinsics library, olit. Olit uses X11R4, as does the rest of the OpenWindows environment.
If your environment variable $LD_LIBRARY_PATH includes the X11R5 libraries, olxplot
will not run because of incompatibilities between X11R4 and X11R5. Remove the X11R5
libraries from the LD_LIBRARY_PATH environment variable in your terminal window
shell before starting LNKnet. The following commands can be used to display and correct
the LD_LIBRARY_PATH variable:
> echo $LD_LIBRARY_PATH
/usr/local/X11R5/lib:/src/openwin3.o/lib:/usr/local/lib:/usr/
local/lib/X11:/usr/lib
> setenv LD_LIBRARY_PATH /src/openwin3.0/lib:/usr/local/lib:/
usr/local/lib/X11:/usr/lib
133
134
APPE NDIX B
Installing LNKnet
changes/
lib/
RCS/
data/
loading_instructions
README
demo/
src/
bin/
man/
At an absolute minimum, you need the bin/ and man/ directories. In order to do the tutorial, you will also need the data/ directory. If space is tight, the other directories may be
deleted. This would save 34 Mbytes of space.
135
136
result of make
$LNKHOME
all binaries
$LNKHOME/src
all binaries
$LNKHOME/src/lib
library nnlib.a
$LNKHOME/src/algorithm
all classifiers
and clusterers
$LNKHOME/src/algorithm/mlp
multi-layer
perceptron
program, mlp
$LNKHOME/src/plot
plot programs
$LNKHOME/src/gclass
LNKnet GUI
The command make creates the binary. The command make copy creates the binary and copies it into
the bin directory, $LNKHOME/bin.
If there are any problems with compilation, call or e-mail Linda Kukolich
(KUKOLICH@LL.MIT.EDU) for help.
137
138
APPE NDIX C
The experiments shown here were performed as part of the LNKnet tutorial in Chapter 2
and as examples in Chapter 6.
C.1 MLP
This experiment was run in two parts. During the first part, the MLP classifier was
trained for 20 epochs on 338 samples from the vowel data base. When evaluated, the
resulting classifier obtained an error rate of 30%. The training was then continued for
another 20 epochs which brought the evaluation error rate down to 20%. The shell script
shows the calls for the first half of the training. It differs from the script for the second
half only in that the calls to MLP use the -create flag. The log file shown below has the
results from both halves of the experiment. Because this classifier looks at the same
training patterns multiple times, there are classification results for the training as well as
for the evaluation portions of the experiment.
C.1.1 MLP Shell Script
#!/bin/csh -ef
# ./X1mlp.run
set loc=`pwd`
#train
(time mlp\
-train -create -pathexp $loc -ferror X1mlp.err.train -fparam X1mlp.param\
-pathdata /u/kukolich/Tutorial -finput vowel.train\
-fdescribe vowel.defaults -npatterns 338 -ninputs 2 -normalize\
-fnorm vowel.norm.simple -cross_valid 0 -fcross_valid vowel.train.cv\
-random_cv -random -seed 0 -priors_npatterns 338 -debug 0 -verbose 3\
-verror 2 \
-nodes 2,25,10 -alpha 0.6 -etta 0.2 -etta_change_type 0 -epsilon 0.1\
-kappa 0.01 -etta_nepochs 0 -decay 0 -tolerance 0.01 -hfunction 0\
-ofunction 0 -sigmoid_param 1 -cost_func 0 -cost_param 1 -epochs 20\
-batch 1,1,0 -init_mag 0.1 \
)|& nn_tee -h X1mlp.log
echo -n X1mlp.run
>> /u/kukolich/Tutorial/LNKnet.note
grep LAST TRAIN EPOCH X1mlp.log | tail -1 >> /u/kukolich/Tutorial/LNKnet.note
#test
(time mlp\
-create -pathexp $loc -ferror X1mlp.err.eval -fparam X1mlp.param\
-pathdata /u/kukolich/Tutorial -finput vowel.eval\
-fdescribe vowel.defaults -npatterns 166 -ninputs 2 -normalize\
-fnorm vowel.norm.simple -cross_valid 0 -fcross_valid vowel.train.cv\
139
Profile plots
Cost plot
#cost plot
plot_cost -pathexp $loc -ferror X1mlp.err.train -fplot X1mlp.cost.plot autoscale\
-xmin 0 -xmax 10000 -ymin 0 -ymax 5 -xstep 1000 -ystep 1 -trials 338\
-title Norm:Simple Net:2,25,10 Step:0.2 -cost_func 0 -fmaclines cost.mac\
-class
Structure plot
#structure plot
plot_mlp -fparam X1mlp.param -fplot X1mlp.struct.plot\
-fdescribe /u/kukolich/Tutorial/vowel.defaults -autoscale \
-threshold 0.000000 -show_weight_magnitude -max_line_width 10
140
-show_bias
#prob plot
plot_prob -bin_plot -target 2 -nbins 5 -min_bin_count 5 -chi_square -pathexp
$loc\
-ferror X1mlp.err.eval -fplot X1mlp.prob.plot -no_graphics -noutputs 10\
-npatterns 166 -xmin 0 -xmax 100 -ymin 0 -ymax 100 -xstep 10 -ystep 10\
-verbose 1 -title Norm:Simple Net:2,25,10 Step:0.2 -fmaclines prob.mac \
|& nn_tee -h -a X1mlp.log
olxplot -geometry 500x480 -title prob_plot X1mlp.prob.plot&
echo -n X1mlp.run
>> /u/kukolich/Tutorial/LNKnet.note
grep CHI X1mlp.log | tail -1 >> /u/kukolich/Tutorial/LNKnet.note
C.1: MLP
#detect plot
plot_detect -target 2 -pathexp $loc -ferror X1mlp.err.eval -fplot
X1mlp.detect.plot\
-noutputs 10 -reject 0 -xmin 0 -xmax 100 -ymin 0 -ymax 100 -xstep 10\
-ystep 10 -title Norm:Simple Net:2,25,10 Step:0.2 -table_begin 0\
-table_end 100 -table_step 2 -verbose 2 -fmaclines detect.mac\
-no_graphics \
|& nn_tee -h -a X1mlp.log
olxplot -geometry 500x480 -title detect_plot X1mlp.detect.plot&
echo -n X1mlp.run
>> /u/kukolich/Tutorial/LNKnet.note
grep ROC AREA X1mlp.log | tail -1 >> /u/kukolich/Tutorial/LNKnet.note
#reject plot
plot_reject -pathexp $loc -ferror X1mlp.err.eval -fplot X1mlp.reject.plot -noutputs 10\
-npatterns 0 -xmin 0 -xmax 100 -ymin 0 -ymax 100 -xstep 10 -ystep 10\
-title Norm:Simple Net:2,25,10 Step:0.2 -table_begin 0 -table_end 100\
-table_step 10 -verbose 1 -fmaclines reject.mac -no_graphics \
|& nn_tee -h -a X1mlp.log
olxplot -geometry 500x480 -title reject_plot X1mlp.reject.plot&
echo current directory: >> X1mlp.log
echo $loc >> X1mlp.log
Reading /u/kukolich/Tutorial/vowel.train
EPOCH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
%error
92.0
81.4
63.6
58.0
51.2
46.7
47.0
42.9
43.5
43.2
43.5
39.3
39.3
42.0
37.3
36.4
34.6
36.1
32.8
141
20
33.7
0.2175
LAST TRAIN EPOCH: 20 33.73 % Err
18.01 secs
0
---308
148
11
119
5
51
33
21
2
14
---712
1
2
---- ---157
12
236
15
6
588
14
37
8
152
93
8
56
5
5
288
7
15
11
24
---- ---593 1144
3
---151
24
19
502
4
19
7
102
8
11
---847
Computed Class
4
5
6
---- ---- ---6
3
49
6
4
288
99
1
5
5
2
15
397
2
4
12
101
17
6
614
80
2
10
28
6
11
104
50
5
---- ---- ---743
171 1018
7
---10
9
60
22
34
19
9
162
3
66
---394
8
---2
8
8
3
41
72
10
7
552
188
---891
9
---2
2
3
1
13
68
23
28
107
---247
Total
----700
740
800
720
660
460
740
700
660
580
----6760
-------------------------------------------------------------------------------------------------------------------------------------------------------------
6760
3193
47.23
( 0.6)
0.245
================================================================ mlp
17.9u 0.2s 0:23 78% 0+440k 5+111io 50pf+0w
END
142
C.1: MLP
Reading /u/kukolich/Tutorial/vowel.eval
0
---14
5
1
----
2
----
3
---3
19
1
11
Computed Class
4
5
6
---- ---- ----
7
----
8
----
9
----
12
5
2
14
1
18
---25
---1
---27
---18
---16
7
6
---16
---30
1
---9
15
3
---19
1
4
---5
Total
----17
18
20
18
16
11
18
18
16
14
----166
-------------------------------------------------------------------------------------------------------------------------------------------------------------
166
54
32.53
( 3.6)
0.207
TEST:
vowel.eval 32.53 % Err 0.207 RMS Err
0.54 secs
================================================================ mlp
0.5u 0.0s 0:00 91% 0+376k 1+3io 1pf+0w
END
------------------------------------------------------------------------------
143
Chi
=
2.446746
Degrees of Freedom =
2
Significance
=
0.294236
TARGET 2 CHI 2.446746 DOF 2 SIGNIFICANCE 0.294236
Created file /u/kukolich/Tutorial/X1mlp.prob.plot
ROC-x
ROC-y
---------------------------------------------------------------target class = 2
reject = 0.00%
number of patterns from target class = 20
---------------------------------------------------------------# PATTERNS # CORRECT
% CORRECT # FALSE % FALSE_ALARM THRESHOLD
---------- ---------- ---------- ---------- ---------- ---------0
0
0.000
0
0.000
1.000
20
16
82.500
4
2.000
0.564
25
18
92.500
7
4.000
0.350
35
19
97.500
16
6.000
0.199
166
20
100.000
146
12.000
0.000
TARGET 2 ROC AREA = 98.732872
Created file X1mlp.detect.plot
144
EPOCH
1
2
3
4
5
6
7
8
9
10
11
12
%error
34.6
35.2
31.4
32.0
30.5
29.0
29.6
27.8
28.4
29.9
27.2
28.7
C.1: MLP
13
27.2
0.1973
14
25.4
0.1964
15
26.3
0.1959
16
28.7
0.1973
17
24.6
0.1935
18
24.3
0.1932
19
25.1
0.1933
20
25.4
0.1935
LAST TRAIN EPOCH: 20 25.44 % Err
17.80 secs
0
---416
119
1
---156
478
2
----
3
---113
660
123
Computed Class
4
5
6
---- ---- ---15
17
126
85
571
128
43
12
39
54
---727
55
26
12
268
8
----
9
----
20
24
29
66
568
71
---683
37
52
286
---470
674
121
---713
471
20
7
----
---909
54
---758
51
22
56
---685
437
18
85
---403
---800
82
---612
Total
----700
740
800
720
660
460
740
700
660
580
----6760
-------------------------------------------------------------------------------------------------------------------------------------------------------------
6760
1931
28.57
( 0.5)
0.202
================================================================ mlp
17.7u 0.1s 0:23 77% 0+440k 4+112io 49pf+0w
END
145
0
---13
1
----
2
----
3
---3
Computed Class
4
5
6
---- ---- ---1
6
15
2
2
12
18
2
7
----
8
----
13
9
----
1
3
18
3
---14
---12
---20
---18
---16
15
2
---10
---24
1
---21
13
1
---14
3
10
---17
Total
----17
18
20
18
16
11
18
18
16
14
----166
-------------------------------------------------------------------------------------------------------------------------------------------------------------
166
33
19.88
( 3.1)
0.181
TEST:
vowel.eval 19.88 % Err 0.181 RMS Err
0.48 secs
================================================================ mlp
0.4u 0.0s 0:00 92% 0+376k 0+3io 0pf+0w
END
------------------------------------------------------------------------------
146
target class = 2
total number of patterns = 166
-----------------------------------------------------------------------------BIN #
# PATTERNS PREDICTED %
ACTUAL %
RANGE
----------- ----------- ----------- ----------- -----------
C.2: KNN
0
1
2
3
142
5
9
10
1.36
30.01
63.01
88.56
1.41
20.00
77.78
100.00
-0.57
-15.78
50.06
100.00
3.39
- 55.78
- 105.49
- 100.00
C.2 KNN
This experiment was on the same training and evaluation data as the MLP experiment
above. The KNN classifier obtained an 18% error rate on the evaluation data. Because
this is a single pass classifier, there are no classification results from the training.
Because this classifier does not produce continuous outputs, there are no tables from the
posterior probability plot, ROC plot, or rejection plot.
C.2.1 KNN Shell Script
#!/bin/csh -ef
# ./X1knn.run
set loc=`pwd`
Train KNN
#train
(time knn\
-train -create -pathexp $loc
-pathdata /u/kukolich/Tutorial
-ferror X1knn.err.train
-finput vowel.train\
-fparam X1knn.param\
147
END
148
Wed Apr
5 14:15:57 1995
Reading /u/kukolich/Tutorial/vowel.eval
Confusion matrix
Desired
Class
----0
1
2
3
4
5
6
7
8
9
----Total
0
---15
1
---1
15
2
----
3
---1
Computed Class
4
5
6
---- ---- ----
7
----
8
----
9
----
3
17
15
1
2
1
1
14
1
2
8
18
2
---17
---16
---18
---18
15
---15
4
---13
---21
1
---19
12
2
---14
1
4
7
---15
Total
----17
18
20
18
16
11
18
18
16
14
----166
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Error summary
166
30
18.07
( 3.0)
0.163
TEST:
vowel.eval 18.07 % Err 0.163 RMS Err
0.42 secs
================================================================ knn
0.3u 0.0s 0:01 35% 0+392k 1+3io 1pf+0w
current directory:
/u/kukolich/Tutorial
END
149
#train
(time gauss\
-train -create -pathexp $loc -ferror last3gauss.err.train\
-fparam last3gauss.param -pathdata /u/kukolich/Tutorial\
-finput gnoise_var.train -fdescribe gnoise_var.defaults -npatterns 200\
-ninputs 3 -features 7,5,6 -normalize -fnorm gnoise_var.norm.simple\
-cross_valid 0 -fcross_valid gnoise_var.train.cv -random_cv -random\
-seed 0 -priors_npatterns 200 -debug 0 -verbose 3 -verror 2 \
-minvar 1e-05 -max_ratio 1e+06 \
)|& nn_tee -h last3gauss.log
Evaluate classifier
#test
(time gauss\
-create -pathexp $loc -ferror last3gauss.err.eval -fparam last3gauss.param\
-pathdata /u/kukolich/Tutorial -finput gnoise_var.eval\
-fdescribe gnoise_var.defaults -npatterns 100 -ninputs 3 -features 7,5,6\
-normalize -fnorm gnoise_var.norm.simple -cross_valid 0\
-fcross_valid gnoise_var.train.cv -random_cv -random -seed 0\
-priors_npatterns 200 -debug 0 -verbose 3 -verror 2 \
-minvar 1e-05 -max_ratio 1e+06 \
)|& nn_tee -h -a last3gauss.log
echo -n last3gauss.run
>> /u/kukolich/Tutorial/LNKnet.note
grep TEST last3gauss.log | tail -1 >> /u/kukolich/Tutorial/LNKnet.note
150
Evaluate classifier
Confusion matrix
Desired
Class
----0
1
2
3
4
5
6
7
8
9
----Total
0
---10
1
1
----
2
----
3
----
1
10
Computed Class
4
5
6
---- ---- ----
7
----
8
----
9
----
1
10
---11
10
10
10
10
---11
---9
---9
---11
---10
---10
---10
---10
---9
Total
----10
10
10
10
10
10
10
10
10
10
----100
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Error summary
100
3.00
( 1.7)
-0.478
END
151
#cross validation
(time kmeans\
-create -pathexp $loc -ferror cvrbf.err.cv -fparam cvkmeans.param\
-pathdata /u/kukolich/Tutorial -finput iris.train\
-fdescribe iris.defaults -npatterns 150 -ninputs 4 -normalize\
-fnorm iris.norm.simple -cross_valid 5 -fcross_valid iris.train.cv\
-random_cv -random -seed 0 -priors_npatterns 150 -debug 0 -verbose 3\
-verror 2 \
-cluster_by_class -ncenters 2 -split_percentage 1 -add_random_offset\
-max_iteration 10 -stop_percentage 1 -reduce_step 1 \
)|& nn_tee -h cvrbf.log
(time rbf\
-create -pathexp $loc -ferror cvrbf.err.cv -fparam cvrbf.param\
-pathdata /u/kukolich/Tutorial -finput iris.train\
-fdescribe iris.defaults -npatterns 150 -ninputs 4 -normalize\
-fnorm iris.norm.simple -cross_valid 5 -fcross_valid iris.train.cv\
-random_cv -random -seed 0 -priors_npatterns 150 -debug 0 -verbose 3\
-verror 2 \
-fclparam cvkmeans.param -hspread 1 -exhspread 1 -max_ratio 1e+06\
-minvar 1e-06 -fast_nhidden 0 \
)|& nn_tee -h -a cvrbf.log
echo -n cvrbf.run
>> /u/kukolich/Tutorial/LNKnet.note
grep CV cvrbf.log | tail -1 >> /u/kukolich/Tutorial/LNKnet.note
echo current directory: >> cvrbf.log
echo $loc >> cvrbf.log
Reading /u/kukolich/Tutorial/iris.train
>>>>>> FOLD 0 <<<<<<
training centers for class 0
EPOCH ave.sq.err ncenters (40 patterns/epoch)
1
0.000
1
2
1.068
2
3
0.469
2
152
4
0.469
2
training centers for class
EPOCH ave.sq.err ncenters
1
0.000
1
2
1.006
2
3
0.520
2
4
0.511
2
5
0.511
2
training centers for class
EPOCH ave.sq.err ncenters
1
0.000
1
2
1.441
2
3
0.789
2
4
0.785
2
>>>>>> FOLD 1 <<<<<<
training centers for class
EPOCH ave.sq.err ncenters
1
0.000
1
2
1.031
2
3
0.499
2
4
0.499
2
training centers for class
EPOCH ave.sq.err ncenters
1
0.000
1
2
0.971
2
3
0.470
2
4
0.466
2
training centers for class
EPOCH ave.sq.err ncenters
1
0.000
1
2
1.195
2
3
0.696
2
4
0.692
2
>>>>>> FOLD 2 <<<<<<
training centers for class
EPOCH ave.sq.err ncenters
1
0.000
1
2
0.894
2
3
0.432
2
4
0.432
2
training centers for class
EPOCH ave.sq.err ncenters
1
0.000
1
2
1.090
2
3
0.546
2
4
0.545
2
training centers for class
EPOCH ave.sq.err ncenters
1
0.000
1
2
1.431
2
3
0.795
2
4
0.795
2
>>>>>> FOLD 3 <<<<<<
training centers for class
EPOCH ave.sq.err ncenters
1
0.000
1
2
0.891
2
3
0.390
2
4
0.384
2
5
0.363
2
6
0.356
2
7
0.356
2
training centers for class
EPOCH ave.sq.err ncenters
1
0.000
1
2
1.053
2
1
(40 patterns/epoch)
2
(40 patterns/epoch)
0
(40 patterns/epoch)
1
(40 patterns/epoch)
2
(40 patterns/epoch)
0
(40 patterns/epoch)
1
(40 patterns/epoch)
2
(40 patterns/epoch)
0
(40 patterns/epoch)
1
(40 patterns/epoch)
153
3
0.510
2
4
0.510
2
training centers for class
EPOCH ave.sq.err ncenters
1
0.000
1
2
1.250
2
3
0.725
2
4
0.724
2
>>>>>> FOLD 4 <<<<<<
training centers for class
EPOCH ave.sq.err ncenters
1
0.000
1
2
0.778
2
3
0.358
2
4
0.358
2
training centers for class
EPOCH ave.sq.err ncenters
1
0.000
1
2
0.942
2
3
0.510
2
4
0.510
2
training centers for class
EPOCH ave.sq.err ncenters
1
0.000
1
2
1.291
2
3
0.730
2
4
0.725
2
2
(40 patterns/epoch)
0
(40 patterns/epoch)
1
(40 patterns/epoch)
2
(40 patterns/epoch)
END
154
Computed Class
0
---10
1
----
2
----
10
---10
---10
10
---10
Total
----10
10
10
----30
-------------------------------------------------------------------------------------------------------------------------------------------------------------
30
0.00
( 0.0)
0.057
Computed Class
0
---10
1
---9
---10
---9
2
----
Total
----10
10
10
----30
1
10
---11
-------------------------------------------------------------------------------------------------------------------------------------------------------------
30
3.33
( 3.3)
39.669
Computed Class
0
Total
155
----0
1
2
----Total
---10
----
----
----10
10
10
----30
10
---10
---10
10
---10
-------------------------------------------------------------------------------------------------------------------------------------------------------------
30
0.00
( 0.0)
39.669
Computed Class
0
---10
---10
1
----
2
----
8
2
---10
2
8
---10
Total
----10
10
10
----30
-------------------------------------------------------------------------------------------------------------------------------------------------------------
30
13.33
( 6.2)
0.577
156
Computed Class
0
---10
1
---9
---10
---9
2
----
Total
----10
10
10
----30
1
10
---11
-------------------------------------------------------------------------------------------------------------------------------------------------------------
30
3.33
( 3.3)
0.577
Computed Class
0
---50
---50
1
----
2
----
46
2
---48
4
48
---52
Total
----50
50
50
----150
-------------------------------------------------------------------------------------------------------------------------------------------------------------
150
4.00
( 1.6)
25.092
157
CV 5:
iris.train 4.00 % Err
25.1 RMS Err
0.41 secs
================================================================ rbf
0.3u 0.0s 0:04 8% 0+396k 2+3io 48pf+0w
current directory:
/u/kukolich/Tutorial
END
KNN experiment
X1knn.run vowel simple
X1knn.run
TEST:
knn -k 3
vowel.eval 18.07 % Err
0.42 secs
158
cvrbf.run iris simple -cross_valid 5 kmeans -cluster_by_class -ncenters 2 split_percentage 1 -add_random_offset -max_iteration 10 -stop_percentage 1 reduce_step 1 rbf -fclparam cvkmeans.param -hspread 1 -exhspread 1 -max_ratio
1e+06 -minvar 1e-06 -fast_nhidden 0
cvrbf.run
CV 5:
iris.train 4.00 % Err
25.1 RMS Err
0.41 secs
cvrbf.run {kmeans -ncenters 4 } {rbf -fast_train -fast_nhidden 2 }
cvrbf.run
CV 5:
iris.train 4.67 % Err
17.7 RMS Err
0.51 secs
0.51 secs
FIGURE C.1
gauss
-train -create -pathexp /u/kukolich/Tutorial\
-ferror XORgauss.err.train -fparam XORgauss.param\
-pathdata /u/kukolich/lnknet/data/class -finput XOR.train\
-fdescribe XOR.defaults -npatterns 16 -ninputs 2 -normalize\
-fnorm XOR.norm.simple -cross_valid 0 -fcross_valid XOR.train.cv\
-random_cv -random -seed 0 -priors_npatterns 16 -debug 0 -verbose 3\
-verror 2 -full -per_class -minvar 1e-05 -max_ratio 1e+06
Fri Apr 7 09:49:19 1995
normalization data
<BEGIN_PARAM>
normalization 1
ninputs
2
nclasses
2
total_trials
16
max_features
2
<END_PARAM>
<BEGIN_PARAM>
{BEGIN_VECTOR
means 2
0
0.5
1
0.5
END_VECTOR}
{BEGIN_VECTOR
sdevs 2
0
0.504975
159
Gauss parameters
Gauss constants
No feature selection
1
0.504975
END_VECTOR}
<END_PARAM>
gauss model
<BEGIN_PARAM>
ninputs
2
noutputs
2
total_trials
16
full_covar
1
per_class
1
<END_PARAM>
<BEGIN_PARAM>
{BEGIN_VECTOR
features 1
0
0
END_VECTOR}
{BEGIN_VECTOR
class_trials 2
0
8
1
8
END_VECTOR}
<END_PARAM>
<BEGIN_PARAM>
{BEGIN_VECTOR
mean 2
0
-1.49012e-08
1
-7.45058e-09
END_VECTOR}
<END_PARAM>
<BEGIN_PARAM>
{BEGIN_VECTOR
mean 2
0
0
1
-7.45058e-09
END_VECTOR}
<END_PARAM>
<BEGIN_PARAM>
log_det
-1.41082
{BEGIN_MATRIX
inv_covar 2 2
0 0
25.7526
0 1
-25.2477
1 0
-25.2477
1 1
25.7526
END_MATRIX}
<END_PARAM>
<BEGIN_PARAM>
log_det
-1.41082
{BEGIN_MATRIX
inv_covar 2 2
0 0
25.7524
0 1
25.2475
1 0
25.2475
1 1
25.7524
END_MATRIX}
<END_PARAM>
160
FIGURE C.2
-suffix XORgauss
>! XORgauss.c
FIGURE C.3
/*
/*
/*
/*
XORgauss.param */
Per Class, Full */
norm: SIMPLE */
2 (all) features */
#include <math.h>
/* macro definitions */
#ifndef SQR(x)
#define SQR(x) ((x) * (x))
#endif
#ifndef M_LOG10E
#define M_LOG10E 0.43429448190325182765
#endif
#ifndef M_LN10
#define M_LN10 2.30258509299404568402
#endif
#ifndef LOG_2PI
#define LOG_2PI 0.798179868358
#endif
161
/* function declarations */
extern int classify_XORgauss(/* float *,float * */);
extern int normalize_XORgauss(/* float * */);
#define NRAW_XORgauss 2
#define NINPUTS_XORgauss 2
#define NCLASSES_XORgauss 2
classify_XORgauss
162
normalize_XORgauss
int normalize_XORgauss(inputs)
float *inputs;
{
int n;
float new_inputs[NRAW_XORgauss];
Do features selection on
new_inputs, copying selected
features back into inputs. Return
the number of normalized selected
features
FIGURE C.4
Main routine for generating
decision region patterns and
classes.
}
#undef NRAW_XORgauss
#undef NINPUTS_XORgauss
#undef NCLASSES_XORgauss
163
2.1
X1
-1.1
-1.1
1
X0
164
2.1
Class Labels
A
B
APPE NDIX D
FIGURE D.1
angle
lnknet/data/class/angle.train
LEGEND
0
1
X1
0
-1
-2
-1
-0.5
0
0.5
X0 (0 vs 1)
Norm:Simple Diagonal Grand
Ninputs
2
165
Angle is a generated data base with two classes. Each class is made of points uniformly
sampled from a pair of ovals which are identical except for the positions of their centers.
After the points were generated the data was rotated 60 degrees to put the classes at an
angle to the origin.
FIGURE D.2
bulls
lnknet/data/class/bulls.train
LEGEND
0
X1
0
-2
-4
-6
-6
-4
-2
0
2
4
X0 (0 vs 1)
Norm:Simple Diagonal Grand
Ninputs
2
This is bulls-eye data. There are two classes. One contains patterns uniformly distributed in a disk of radius 1.0 and the other class contains patterns uniformly distributed in
an annulus with an inner radius of 1.0 and an outer radius of 5.0.
166
FIGURE D.3
cross
lnknet/data/class/cross.train
LEGEND
0
1
0.5
X1
0
-0.5
-1
-1
-0.5
0
0.5
X0 (0 vs 1)
Norm:Simple Diagonal Grand
Ninputs
2
Class zero contains patterns from a 2-D Gaussian distribution with positively correlated
features. Class one contains patterns from a similar 2-D Gaussian distribution with negatively correlated features. These distributions overlap and form a cross as shown.The
data base was intended for use in testing the Gaussian classifier with full covariance
matrices.
167
FIGURE D.4
daisy
20
lnknet/data/class/daisy.train
LEGEND
A
B
C
10
D
E
F
X1
0
G
H
I
J
K
-10
-20
-20
-10
0
10
X0 (0 vs 1)
Norm:Simple Diagonal Grand
Ninputs
2
20
This data base has 12 classes. The points in each class are in Gaussian clusters which
radiate out from the origin. There are one, two, or three Gaussians per class. This data
base was generated to test the Gaussian Mixture Classifier.
168
FIGURE D.5
digit1
10
lnknet/data/class/digit1.train
LEGEND
0
1
2
3
4
5
X1
0
-5
-10
-2
2
4
X0 (c0 vs c1)
Norm:Simple Diagonal Grand
Ninputs
22
The classes in this speech data base are the first seven monosyllabic digits from the TI
digit data base. A version of the TI data base was sampled at 12kHz and processed to
extract 15 mel cepstra from 10 msec frames. Eleven of the low cepstral values were
used from two frames of each word. One frame was taken where the energy was highest
and the other frame is from 30 msec before the highest energy frame. This data base was
generated by Richard Lippmann of MIT Lincoln Laboratory [28].
169
FIGURE D.6
disjoint
lnknet/data/class/disjoint.train
LEGEND
0
1
X1
-1
-2
2
4
X0 (0 vs 1)
Norm:Simple Diagonal Grand
Ninputs
2
Class 0 contains patterns uniformly sampled from a 6 by 3 unit rectangle. There are two
square regions where there are no patterns from class 0. The first square lies between
(0,0) and (1,1). The second square lies between (2,0) to (3,1). Class 1 contains patterns
uniformly sampled within these squares.
170
FIGURE D.7
disjoint_tail
300
lnknet/data/class/disjoint_tail.train
LEGEND
A
B
Total
200
100
-0
-50
50
X0 (0)
Norm:Simple Diagonal Grand
Ninputs
1
100
Class A contains patterns with a bimodal distribution. Most of the patterns are found in
a uniform distribution covering the range (-0.5,0.5). An additional 10% of the class A
patterns are found in a second uniform distribution covering the range (99.5,100.5). The
class B patterns have a Gaussian distribution with a mean at 2 and standard deviation of
1.0
171
FIGURE D.8
Disk
lnknet/data/class/Disk.train
LEGEND
0
1
0.5
X1
0
-0.5
-1
-1
-0.5
0
0.5
X0 (0 vs 1)
Norm:Simple Diagonal Grand
Ninputs
2
This data base has two uniformly sampled ellipses which have been rotated to put them
at a 45 degree angle to the input features.
172
FIGURE D.9
DiskOut
lnknet/data/class/DiskOut.train
LEGEND
0
1
-2
X1
-4
-6
-8
-2
4
6
X0 (0 vs 1)
Norm:Simple Diagonal Grand
Ninputs
2
This data base has two uniformly sampled ellipses which have been rotated to put them
at a 45 degree angle to the input features. Class 1 has an additional set of patterns separated from the main ellipse by 10 units.
173
FIGURE D.10
Gap
lnknet/data/class/Gap.train
LEGEND
0
1
0.8
0.6
X1
0.4
0.2
0
0
10
X0 (0 vs 1)
Norm:Simple Diagonal Grand
Ninputs
2
15
The data for each class is uniformly sampled from a rectangle with height 1. The width
of the rectangle for class 0 is 1, the width for class 1 is 10. Each class has the same number of patterns. This data base was generated while testing the Perceptron Convergence
Procedure cost function of the MLP classifier.
174
FIGURE D.11
gmix
lnknet/data/class/gmix.train
LEGEND
0
1
1
X1
0
-1
-2
-10
-5
0
5
X0 (0 vs 1)
Norm:Simple Diagonal Grand
Ninputs
2
10
The patterns for each class are taken from Gaussian mixture distributions. Each Gaussian mixture distribution has three clusters, as shown, with one half of the patterns in the
central cluster and one quarter of the patterns in each of the other two clusters. This data
base was generated to test the Gaussian mixture classifier using diagonal covariance
matrices.
175
FIGURE D.12
gmix_close
lnknet/data/class/gmix_close.train
LEGEND
0
1
1
X1
0
-1
-2
-4
-2
0
2
X0 (0 vs 1)
Norm:Simple Diagonal Grand
Ninputs
2
The patterns for each class are taken from Gaussian mixture distributions. Each Gaussian mixture distribution has three clusters as shown with one half of the patterns in the
central cluster and one quarter of the patterns in each of the other two clusters. This data
base is very similar to the gmix data base. The class distributions are considerably closer
together, however and overlap.
176
FIGURE D.13
gnoise
10
lnknet/data/class/gnoise.train
LEGEND
0
1
2
3
4
5
X1
4
6
7
8
9
-2
-2
4
6
8
X0 (0 vs 1)
Norm:Simple Diagonal Grand
Ninputs
8
10
Each pattern was generated by adding Gaussian noise to the center of each class in all
eight input dimensions. The standard deviation of the noise is 0.5. The class means are
found along the line x d = j, ( 0 d 7 ), ( 0 j 9 ) where x d is the value of the dth feature and j is the class.
177
FIGURE D.14
gnoise_var
10
lnknet/data/class/gnoise_var.train
LEGEND
0
1
2
3
4
5
X7
4
6
7
8
9
-2
-5
5
10
X0 (0 vs 7)
Norm:Simple Diagonal Grand
Ninputs
8
15
This is a modified version of gnoise where the variance is smaller for the higher input
features. Each pattern was generated by adding Gaussian noise to the center of each
class in all eight input dimensions. The standard deviation decreases as the dimensions
increase. The standard deviation is d = ( 8 d )0.25 where d is the number of the
dimension ( 0 d 7 ) . The class means are found along the line x d = j , ( 0 d 7 ) ,
( 0 j 9 ) . The plot shows the first, noisiest, dimension plotted against the last, most
clean, dimension.
178
FIGURE D.15
HalfDisk
0.1
lnknet/data/class/HalfDisk.train
LEGEND
0
1
0.05
X1
0
-0.05
-0.1
-1
-0.5
0
0.5
X0 (0 vs 1)
Norm:Simple Diagonal Grand
Ninputs
2
The patterns in this data base were all uniformly sampled from an ellipse which is ten
times longer in the first direction than in the second. All the sampled patterns with
X1 > 0 were assigned to class 0. All other patterns were assigned to class 1.
179
FIGURE D.16
high_tail
150
lnknet/data/class/high_tail.train
LEGEND
A
B
Total
100
50
-0
-60
-40
-20
0
20
40
X0 (0)
Norm:Simple Diagonal Grand
Ninputs
1
60
The first class was generated using a single Gaussian distribution with a mean of 5 and a
variance of 1. The second class was generated by sampling two overlapping uniform
distributions of differing length. The probability of a pattern in the second class being in
the first segment is 0.9. The first segment covers the range (-0.5, 0.5). The second segment covers the range (-50,50).
180
FIGURE D.17
iris
lnknet/data/class/iris.train
LEGEND
Setosa
Versicolour
Virginica
4
X1
2
4
5
6
7
X0 (sepal_length vs sepal_width)
Norm:Simple Diagonal Grand
Ninputs
4
This is R.A. Fishers iris data [8]. The data set contains three classes with 50 patterns for
each class. Each class is a type of iris plant. The inputs are the sepal length and width in
centimeters and the petal length and width in centimeters.
181
FIGURE D.18
ocrdigit
Ninputs
64
This is the little 1200 data base which was collected at AT&T Bell Laboratories by Isabelle Guyon[35] among her collaborators. Twelve people wrote the 10 digits several
times each. The data was mapped to a 64 by 64 grid and then smoothed and fit into a 8
by 8 grid. This smoothed data was provided by John Hampshire.
182
FIGURE D.19
pbvowel
4000
lnknet/data/class/pbvowel.train
LEGEND
heed
hid
head
3000
had
hud
hod
X2
2000
hawed
hood
whod
heard
1000
0
0
500
1000
X1 (1formant vs 2formant)
Norm:Simple Diagonal Grand
Ninputs
5
1500
This is most of the original Peterson and Barney[32] vowel data which was collected in
the 1950s. The original data was collected from 67 speakers, each of whom said the 10
words given in the legend. The inputs are the pitch and first three formant frequencies of
the vowel in each word and whether the speaker was a man, woman, or child.
183
FIGURE D.20
uniform_1_1
lnknet/data/class/uniform_1_1.train
LEGEND
0
1
0
X1
-2
-4
-6
-4
-2
0
2
X0 (0 vs 1)
Norm:Simple Diagonal Grand
Ninputs
2
The three data bases, uniform_1_1, uniform_2_1, and uniform_10_1, were all generated
by sampling the same pair of Gaussian distributions. The means of the Gaussians are
one standard deviation apart along the line x=y. The differences among the three data
bases are in the number of patterns sampled for each class. There ar 500 patterns in each
class in uniform_1_1. Uniform_2_1 has 666 patterns from class 0 and 333 patterns from
class 1, giving a 2 to 1 ratio in the a priori probabilities of the classes. Uniform_10_1
has 1000 patterns from class 0 and 100 from class 1, giving a 10 to 1 ratio in the class
probabilities. These data bases were generated to test the a priori probability adjustment
features of the LNKnet classifiers.
184
FIGURE D.21
uniform_2_1
lnknet/data/class/uniform_2_1.train
LEGEND
0
1
X1
0
-2
-4
-6
-4
-2
0
2
X0 (0 vs 1)
Norm:Simple Diagonal Grand
Ninputs
2
Data taken from two identical Gaussian distributions with means one standard deviation
apart. There are twice as many patterns from class 0 as from class 1.
185
FIGURE D.22
uniform_10_1
lnknet/data/class/uniform_10_1.train
LEGEND
0
1
X1
0
-2
-4
-4
-2
0
2
X0 (0 vs 1)
Norm:Simple Diagonal Grand
Ninputs
2
Data taken from two identical Gaussian distributions with means one standard deviation
apart. There are 10 times as many patterns from class 0 as from class 1.
186
FIGURE D.23
vowel
lnknet/data/class/vowel.train
LEGEND
head
hid
0.8
hod
had
hawed
0.6
heard
heed
X1
hud
whod
0.4
hood
0.2
0
0
0.2
0.4
0.6
0.8
X0 (1formant vs 2formant)
Norm:Simple Diagonal Grand
Ninputs
2
This data is a normalized version of some of the Peterson and Barney[32] vowel data.
The inputs are the first and second formant frequencies of the vowel in the 10 words
given in the legend. The frequency data is normalized to be between 0 and 1.
187
FIGURE D.24
XOR
1.5
lnknet/data/class/XOR.train
LEGEND
A
B
X1
0.5
-0.5
-0.5
0.5
1
X0 (0 vs 1)
Norm:Simple Diagonal Grand
Ninputs
2
1.5
This data base has hand generated patterns from the XOR problem. It is intended for
testing algorithms by hand to verify the outputs of calculations.
188
APPE NDIX E
Using OpenWindows
SUN has an excellent tutorial describing the use of the OpenLook window manager.
The tutorial can be found in $OPENWINHOME/bin/helpopen. SUN also publishes an
OpenWindows User Guide which provides information on the use of OpenLook style
applications. If these sources of information are unavailable, this appendix covers those
parts of OpenWindows that affect the use of LNKnet. This appendix assumes that you
can already start a window manager. For more information on how to do this, contact
your system administrator.
Adjust
Select
Mouse Pointer
Menu
sun
The mouse controls a pointer, also shown in Figure E.1. When the mouse is moved on
its pad, the pointer moves on the screen. To select a button on a LNKnet window, you
must first use the mouse to move the pointer over the button. It may also be necessary to
click the mouse on the bar at the top of the window to bring the LNKnet window to
the attention of the window manager. To do this, move the pointer to the bar at the top of
the window. Press and release the select button on the mouse.
189
E.1.1 Menus
In OpenWindows, a menu is indicated by a small triangle on a button or a small box. To
select something from a menu you must first move the mouse pointer to the triangle.
Press and hold the menu button on the mouse. The menu attached to the triangle should
now appear. Still holding down the mouse menu button, drag the pointer down the menu
to the item you want to select, then let go. It is possible that the item you want also has a
triangle next to it, indicating that there is another menu that must be selected from. If so,
do not let go of the menu button. Drag the mouse pointer in the direction the triangle
points to bring up the second menu and proceed as before.
FIGURE E.2
Scrolling LIst
Top Anchor
(select to scroll to the top)
Elevator
(Top scrolls up,
middle scrolls up or down,
bottom scrolls down)
Bottom Anchor
(select to scroll to the bottom)
190
up windows, or perform some function inside LNKnet. The setting objects set some
LNKnet or algorithm variable to some setting from a short list. The check boxes are
graphical binary flags. They turn on or off LNKnet features. Figure E.4 shows a set of
LNKnet selection objects.
FIGURE E.4
LNKnet Objects that are Set using the Select Mouse Button
Button brings up
C Code
Generation
window
E.3 Windows
LNKnet is built around many windows. There is a main window and many popup windows. When olwm is your window manager, these windows come up automatically.
With some other window managers, each window must be placed when it is displayed.
In olwm, the main difference between the main window and the popup windows is that
the popup windows are displayed with a push pin in the upper left corner of the window.
If you select the push pin, the popup window will disappear. It can be redisplayed by
reselecting the button which brought up the window before. If you select the small
191
square in the upper left of the main window, LNKnet will be iconified. That is, the
LNKnet main window and all of its popup windows will disappear and a square icon
will appear somewhere on your screen. Double clicking on the LNKnet icon will redisplay the main LNKnet window and its popups. Figure E.6 shows the bar at the top of the
LNKnet main window and two popup windows as well as the LNKnet icon.
FIGURE E.6
192
BIBLIOGRAPHY
1.
Bruce G. Batchelor, ed. Pattern Recognition: Ideas in Practice. Plenum Press: New York, (1978).
2. Christopher M. Bishop, Neural Networks for Pattern Classification. Oxford: Clarendon Press,
(1995).
3. Leo Breiman, Jerome H. Friedman, Richard A. Olshen, Charles J. Stone, Classification and Regression Trees. Belmont, California: Wadsworth, Inc., (1984).
4. Linde, Y., A. Buzo, and R.M. Gray, An Algorithm for Vector Quantizer Design, IEEE Transactions on Communications, COM-28, 84-95, 1980.
5. N. Christiani and J. Shawe-Tayor, An Introduction to Support Vector Machines. Cambridge University Press (2000).
6. Eric I. Chang and Richard P. Lippmann, Using Genetic Algorithms to Select and Create Features
for Pattern Classification, MIT Lincoln Laboratory, Technical Report 892, 1991.
7. R.O. Duda, P.E. Hart, and David Stork, Pattern Classification (Second Edition). New York: Wiley
(2000).
8. R. A. Fisher, The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics,
7, 179-188, 1936.
9. K. Fukunaga, Introduction to Statistical Pattern Recognition (Second Edition). New York, NY:
Academic Press (1990).
10. John B. Hampshire and B. V. K. Vijaya Kumar. Why Error Measures are Sub-Optimal for Training Neural Network Pattern Classifiers, in IEEE Proceedings of the 1992 International Joint Conference on Neural Networks. IEEE, 1992.
11. John B. Hampshire and Alexander H. Waibel, A Novel Objective Function for Improved Phoneme Recognition Using Time-Delay Neural Networks, in IEEE Transactions on Neural Networks,
216-228, 1990.
12. J. A. Hartigan, Clustering Algorithms. New York: John Wiley and Sons (1975).
13. J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the Theory of Neural Computation. AddisonWesley (1991).
14. Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning,
193
BIBLIOGRAPHY
194
BIBLIOGRAPHY
28. Kenney Ng and Richard P. Lippmann, A Comparative Study of the Practical Characteristics of
Neural Network and Conventional Pattern Classifiers, MIT Lincoln Laboratory, Technical Report
894, 1991.
29. Kenney Ng and Richard P. Lippmann, A Comparative Study of the Practical Characteristics of
Neural Network and Conventional Pattern Classifiers, in Neural Information Processing Systems 3,
R. Lippmann, J. Moody, and D. Touretzky, (Eds.), Morgan Kaufmann: San Mateo, California, 970976, 1990.
30. Nils J. Nilsson, Learning Machines. McGraw Hill, N.Y. (1965).
31. T. W. Parsons, Voice and Speech Processing. New York: McGraw-Hill (1986).
32. Gorden E. Peterson and Harold L. Barney, Control Methods Used in a Study of Vowels, in The
Journal of the Acoustical Society of America, 175-84, 1952.
33. Platt, J., Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized
Likelihood Methods, in Advances in Large Margin Classifiers, A. Smola, et al., Editors. 2000, MIT
Press, http://www.research.microsoft.com/users/jplatt/SVMprob.ps.gz.
34. Platt, J., Fast Training of Support Vector Machines Using Sequential Minimal Optimization, in
Advances in Kernel Methods - Support Vector Learning, B. Scholkopf, C. Burges, and A. Smola, Editors. 1998, MIT Press, http://www.research.microsoft.com/~jplatt/smo.html.
35. Guyon, I. Poujand, et al., Comparing Different Neural Network Architectures for Classifying
Handwritten Digits, in Proceedings International Joint Conference on Neural Networks, Washington
DC, II.127-II.132, 1989.
36. Mike D. Richard and Richard P. Lippmann, Neural Network Classifiers Estimate Bayesian a Posteriori Probabilities, Neural Computation, 3, 461-483, 1992.
37. Elliot Singer and Richard P. Lippmann, Improved Hidden Markov Model Speech Recognition
Using Radial Basis Function Networks, in Neural Information Processing Systems 4, J. Moody, S.
Hanson, and R. Lippmann, (Eds.), Morgan Kaufmann: San Mateo, California, 1992.
38. Elliot Singer and Richard P. Lippmann. A Speech Recognizer Using Radial Basis Function Neural
Networks in an HMM Framework. in Proceedings International Conference on Acoustics Speech and
Signal Processing. San Francisco: IEEE, 1992.
39. Donald Specht Probabilistic Neural Networks in Neural Networks, Volume 3 pp 109-118, Pergamon Press, 1990.
40. Sholom M. Weiss and Casimir A. Kulikowski, Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. San Mateo,
California: Morgan Kaufmann (1991).
195
BIBLIOGRAPHY
196
SUBJECT INDEX
back propagation 51
backward feature search 86
bad flags 129
batch files 112
binary splitting clustering 73
binary tree classifier (BINTREE) 64
initializing MLP 113
internals plot 97
structure plot 96
BINTREE 64
bugs 129
bullseye data base 166
197
Subject Index
EM_CLUS 73
environment variable
LD_LIBRARY_PATH 133
LNKHOME 136
MANPATH 129, 136
PATH 129, 136
error file 80, 125
format 125
verbosity 13
error summary 24, 143
198
Subject Index
199
Subject Index
LD_LIBRARY_PATH 133
LDA 83
leader clustering (LEAD_CLUS) 75
learning vector quantizer (LVQ) 64
leave-one-out cross validation 62, 86
likelihood classifiers 56
linear classifier, gauss 57
linear discriminant analysis (LDA) 44, 83
LNK2gobi 115
LNKHOME 136
LNKnet defaults file (.lnknetrc) 19, 79
log file 21, 80, 123, 141
verbosity 80
lpr 109
LVQ 64
main window 77
maker interchange format 109
MANPATH 129, 136
manual pages 129
maximum likelihood 53
MIF 107, 109
missing file 131
MLP 51
movie mode 106
multi-layer perceptron classifier (MLP) 11, 51, 139
cost window 54
initialization by bintree 113
internals plot 101
main window 51
200
Subject Index
node window 54
output sigmoid 53
parameters 12
slow training 132
structure plot 100
weight window 52
201
Subject Index
ROC 103
scatter 91
structure 96
plot file 107, 126
plot only 41, 77
plot selection 15, 92
plot un-normalized data 46, 93
plot2mif 107
plot2ps 108
plotting dimensions 40, 93
posterior class probabilities 51
posterior probability plot 28, 102
window 20
PostScript 107, 108, 109
preview 109
principal components analysis (PCA) 44, 83
printing 107, 109
prior class probabilities 87
priors window 87
profile plot 25, 93
window 18
202
Subject Index
gmix 99
irbf 99
mlp 100
rbf 99
window 18
support vector machine 3, 7, 65
SVM 7, 65
xgobi 115
XOR data base 159, 164, 188
xplot 106, 126, 133
203
Subject Index
204