Self-Organizing Map: Machine Learning Data Mining

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Self-organizing map

From Wikipedia, the free encyclopedia


Jump to navigationJump to search
This article may require cleanup to meet Wikipedia's quality
standards. No cleanup reason has been specified. Please help improve this
article if you can. (June 2011) (Learn how and when to remove this template message)

Machine learning and


data mining

Problems[show]

Supervised learning
(classification • regression)

[show]

Clustering[show]

Dimensionality reduction[show]

Structured prediction[show]

Anomaly detection[show]

Artificial neural networks[show]

Reinforcement learning[show]

Theory[show]

Machine-learning venues[show]
Glossary of artificial intelligence[show]

Related articles[show]

 Machine learning portal

 v
 t
 e

A self-organizing map (SOM) or self-organizing feature map (SOFM) is a type of artificial neural
network (ANN) that is trained using unsupervised learning to produce a low-dimensional (typically
two-dimensional), discretized representation of the input space of the training samples, called
a map, and is therefore a method to do dimensionality reduction. Self-organizing maps differ from
other artificial neural networks as they apply competitive learning as opposed to error-correction
learning (such as backpropagation with gradient descent), and in the sense that they use a
neighborhood function to preserve the topological properties of the input space.

A self-organizing map showing U.S. Congressvoting patterns. The input data was a table with a row for each
member of Congress, and columns for certain votes containing each member's yes/no/abstain vote. The SOM
algorithm arranged these members in a two-dimensional grid placing similar members closer together. The
first plot shows the grouping when the data are split into two clusters. The second plot shows average
distance to neighbours: larger distances are darker. The third plot predicts Republican (red)
or Democratic (blue) party membership. The other plots each overlay the resulting map with predicted values
on an input dimension: red means a predicted 'yes' vote on that bill, blue means a 'no' vote. The plot was
created in Synapse.

This makes SOMs useful for visualization by creating low-dimensional views of high-dimensional
data, akin to multidimensional scaling. The artificial neural network introduced by
the Finnish professor Teuvo Kohonen in the 1980s is sometimes called a Kohonen
map or network.[1][2] The Kohonen net is a computationally convenient abstraction building on
biological models of neural systems from the 1970s[3] and morphogenesis models dating back
to Alan Turing in the 1950s.[4]
While it is typical to consider this type of network structure as related to feedforward networks where
the nodes are visualized as being attached, this type of architecture is fundamentally different in
arrangement and motivation.
Useful extensions include using toroidal grids where opposite edges are connected and using large
numbers of nodes.
It has been shown that while self-organizing maps with a small number of nodes behave in a way
that is similar to K-means, larger self-organizing maps rearrange data in a way that is fundamentally
topological in character.
It is also common to use the U-Matrix.[5] The U-Matrix value of a particular node is the average
distance between the node's weight vector and that of its closest neighbors.[6] In a square grid, for
instance, we might consider the closest 4 or 8 nodes (the Von Neumannand Moore neighborhoods,
respectively), or six nodes in a hexagonal grid.
Large SOMs display emergent properties. In maps consisting of thousands of nodes, it is possible to
perform cluster operations on the map itself.[7]

Contents

 1Structure and operations


 2Learning algorithm
o 2.1Variables
o 2.2Algorithm
o 2.3SOM Initialization
 3Examples
o 3.1Fisher's Iris Flower Data
 4Interpretation
 5Alternatives
 6Applications
 7See also
 8Notes
 9References

Structure and operations[edit]


Like most artificial neural networks, SOMs operate in two modes: training and mapping. "Training"
builds the map using input examples (a competitive process, also called vector quantization), while
"mapping" automatically classifies a new input vector.
The visible part of a self-organizing map is the map space, which consists of components called
nodes or neurons. The map space is defined beforehand, usually as a finite two-dimensional region
where nodes are arranged in a regular hexagonal or rectangular grid.[8] Each node is associated with
a "weight" vector, which is a position in the input space; that is, it has the same dimension as each
input vector. While nodes in the map space stay fixed, training consists in moving weight vectors
toward the input data (reducing a distance metric) without spoiling the topology induced from the
map space. Thus, the self-organizing map describes a mapping from a higher-dimensional input
space to a lower-dimensional map space. Once trained, the map can classify a vector from the input
space by finding the node with the closest (smallest distance metric) weight vector to the input space
vector.

Learning algorithm[edit]
The goal of learning in the self-organizing map is to cause different parts of the network to respond
similarly to certain input patterns. This is partly motivated by how visual, auditory or
other sensory information is handled in separate parts of the cerebral cortex in the human brain.[9]

An illustration of the training of a self-organizing map. The blue blob is the distribution of the training data, and
the small white disc is the current training datum drawn from that distribution. At first (left) the SOM nodes are
arbitrarily positioned in the data space. The node (highlighted in yellow) which is nearest to the training datum
is selected. It is moved towards the training datum, as (to a lesser extent) are its neighbors on the grid. After
many iterations the grid tends to approximate the data distribution (right).

The weights of the neurons are initialized either to small random values or sampled evenly from the
subspace spanned by the two largest principal component eigenvectors. With the latter alternative,
learning is much faster because the initial weights already give a good approximation of SOM
weights.[10]
The network must be fed a large number of example vectors that represent, as close as possible,
the kinds of vectors expected during mapping. The examples are usually administered several times
as iterations.
The training utilizes competitive learning. When a training example is fed to the network,
its Euclidean distance to all weight vectors is computed. The neuron whose weight vector is most
similar to the input is called the best matching unit (BMU). The weights of the BMU and neurons
close to it in the SOM grid are adjusted towards the input vector. The magnitude of the change
decreases with time and with the grid-distance from the BMU. The update formula for a neuron v
with weight vector Wv(s) is

,
where s is the step index, t an index into the training sample, u is the index of the BMU for the
input vector D(t), α(s) is a monotonically decreasing learning coefficient; Θ(u, v, s) is the
neighborhood function which gives the distance between the neuron u and the neuron v in step
s.[11] Depending on the implementations, t can scan the training data set systematically (t is 0, 1,
2...T-1, then repeat, T being the training sample's size), be randomly drawn from the data set
(bootstrap sampling), or implement some other sampling method (such as jackknifing).
The neighborhood function Θ(u, v, s) depends on the grid-distance between the BMU
(neuron u) and neuron v. In the simplest form, it is 1 for all neurons close enough to BMU and 0
for others, but a Gaussian function is a common choice, too. Regardless of the functional form,
the neighborhood function shrinks with time.[9] At the beginning when the neighborhood is broad,
the self-organizing takes place on the global scale. When the neighborhood has shrunk to just a
couple of neurons, the weights are converging to local estimates. In some implementations, the
learning coefficient α and the neighborhood function Θ decrease steadily with increasing s, in
others (in particular those where t scans the training data set) they decrease in step-wise
fashion, once every T steps.
This process is repeated for each input vector for a (usually large) number of cycles λ. The
network winds up associating output nodes with groups or patterns in the input data set. If these
patterns can be named, the names can be attached to the associated nodes in the trained net.
During mapping, there will be one single winning neuron: the neuron whose weight vector lies
closest to the input vector. This can be simply determined by calculating the Euclidean distance
between input vector and weight vector.
While representing input data as vectors has been emphasized in this article, it should be noted
that any kind of object which can be represented digitally, which has an appropriate distance
measure associated with it, and in which the necessary operations for training are possible can
be used to construct a self-organizing map. This includes matrices, continuous functions or even
other self-organizing maps.
Variables[edit]
These are the variables needed, with vectors in bold,

 is the current iteration

 is the iteration limit

 is the index of the target input data vector in the input data set

 is a target input data vector

 is the index of the node in the map

 is the current weight vector of node

 is the index of the best matching unit (BMU) in the map

 is a restraint due to distance from BMU, usually called the neighborhood function, and

 is a learning restraint due to iteration progress.


Algorithm[edit]

1. Randomize the node weight vectors in a map

2. Randomly pick an input vector


3. Traverse each node in the map
1. Use the Euclidean distance formula to find the similarity between the input
vector and the map's node's weight vector
2. Track the node that produces the smallest distance (this node is the best
matching unit, BMU)
4. Update the weight vectors of the nodes in the neighborhood of the BMU (including the
BMU itself) by pulling them closer to the input vector

1.

5. Increase and repeat from step 2 while


A variant algorithm:

1. Randomize the map's nodes' weight vectors


2. Traverse each input vector in the input data set
1. Traverse each node in the map
1. Use the Euclidean distance formula to find the similarity between the
input vector and the map's node's weight vector
2. Track the node that produces the smallest distance (this node is the
best matching unit, BMU)
2. Update the nodes in the neighborhood of the BMU (including the BMU itself) by
pulling them closer to the input vector

1.

3. Increase and repeat from step 2 while


SOM Initialization[edit]
Selection of a good initial approximation is a well-known problem for all iterative methods of
learning neural networks. Kohonen[12] used random initiation of SOM weights. Recently, principal
component initialization, in which initial map weights are chosen from the space of the first
principal components, has become popular due to the exact reproducibility of the results.[13]
Careful comparison of the random initiation approach to principal component initialization for
one-dimensional SOM (models of principal curves) demonstrated that the advantages of
principal component SOM initialization are not universal. The best initialization method depends
on the geometry of the specific dataset. Principal component initialization is preferable (in
dimension one) if the principal curve approximating the dataset can be univalently and linearly
projected on the first principal component (quasilinear sets). For nonlinear datasets, however,
random initiation performs better.[14]

Examples[edit]
Fisher's Iris Flower Data[edit]
This section does not cite any sources. Please help improve this
section by adding citations to reliable sources. Unsourced material may
be challenged and removed. (February 2010) (Learn how and when to remove
this template message)

This section possibly contains original research. Please improve


it by verifying the claims made and adding inline citations. Statements
consisting only of original research should be removed. (June 2017) (Learn
how and when to remove this template message)

Consider an n×m array of nodes, each of which contains a weight vector and is aware of its
location in the array. Each weight vector is of the same dimension as the node's input vector.
The weights may initially be set to random values.
Now we need input to feed the map. Colors can be represented by their red, green, and blue
components. Consequently, we will represent colors as vectors in the unit cube of the free
vector space over ℝ generated by the basis:
R = <255, 0, 0>
G = <0, 255, 0>
B = <0, 0, 255>
The diagram shown

Self organizing maps (SOM) of three and eight colors with U-Matrix.

compares the results of training on the data sets[Note 1]


threeColors = [255, 0, 0], [0, 255, 0], [0, 0, 255]
eightColors = [0, 0, 0], [255, 0, 0], [0, 255, 0], [0, 0, 255], [255, 255, 0], [0, 255, 255], [255, 0,
255], [255, 255, 255]
and the original images. Note the striking resemblance between the two.
Similarly, after training a 40×40 grid of neurons for 250 iterations with
a learning rate of 0.1 on Fisher's Iris, the map can already detect the main
differences between species.

Self organizing map (SOM) of Fisher's Iris Flower Data Set with U-Matrix. Top
left: a color image formed by the first three dimensions of the four-dimensional
SOM weight vectors. Top Right: a pseudo-color image of the magnitude of the
SOM weight vectors. Bottom Left: a U-Matrix (Euclidean distance between
weight vectors of neighboring cells) of the SOM. Bottom Right: An overlay of
data points (red: I. setosa, green: I. versicolor and blue: I. virginica) on the U-
Matrix based on the minimum Euclidean distance between data vectors and
SOM weight vectors.

Interpretation[edit]

Cartographical representation of a self-organizing map (U-Matrix) based on


Wikipedia featured article data (word frequency). Distance is inversely proportional
to similarity. The "mountains" are edges between clusters. The red lines are links
between articles.

One-dimensional SOM versus principal component analysis (PCA) for data


approximation. SOM is a red broken line with squares, 20 nodes. The first
principal component is presented by a blue line. Data points are the small grey
circles. For PCA, the fraction of variance unexplained in this example is
23.23%, for SOM it is 6.86%.[15]

There are two ways to interpret a SOM. Because in the training phase
weights of the whole neighborhood are moved in the same direction, similar
items tend to excite adjacent neurons. Therefore, SOM forms a semantic
map where similar samples are mapped close together and dissimilar ones
apart. This may be visualized by a U-Matrix(Euclidean distance between
weight vectors of neighboring cells) of the SOM.[5][6][16]
The other way is to think of neuronal weights as pointers to the input space.
They form a discrete approximation of the distribution of training samples.
More neurons point to regions with high training sample concentration and
fewer where the samples are scarce.
SOM may be considered a nonlinear generalization of Principal
components analysis(PCA).[17] It has been shown, using both artificial and
real geophysical data, that SOM has many advantages[18][19] over the
conventional feature extraction methods such as Empirical Orthogonal
Functions (EOF) or PCA.
Originally, SOM was not formulated as a solution to an optimisation
problem. Nevertheless, there have been several attempts to modify the
definition of SOM and to formulate an optimisation problem which gives
similar results.[20] For example, Elastic maps use the mechanical metaphor
of elasticity to approximate principal manifolds:[21] the analogy is an elastic
membrane and plate.

Alternatives[edit]
 The generative topographic map (GTM) is a potential alternative to
SOMs. In the sense that a GTM explicitly requires a smooth and
continuous mapping from the input space to the map space, it is
topology preserving. However, in a practical sense, this measure of
topological preservation is lacking.[22]
 The time adaptive self-organizing map (TASOM) network is an
extension of the basic SOM. The TASOM employs adaptive learning
rates and neighborhood functions. It also includes a scaling parameter
to make the network invariant to scaling, translation and rotation of the
input space. The TASOM and its variants have been used in several
applications including adaptive clustering, multilevel thresholding, input
space approximation, and active contour modeling.[23] Moreover, a
Binary Tree TASOM or BTASOM, resembling a binary natural tree
having nodes composed of TASOM networks has been proposed
where the number of its levels and the number of its nodes are adaptive
with its environment.[24]
 The growing self-organizing map (GSOM) is a growing variant of the
self-organizing map. The GSOM was developed to address the issue of
identifying a suitable map size in the SOM. It starts with a minimal
number of nodes (usually four) and grows new nodes on the boundary
based on a heuristic. By using a value called the spread factor, the data
analyst has the ability to control the growth of the GSOM.
 The elastic maps approach[25] borrows from the spline interpolation the
idea of minimization of the elastic energy. In learning, it minimizes the
sum of quadratic bending and stretching energy with the least
squares approximation error.
 The conformal approach [26][27] that uses conformal mapping to
interpolate each training sample between grid nodes in a continuous
surface. A one-to-one smooth mapping is possible in this approach.
 The oriented and scalable map (OS-Map) generalises the
neighborhood function and the winner selection[28]. The homogeneous
Gaussian neighborhood function is replaced with the matrix
exponential. Thus one can specify the orientation either in the map
space or in the data space. SOM has a fixed scale (=1), so that the
maps "optimally describe the domain of observation". But what about a
map covering the domain twice or in n-folds? This entails the
conception of scaling. The OS-Map regards the scale as a statistical
description of how many best-matching nodes an input has in the map.
Applications[edit]
 Meteorology and oceanography[29]
 Project prioritization and selection [30]
 Seismic facies analysis for oil and gas exploration [31]
 Failure mode and effects analysis [32]
 Creation of artwork [33]

You might also like