(IJCST-V3I1P21) : S. Padmapriya

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 1, Jan-Feb 2015
RESEARCH ARTICLE
OPEN ACCESS
A Study on Algorithmic Approaches and Mining Methodologies

In Data Mining
S. Padmapriya
Assistant Professor
Department of Computer Science
Srimad Andavan Arts and Science College
Tamil Nadu India
ABSTRACT
Data mining finds valuable information hidden in large volumes of data that need to be turned into useful information. It is
considered to deal with huge amounts of data which are kept in the database.Data mining is the analysis of data and the use of
software techniques for finding hidden patterns and regularities in sets of data Knowledge discovery from the large data set
becomes difficult. The increase in demand of finding pattern from huge data is improved by means of data mining algorithms
and techniques.Researchers presented a lot of approaches and algorithms for determining patterns. This paper presented various
data mining algorithms and mining methods to discover valuable patterns from the hidden information.
Keywords:-Data mining, Knowledge Discovery.
I.
INTRODUCTION
Data mining is an emerging trends.The information age has
serve as placeholders for data. There are two types of
enabled many organizations to gather large volumes of data.
variables,
numerical
and
categorical.A
numerical
or
However, the usefulness of this datam is negligible if
continuous variable is one that can accept any value within a
meaningful information or Knowledge cannot be extracted
finite or infinite interval. There are two types of numerical
from it [4].Data mining, otherwise known as knowledge
data, interval and ratio. Data on an interval scale can be added
discovery, attempts to answer this need. In contrast to
and subtracted but cannot be meaningfully multiplied or
standard Statistical methods, data mining techniques search
divided because there is no true zero. For example, we cannot
for interesting information without demanding a priori
say that one day is twice as hot as another day. On the other
hypotheses. Now a day, advances in hardware technology
hand, data on a ratio scale has true zero and can be added,
have lead to an increase in the capability to store and record
subtracted, multiplied or divided (e.g., weight).A categorical
personal data.
or discrete variable is one that can accept two or more values

(categories). There are two types of categorical data, nominal
and ordinal. Nominal data does not have an intrinsic ordering
in the categories. For example, "gender" with two categories,
male and female. In contrast, ordinal data does have an
intrinsic ordering in the categories. For example, "level of
energy" with three orderly categories (low, medium and high).
Figure:1 Data Minig Process
Data Preparation:
Data preparation is about constructing a dataset from
II. BASIC TERMINOLOGY

Data:
one or more data sources to be used for exploration and

Data
is
information
of
modelling. It is a solid practice to start with an initial dataset
measurement (numerical) or counting (categorical). Variables
to get familiar with the data, to discover first insights into the
ISSN: 2347-8578
typically
the
results
www.ijcstjournal.org
Page 105
data and have a good understanding of any possible data
classification. For example, the CART (Classification And
quality issues. Data preparation is often a time consuming
Regression Trees) decision tree algorithm can be used to build
process and heavily prone to errors. The old saying "garbage-
both classification trees (to classify categorical response
in-garbage-out" is particularly applicable to those data mining
variables) and regression trees (to forecast continuous
projects where data gathered with many invalid, out-of-range
response
and missing values. Analyzing data that has not been carefully
classification and regression models.
screened for such problems can produce highly misleading
c. Logistic regression:
results. Then, the success of data mining projects heavily

depends on the quality of the prepared data.
variables).Neural
nets
too
can
create
both
Logistic regression is a generalization of linear

regression. It is used primarily for predicting binaryvariables
(with values such as yes/no or 0/1) and occasionally multi-
Dataset:
class variables[1]. Because the response variable is discrete, it

Dataset is a collection of data, usually presented in a
cannot be modeled directly by linear regression. Therefore,
tabular form. Each column represents a particular variable,
rather than predict whether the event itself (the response
and each row corresponds to a given member of the data.Data
variable) will occur, we build the model to predict the
sets are classified into two types test data and training data.
logarithm of the odds of its occurrence. This logarithm is
III. MINING METHODOLOGY
called the log odds or the logit transformation.
a.Classification
Classification
The odds ratio: Probability of an event occurring

problems
aim
to
identify
the
characteristics that indicate the group to which each

casebelongs. This pattern can be used both to understand the
existing data and to predict how new instances will
behave[1].Data mining creates classification models by
examining already classified data (cases) and inductively
finding a predictive pattern. These existing cases may come
from an historical database, such as people who have already
undergone a particular medical treatment or moved to a new
long- distance service. They may come from an experiment in
which a sample of the entire database is tested in the real
Probability of the event not occurring

d.Neural networks
Neural networks are of particular interest because
they offer a means of efficiently modeling large and complex
problems in which there may be hundreds of predictor
variables that have many interactions.(Actual biological
neural networks are incomparably more complex.) Neural nets
may be used in classification problems (where the output is a
categorical variable) or for regressions (where the output
variable is continuous). A neural network starts with an input
layer , where each node corresponds to a predictor variable.
These input nodes are connected to a number of nodes in a
world and the results used to create a classifier.

b..Regression
Regression uses existing values to forecast what
hidden layer . Each input node is connected to every node in

the hidden layer. The nodes in the hidden layer may be
other values will be. In the simplest case, regression uses
connected to nodes in another hidden layer, or to an output
standard
layer . The output layer consists of one or more response
statistical
techniques
such
as
linear
regression[1].Unfortunately, many real-world problems are
variable.
not simply linear projections of previous values. For instance,
Algorithm
There are different types of neural networks, but they
sales volumes, stock prices, and product failure rates are all
very difficult to predict because they may depend on complex
interactions of multiple predictor variables. Therefore, more
complex techniques (e.g., logistic regression, decision trees, or
neural nets) may be necessary to forecast future values.The
same model types can often be used for both regression and
are generally classified into feed-forward and feed-back

networks.A feed-forward network is a non-recurrent network
which contains inputs, outputs, and hidden layers; the signals
can only travel in one direction. Input data is passed onto a
layer of processing elements where it performs calculations.
Each processing element makes its computation based upon a
ISSN: 2347-8578
Page 106
weighted sum of its inputs. The new calculated values then
become the new input values that feed the next layer. This
process continues until it has gone through all the layers and
determines the output. A threshold transfer function is
sometimes used to quantify the output of a neuron in the
output layer. Feed-forward networks include Perceptron
(linear and non-linear) and Radial Basis Function networks.
Feed-forward networks are often used in data mining.
A feed-back network has feed-back paths meaning they can
have signals traveling in both directions using loops. All
possible connections between neurons are allowed. Since
Figure2:Support Vector Method
IV. ALGORITHMIC APPROACHES
loops are present in this type of network, it becomes a non-
a.Genetic algorithms
Genetic algorithms are not used to find patterns per
linear dynamic system which changes continuously until it
se, but rather to guide the learning process of data mining
reaches a state of equilibrium. Feed-back networks are often
algorithms such as neural nets. Essentially, genetic algorithms
used in associative memories and optimization problems
act as a method for performing a guided search for good
where the network looks for the best arrangement of
models in the solution space.They are called genetic
interconnected factors.
algorithms because they loosely follow the pattern of
f.Support Vector machine

A Support Vector
Machine
(SVM)
is
discriminative classifier formally defined by a separating

hyperplane. Support Vector Machine can also be used as a
regression method, maintaining all the main features that
characterize the algorithm (maximal margin). The Support
biological evolution in which the members of one generation

(of models) compete to pass on their characteristics to the next
generation (of models), until the best (model) is found. The
information to be passed on is contained in chromosomes,
which contain the parameters for building the model.
Vector Regression (SVR) uses the same principles as the
b.Artificial Neural Networks

An ANN is comprised of a network of artificial
SVM for classification, with only a few minor differences.
neurons (also known as "nodes"). These nodes are connected
First of all, because output is a real number it becomes very
to each other, and the strength of their connections to one
difficult to predict the information at hand, which has infinite
another is assigned a value based on their strength: inhibition
possibilities. In the case of regression, a margin of tolerance
(maximum being -1.0) or excitation (maximum being +1.0). If
(epsilon) is set in approximation to the SVM which would
the value of the connection is high, then it indicates that there
have already requested from the problem. But besides this
is a strong connection. Within each node's design, a transfer
fact, there is also a more complicated reason, the algorithm is
function is built in. There are three types of neurons in an
more complicated therefore to be taken in consideration[3].
ANN, input nodes, hidden nodes, and output nodes.The input
However, the main idea is always the same: to minimize error,
nodes take in information, in the form which can be
individualizing the hyperplane which maximizes the margin,
numerically expressed. The information is presented as
keeping in mind that part of the error is tolerated.
activation values, where each node is given a number, the

higher the number, the greater the activation. This information
is then passed throughout the network. Based on the
connection strengths (weights), inhibition or excitation, and
transfer functions, the activation value is passed from node to
node. Each of the nodes sums the activation values it receives;
it then modifies the value based on its transfer function. The
ISSN: 2347-8578
Page 107
activation flows through the network, through hidden layers,
and all computation is deferred until classification. The k-NN
until it reaches the output nodes. The output nodes then reflect
algorithm is among the simplest of all machine learning
the input in a meaningful way to the outside world.
algorithms.Both for classification and regression, it can be
c.The k-means Algorithm

The k-means algorithm is an evolutionary algorithm
useful to weight the contributions of the neighbors, so that the
that gains its name from its method of operation. The

algorithm clusters observations into k groups, where k is
provided as an input parameter. It then assigns each
observation to clusters based upon the observations proximity
to the mean of the cluster. The clusters mean is then
recomputed and the process begins again. Heres how the
algorithm works:
nearer neighbors contribute more to the average than the more

distant ones. For example, a common weighting scheme
consists in giving each neighbor a weight of 1/d, where d is
the distance to the neighbor.The neighbors are taken from a
set of objects for which the class (for k-NN classification) or
the object property value (for k-NN regression) is known. This
can be thought of as the training set for the algorithm, though
no explicit training step is required.A shortcoming of the k-
1.The algorithm arbitrarily selects k points as the initial cluster

centers (means).
NN algorithm is that it is sensitive to the local structure of the

data[2].
2. Each point in the dataset is assigned to the closed cluster,

based upon the Euclidean distance between each point and
Choosing the Number of Clusters

One of the main disadvantages to k-means is the fact
each cluster center.
that you must specify the number of clusters as an input to the

algorithm. As designed, the algorithm is not capable of
3.Each cluster center is recomputed as the average of the
determining the appropriate number of clusters and depends
points in that cluster.
upon the user to identify this in advance. For example, if you
4.Steps 2 and 3 repeat until the clusters converge.
had a group of people that were easily clustered based upon
Convergence may be defined differently depending upon the
gender, calling the k-means algorithm with k=3 would force
implementation, but it normally means that either no
the people into three clusters, when k=2 would provide a more
observations change clusters when steps 2 and 3 are repeated
natural fit. Similarly, if a group of individuals were easily
or that the changes do not make a material difference in the
clustered based upon home state and you called the k-means
definition of the clusters[2].
algorithm with k=20, the results might be too generalized to
b.Nearest Neighbour Algorithms

In pattern recognition, the k-Nearest Neighbors
be effective.
algorithm (or k-NN for short) is a non-parametric method
d.K-nearest neighbor and memory-based reasoning (MBR)

When trying to solve new problems, people often
used for classification and regression.[1] In both cases, the
look at solutions to similar problems that they have previously
input consists of the k closest training examples in the feature
solved. K-nearest neighbor (k-NN) is a classification
space. The output depends on whether k-NN is used for
technique that uses a version of this same method. It decides
classification or regression:In k-NN classification, the output
in which class to place a new case by examining some number
is a class membership. An object is classified by a majority
the k in k-nearest neighbor of the most similar cases or
vote of its neighbors, with the object being assigned to the
neighbors . It counts the number of cases for each class, and
class most common among its k nearest neighbors (k is a
assigns the new case to the same class to which most of its
positive integer, typically small). If k = 1, then the object is
neighbors belong.
simply assigned to the class of that single nearest neighbor.In
e.Bayesian Algorithms
Bayesian approches are a fundamentally important DM
k-NN regression, the output is the property value for the

object. This value is the average of the values of its k nearest
neighbors[2].k-NN is a type of instance-based learning, or
lazy learning, where the function is only approximated locally
ISSN: 2347-8578
technique. Given the probability distribution, Bayes classifier

can provably achieve the optimal result.Bayesian method is
based on the probability theory. Bayes Rule is applied to
Page 108
calculate the posterior from the prior and the likelihood,
set much better than it fits the test set, over fitting is probably
because the later two is generally easier to be calculated from
the cause.
a probability model.One limitation that the Bayesian
Cross-Validation
approaches can not cross is the need of the probability

estimation from the training dataset.It is noticeable that in
some situations, such as the decision is clearly based on
certain criteria, or the dataset has high degree of randomality,
the Bayesian approaches will not be a good choice. Bayes
theorem plays a critical role in probabilistic learning and
classification. Build a generative model that approximates
how data is produced uses prior probability of each category
When only a limited amount of data is available, to

achieve an unbiased estimate of the model performance we
use k-fold cross-validation. In k-fold cross-validation, we
divide the data into k subsets of equal size. We build models k
times, each time leaving out one of the subsets from training
and use it as the test set. If k equals the sample size, this is
called "leave-one-out".
given no information about an item. Categorization produces a
Time series
Time series forecasting predicts unknown future
posterior probability distribution over the possible categories
values based on a time-varying series of predictors. Like
given a description of an item
regression, it uses known results to guide its predictions.
P(C,D)=P(C|D)P(D)=P(D|C)P(C)
Models must take into account the distinctive properties of

time, especially the hierarchy of periods (including such
varied definitions as the five- or seven-day work week, the
P(C|D)=P(D|C)P(C)
thirteen-month year, etc.), seasonality, and calendar effects
P(D)
such as holidays, date arithmetic, and special considerations
V. MODEL EVALUATION
such as how much of the past is relevant [1].

Model Evaluation is an integral part of the model
VI. CONCLUSION
development process. It helps to find the best model that

represents our data and how well the chosen model will work
Due to increase in the demanding need of valuable
in the future. Evaluating model performance with the data
data and accuracy new methods and techniques are needed to
used for training is not acceptable in data mining because it
be identified to improve the quality parameter. This paper
can easily generate overoptimistic and over fitted models.
describes different methodologies associated with different
There are two methods of evaluating models in data mining,
algorithms used to handle huge data and also it gives an
Hold-Out and Cross-Validation. To avoid over fitting, both
overview of various techniques and algorithms used in big
methods use a test set (not seen by the model) to evaluate
data sets.
model performance.
REFERENCES
Hold-Out:
[1] http://www.twocrows.com/intro-dm.pdf
In this method, the mostly large dataset is randomly divided to
three subsets: Training set is a subset of the dataset used to
build predictive models. Validation set is a subset of the
dataset used to assess the performance of model built in the
training phase. It provides a test platform for fine tuning
model's parameters and selecting the best-performing model.
[2] http://en.wikipedia/wiki/k-nearest neighbour algorithm

[3] http://research.cs.queensu.ca/home/xiao/dm.html
[4] Mohammed
Mohammed
Younus,Dr.Ahmad
Farooq,Fahmida
A.Alhamed,Khazi
BegumData
Mining
Modeling Techniques and Algorithm Approaches in

Privacy Data,ijarcsse volume4-2014.
Not all modelling algorithms need a validation set. Test set or
[5] Rachna Somkunwar,A study on Various Data Mining
unseen examples is a subset of the dataset to assess the likely
Approaches of Association Rules,ijarcsse volume2-
future performance of a model. If a model fit to the training
ISSN: 2347-8578
2012.
Page 109

(IJCST-V3I1P21) : S. Padmapriya

Uploaded by

Copyright:

Available Formats

(IJCST-V3I1P21) : S. Padmapriya

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(IJCST-V3I1P21) : S. Padmapriya

Uploaded by

Copyright:

Available Formats

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 1, Jan-Feb 2015

A Study on Algorithmic Approaches and Mining Methodologies

serve as placeholders for data. There are two types of

enabled many organizations to gather large volumes of data.

However, the usefulness of this datam is negligible if

continuous variable is one that can accept any value within a

meaningful information or Knowledge cannot be extracted

finite or infinite interval. There are two types of numerical

from it [4].Data mining, otherwise known as knowledge

data, interval and ratio. Data on an interval scale can be added

discovery, attempts to answer this need. In contrast to

and subtracted but cannot be meaningfully multiplied or

standard Statistical methods, data mining techniques search

divided because there is no true zero. For example, we cannot

for interesting information without demanding a priori

hypotheses. Now a day, advances in hardware technology

have lead to an increase in the capability to store and record

subtracted, multiplied or divided (e.g., weight).A categorical

or discrete variable is one that can accept two or more values

II. BASIC TERMINOLOGY

one or more data sources to be used for exploration and

modelling. It is a solid practice to start with an initial dataset

measurement (numerical) or counting (categorical). Variables

classification. For example, the CART (Classification And

quality issues. Data preparation is often a time consuming

Regression Trees) decision tree algorithm can be used to build

process and heavily prone to errors. The old saying "garbage-

both classification trees (to classify categorical response

in-garbage-out" is particularly applicable to those data mining

variables) and regression trees (to forecast continuous

projects where data gathered with many invalid, out-of-range

classification and regression models.

screened for such problems can produce highly misleading

results. Then, the success of data mining projects heavily

Logistic regression is a generalization of linear

class variables[1]. Because the response variable is discrete, it

cannot be modeled directly by linear regression. Therefore,

tabular form. Each column represents a particular variable,

rather than predict whether the event itself (the response

and each row corresponds to a given member of the data.Data

variable) will occur, we build the model to predict the

logarithm of the odds of its occurrence. This logarithm is

III. MINING METHODOLOGY

called the log odds or the logit transformation.

The odds ratio: Probability of an event occurring

characteristics that indicate the group to which each

Probability of the event not occurring

world and the results used to create a classifier.

hidden layer . Each input node is connected to every node in

other values will be. In the simplest case, regression uses

connected to nodes in another hidden layer, or to an output

layer . The output layer consists of one or more response

regression[1].Unfortunately, many real-world problems are

not simply linear projections of previous values. For instance,

are generally classified into feed-forward and feed-back

Figure2:Support Vector Method

IV. ALGORITHMIC APPROACHES

loops are present in this type of network, it becomes a non-

linear dynamic system which changes continuously until it